CN114420197A - Connector design method of protein degradation targeting chimera based on self-encoder - Google Patents

Connector design method of protein degradation targeting chimera based on self-encoder Download PDF

Info

Publication number
CN114420197A
CN114420197A CN202210002416.1A CN202210002416A CN114420197A CN 114420197 A CN114420197 A CN 114420197A CN 202210002416 A CN202210002416 A CN 202210002416A CN 114420197 A CN114420197 A CN 114420197A
Authority
CN
China
Prior art keywords
node
vector
atom
feature vector
vectors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210002416.1A
Other languages
Chinese (zh)
Inventor
邓岳
李柯岑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202210002416.1A priority Critical patent/CN114420197A/en
Publication of CN114420197A publication Critical patent/CN114420197A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/50Molecular design, e.g. of drugs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Biotechnology (AREA)
  • Databases & Information Systems (AREA)
  • Medicinal Chemistry (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Bioethics (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Mathematical Physics (AREA)
  • Physiology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a connector design method of a protein degradation targeting chimera based on an autoencoder, which comprises the following steps: replacing the linker of the existing protein degradation targeting chimera with various small molecules in a molecule library, removing the replacements which do not accord with chemical rules to obtain a large amount of pseudochimera molecules, and dividing the pseudochimera molecules into a training set and a verification set according to a certain proportion; designing a connector design network, training on a training set, and well reconstructing a connector of each input molecule; fine adjustment is carried out on the model trained on the pseudo chimera molecular set on the real chimera molecular set; the final trained network is validated on the test set and the decoder's effect on the generation of random samples from a given distribution is tested.

Description

Connector design method of protein degradation targeting chimera based on self-encoder
Technical Field
The invention relates to the technical field of a protein degradation targeting chimera, in particular to a connector design method of a protein degradation targeting chimera based on an autoencoder.
Background
At present, the traditional machine learning usually depends on a data set of a large amount of data, and a prediction model or a regression model fitting the data set is learned, so that the obtained model is well suitable for the data of a training set and a test set. The variational self-encoder is a deep learning generation model based on the variational thought, and consists of an encoder and a decoder. The encoder maps the input to a particular distribution and takes samples of this distribution as input to the decoder, which yields an output that approximates the same distribution as the input data. The protein degradation target chimera is a hybrid dual-functional small molecule compound, which consists of three parts, namely a target protein ligand, a connector and an E3 ligase ligand, and is characterized in that the distance between target protein and intracellular E3 ubiquitin ligase is shortened, the target protein is specifically degraded by a ubiquitin-proteasome pathway, the target protein ligand of the chimera is combined with the target protein, the E3 ligase ligand is combined with the E3 ubiquitin ligase, and the connector connects the two ligands and has certain influence on the combination strength of the two parts.
However, amino acids are by far the most common linking moieties for degradation of targeted chimeras, including aminocaproic acid, glycine and serine, and polyethylene glycol is also commonly used to increase the hydrophilicity of targeted chimeras. Compared with a wide small molecule space, the number of the existing connectors is very rare, and the realization of a method for rapidly discovering a novel connector in the small molecule space is very important; the existing methods for designing a connector by machine learning mainly comprise two methods, one is that each molecule is directly expressed as a character sequence by SMILES, and the structure of the connector is predicted by a related algorithm of sequence prediction, so that the method does not utilize the information of the molecule and the topological structure of the molecule in a chemical principle; the second method is to use a graph neural network to obtain the characteristics of molecules and use a network structure of a conditional variational self-encoder to generate a new chimera, but in the method, 3D structure information of a connector is needed, the maximum number of atoms is determined before atoms are generated, and the interpretability is poor without starting from the topological structure of the connector.
Therefore, designing a linker design network to be applied to the design of linkers of protein degradation targeting chimeras is a problem that needs to be solved by those skilled in the art.
Disclosure of Invention
In view of the above, the present invention provides a linker design method for protein degradation targeting chimera based on an autoencoder; by giving a targeting chimera molecule needing to design a new connector and inputting a connector design network, the network can extract related information from the input molecule to predict the topological structure of the connector to be designed, the topological structure is a tree representation of the connector molecule, and nodes and edges are predicted one by one through information transmission on the basis of the predicted tree, so that novel connector molecules are obtained.
In order to achieve the purpose, the invention adopts the following technical scheme:
a design method of a connector of a protein degradation targeting chimera based on an autoencoder comprises the following steps:
s1, acquiring a small molecular set, and preprocessing the small molecular set to obtain a molecular set with strong drug-like property;
s2, processing the molecular shear algorithm-based molecular shear subset to obtain a training set, wherein the training set comprises a ligand 1, a ligand 2 and a chimera;
s3, inputting the training set into an encoder, and performing information transmission through a graph neural network to obtain the feature vectors of the ligand 1, the ligand 2 and the chimera;
s4, inputting the feature vectors of the ligand 1, the ligand 2 and the chimera into the full-connection layer to obtain an implicit space vector 1, an implicit space vector 2 and a conditional space vector;
s5, splicing the conditional space vector with the hidden space vector 1 and the hidden space vector 2 respectively to obtain two hidden space vectors to be decoded;
s6, transmitting the two hidden space vectors into a full-connection layer 1, outputting the prediction probability of the corresponding connection body topology result, and initializing a new chimera by taking the connection body topology structure with the highest probability;
s7, transmitting the two hidden space vectors into a full-connection layer 2, respectively obtaining the characteristics of initializing each node by taking the corresponding connection point as a root node and weighting in a predicted connection structure with breadth first;
s8, randomly selecting a connection point as a root node, predicting the node type of each node with breadth first, updating node information once by using a graph neural network when each node is generated until all nodes of a connector are generated, and finishing the training of the network;
s9, inputting the pseudo chimeras of the ligands of two given connection points and a given connector structure into the trained encoder, outputting the implicit space vectors and the conditional implicit space vectors of the two ligand connection points as the input of the decoder, and outputting a new chimera structure by the decoder.
Preferably, the step S2 specifically includes:
marking all the single bonds which are not cyclized and are not connected with hydrogen atoms, forming a cleavable single bond set with the number n, traversing all the possible pairwise combinations
Figure RE-GDA0003550794710000041
Cleavage of each single bond of the combination gives a ligand-linker-ligand triad, i.e.a molecule gives
Figure RE-GDA0003550794710000042
(ii) for each triplet, calculating the average of the druggability of the two ligands as the druggability of that triplet, calculating the average of the synthesizability of the linker and the two ligands as the synthesizability of that triplet, and calculating the fraction of reasonableness of the triplet cut by the formula:
Score=μSd+(1-μ)SA
wherein SdIndicates the patent drug score, SAAnd (3) representing a synthetic score, wherein mu is a hyperparameter for weighing two scores, 0.6 is taken, and after the rationality scores of the triples obtained by all the cutting schemes of each molecule are calculated, the contents of the part with the highest score are obtained and added into a training set.
Preferably, the step S3 specifically includes:
each atom and each bond are represented by a unique hot code, the total number of the atoms and each bond is single, double and triple, the unique hot code of each bond at the side is a three-dimensional vector, if the atom and each bond are single bonds, the first element is 1, and the other two elements are 0; iteratively updating the information of each node through the initial characteristic vector; for the updating of N atoms in the ligand 1, the characteristic vectors of the atoms connected with the N atoms are updated by weighted summation of the characteristic vectors
Figure RE-GDA0003550794710000043
Figure RE-GDA0003550794710000044
Representing the value of the next update to the current atom u feature vector,
Figure RE-GDA0003550794710000045
a feature vector representing the current atom u,
Figure RE-GDA0003550794710000046
feature vector, e, representing each neighbor atom to which atom u is connecteduvRepresenting the characteristic vector of the connection of the atom u and the atom v, and the weight value of the characteristic vector of the neighbor atom is euvA number between 0 and 1 is output after a neural network layer nn (), wherein eta is a hyper-parameter;
after each atom is updated for a certain number of times in such a way, the feature vectors of all atoms are fused to the atoms to be connected in a weighted manner
Figure RE-GDA0003550794710000051
h is the feature vector of the atom to be connected which is to be updated finally, hiAre the feature vectors of the remaining atoms and,
Figure RE-GDA0003550794710000052
for the weighted sum of weights, εiIs the shortest path from the atom to be connected, i.e. the shortest path between two nodes in a graph.
Preferably, the step S4 specifically includes:
setting the dimension of the hidden space vector and a specific distribution obeyed by the hidden space, setting the dimension of the vector, so that each molecule has two distribution parameter vectors with set dimensions, each dimension respectively represents the mean value and the variance of a unitary Gaussian distribution:
μ=nnμ(h
σ=nnσ(h)
the corresponding hidden space feature vector is obtained by sampling in Gaussian distribution determined by the parameter, a vector with a set dimension is obtained by sampling in multivariate normal distribution with a set dimension, multiplied by the corresponding element of the obtained variance vector and added with the corresponding element of the mean vector to obtain the hidden space vector of the molecule:
z=μ+σ·ε。
preferably, the step S6 specifically includes:
and counting the number of topological structures of all chimeras appearing in the training set, and recording the number as N, so that the full-connection layer takes two hidden space vectors as input, outputs an N-dimensional vector normalized by softmax, and the element value of each dimension represents the prediction probability of one topological structure.
Preferably, the step S7 specifically includes:
obtaining initial characteristic vectors of two atoms to be connected by passing the hidden space vectors of the two atoms to be connected through a neural network layer
Figure RE-GDA0003550794710000061
Figure RE-GDA0003550794710000062
Initializing the feature vector of each node in the molecules of the connector to be generated based on the initial feature vectors of the two points to be connected, wherein the initialization formula of the feature vector of each node i is
Figure RE-GDA0003550794710000063
εi1Is the shortest distance, ε, from the node i to the point 1 to be connectedi2Is the shortest distance from the node i to the point 2 to be connected.
Preferably, the step S8 specifically includes:
predicting the specific type of each node by using breadth first principle, and giving a feature vector of one node
Figure RE-GDA0003550794710000067
Outputting an m-dimensional vector through a full connection layer, normalizing by softmax to obtain a probability distribution vector for predicting the node type,
Figure RE-GDA0003550794710000064
after the specific type of the node is predicted, the feature vector of the node is subjected to one-time self-updating:
Figure RE-GDA0003550794710000065
Figure RE-GDA0003550794710000066
is the feature vector of the node before prediction, hiIs a unique code of the node type and mu is a hyper-parameter, between 0 and 1.
After predicting a new node type, predicting which two atoms in two nodes should be connected, the specific formula is as follows
Figure RE-GDA0003550794710000071
Figure RE-GDA0003550794710000072
When predicting an atom to which the node 1 should be connected, for each atom, using the same full-connection layer, using the feature vector of the atom and the feature vector of the node 2 as input, outputting the probability that the atom is connected in the node 1,
Figure RE-GDA0003550794710000073
a feature vector representing the ith atom in node 1, the feature vector of each atom being set in advance,
Figure RE-GDA0003550794710000074
a feature vector representing node 2;
updating the feature vector of the adjacent unpredicted node once every time a new node type is predicted
Figure RE-GDA0003550794710000075
Wherein
Figure RE-GDA0003550794710000076
Is the feature vector of the newly connected node, has been updated by itself,
Figure RE-GDA0003550794710000077
is the original feature vector of the neighbor node of the node, and the lambda is a hyper-parameter and is between 0 and 1.
According to the technical scheme, compared with the prior art, the invention discloses a connector design method of a protein degradation targeting chimera based on an autoencoder; by giving a targeting chimera molecule needing to design a new connector and inputting a connector design network, the network can extract related information from the input molecule to predict the topological structure of the connector to be designed, the topological structure is a tree representation of the connector molecule, and nodes and edges are predicted one by one through information transmission on the basis of the predicted tree, so that novel connector molecules are obtained.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a schematic diagram of a process flow structure of the method provided by the present invention.
Fig. 2 is a schematic diagram of a flow structure of data set preparation provided by the present invention.
FIG. 3 is a schematic representation of the molecular shear and triad provided by the present invention.
Fig. 4 is a schematic diagram of a network structure provided by the present invention.
FIG. 5 is a schematic diagram of a new linker generation structure provided by the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a connector design method of a protein degradation targeting chimera based on an autoencoder, which comprises the following steps:
s1, acquiring a small molecular set, and preprocessing the small molecular set to obtain a molecular set with strong drug-like property;
s2, processing the molecular shear algorithm-based molecular shear subset to obtain a training set, wherein the training set comprises a ligand 1, a ligand 2 and a chimera;
s3, inputting the training set into an encoder, and performing information transmission through a graph neural network to obtain the feature vectors of the ligand 1, the ligand 2 and the chimera;
s4, inputting the feature vectors of the ligand 1, the ligand 2 and the chimera into the full-connection layer to obtain an implicit space vector 1, an implicit space vector 2 and a conditional space vector;
s5, splicing the conditional space vector with the hidden space vector 1 and the hidden space vector 2 respectively to obtain two hidden space vectors to be decoded;
s6, transmitting the two hidden space vectors into a full-connection layer 1, outputting the prediction probability of the corresponding connection body topology result, and initializing a new chimera by taking the connection body topology structure with the highest probability;
s7, transmitting the two hidden space vectors into a full-connection layer 2, respectively obtaining the characteristics of initializing each node by taking the corresponding connection point as a root node and weighting in a predicted connection structure with breadth first;
s8, randomly selecting a connection point as a root node, predicting the node type of each node with breadth first, updating node information once by using a graph neural network when each node is generated until all nodes of a connector are generated, and finishing the training of the network;
s9, inputting the pseudo chimeras of the ligands of two given connection points and a given connector structure into the trained encoder, outputting the implicit space vectors and the conditional implicit space vectors of the two ligand connection points as the input of the decoder, and outputting a new chimera structure by the decoder.
To further optimize the above technical solution, step S2 specifically includes:
marking all the single bonds which are not cyclized and are not connected with hydrogen atoms, forming a cleavable single bond set with the number n, traversing all the possible pairwise combinations
Figure RE-GDA0003550794710000091
Cleavage of each single bond of the combination gives a ligand-linker-ligand triad, i.e.a molecule gives
Figure RE-GDA0003550794710000092
(ii) for each triplet, calculating the average of the druggability of the two ligands as the druggability of that triplet, calculating the average of the synthesizability of the linker and the two ligands as the synthesizability of that triplet, and calculating the fraction of reasonableness of the triplet cut by the formula:
Score=μSd+(1-μ)SA
wherein SdIndicates the patent drug score, SAAnd (3) representing a synthetic score, wherein mu is a hyperparameter for weighing two scores, 0.6 is taken, and after the rationality scores of the triples obtained by all the cutting schemes of each molecule are calculated, the contents of the part with the highest score are obtained and added into a training set.
To further optimize the above technical solution, step S3 specifically includes:
each atom and each bond are represented by a unique hot code, the total number of the atoms and each bond is single, double and triple, the unique hot code of each bond at the side is a three-dimensional vector, if the atom and each bond are single bonds, the first element is 1, and the other two elements are 0; iteratively updating the information of each node through the initial characteristic vector; for the updating of N atoms in the ligand 1, the characteristic vectors of the atoms connected with the N atoms are updated by weighted summation of the characteristic vectors
Figure RE-GDA0003550794710000101
Figure RE-GDA0003550794710000102
Representing the value of the next update to the current atom u feature vector,
Figure RE-GDA0003550794710000103
features representing the current atom uThe vector of the vector is then calculated,
Figure RE-GDA0003550794710000104
feature vector, e, representing each neighbor atom to which atom u is connecteduvRepresenting the characteristic vector of the connection of the atom u and the atom v, and the weight value of the characteristic vector of the neighbor atom is euvA number between 0 and 1 is output after a neural network layer nn (), wherein eta is a hyper-parameter;
after each atom is updated for a certain number of times in such a way, the feature vectors of all atoms are fused to the atoms to be connected in a weighted manner
Figure RE-GDA0003550794710000111
h is the feature vector of the atom to be connected which is to be updated finally, hiAre the feature vectors of the remaining atoms and,
Figure RE-GDA0003550794710000112
for the weighted sum of weights, εiIs the shortest path from the atom to be connected, i.e. the shortest path between two nodes in a graph.
To further optimize the above technical solution, step S4 specifically includes:
setting the dimension of the hidden space vector and a specific distribution obeyed by the hidden space, setting the dimension of the vector, so that each molecule has two distribution parameter vectors with set dimensions, each dimension respectively represents the mean value and the variance of a unitary Gaussian distribution:
μ=nnμ(h)
σ=nnσ(h)
the corresponding hidden space feature vector is obtained by sampling in Gaussian distribution determined by the parameter, a vector with a set dimension is obtained by sampling in multivariate normal distribution with a set dimension, multiplied by the corresponding element of the obtained variance vector and added with the corresponding element of the mean vector to obtain the hidden space vector of the molecule:
z=μ+σ·ε。
to further optimize the above technical solution, step S6 specifically includes:
and counting the number of topological structures of all chimeras appearing in the training set, and recording the number as N, so that the full-connection layer takes two hidden space vectors as input, outputs an N-dimensional vector normalized by softmax, and the element value of each dimension represents the prediction probability of one topological structure.
To further optimize the above technical solution, step S7 specifically includes:
obtaining initial characteristic vectors of two atoms to be connected by passing the hidden space vectors of the two atoms to be connected through a neural network layer
Figure RE-GDA0003550794710000121
Figure RE-GDA0003550794710000122
Initializing the feature vector of each node in the molecules of the connector to be generated based on the initial feature vectors of the two points to be connected, wherein the initialization formula of the feature vector of each node i is
Figure RE-GDA0003550794710000123
εi1Is the shortest distance, ε, from the node i to the point 1 to be connectedi2Is the shortest distance from the node i to the point 2 to be connected.
To further optimize the above technical solution, step S8 specifically includes:
predicting the specific type of each node by using breadth first principle, and giving a feature vector of one node
Figure RE-GDA0003550794710000128
Outputting an m-dimensional vector through a full connection layer, and normalizing through softmax to obtain the predicted nodeA probability distribution vector for the type(s),
Figure RE-GDA0003550794710000124
after the specific type of the node is predicted, the feature vector of the node is subjected to one-time self-updating:
Figure RE-GDA0003550794710000125
Figure RE-GDA0003550794710000126
is the feature vector of the node before prediction, hiIs a unique code of the node type and mu is a hyper-parameter, between 0 and 1.
After predicting a new node type, predicting which two atoms in two nodes should be connected, the specific formula is as follows
Figure RE-GDA0003550794710000127
Figure RE-GDA0003550794710000131
When predicting an atom to which the node 1 should be connected, for each atom, using the same full-connection layer, using the feature vector of the atom and the feature vector of the node 2 as input, outputting the probability that the atom is connected in the node 1,
Figure RE-GDA0003550794710000132
a feature vector representing the ith atom in node 1, the feature vector of each atom being set in advance,
Figure RE-GDA0003550794710000133
a feature vector representing node 2;
updating the feature vector of the adjacent unpredicted node once every time a new node type is predicted
Figure RE-GDA0003550794710000134
Wherein
Figure RE-GDA0003550794710000135
Is the feature vector of the newly connected node, has been updated by itself,
Figure RE-GDA0003550794710000136
is the original feature vector of the neighbor node of the node, and the lambda is a hyper-parameter and is between 0 and 1.
Downloading a small molecule subset from a small molecule data set website (ChEMBL, ZINC);
calculating five multiplying powers, toxic groups and synthesization of each molecule, and performing primary screening to obtain a subset with strong drug-like property;
generating a training set by utilizing a molecular shearing algorithm, wherein each input data of the training set is two ligands, and a label is a whole original molecule, namely two ligands and a connector; the molecular shearing adopts a molecular shearing algorithm proposed by Hussain, and effective groups of the original molecules are kept as much as possible in the sheared triads;
the whole network mainly consists of an encoder and a decoder.
(1) Encoder for encoding a video signal
Respectively transmitting information of the ligand 1, the ligand 2 and the original molecule by using a graph neural network to obtain a characteristic vector of each node;
the characteristics of the connection point 1 and the connection point 2 are characteristic vectors of nodes needing to be connected in the two ligands, and the characteristics of the chimera are the average value of the characteristic vectors of all the nodes of the chimera;
the three characteristics are respectively processed by a full connection layer to obtain parameters of given distribution, Gaussian distribution is adopted to obtain a mean value and a variance, and sampling is carried out on the obtained corresponding Gaussian distribution to obtain a hidden space vector 1, a vector 2 and a conditional hidden space vector;
(2) decoder
Splicing the conditional implicit space vectors obtained by the decoder with the implicit space vectors of the two connection points respectively to obtain two decoding-wearing implicit space vectors;
respectively transmitting the two hidden space vectors into the same full-connection layer, respectively outputting the prediction probability of a topological structure of a connection body, and initializing a new chimera by taking the topological structure with the highest probability;
respectively transmitting the two hidden space vectors into the same full-connection layer (different from the full-connection layer), respectively outputting the characteristics of each node by taking the corresponding connection point as a root node and weighting and initializing the predicted connection structure by breadth first;
randomly selecting a connection point as a root node, predicting the node type of each node with breadth first, updating node information once by using a graph neural network when a node is newly generated until all nodes of a connector are generated;
inputting two ligands with connection (a connection point needs to be given) and a pseudo-chimera, wherein the structure of a connecting body in the pseudo-chimera is considered to be given;
the encoder outputs the implicit space vectors and the conditional implicit space vectors of the two ligand connection points as the input of the decoder;
the decoder outputs a new mosaic structure;
loss function
Loss function of network is based on reconstruction loss LreconAnd distribution loss LKLThe two parts are as follows:
L=LreconKLLKL
wherein λKLIs a manually set hyper-parameter.
Loss of reconstruction
The reconstruction loss consists of three parts of a connecting body topological structure prediction loss, a node prediction loss and an edge prediction loss:
Figure RE-GDA0003550794710000151
one item on the left is the predicted loss of the topology of the link, k is the total number of topologies, ytopu,iAnd (3) representing the probability truth value of the connector in the ith type topological structure, namely the connector belongs to the ith type and is 1, otherwise, the connector is 0.
Figure RE-GDA0003550794710000152
Representing the probability that the network predicted connection is of the ith type of topology;
predicting the loss of the node type in the middle, wherein m is the total number of nodes in the training set, n is the total number of nodes in the predicted topological structure,
Figure RE-GDA0003550794710000153
and representing the probability truth value of each j node of the connecting body in the i-th class node, namely the node belongs to the i-th class and is 1, otherwise, the node is 0.
Figure RE-GDA0003550794710000154
Representing the probability that the network predicts that the jth node belongs to the ith class node;
the one on the right side is the edge type prediction loss, '3' indicates that the types of edges are three in total, single bonds, double bonds and triple bonds, and the type of the edge connecting the existing structure and the new node needs to be predicted after each node is generated, but when all the nodes are predicted, the type of the edge connecting another ligand needs to be predicted, so that the number of predicted edges is n + 1. In the same way
Figure RE-GDA0003550794710000161
Representing the probability truth of the jth edge on the ith class edge,
Figure RE-GDA0003550794710000162
representing the probability that the network predicts that the jth edge belongs to the ith class edge
Loss of distribution
The distribution loss adopts classical KL divergence loss:
Figure RE-GDA0003550794710000163
where J is an artificially given hyper-parameter representing dimensions of the hidden space, each dimension representing a Gaussian distribution, μjAnd σjRepresenting the mean and variance of the gaussian distribution of the network's prediction in dimension j, the loss function is expected to map the input into a normal distribution, allowing the decoder to generate more novel concatenations by sampling in the gaussian distribution.
The application has the advantages that:
new connectors are generated starting from the topology of the connectors. The influence of the connector on the two ligands is mainly the indirect influence of the topological structures of the two ligands, and the generation of a new connector from the point has more instructive significance on a specific application scene, but the existing two connector design methods do not consider the point;
the method does not need to train 3D structural information of a large number of molecules in a set, and has stronger applicability;
although the background of the method is the design of a chimeric connector, the method can be easily popularized to the applications of framework transition, lead optimization and the like, and has rich functions;
the embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (7)

1. A design method of a connector of a protein degradation targeting chimera based on an autoencoder is characterized by comprising the following steps:
s1, acquiring a small molecular set, and preprocessing the small molecular set to obtain a molecular set with strong drug-like property;
s2, processing the molecular shear algorithm-based molecular shear subset to obtain a training set, wherein the training set comprises a ligand 1, a ligand 2 and a chimera;
s3, inputting the training set into an encoder, and performing information transmission through a graph neural network to obtain the feature vectors of the ligand 1, the ligand 2 and the chimera;
s4, inputting the feature vectors of the ligand 1, the ligand 2 and the chimera into the full-connection layer to obtain an implicit space vector 1, an implicit space vector 2 and a conditional space vector;
s5, splicing the conditional space vector with the hidden space vector 1 and the hidden space vector 2 respectively to obtain two hidden space vectors to be decoded;
s6, transmitting the two hidden space vectors into a full-connection layer 1, outputting the prediction probability of the corresponding connection body topology result, and initializing a new chimera by taking the connection body topology structure with the highest probability;
s7, transmitting the two hidden space vectors into a full-connection layer 2, respectively obtaining the characteristics of initializing each node by taking the corresponding connection point as a root node and weighting in a predicted connection structure with breadth first;
s8, randomly selecting a connection point as a root node, predicting the node type of each node with breadth first, updating node information once by using a graph neural network when each node is generated until all nodes of a connector are generated, and finishing the training of the network;
s9, inputting the pseudo chimeras of the ligands of two given connection points and a given connector structure into the trained encoder, outputting the implicit space vectors and the conditional implicit space vectors of the two ligand connection points as the input of the decoder, and outputting a new chimera structure by the decoder.
2. The method for designing the linker of the self-encoder-based protein degradation targeting chimera according to claim 1, wherein the step S2 specifically includes:
marking all the single bonds which are not cyclized and are not connected with hydrogen atoms, forming a cleavable single bond set with the number n, traversing all the possible pairwise combinations
Figure FDA0003455333160000021
Cleavage of each single bond of the combination gives a ligand-linker-ligand triad, i.e.a molecule gives
Figure FDA0003455333160000022
(ii) for each triplet, calculating the average of the druggability of the two ligands as the druggability of that triplet, calculating the average of the synthesizability of the linker and the two ligands as the synthesizability of that triplet, and calculating the fraction of reasonableness of the triplet cut by the formula:
Score=μSd+(1-μ)SA
wherein SdIndicates the patent drug score, SAAnd (3) representing a synthetic score, wherein mu is a hyperparameter for weighing two scores, 0.6 is taken, and after the rationality scores of the triples obtained by all the cutting schemes of each molecule are calculated, the contents of the part with the highest score are obtained and added into a training set.
3. The method for designing the linker of the self-encoder-based protein degradation targeting chimera according to claim 1, wherein the step S3 specifically includes:
each atom and each bond are represented by a unique hot code, the total number of the atoms and each bond is single, double and triple, the unique hot code of each bond at the side is a three-dimensional vector, if the atom and each bond are single bonds, the first element is 1, and the other two elements are 0; iteratively updating the information of each node through the initial characteristic vector; for the updating of N atoms in the ligand 1, the characteristic vectors of the atoms connected with the N atoms are updated by weighted summation of the characteristic vectors
Figure FDA0003455333160000023
Figure FDA0003455333160000031
Representing the value of the next update to the current atom u feature vector,
Figure FDA0003455333160000032
a feature vector representing the current atom u,
Figure FDA0003455333160000033
feature vector, e, representing each neighbor atom to which atom u is connecteduvRepresenting the characteristic vector of the connection of the atom u and the atom v, and the weight value of the characteristic vector of the neighbor atom is euvA number between 0 and 1 is output after a neural network layer nn (), wherein eta is a hyper-parameter;
after each atom is updated for a certain number of times in such a way, the feature vectors of all atoms are fused to the atoms to be connected in a weighted manner
Figure FDA0003455333160000034
h is the feature vector of the atom to be connected which is to be updated finally, hiAre the feature vectors of the remaining atoms and,
Figure FDA0003455333160000035
for the weighted sum of weights, εiIs the shortest path from the atom to be connected, i.e. the shortest path between two nodes in a graph.
4. The method for designing the linker of the self-encoder-based protein degradation targeting chimera according to claim 1, wherein the step S4 specifically includes:
setting the dimension of the hidden space vector and a specific distribution obeyed by the hidden space, setting the dimension of the vector, so that each molecule has two distribution parameter vectors with set dimensions, each dimension respectively represents the mean value and the variance of a unitary Gaussian distribution:
μ=nnμ(h)
σ=nnσ(h)
the corresponding hidden space feature vector is obtained by sampling in Gaussian distribution determined by the parameter, a vector with a set dimension is obtained by sampling in multivariate normal distribution with a set dimension, multiplied by the corresponding element of the obtained variance vector and added with the corresponding element of the mean vector to obtain the hidden space vector of the molecule:
z=μ+σ·ε。
5. the method for designing the linker of the self-encoder-based protein degradation targeting chimera according to claim 1, wherein the step S6 specifically includes:
and counting the number of topological structures of all chimeras appearing in the training set, and recording the number as N, so that the full-connection layer takes two hidden space vectors as input, outputs an N-dimensional vector normalized by softmax, and the element value of each dimension represents the prediction probability of one topological structure.
6. The method for designing the linker of the self-encoder-based protein degradation targeting chimera according to claim 1, wherein the step S7 specifically includes:
obtaining initial characteristic vectors of two atoms to be connected by passing the hidden space vectors of the two atoms to be connected through a neural network layer
Figure FDA0003455333160000041
Figure FDA0003455333160000042
Initializing the feature vector of each node in the molecules of the connector to be generated based on the initial feature vectors of the two points to be connected, wherein the initialization formula of the feature vector of each node i is
Figure FDA0003455333160000043
εi1Is the shortest distance, ε, from the node i to the point 1 to be connectedi2Is the shortest distance from the node i to the point 2 to be connected.
7. The method for designing the linker of the self-encoder-based protein degradation targeting chimera according to claim 1, wherein the step S8 specifically includes:
predicting the specific type of each node by using breadth first principle, and giving a feature vector of one node
Figure FDA0003455333160000044
Outputting an m-dimensional vector through a full connection layer, normalizing by softmax to obtain a probability distribution vector for predicting the node type,
Figure FDA0003455333160000051
after the specific type of the node is predicted, the feature vector of the node is subjected to one-time self-updating:
Figure FDA0003455333160000052
Figure FDA0003455333160000053
is the feature vector of the node before prediction, hiIs a unique code of the node type, mu is a hyper-parameter, between 0 and 1;
after predicting a new node type, predicting which two atoms in two nodes should be connected, the specific formula is as follows
Figure FDA0003455333160000054
Figure FDA0003455333160000055
When predicting an atom to which the node 1 should be connected, for each atom, using the same full-connection layer, using the feature vector of the atom and the feature vector of the node 2 as input, outputting the probability that the atom is connected in the node 1,
Figure FDA0003455333160000056
a feature vector representing the ith atom in node 1, the feature vector of each atom being set in advance,
Figure FDA0003455333160000057
a feature vector representing node 2;
updating the feature vector of the adjacent unpredicted node once every time a new node type is predicted
Figure FDA0003455333160000058
Wherein
Figure FDA0003455333160000059
Is the feature vector of the newly connected node,the information of the self-updating is already updated,
Figure FDA00034553331600000510
is the original feature vector of the neighbor node of the node, and the lambda is a hyper-parameter and is between 0 and 1.
CN202210002416.1A 2022-01-04 2022-01-04 Connector design method of protein degradation targeting chimera based on self-encoder Pending CN114420197A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210002416.1A CN114420197A (en) 2022-01-04 2022-01-04 Connector design method of protein degradation targeting chimera based on self-encoder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210002416.1A CN114420197A (en) 2022-01-04 2022-01-04 Connector design method of protein degradation targeting chimera based on self-encoder

Publications (1)

Publication Number Publication Date
CN114420197A true CN114420197A (en) 2022-04-29

Family

ID=81271181

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210002416.1A Pending CN114420197A (en) 2022-01-04 2022-01-04 Connector design method of protein degradation targeting chimera based on self-encoder

Country Status (1)

Country Link
CN (1) CN114420197A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116312855A (en) * 2023-02-28 2023-06-23 杭州生奥信息技术有限公司 Method for optimizing activity of lead compound
CN116597892A (en) * 2023-05-15 2023-08-15 之江实验室 Model training method and molecular structure information recommending method and device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116312855A (en) * 2023-02-28 2023-06-23 杭州生奥信息技术有限公司 Method for optimizing activity of lead compound
CN116312855B (en) * 2023-02-28 2023-09-08 杭州生奥信息技术有限公司 Method for optimizing activity of lead compound
CN116597892A (en) * 2023-05-15 2023-08-15 之江实验室 Model training method and molecular structure information recommending method and device
CN116597892B (en) * 2023-05-15 2024-03-19 之江实验室 Model training method and molecular structure information recommending method and device

Similar Documents

Publication Publication Date Title
US11544530B2 (en) Self-attentive attributed network embedding
US11537898B2 (en) Generative structure-property inverse computational co-design of materials
CN103049792B (en) Deep-neural-network distinguish pre-training
CN107103331B (en) Image fusion method based on deep learning
CN114420197A (en) Connector design method of protein degradation targeting chimera based on self-encoder
CN111949865A (en) Interest point recommendation method based on graph neural network and user long-term and short-term preference
US20060212279A1 (en) Methods for efficient solution set optimization
Yu et al. Evolutionary fuzzy neural networks for hybrid financial prediction
EP1818862A1 (en) Combining model-based and genetics-based offspring generation for multi-objective optimization using a convergence criterion
CN113780002A (en) Knowledge reasoning method and device based on graph representation learning and deep reinforcement learning
CN113792110A (en) Equipment trust value evaluation method based on social networking services
CN115659807A (en) Method for predicting talent performance based on Bayesian optimization model fusion algorithm
CN114999565A (en) Drug target affinity prediction method based on representation learning and graph neural network
Li et al. Foldingzero: Protein folding from scratch in hydrophobic-polar model
CN113205181A (en) Graph combination optimization problem solving method based on deep graph learning
Zhou Explainable ai in request-for-quote
CN111126607B (en) Data processing method, device and system for model training
Lin et al. An efficient hybrid Taguchi-genetic algorithm for protein folding simulation
CN106355000A (en) Scaffolding method based on statistical characteristic of double-end insert size
CN116564555A (en) Drug interaction prediction model construction method based on deep memory interaction
CN115544307A (en) Directed graph data feature extraction and expression method and system based on incidence matrix
EP4170556A1 (en) Control device, method, and program
CN110119779B (en) Cross-network data arbitrary dimension fusion method and device based on self-encoder
CN109858520B (en) Multi-layer semi-supervised classification method
Wong et al. Rainfall prediction using neural fuzzy technique

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination