CN114420197A

CN114420197A - Connector design method of protein degradation targeting chimera based on self-encoder

Info

Publication number: CN114420197A
Application number: CN202210002416.1A
Authority: CN
Inventors: 邓岳; 李柯岑
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2022-01-04
Filing date: 2022-01-04
Publication date: 2022-04-29

Abstract

The invention discloses a connector design method of a protein degradation targeting chimera based on an autoencoder, which comprises the following steps: replacing the linker of the existing protein degradation targeting chimera with various small molecules in a molecule library, removing the replacements which do not accord with chemical rules to obtain a large amount of pseudochimera molecules, and dividing the pseudochimera molecules into a training set and a verification set according to a certain proportion; designing a connector design network, training on a training set, and well reconstructing a connector of each input molecule; fine adjustment is carried out on the model trained on the pseudo chimera molecular set on the real chimera molecular set; the final trained network is validated on the test set and the decoder's effect on the generation of random samples from a given distribution is tested.

Description

Connector design method of protein degradation targeting chimera based on self-encoder

Technical Field

The invention relates to the technical field of a protein degradation targeting chimera, in particular to a connector design method of a protein degradation targeting chimera based on an autoencoder.

Background

At present, the traditional machine learning usually depends on a data set of a large amount of data, and a prediction model or a regression model fitting the data set is learned, so that the obtained model is well suitable for the data of a training set and a test set. The variational self-encoder is a deep learning generation model based on the variational thought, and consists of an encoder and a decoder. The encoder maps the input to a particular distribution and takes samples of this distribution as input to the decoder, which yields an output that approximates the same distribution as the input data. The protein degradation target chimera is a hybrid dual-functional small molecule compound, which consists of three parts, namely a target protein ligand, a connector and an E3 ligase ligand, and is characterized in that the distance between target protein and intracellular E3 ubiquitin ligase is shortened, the target protein is specifically degraded by a ubiquitin-proteasome pathway, the target protein ligand of the chimera is combined with the target protein, the E3 ligase ligand is combined with the E3 ubiquitin ligase, and the connector connects the two ligands and has certain influence on the combination strength of the two parts.

However, amino acids are by far the most common linking moieties for degradation of targeted chimeras, including aminocaproic acid, glycine and serine, and polyethylene glycol is also commonly used to increase the hydrophilicity of targeted chimeras. Compared with a wide small molecule space, the number of the existing connectors is very rare, and the realization of a method for rapidly discovering a novel connector in the small molecule space is very important; the existing methods for designing a connector by machine learning mainly comprise two methods, one is that each molecule is directly expressed as a character sequence by SMILES, and the structure of the connector is predicted by a related algorithm of sequence prediction, so that the method does not utilize the information of the molecule and the topological structure of the molecule in a chemical principle; the second method is to use a graph neural network to obtain the characteristics of molecules and use a network structure of a conditional variational self-encoder to generate a new chimera, but in the method, 3D structure information of a connector is needed, the maximum number of atoms is determined before atoms are generated, and the interpretability is poor without starting from the topological structure of the connector.

Therefore, designing a linker design network to be applied to the design of linkers of protein degradation targeting chimeras is a problem that needs to be solved by those skilled in the art.

Disclosure of Invention

In view of the above, the present invention provides a linker design method for protein degradation targeting chimera based on an autoencoder; by giving a targeting chimera molecule needing to design a new connector and inputting a connector design network, the network can extract related information from the input molecule to predict the topological structure of the connector to be designed, the topological structure is a tree representation of the connector molecule, and nodes and edges are predicted one by one through information transmission on the basis of the predicted tree, so that novel connector molecules are obtained.

In order to achieve the purpose, the invention adopts the following technical scheme:

a design method of a connector of a protein degradation targeting chimera based on an autoencoder comprises the following steps:

s1, acquiring a small molecular set, and preprocessing the small molecular set to obtain a molecular set with strong drug-like property;

s2, processing the molecular shear algorithm-based molecular shear subset to obtain a training set, wherein the training set comprises a ligand 1, a ligand 2 and a chimera;

s3, inputting the training set into an encoder, and performing information transmission through a graph neural network to obtain the feature vectors of the ligand 1, the ligand 2 and the chimera;

s4, inputting the feature vectors of the ligand 1, the ligand 2 and the chimera into the full-connection layer to obtain an implicit space vector 1, an implicit space vector 2 and a conditional space vector;

s5, splicing the conditional space vector with the hidden space vector 1 and the hidden space vector 2 respectively to obtain two hidden space vectors to be decoded;

s6, transmitting the two hidden space vectors into a full-connection layer 1, outputting the prediction probability of the corresponding connection body topology result, and initializing a new chimera by taking the connection body topology structure with the highest probability;

s7, transmitting the two hidden space vectors into a full-connection layer 2, respectively obtaining the characteristics of initializing each node by taking the corresponding connection point as a root node and weighting in a predicted connection structure with breadth first;

s8, randomly selecting a connection point as a root node, predicting the node type of each node with breadth first, updating node information once by using a graph neural network when each node is generated until all nodes of a connector are generated, and finishing the training of the network;

s9, inputting the pseudo chimeras of the ligands of two given connection points and a given connector structure into the trained encoder, outputting the implicit space vectors and the conditional implicit space vectors of the two ligand connection points as the input of the decoder, and outputting a new chimera structure by the decoder.

Preferably, the step S2 specifically includes:

marking all the single bonds which are not cyclized and are not connected with hydrogen atoms, forming a cleavable single bond set with the number n, traversing all the possible pairwise combinations

Cleavage of each single bond of the combination gives a ligand-linker-ligand triad, i.e.a molecule gives

(ii) for each triplet, calculating the average of the druggability of the two ligands as the druggability of that triplet, calculating the average of the synthesizability of the linker and the two ligands as the synthesizability of that triplet, and calculating the fraction of reasonableness of the triplet cut by the formula:

Score＝μS_d+(1-μ)S_A

wherein S_dIndicates the patent drug score, S_AAnd (3) representing a synthetic score, wherein mu is a hyperparameter for weighing two scores, 0.6 is taken, and after the rationality scores of the triples obtained by all the cutting schemes of each molecule are calculated, the contents of the part with the highest score are obtained and added into a training set.

Preferably, the step S3 specifically includes:

each atom and each bond are represented by a unique hot code, the total number of the atoms and each bond is single, double and triple, the unique hot code of each bond at the side is a three-dimensional vector, if the atom and each bond are single bonds, the first element is 1, and the other two elements are 0; iteratively updating the information of each node through the initial characteristic vector; for the updating of N atoms in the ligand 1, the characteristic vectors of the atoms connected with the N atoms are updated by weighted summation of the characteristic vectors

Representing the value of the next update to the current atom u feature vector,

a feature vector representing the current atom u,

feature vector, e, representing each neighbor atom to which atom u is connected_uvRepresenting the characteristic vector of the connection of the atom u and the atom v, and the weight value of the characteristic vector of the neighbor atom is e_uvA number between 0 and 1 is output after a neural network layer nn (), wherein eta is a hyper-parameter;

after each atom is updated for a certain number of times in such a way, the feature vectors of all atoms are fused to the atoms to be connected in a weighted manner

h is the feature vector of the atom to be connected which is to be updated finally, h_iAre the feature vectors of the remaining atoms and,

for the weighted sum of weights, ε_iIs the shortest path from the atom to be connected, i.e. the shortest path between two nodes in a graph.

Preferably, the step S4 specifically includes:

setting the dimension of the hidden space vector and a specific distribution obeyed by the hidden space, setting the dimension of the vector, so that each molecule has two distribution parameter vectors with set dimensions, each dimension respectively represents the mean value and the variance of a unitary Gaussian distribution:

μ＝nn_μ(h

σ＝nn_σ(h)

the corresponding hidden space feature vector is obtained by sampling in Gaussian distribution determined by the parameter, a vector with a set dimension is obtained by sampling in multivariate normal distribution with a set dimension, multiplied by the corresponding element of the obtained variance vector and added with the corresponding element of the mean vector to obtain the hidden space vector of the molecule:

z＝μ+σ·ε。

preferably, the step S6 specifically includes:

and counting the number of topological structures of all chimeras appearing in the training set, and recording the number as N, so that the full-connection layer takes two hidden space vectors as input, outputs an N-dimensional vector normalized by softmax, and the element value of each dimension represents the prediction probability of one topological structure.

Preferably, the step S7 specifically includes:

obtaining initial characteristic vectors of two atoms to be connected by passing the hidden space vectors of the two atoms to be connected through a neural network layer

Initializing the feature vector of each node in the molecules of the connector to be generated based on the initial feature vectors of the two points to be connected, wherein the initialization formula of the feature vector of each node i is

ε_i1Is the shortest distance, ε, from the node i to the point 1 to be connected_i2Is the shortest distance from the node i to the point 2 to be connected.

Preferably, the step S8 specifically includes:

predicting the specific type of each node by using breadth first principle, and giving a feature vector of one node

Outputting an m-dimensional vector through a full connection layer, normalizing by softmax to obtain a probability distribution vector for predicting the node type,

after the specific type of the node is predicted, the feature vector of the node is subjected to one-time self-updating:

is the feature vector of the node before prediction, h_iIs a unique code of the node type and mu is a hyper-parameter, between 0 and 1.

After predicting a new node type, predicting which two atoms in two nodes should be connected, the specific formula is as follows

When predicting an atom to which the node 1 should be connected, for each atom, using the same full-connection layer, using the feature vector of the atom and the feature vector of the node 2 as input, outputting the probability that the atom is connected in the node 1,

a feature vector representing the ith atom in node 1, the feature vector of each atom being set in advance,

a feature vector representing node 2;

updating the feature vector of the adjacent unpredicted node once every time a new node type is predicted

Wherein

Is the feature vector of the newly connected node, has been updated by itself,

is the original feature vector of the neighbor node of the node, and the lambda is a hyper-parameter and is between 0 and 1.

According to the technical scheme, compared with the prior art, the invention discloses a connector design method of a protein degradation targeting chimera based on an autoencoder; by giving a targeting chimera molecule needing to design a new connector and inputting a connector design network, the network can extract related information from the input molecule to predict the topological structure of the connector to be designed, the topological structure is a tree representation of the connector molecule, and nodes and edges are predicted one by one through information transmission on the basis of the predicted tree, so that novel connector molecules are obtained.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic diagram of a process flow structure of the method provided by the present invention.

Fig. 2 is a schematic diagram of a flow structure of data set preparation provided by the present invention.

FIG. 3 is a schematic representation of the molecular shear and triad provided by the present invention.

Fig. 4 is a schematic diagram of a network structure provided by the present invention.

FIG. 5 is a schematic diagram of a new linker generation structure provided by the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a connector design method of a protein degradation targeting chimera based on an autoencoder, which comprises the following steps:

To further optimize the above technical solution, step S2 specifically includes:

Score＝μS_d+(1-μ)S_A

To further optimize the above technical solution, step S3 specifically includes:

Representing the value of the next update to the current atom u feature vector,

features representing the current atom uThe vector of the vector is then calculated,

To further optimize the above technical solution, step S4 specifically includes:

μ＝nn_μ(h)

σ＝nn_σ(h)

z＝μ+σ·ε。

to further optimize the above technical solution, step S6 specifically includes:

To further optimize the above technical solution, step S7 specifically includes:

To further optimize the above technical solution, step S8 specifically includes:

Outputting an m-dimensional vector through a full connection layer, and normalizing through softmax to obtain the predicted nodeA probability distribution vector for the type(s),

a feature vector representing node 2;

Wherein

Is the feature vector of the newly connected node, has been updated by itself,

Downloading a small molecule subset from a small molecule data set website (ChEMBL, ZINC);

calculating five multiplying powers, toxic groups and synthesization of each molecule, and performing primary screening to obtain a subset with strong drug-like property;

generating a training set by utilizing a molecular shearing algorithm, wherein each input data of the training set is two ligands, and a label is a whole original molecule, namely two ligands and a connector; the molecular shearing adopts a molecular shearing algorithm proposed by Hussain, and effective groups of the original molecules are kept as much as possible in the sheared triads;

the whole network mainly consists of an encoder and a decoder.

(1) Encoder for encoding a video signal

Respectively transmitting information of the ligand 1, the ligand 2 and the original molecule by using a graph neural network to obtain a characteristic vector of each node;

the characteristics of the connection point 1 and the connection point 2 are characteristic vectors of nodes needing to be connected in the two ligands, and the characteristics of the chimera are the average value of the characteristic vectors of all the nodes of the chimera;

the three characteristics are respectively processed by a full connection layer to obtain parameters of given distribution, Gaussian distribution is adopted to obtain a mean value and a variance, and sampling is carried out on the obtained corresponding Gaussian distribution to obtain a hidden space vector 1, a vector 2 and a conditional hidden space vector;

(2) decoder

Splicing the conditional implicit space vectors obtained by the decoder with the implicit space vectors of the two connection points respectively to obtain two decoding-wearing implicit space vectors;

respectively transmitting the two hidden space vectors into the same full-connection layer, respectively outputting the prediction probability of a topological structure of a connection body, and initializing a new chimera by taking the topological structure with the highest probability;

respectively transmitting the two hidden space vectors into the same full-connection layer (different from the full-connection layer), respectively outputting the characteristics of each node by taking the corresponding connection point as a root node and weighting and initializing the predicted connection structure by breadth first;

randomly selecting a connection point as a root node, predicting the node type of each node with breadth first, updating node information once by using a graph neural network when a node is newly generated until all nodes of a connector are generated;

inputting two ligands with connection (a connection point needs to be given) and a pseudo-chimera, wherein the structure of a connecting body in the pseudo-chimera is considered to be given;

the encoder outputs the implicit space vectors and the conditional implicit space vectors of the two ligand connection points as the input of the decoder;

the decoder outputs a new mosaic structure;

loss function

Loss function of network is based on reconstruction loss L_reconAnd distribution loss L_KLThe two parts are as follows:

L＝L_recon+λ_KLL_KL

wherein λ_KLIs a manually set hyper-parameter.

Loss of reconstruction

The reconstruction loss consists of three parts of a connecting body topological structure prediction loss, a node prediction loss and an edge prediction loss:

one item on the left is the predicted loss of the topology of the link, k is the total number of topologies, y_topu,iAnd (3) representing the probability truth value of the connector in the ith type topological structure, namely the connector belongs to the ith type and is 1, otherwise, the connector is 0.

Representing the probability that the network predicted connection is of the ith type of topology;

predicting the loss of the node type in the middle, wherein m is the total number of nodes in the training set, n is the total number of nodes in the predicted topological structure,

and representing the probability truth value of each j node of the connecting body in the i-th class node, namely the node belongs to the i-th class and is 1, otherwise, the node is 0.

Representing the probability that the network predicts that the jth node belongs to the ith class node;

the one on the right side is the edge type prediction loss, '3' indicates that the types of edges are three in total, single bonds, double bonds and triple bonds, and the type of the edge connecting the existing structure and the new node needs to be predicted after each node is generated, but when all the nodes are predicted, the type of the edge connecting another ligand needs to be predicted, so that the number of predicted edges is n + 1. In the same way

Representing the probability truth of the jth edge on the ith class edge,

representing the probability that the network predicts that the jth edge belongs to the ith class edge

Loss of distribution

The distribution loss adopts classical KL divergence loss:

where J is an artificially given hyper-parameter representing dimensions of the hidden space, each dimension representing a Gaussian distribution, μ_jAnd σ_jRepresenting the mean and variance of the gaussian distribution of the network's prediction in dimension j, the loss function is expected to map the input into a normal distribution, allowing the decoder to generate more novel concatenations by sampling in the gaussian distribution.

The application has the advantages that:

new connectors are generated starting from the topology of the connectors. The influence of the connector on the two ligands is mainly the indirect influence of the topological structures of the two ligands, and the generation of a new connector from the point has more instructive significance on a specific application scene, but the existing two connector design methods do not consider the point;

the method does not need to train 3D structural information of a large number of molecules in a set, and has stronger applicability;

although the background of the method is the design of a chimeric connector, the method can be easily popularized to the applications of framework transition, lead optimization and the like, and has rich functions;

the embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A design method of a connector of a protein degradation targeting chimera based on an autoencoder is characterized by comprising the following steps:

2. The method for designing the linker of the self-encoder-based protein degradation targeting chimera according to claim 1, wherein the step S2 specifically includes:

Score＝μS_d+(1-μ)S_A

3. The method for designing the linker of the self-encoder-based protein degradation targeting chimera according to claim 1, wherein the step S3 specifically includes:

Representing the value of the next update to the current atom u feature vector,

a feature vector representing the current atom u,

4. The method for designing the linker of the self-encoder-based protein degradation targeting chimera according to claim 1, wherein the step S4 specifically includes:

μ＝nn_μ(h)

σ＝nn_σ(h)

z＝μ+σ·ε。

5. the method for designing the linker of the self-encoder-based protein degradation targeting chimera according to claim 1, wherein the step S6 specifically includes:

6. The method for designing the linker of the self-encoder-based protein degradation targeting chimera according to claim 1, wherein the step S7 specifically includes:

7. The method for designing the linker of the self-encoder-based protein degradation targeting chimera according to claim 1, wherein the step S8 specifically includes:

is the feature vector of the node before prediction, h_iIs a unique code of the node type, mu is a hyper-parameter, between 0 and 1;

a feature vector representing node 2;

Wherein

Is the feature vector of the newly connected node,the information of the self-updating is already updated,