WO2023016621A1 - Ternary complex determination for plausible targeted protein degradation using deep learning and design of degrader molecules using deep learning - Google Patents
Ternary complex determination for plausible targeted protein degradation using deep learning and design of degrader molecules using deep learning Download PDFInfo
- Publication number
- WO2023016621A1 WO2023016621A1 PCT/EP2021/025372 EP2021025372W WO2023016621A1 WO 2023016621 A1 WO2023016621 A1 WO 2023016621A1 EP 2021025372 W EP2021025372 W EP 2021025372W WO 2023016621 A1 WO2023016621 A1 WO 2023016621A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- protein
- computer implemented
- deep
- implemented method
- degrader
- Prior art date
Links
- 239000001064 degrader Substances 0.000 title claims abstract description 69
- 230000017854 proteolysis Effects 0.000 title claims abstract description 17
- 238000013135 deep learning Methods 0.000 title claims description 13
- 238000013461 design Methods 0.000 title description 8
- 102000004169 proteins and genes Human genes 0.000 claims abstract description 103
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 103
- 238000000034 method Methods 0.000 claims abstract description 79
- 239000012634 fragment Substances 0.000 claims abstract description 61
- 230000003993 interaction Effects 0.000 claims abstract description 55
- 238000010801 machine learning Methods 0.000 claims abstract description 16
- 230000006870 function Effects 0.000 claims description 26
- 238000005457 optimization Methods 0.000 claims description 19
- 238000009826 distribution Methods 0.000 claims description 15
- 230000004850 protein–protein interaction Effects 0.000 claims description 15
- 230000008569 process Effects 0.000 claims description 12
- 238000003032 molecular docking Methods 0.000 claims description 8
- 238000013519 translation Methods 0.000 claims description 7
- 230000014616 translation Effects 0.000 claims description 7
- 230000000977 initiatory effect Effects 0.000 claims description 2
- 230000000875 corresponding effect Effects 0.000 abstract 2
- 235000018102 proteins Nutrition 0.000 description 91
- 239000000126 substance Substances 0.000 description 17
- 102000006275 Ubiquitin-Protein Ligases Human genes 0.000 description 10
- 108010083111 Ubiquitin-Protein Ligases Proteins 0.000 description 10
- 239000013598 vector Substances 0.000 description 10
- 238000013459 approach Methods 0.000 description 8
- 239000000470 constituent Substances 0.000 description 8
- 238000012549 training Methods 0.000 description 8
- 238000000329 molecular dynamics simulation Methods 0.000 description 7
- 230000002776 aggregation Effects 0.000 description 6
- 238000004220 aggregation Methods 0.000 description 6
- 230000027455 binding Effects 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 5
- 230000015556 catabolic process Effects 0.000 description 5
- 238000006731 degradation reaction Methods 0.000 description 5
- 229940079593 drug Drugs 0.000 description 5
- 239000003814 drug Substances 0.000 description 5
- 238000007781 pre-processing Methods 0.000 description 5
- 231100000419 toxicity Toxicity 0.000 description 5
- 230000001988 toxicity Effects 0.000 description 5
- 108020004459 Small interfering RNA Proteins 0.000 description 4
- 230000008859 change Effects 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 4
- 239000003446 ligand Substances 0.000 description 4
- 239000004055 small Interfering RNA Substances 0.000 description 4
- 125000003275 alpha amino acid group Chemical group 0.000 description 3
- 230000004900 autophagic degradation Effects 0.000 description 3
- 210000004027 cell Anatomy 0.000 description 3
- 230000009918 complex formation Effects 0.000 description 3
- 230000001143 conditioned effect Effects 0.000 description 3
- 201000010099 disease Diseases 0.000 description 3
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000002349 favourable effect Effects 0.000 description 3
- 238000001727 in vivo Methods 0.000 description 3
- 229920002521 macromolecule Polymers 0.000 description 3
- 230000001404 mediated effect Effects 0.000 description 3
- 150000003384 small molecules Chemical class 0.000 description 3
- 230000009897 systematic effect Effects 0.000 description 3
- 238000003775 Density Functional Theory Methods 0.000 description 2
- 101710113864 Heat shock protein 90 Proteins 0.000 description 2
- 102000002812 Heat-Shock Proteins Human genes 0.000 description 2
- 108010004889 Heat-Shock Proteins Proteins 0.000 description 2
- 102000003960 Ligases Human genes 0.000 description 2
- 108090000364 Ligases Proteins 0.000 description 2
- 206010028980 Neoplasm Diseases 0.000 description 2
- 102000001253 Protein Kinase Human genes 0.000 description 2
- VYPSYNLAJGMNEJ-UHFFFAOYSA-N Silicium dioxide Chemical compound O=[Si]=O VYPSYNLAJGMNEJ-UHFFFAOYSA-N 0.000 description 2
- 238000003556 assay Methods 0.000 description 2
- 230000003190 augmentative effect Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 150000001875 compounds Chemical class 0.000 description 2
- 239000013078 crystal Substances 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000008030 elimination Effects 0.000 description 2
- 238000003379 elimination reaction Methods 0.000 description 2
- 229940125974 heterobifunctional degrader Drugs 0.000 description 2
- 238000003970 interatomic potential Methods 0.000 description 2
- 150000002605 large molecules Chemical class 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000000144 pharmacologic effect Effects 0.000 description 2
- 235000004252 protein component Nutrition 0.000 description 2
- 230000012846 protein folding Effects 0.000 description 2
- 108060006633 protein kinase Proteins 0.000 description 2
- 238000011865 proteolysis targeting chimera technique Methods 0.000 description 2
- 229940124823 proteolysis targeting chimeric molecule Drugs 0.000 description 2
- 238000013442 quality metrics Methods 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 108010026668 snake venom protein C activator Proteins 0.000 description 2
- WEVYNIUIFUYDGI-UHFFFAOYSA-N 3-[6-[4-(trifluoromethoxy)anilino]-4-pyrimidinyl]benzamide Chemical compound NC(=O)C1=CC=CC(C=2N=CN=C(NC=3C=CC(OC(F)(F)F)=CC=3)C=2)=C1 WEVYNIUIFUYDGI-UHFFFAOYSA-N 0.000 description 1
- 241000894006 Bacteria Species 0.000 description 1
- 102000004190 Enzymes Human genes 0.000 description 1
- 108090000790 Enzymes Proteins 0.000 description 1
- 101001012157 Homo sapiens Receptor tyrosine-protein kinase erbB-2 Proteins 0.000 description 1
- 108090001090 Lectins Proteins 0.000 description 1
- 102000004856 Lectins Human genes 0.000 description 1
- KDXKERNSBIXSRK-UHFFFAOYSA-N Lysine Natural products NCCCCC(N)C(O)=O KDXKERNSBIXSRK-UHFFFAOYSA-N 0.000 description 1
- 239000004472 Lysine Substances 0.000 description 1
- 102000005431 Molecular Chaperones Human genes 0.000 description 1
- 108010006519 Molecular Chaperones Proteins 0.000 description 1
- 101710163270 Nuclease Proteins 0.000 description 1
- 108010026552 Proteome Proteins 0.000 description 1
- 102100030086 Receptor tyrosine-protein kinase erbB-2 Human genes 0.000 description 1
- 102000040945 Transcription factor Human genes 0.000 description 1
- 108091023040 Transcription factor Proteins 0.000 description 1
- 108090000848 Ubiquitin Proteins 0.000 description 1
- 102000044159 Ubiquitin Human genes 0.000 description 1
- 230000001594 aberrant effect Effects 0.000 description 1
- 238000004026 adhesive bonding Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000010420 art technique Methods 0.000 description 1
- 230000004888 barrier function Effects 0.000 description 1
- 239000011230 binding agent Substances 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 201000011510 cancer Diseases 0.000 description 1
- 230000006652 catabolic pathway Effects 0.000 description 1
- 230000003833 cell viability Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000005315 distribution function Methods 0.000 description 1
- -1 e.g. Proteins 0.000 description 1
- 230000002255 enzymatic effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000005714 functional activity Effects 0.000 description 1
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 1
- 239000010931 gold Substances 0.000 description 1
- 229910052737 gold Inorganic materials 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 230000013632 homeostatic process Effects 0.000 description 1
- 229910052739 hydrogen Inorganic materials 0.000 description 1
- 239000001257 hydrogen Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000000126 in silico method Methods 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 239000004615 ingredient Substances 0.000 description 1
- 239000003112 inhibitor Substances 0.000 description 1
- 230000005764 inhibitory process Effects 0.000 description 1
- 230000015788 innate immune response Effects 0.000 description 1
- 244000000056 intracellular parasite Species 0.000 description 1
- 230000006662 intracellular pathway Effects 0.000 description 1
- 239000002523 lectin Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 210000003470 mitochondria Anatomy 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000008450 motivation Effects 0.000 description 1
- 230000004770 neurodegeneration Effects 0.000 description 1
- 208000015122 neurodegenerative disease Diseases 0.000 description 1
- 230000009437 off-target effect Effects 0.000 description 1
- 231100000590 oncogenic Toxicity 0.000 description 1
- 230000002246 oncogenic effect Effects 0.000 description 1
- 210000003463 organelle Anatomy 0.000 description 1
- 230000001717 pathogenic effect Effects 0.000 description 1
- 210000002824 peroxisome Anatomy 0.000 description 1
- 230000004962 physiological condition Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000135 prohibitive effect Effects 0.000 description 1
- 230000007111 proteostasis Effects 0.000 description 1
- 238000003908 quality control method Methods 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 102200055464 rs113488022 Human genes 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000000377 silicon dioxide Substances 0.000 description 1
- 230000000087 stabilizing effect Effects 0.000 description 1
- 230000004083 survival effect Effects 0.000 description 1
- 230000009885 systemic effect Effects 0.000 description 1
- 230000008685 targeting Effects 0.000 description 1
- 230000001225 therapeutic effect Effects 0.000 description 1
- 238000002560 therapeutic procedure Methods 0.000 description 1
- 208000001072 type 2 diabetes mellitus Diseases 0.000 description 1
- 230000034512 ubiquitination Effects 0.000 description 1
- 238000010798 ubiquitination Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 230000035899 viability Effects 0.000 description 1
- 238000003041 virtual screening Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
- G16B15/30—Drug targeting using structural data; Docking or binding prediction
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
Definitions
- the present invention relates to a computer implemented, machine learning based method for determining ternary complexes in targeted protein degradation.
- UPS ubiquitin-proteasome system
- autophagy are the two major intracellular pathways for protein degradation.
- the UPS and autophagy have long been considered as independent degradation pathways with little or no interaction points.
- autophagy eliminates long-lived proteins, insoluble protein aggregates and even whole organelles (such as mitochondria, peroxisomes), macromolecular compounds, and intracellular parasites (e.g., certain bacteria).
- small interfering RNA siRNA
- CRISPR-Cas9 clustered regularly interspaced short palindromic re- peats/associated protein nuclease technologies can also down-regulate or eliminate proteins.
- these two technologies also have limitations: for example, CRISPR-Cas9 technology has undesired off-target effects and low efficiency, which limit its application in vivo. Inefficient delivery to target cells in vivo and non-specific immune responses following systemic or local administration are barriers for the clinical application of siRNA.
- researchers are still developing various technology platforms to improve in vivo delivery of therapeutic siRNA.
- HSPs heat shock proteins
- ERBB2 oncogenic kinases
- BRAF-V600E FGFR-G719S
- BCR-ABL heat shock protein 90
- the methods mentioned above for controlling protein degradation are mostly achieved via biomacromolecules.
- pharmaceutical researchers have developed a series of new strategies for protein degradation using small molecules.
- One representative strategy is mono- and heterobifunctional degrader that degrade proteins by hijacking the UPS.
- degraders are small molecules that bind both E3 ubiquitin (U) ligase and target proteins, thereby leading to the exposed lysine on the target protein being ubiquitinated by the E3 ubiquitin ligase complex, followed by UPS-mediated protein degradation.
- degrader not only provide binding activity, but also have great potential to eliminate protein targets that are "undruggables" by traditional inhibitors or are non-enzymatic proteins, e.g., transcription factors.
- the degrader technique is "event-driven", which does not require direct inhibition of the functional activity of the target protein.
- targeted protein degradation using the mono- and heterobifunctional degrader technologies is emerging as a novel therapeutic method to address diseases, such as cancer, driven by the aberrant expression of a disease-causing protein.
- a degrader molecule to a target protein (protein of interest) as well as to an E3 ligase at the same time results in the formation of a ternary complex.
- This ternary complex can induce the targeted degradation of the pathogenic protein, as the E3 ligase triggers protein degradation via proteasomes by ubiquitination.
- positive cooperativity between the molecules forming the ternary complex is necessary.
- Ternary complex formation in a degrader function is known for several years, as degraders, that are weaker binders can also induce the degradation of proteins under the condition of ternary complex formation between a protein of interest, a degrader molecule, and a recruited E3 ligase.
- the significance of such ternary complexes was shown with the first ternary complex crystal structures, which displayed positive cooperativity and newly formed protein-protein interactions.
- ternary complexes are performed by traditional computer-based methods such as molecular dynamics simulations and docking.
- AutoDock, AutoDock Vina, DOCK, FlexX, GLIDE, GOLD, and similar software are used for fragments and, e.g., Zdock as well as RosettaDock for proteins.
- CN109785902A provides a method to predict the degradation of target proteins by means of state-of-the- art techniques in the field of homology modeling, molecular dynamics simulations and docking, or by means of Convolutional Neural Networks.
- the problem of predicting ternary complexes involves resolving a significantly larger set of interactions
- fragment-protein interactions and protein-protein interactions crucial, but importantly, the effects that the linker imposes on these interactions need to be considered as well.
- a framework for ternary complex formation is provided, which enables the treatment of this cluster of interactions via the use of machine learning models.
- Figure 1 shows a summary of the protocol for degrader design and ternary complex prediction.
- Figure 2 shows the method of estimation of chemo-geometric features.
- Figure 3 shows the main DGRL pipeline.
- Figure 4 shows the method for estimation and pre-processing for the protein component.
- Figure 5 shows a fragment-protein interaction module
- Figure 6 shows a protein-protein interaction module.
- Figure 7 shows the Bayesian Optimization Loop.
- Figure 8 shows Deep Molecular Conformation Generation from the 2D graph.
- Figure 9 shows Message Passing Neural Networks.
- Figure 10 shows an example of a score network.
- Figure 11 shows the Deep Linker Generation.
- Figure 12 shows the relative orientation coordinates fed to the Deep Linker Generation model.
- Tab. 1 shows the Statistics of the GEOM data, which contains Q.M9 and DRUGS dataset.
- the illustration in the drawings is in schematic form. It is noted that in different figures, similar or identical elements may be provided with the same reference signs.
- Figure 2 shows a summary of the method for degrader design and ternary complex prediction.
- the method consists of four serial steps: the 3D structure determination of proteins 1, the interaction determination between protein and ligand 2, the protein-protein complex generation 3, and the refinement of ternary complex structure 4.
- Core of the present invention is a new method for the determination of a degrader molecule and the associated ternary complex by use of machine learning modules in tackling the various requirements of ternary complex determination.
- the method according to the invention also allows the determination of the ternary complex formed by a pre-designed, e.g., human-designed degrader molecule, thus serving as an in-silico tool to validate manually designed degraders.
- the method comprises the following four major steps:
- Step 1 3D structure determination of relevant proteins (E3 ligase and the protein of interest).
- Step 2 Determination of the interactions between each fragment of the degrader and the corresponding proteins as well as identification of the corresponding interaction sites using module "Deep Interaction Prediction" DIP.
- Step 3 Protein-Protein complex prediction using modules "Bayesian Optimization” BO, “Deep Linker Generation”, “Deep Molecular Conformation Generation”, and “Dep Graph Representation Learning”.
- Step 4 Refinement of the ternary complex, with the designed linker.
- Deep Interaction Prediction DIP is used for converting the geometry of the protein molecule and degrader fragments into a graph and applying deep learning techniques to this graph to determine properties such as the protein-fragment and protein-protein interactions (used in Steps 2 and 3 above).
- DMCG Deep Molecular Conformation Generation
- the methodology for determining a ternary complex includes the following steps, which are briefly described in the subsections below. For more details regarding the Bayesian Optimization loop and the three deep learning modules, see the section on our modules.
- the value chain for designing a degrader molecule starts with an amino acid sequence or protein structure that acts as a potential target for a degrader molecule.
- the method according to the invention starts from such information.
- the 3D structure is determined via in-house models that are inspired by open-source frameworks such as AlphaFold and RosettaFold for proteins or RDKit in the case of fragments.
- open-source frameworks such as AlphaFold and RosettaFold for proteins or RDKit
- homology modeling can be used.
- the direct use of experimentally determined 3D structures as an input to the pipeline is possible. This step outputs 3D structures of not only proteins of interests but also E3 ligases.
- the computation of the protein-protein interactions and the resulting complex formed is the deciding factor in solving the problem of ternary complex determination. This because the protein-protein interaction is the primary interaction stabilizing the ternary complex.
- an iterative optimization process with active learning and Bayesian Optimization is applied, that uses the constraints imposed by the linker design to determine the structure of the protein-protein complex.
- a fitness function for each candidate protein-protein structure is acquired, which is computed by help of the following modules.
- Module Deep Linker Generation Generative models are used to predict whether a valid linker can be generated to connect the fragments as bound in this protein-protein complex.
- the model takes into ac- count the relative position and orientations of the degrader fragments as well as pharmacological constraints to design a valid linker. This enables to ignore protein-protein complexes for which the bound degrader fragments cannot be linked by a valid linker structure.
- this method allows to efficiently generate a potentially large dataset of conformations (> 100000).
- This conformation generation is used to score the linkers generated by Deep Linker Generation above. Additionally, when dealing with a pre-designed degrader, by analyzing a large dataset of generated conformations, the probability of a valid degrader conformation within a particular protein-protein complex candidate can be determined. This gives an additional score that allows to filter out viable protein-protein complex candidates.
- the use of the deep-learning modules for protein-protein interactions, linker generation and molecular conformation generation means that the space of interactions in the ternary complex can be screened while avoiding expensive docking and molecular dynamics simulations.
- a monte-carlo based method to pack the designed linker in the complexes and perform energy minimization is used.
- Candidates for this include AMBER and M ERCK force fields for the degrader molecule and PyRosetta for the proteins and ternary complexes. Then clustering techniques are used to choose the complexes with the best energy and consensus from possible ternary complexes.
- the goal of the pipeline is the determination of ternary complex structures consisting of the proteins of interest, the degrader and the E3 ligase. This in turn involves modeling the interactions between proteins, i.e., the proteins of interest and the E3 ligase, as well as between proteins and the degrader. Typical methods to achieve this apply particularly expensive docking operations.
- a graph is a structure amounting to a set of objects in which some pairs of the objects are in some sense "related".
- the objects correspond to mathematical abstractions called vertices (also called nodes or points) and each of the related pairs of vertices is called an edge (also called link or line).
- vertices also called nodes or points
- edge also called link or line
- a graph is depicted in diagrammatic form as a set of dots or circles for the vertices, joined by lines or curves for the edges.
- molecules are presented as graphs through their point clouds and chemo-geometric features and process this representation using deep graph representation learning DGRL network architectures.
- the final deep learning architecture leverages the fact that all the nodes in a certain neighborhood of a node share common properties with that node (in the real world but also in their graph representation). These properties, that are expressed with edges, can be "summarized” with the help of weight sharing. That's the reason why the main layer components of the used neural network are convolutional layers.
- Cluster-GCN Cluster-GCN (Chiang, et al., 2019) is used.
- This convolutional layer architecture does not only already demonstrate superior performance on similar molecular datasets, but it also reduces the memory and time complexity by a high margin. This fact is of considerable importance because the network has to be fast during runtime.
- the subsequent layer is GraphConv (Morris, et al., 2019).
- This convolutional layer architecture proved useful because of not only self-supervised representation-learning capabilities, that allow it to exploit atom level complexities, their geometries and all of the interactions between the atoms, but also its efficiency in computing the graph convolutions, which is again important during runtime.
- the code that encompasses these two main layers is PyTorch-Geometric code (Fey & Lenssen, 2019) doing the standard batching, pooling, gluing these layers together so that slowly a lower and lower dimensional representation is reached, until the final prediction of the score function is made.
- the first step in the deep graph representation learning DGRL pipeline is to map the initial 3D structures of proteins and fragments to a suitable representation that respects the chemo-geometric properties of the biomolecules involved. Subsequently, deep graph representation learning DGRL methods are applied to model the respective interactions.
- a graph consists of nodes and edges, i.e., atoms and their connections.
- the graph that describes a degrader fragment is either constructed using k-nearest-neighbor, or ball queries.
- a node e.g., an atom
- Ball query graphs are constructed by specifying a cutoff distance. If the distance between two constituents lies below this threshold value, the algorithm is allowed to place an edge.
- the model computes a representation for the surface of the proteins on which the DGRL models operate. Suitable surface representations are given by surface meshes, which are computed by triangulation of the (virtual) protein surface, or surface point clouds. It is these points on this virtual surface that are connected to their neighboring points to form the relevant graph.
- 3D coordinates of the estimated protein surface, 3D atomic coordinates and their respective atom types and, lastly, the normal vectors which are estimated based on the local coordinate features are used as input for the estimation of the chemo-geometric features.
- the pipeline proceeds to generate embeddings of chemical and geometrical properties of the molecules. This assumes that a complete description of chemo-geometric features is needed to model protein-protein and protein-fragment interactions accurately.
- the procedure is straightforward. Due to the graph structure of small molecules, well-known deep graph representation learning DGRL strategies can be employed to learn embeddings of chemical information on the nodes of the graphs. To describe the 3D structures free of any bias from the center of mass and global rotations, the deep graph representation learning DGLR models depend only on inter-atomic distances and angles between constituents.
- the graph representation is of points on the surface mesh or surface point cloud, which do not correspond directly to the constituents of the protein.
- a graph is created where each surface point is connected to the k atoms of the protein that are closest (by Euclidean distance) to it.
- the chemical information associated to the atoms is processed, and by the use of deep graph representation learning DGRL methods, embed representations of this onto the surface points. More concretely, different convolutional and attention layers to learn a low dimensional representation of the chemical information are leveraged. This learning is not only based upon the 3D coordinates of atoms and the atom types but also, some chemical information is generated explicitly and fed into the module deep graph representation learning DGRL. More concretely, this information consists of angles between atoms, interatomic distances, hydrophobicity and hydrogen bond potential. It has been observed that providing some explicit information lets the network learn the hidden ones that are not familiar.
- Figure 3 shows the main deep graph representation learning DGRL pipeline: In this block, all the pre-processing steps are combined in one final model with various convolutional layers. These layers mainly consist of GraphConv (Morris, et al., 2019) and ClusterGCNConv (Chiang, et al., 2019) and were constructed with a manual hyperparameter search to minimize the loss and achieve the best ROC-AUC score possible for the classification.
- the necessary pre-processing of the 3D structures as well as the necessary chemical and geometrical representations of the protein surfaces is already accomplished.
- it can be proceeded to learn, with the help of geometric deep learning, which surface regions are the interaction sites.
- the process of achieving the interaction site classification can be divided into two parts. The first one, as noted above, is done with suitable chemo-geometric features, where the best low-dimensional representation has been learned.
- the subsequent step is applying the main deep graph representation learning DGRL pipeline on these features so that the classification can eventually be performed.
- Figure 3 shows a Deep Interaction Prediction module: Taking in inputs in the form of atomic 3D coordinates and atom types, this information is used to estimate the protein surface. For the calculation of protein surfaces, standard algorithms for point cloud representations conversion into meshes, e.g., Points2Surf (Erler, et al., 2020) and Delaunay triangulation are used. After calculating the protein surface and selecting patches, the patches are forwarded together with the atomic coordinates and the atoms into a pipeline to generate geometric, chemical, and local coordinate features.
- Points2Surf Erler, et al., 2020
- this information is forwarded in form of graph representations into a deep learning pipeline with multiple convolutional layers that ought to learn deep relationships and rotational invariance of the protein surfaces in question.
- the main components are GraphConv (Morris, et al., 2019) and ClusterGCNConv (Chiang, et al., 2019) layers, which are combined to perform the binary classification indicating if the surface in question is a potential interaction site
- Figure 5 shows a Fragment-protein interaction module: Estimation and pre-processing for the protein component in this architecture is the same as for the interaction site prediction presented in Figure 4.
- the other constituent of the input pair i.e., the fragment, needs a different representation.
- the start is similarly for the fragment, where atom coordinates and atom types are taken.
- the 3D structure of the fragment to a graph representation is mapped, which is capable of modelling interatomic relationships. This is achieved by using a combination of DimeNet [(Klicpera, et al., 2020a) and (Klicpera, et al., 2020b)] and explicit features that model interatomic relationships.
- the aim is the prediction whether protein and a fragment will interact. Again, it is proceeded similarly to the previous section where the necessary pre-processing has been performed for the protein and the fragment in the ternary complex, i.e., representing them as the respective graphs, and embedding in them the geometric and chemical features.
- This resultant graph embedding is processed by the main deep graph representation learning DGRL pipeline to where a binary label will be predicted of whether it interacts with the ligand or not.
- a dataset of proteins and ligands interacting is used.
- the ground truth of whether the pair does in fact interact is used, in order to train the deep graph representation learning DGRL pipeline to recognize what constitutes interaction and what not.
- the elaborated procedure may be considered as "fuzzy" docking, where not any Root-Mean-Square-Deviation (RMSD) values are predicted as part of our inference, but rather a simple binary classification indicating if two proteins would interact or not.
- RMSD Root-Mean-Square-Deviation
- Figure 6 shows a protein-protein interaction module. This interaction is modelled similarly to the case of interaction site identification and fragment-protein interaction. To be precise, the pipeline that was used to determine the interaction site on a single protein for both proteins in parallel as shown in Figure 4 is used similarly. To achieve the desired effect, the loss function to make sure that the pipeline is learning to model the interactions between proteins is adjusted.
- the learned interaction is not quantified in terms of a continuous value like the RMSD, but rather by a binary classification indicating the interaction between the respective pair of proteins.
- a surrogate model (see surrogate model explanation below) is calculated, to predict the combined-fitness (a Gaussian Process) using the scores obtained in the loop from step 2.
- a surrogate model (see surrogate model explanation below) is calculated, to predict the combined-fitness (a Gaussian Process) using the scores obtained in the loop from step 2.
- An important fact here is that the surrogate function can report the uncertainty in its prediction.
- a new set of conformations/orientations is selected for which the surrogate model lacks knowledge, i.e., expresses high uncertainty, or predicts a high score. This tradeoff between exploitation and exploration is managed by an acquisition strategy as shown below.
- the surrogate model is a model that takes as input the representation (i.e., RRT + NMA coordinates) of a particular protein-protein complex candidate and predicts the associated combined-fitness. It is trained using the actual combined-fitness as data points.
- a Gaussian Process model is used, that can predict not only an estimate of the combined-fitness, but also give a reliable measure of the uncertainty in its estimate.
- the Kernel function used for the Gaussian Process is the well-known Matern Kernel that is modified to handle the relative translations and rotations. This specific kernel function is not essential to the advantage proposed by this patent and can be substituted for any valid alternative in the representation space.
- the acquisition strategy is a key aspect of a Bayesian Optimization BO loop and determines in what manner and to what extent exploration for exploitation is traded.
- the fact that the surrogate model reports the uncertainty of its estimate is crucial here and allows to make principled decisions regarding this tradeoff.
- Several standard acquisition strategies may be used, for instance, noisy Expected Improvement, Upper Confidence Bound, and Knowledge Gradient. These strategies are implemented by use of the openly available BoTorch framework (Balandat, et al., 2020).
- each complex candidate by the RRT between the two constituent proteins, as well as a vector representation of the conformations of these proteins are chosen.
- the relative translation is represented by a 3D vector between the center of masses of the two proteins.
- the relative rotation is represented by a 4D normalized quaternion.
- Each candidate complex is sampled by picking a random RRT and conformation using an even distribution over the above representation space.
- a uniformly random direction is picked for translation with the distance exponentially distributed.
- the rotations are selected evenly at random.
- NMA Normal Mode Analysis
- Convolutional neural networks are applied, which operate on graph representations of the protein molecules to predict the score. These representations account for the geometric and chemical properties in order to predict features that are subsequently processed to eventually yield a measure of the interaction strength.
- a Deep Linker Generation model is used, that takes as input the coordinates of the fragments, as bound to the respective proteins in their respective positions and orientations, and thereby the fragments relative orientation (RRT). The model then generates a linker that joins the two fragments. This linker is then scored on the basis of any number of pharmacological constraints such as toxicity and drug-likeness. Additionally, through the use of the Deep Molecular Conformation Generation module, the geometric viability of the linker is determined. Together, this provides the constraint-fitness.
- a deep learning-based approach (Deep Molecular Conformation Generation) is used to generate a large dataset (> 100000 datapoints) of energetically stable (low energy) degrader conformations, including the two fragments and the linker.
- Each generated degrader conformation is characterized by the relative rotation and translation (RRT's) between its two fragments and the distribution of valid conformations over the RRT space is learned. For instance, one may fit a mixture of Gaussians using expectation maximization. Hence, given the RRT of the two proteins, since the binding pocket for each of the degrader fragments is known, the RRT between the degrader fragments can be computed. The learned distribution function can be used to compute the constraint score.
- the combined-fitness can be any function of the PPI-fitness and the constraint-fitness that mimics a logical AND operation. This means that if either of the fitness scores indicates a particularly unfit protein-protein complex candidate, the combined fitness must be low. For instance, if the PPI-fitness and the constraintfitness are normalized to lie between Oto 1, the product of these fitness scores would be a valid combined- fitness.
- One of the key considerations in ternary complex determination is the stability and validity of the degrader molecule itself. In the Bayesian Optimization BO protocol, this is specified through the constraint-fitness. As previously described, one of the methods to achieve it is to analyze the dataset of stable (low energy) conformations of the degrader molecule. A method that can generate a large number (> 100000) of conformations of a large molecule such as a degrader, which can have more than 60 atoms is needed.
- the problem of molecular conformation generation i.e., predicting an ensemble of low energy 3D conformations based on a molecular graph, is traditionally treated with either stochastic or systematic approaches.
- the former is based on molecular dynamics (MD) simulations or Markov Chain Monte Carlo (MCMC) techniques.
- Stochastic algorithms can be accurate but are difficult to scale to larger molecules (e.g., proteins) as the runtime becomes prohibitive.
- systematic (rule-based) methods are based on careful selection and design of torsion templates (torsion rules), and knowledge of rigid 3D fragments. These methods can be fast and generate conformations in the order of seconds. However, their prediction might become inaccurate for larger molecules, or molecules that are not subject to any of these rules (torsion rules). Therefore, systematic methods are fast, but they do/may not generalize.
- an end-to-end trainable machine learning model that can handle and generate conformations is preferred.
- it models conformations in a SE(3) invariant manner, which means that the likelihood of a particular conformation is unaffected by rigid translation and rotation operations.
- This is a desirable inductive bias for molecular generation tasks, as molecules do not change if the entire molecule is translated or rotated.
- This model is based on a recently proposed machine learning technique, i.e., score-based generative models. The score is the gradient of the log density of the data distribution with respect to the data.
- the score of the data distribution can be considered as a vector (gradient field) that guides the molecule towards stable (low energy) conformations as shown in Figure 8.
- annealed Langevin dynamics can be leveraged to create an ensemble of stable conformations within a short amount of time. It is also possible to fix some parts of the molecules (two fragments) and apply the gradient (score) on other parts of the molecule (e.g., linker) to generate constrained conformations. Using the ensembles of generated conformations, a function can be learned, that predicts the likelihood of an energetically stable linker for a particular relative position and orientation of the fragments.
- Figure 8 shows the Deep Molecular Conformation Generation from the 2D graph:
- the input is the graph, and the goal is generating an ensemble of stable (low energy) 3D conformations. It will be initiated with random 3D coordinates for the molecule in 3D space, and in each iteration, these coordinates change a little bit towards a more stable conformation.
- Something that guides the coordinate change is pseudoforce which comes from the estimation of the score.
- the score is the gradient of data distribution, and it will be tried to learn the score based on the training data. After that, this score is used to guide the atoms to the specific conformation through stochastic Langevin dynamics.
- a machine learning model has been leveraged for generating conformations from input molecular graphs. So, some data has been used for the training the model.
- the data that has been used for training is GEOM- Q.M9 and GEOM-DRUGS data (Axelrod & Gomez-Bombarelli, 2020), which consists of a molecular graph and corresponding ground truth conformations.
- Q.M9 contains smaller molecules (up to 9 heavy atoms), but DRUGS contains larger and drug-like molecules. You can find some more information about the training dataset in Table 1.
- Tab 1 shows the Statistics of the GEOM data, which contains Q.M9 and DRUGS dataset.
- MD molecular dynamics
- the method that is used in the present example is based on score matching generative models that have been used recently in the machine vision domain for generating realistic images (Song & Ermon, 2019).
- the goal of a score-based generative model is to estimate the score (gradient of the data distribution with respect to data) by minimizing the following loss.
- This gradient can be considered as some pseudo force that guides the evolution of molecules towards stable (low energy) conformations.
- people are using a noise conditional score-based generative model (Song & Ermon, 2021).
- the goal is to estimate the noisy version of the data score:
- the score network (s(r; 0)) can be anything that maps the input molecules to the gradient with respect to input coordinates (the output will be 3N dimensional where the N will be the number of atoms in a molecule).
- MPNN message passing neural network
- the input to the MPNN is a molecule (graph) with nodes (atoms) and edges (bonds).
- Figure 9 shows a message passing neural networks.
- the ⁇ P e , ⁇ P V , ⁇ P U are update functions for edge (E), node (V), and global feature (u) update, respectively.
- p e ⁇ v reduce edges to nodes
- p v ⁇ u reduce vertices to global features
- Score network as shown in Figure 10 is MPNN that updates the edge and node features at each step.
- the output will be three coordinates for each node which represent the pseudo-force (gradient) that change the position of each node.
- An MPNN layer updates the edge features e (/ and node features and computes a global feature u at each step.
- edge features can be updated by using a learned function of the current edge feature as well as the node features of connected nodes. Then, for each node, the edge features of connected edges can be aggregated and update the node features using a learned function of this aggregation.
- global features that belong to the whole graph, in the case molecule
- p e ⁇ v denotes a differentiable, permutation invariant aggregation function, e.g., sum, mean or max, and denote differentiable functions the parameters of which can be learned, such as MLPs (Multi-Layer Perceptron).
- MLPs Multi-Layer Perceptron
- element-wise summation for aggregation function and MLPs for the differentiable functions have been used.
- MLP Multi-Layer Perceptron
- the initiation starts from some random coordinate in 3D, update the coordinate sequentially based on the learned score to come up with an ensemble of low energy conformations
- each generated linker graph corresponds to a complete degrader graph when considered with the two fragments.
- graph/pharmaceutical metrics such as uniqueness, chemical validity, quantitative estimate of drug-likeness, synthetic accessibility, toxicity, solubility, ring aromaticity, and/or pan-assay interference compounds, a fitness score can be reported to the surrogate model.
- the energy as determined either by classical methods such as force-fields or dedicated machine learning algorithms, normalized per degree of freedom of the molecule, presents itself as a viable measure of the validity of the degrader since it reports on the molecules strain.
- the model by removing the relative orientations from its architecture, can generate linker graphs without any structural information as an input. Then, however, the quality of the generated linkers is expected to be lower.
- the model is inspired by DeLinker (Imrie, et al., 2020), with most fundamental differences listed at the bottom of this section.
- the model is a Variational Autoencoder (VAE), whereby both the encoder as well as the decoder are implemented via standard Gated Graph Neural Networks (GGNN).
- VAE Variational Autoencoder
- GGNN Gated Graph Neural Networks
- the decoder takes as input a set of latent variables and generates a linker to connect the input fragments.
- the encoder on the other hand, imposes a distribution over the latent variables that is conditioned on the graph and structure of the unlinked input fragments.
- the fragment graph X is processed using the encoder GGNN, yielding the set of latent variables z v , one for each node (atom) in the graph.
- the decoder is fed a low-dimensional latent vector z derived via a learned mapping from the node embeddings of the label (ground truth) degrader (i.e., the target degrader supposed to be generated). Loosely speaking, this allows the decoder to learn to generate different "types" of linkers conditioned on z (i.e., via a conditioned multi-modal distribution).
- the model can be augmented to learn a prediction of constraints such as toxicity and alike. Then, during runtime, by optimizing over z, z v , the decoder can improve the quality of the generated linkers with respect to these constraints.
- both z and z v are regularized to approximate the standard normal distribution.
- a set of candidate atoms are added to the graph and initialized with random node features. Using these features, the atom types are initialized.
- the features z v , z, atom types I v , and the features and types of the candidate atoms are initialized.
- one bond can be of any type connecting an unconnected candidate node to an already connected candidate node in the graph.
- the valency of the already connected node also affects the choice of the bond. It can be continued to choose bonds for this node until a bond to a special "STOP" atom is picked, at which point the next connected atom in the queue is chosen. This queue is created and traversed in a breadth-first manner. Note that every bond that is selected changes the graph V. This means that the features z v , I v are recomputed in each iteration.
- z During generation, one can draw z from a standard normal distribution and add noise to the encoding of X to calculate z v . Note that, if during training one can learn to predict the properties mentioned below as a function of z, z v , during generation, it can be optimized over z, z v to condition the model to generate degraders of better quality. Properties such as a quantitative estimate of drug-likeness, synthetic accessibility, toxicity, solubility, ring aromaticity, and/or pan-assay interference compounds are considered in this context.
- Figure 12 illustrates the structural information provided, i.e., the fragments' relative orientation. This allows to directly interface with the RRT coordinates used in the Bayesian Optimization Pipeline (The relative orientation coordinates fed to the Deep Linker Generation model. The two rings represent the fragments of a degrader. Then, the distance from atom to atom L 2 , the angles between the vectors and L ⁇ -L 2 ( «I) as well as between the vectors L ⁇ -L 2 and L 2 -E 2 (a 2 ) and the dihedral angle ⁇ p (stemming from all three mention vectors) are processed by the model as structural information.
- E -L and E 2 -L 2 constitute rotatable bonds by design of the graph generation model
- the following bond-angle-torsion coordinates completely specify the relative orientation of the fragments: the lengths E 1 -L 1 , L ⁇ -L 2 , L 2 -E 2 , the bond angles a and a 2 and the dihedral angle ⁇ p.
- the physical bond lengths hardly vary.
- the atom types and L 2 are not available prior to the graph generation process but are modeled as placeholder atoms. Thus, the model is not fed with the bond lengths L -E and L 2 -E 2 .
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Chemical & Material Sciences (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Biotechnology (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Public Health (AREA)
- Evolutionary Computation (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Medicinal Chemistry (AREA)
- Pharmacology & Pharmacy (AREA)
- Crystallography & Structural Chemistry (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention is related to a computer implemented, machine learning based method for determining ternary complexes in targeted protein degradation, by representing biomolecules as graphs and then feeding these graphs as inputs into a machine learning system comprising steps of - determination of the 3D structure of relevant proteins (1) determination of the interactions between each fragment of the degrader and the cor-responding proteins as well as identification of the corresponding interaction (2) Protein-Protein complex prediction (3) Refinement of the ternary complex, with the designed linker (4).
Description
Description
Title of the invention
Ternary complex determination for plausible targeted protein degradation using Deep Learning and design of degrader molecules using Deep Learning.
Field of the invention
The present invention relates to a computer implemented, machine learning based method for determining ternary complexes in targeted protein degradation.
Background of the invention
Proteins play critical roles in maintaining the life of organisms. Correct protein folding controls cell health and survival. However, most proteins are inherently prone to aggregation in their misfolded or partially misfolded state. In addition, misfolding or misregulation of proteins leads to the development of many diseases, including neurodegenerative diseases, cancers and type 2 diabetes mellitus. Therefore, cells must constantly adjust their protein composition to maintain a homeostasis of their proteomes. Misfolded proteins are refolded or degraded by quality control systems, and elimination of misfolded proteins is critical for maintaining protein homeostasis and cell viability.
Under physiological conditions a complex network that includes folding enzymes, chaperones, lectins and ATP-driven motors controls the elimination of misfolded proteins. The ubiquitin-proteasome system (UPS) and autophagy are the two major intracellular pathways for protein degradation. The UPS and autophagy have long been considered as independent degradation pathways with little or no interaction points. In spite of growing evidence of close coordination and complementarity between the two systems, they are actually different mechanisms: UPS is responsible for the degradation of short-lived proteins and soluble misfolded proteins, whereas autophagy eliminates long-lived proteins, insoluble protein aggregates and even whole organelles (such as mitochondria, peroxisomes), macromolecular compounds, and intracellular parasites (e.g., certain bacteria).
In addition, small interfering RNA (siRNA) and clustered regularly interspaced short palindromic re- peats/associated protein nuclease (CRISPR-Cas9) technologies can also down-regulate or eliminate proteins. However, these two technologies also have limitations: for example, CRISPR-Cas9 technology has undesired off-target effects and low efficiency, which limit its application in vivo. Inefficient delivery to target cells in vivo and non-specific immune responses following systemic or local administration are barriers for the clinical application of siRNA. Researchers are still developing various technology platforms to improve in vivo delivery of therapeutic siRNA.
In addition, heat shock proteins (HSPs) also play important roles in protein kinase degradation. For example, the level of many oncogenic kinases, such as ERBB2, BRAF-V600E, FGFR-G719S and BCR-ABL, are reported to be tightly coupled to heat shock protein 90 (HSP90).
The methods mentioned above for controlling protein degradation are mostly achieved via biomacromolecules. In order to target a broader range of proteins with sufficiently high efficiency for clinical application, in recent years pharmaceutical researchers have developed a series of new strategies for protein degradation using small molecules. One representative strategy is mono- and heterobifunctional degrader that degrade proteins by hijacking the UPS. These degraders are small molecules that bind both E3 ubiquitin (U) ligase and target proteins, thereby leading to the exposed lysine on the target protein being ubiquitinated by the E3 ubiquitin ligase complex, followed by UPS-mediated protein degradation. Theoretically, degrader not only provide binding activity, but also have great potential to eliminate protein targets that are "undruggables" by traditional inhibitors or are non-enzymatic proteins, e.g., transcription factors. In addition, the degrader technique is "event-driven", which does not require direct inhibition of the functional activity of the target protein. These characteristics make degrader technologies an attractive strategy for targeting protein degradation (TPD).
Therefore, targeted protein degradation using the mono- and heterobifunctional degrader technologies is emerging as a novel therapeutic method to address diseases, such as cancer, driven by the aberrant expression of a disease-causing protein.
The binding of a degrader molecule to a target protein (protein of interest) as well as to an E3 ligase at the same time results in the formation of a ternary complex. This ternary complex can induce the targeted degradation of the pathogenic protein, as the E3 ligase triggers protein degradation via proteasomes by ubiquitination. In order to induce TPD, positive cooperativity between the molecules forming the ternary complex is necessary.
Ternary complex formation in a degrader function is known for several years, as degraders, that are weaker binders can also induce the degradation of proteins under the condition of ternary complex formation between a protein of interest, a degrader molecule, and a recruited E3 ligase. The significance of such ternary complexes was shown with the first ternary complex crystal structures, which displayed positive cooperativity and newly formed protein-protein interactions.
According to the state of the art, the determination of ternary complexes is performed by traditional computer-based methods such as molecular dynamics simulations and docking. AutoDock, AutoDock Vina, DOCK, FlexX, GLIDE, GOLD, and similar software are used for fragments and, e.g., Zdock as well as RosettaDock for proteins.
The two recent publications In Silica Modeling of PROTAC-Mediated Ternary Complexes: Validation and Application; M8ichael L Drummond and Christopher I. Williams; J. Chem. Inf. Model. 2019, 59, 4, 1634- 1644 and PRosettaC: Rosetta Based Modeling of PROTAC Mediated Ternary Complexes: Daniel Zaidman, Jaime Prilusky, and Nir London; J. Chem. Inf. Model. 2020, 60, 10, 4894-4903 employ these methods. Both articles treat the field of targeted protein degradation. The ternary complex predictions of the presented frameworks are validated via the reconstruction of already known ternary complex crystal structures.
CN109785902A provides a method to predict the degradation of target proteins by means of state-of-the- art techniques in the field of homology modeling, molecular dynamics simulations and docking, or by means of Convolutional Neural Networks. However, the problem of predicting ternary complexes involves resolving a significantly larger set of interactions
For an accurate determination of ternary complexes, not only are fragment-protein interactions and protein-protein interactions crucial, but importantly, the effects that the linker imposes on these interactions need to be considered as well.
It's an object of the invention to enable the process of designing a degrader molecule that results in a stable ternary complex.
It is also an object of the invention to determine the structure (conformation, orientation in a 3D space) of the ternary complex that results from a particular degrader.
SUMMARY OF THE INVENTION
The object of the invention is achieved by the independent claims. The dependent claims describe advantageous developments and modifications of the invention.
According to the invention, a framework for ternary complex formation is provided, which enables the treatment of this cluster of interactions via the use of machine learning models.
Brief description of the drawings
Figure 1 shows a summary of the protocol for degrader design and ternary complex prediction.
Figure 2 shows the method of estimation of chemo-geometric features.
Figure 3 shows the main DGRL pipeline.
Figure 4 shows the method for estimation and pre-processing for the protein component.
Figure 5 shows a fragment-protein interaction module.
Figure 6 shows a protein-protein interaction module.
Figure 7 shows the Bayesian Optimization Loop.
Figure 8 shows Deep Molecular Conformation Generation from the 2D graph.
Figure 9 shows Message Passing Neural Networks.
Figure 10 shows an example of a score network.
Figure 11 shows the Deep Linker Generation.
Figure 12 shows the relative orientation coordinates fed to the Deep Linker Generation model.
Tab. 1 shows the Statistics of the GEOM data, which contains Q.M9 and DRUGS dataset.
The illustration in the drawings is in schematic form. It is noted that in different figures, similar or identical elements may be provided with the same reference signs.
Description of the drawings
Figure 2 shows a summary of the method for degrader design and ternary complex prediction. The method consists of four serial steps: the 3D structure determination of proteins 1, the interaction determination between protein and ligand 2, the protein-protein complex generation 3, and the refinement of ternary complex structure 4.
Core of the present invention is a new method for the determination of a degrader molecule and the associated ternary complex by use of machine learning modules in tackling the various requirements of ternary complex determination.
Additionally, a specialized optimization method using Bayesian Optimization (BO) is presented, which allows the efficient inclusion of the effect of the linker on the ternary complex, while simultaneously informing the linker generation.
The method according to the invention also allows the determination of the ternary complex formed by a pre-designed, e.g., human-designed degrader molecule, thus serving as an in-silico tool to validate manually designed degraders.
As shown in Figure 1, the method comprises the following four major steps:
Step 1. 3D structure determination of relevant proteins (E3 ligase and the protein of interest).
Step 2. Determination of the interactions between each fragment of the degrader and the corresponding proteins as well as identification of the corresponding interaction sites using module "Deep Interaction Prediction" DIP.
Step 3. Protein-Protein complex prediction using modules "Bayesian Optimization" BO, "Deep Linker Generation", "Deep Molecular Conformation Generation", and "Dep Graph Representation Learning".
Step 4. Refinement of the ternary complex, with the designed linker.
As mentioned above the following functional modules are used in the method:
Module "Deep Interaction Prediction" DIP is used for converting the geometry of the protein molecule and degrader fragments into a graph and applying deep learning techniques to this graph to determine properties such as the protein-fragment and protein-protein interactions (used in Steps 2 and 3 above).
With Module "Deep Linker Generation "DLG a valid linker sub-structure is generated that connects the two molecules on basis of the coordinates of the two fragments of a degrader molecule. The validity of this generated linker is scored on metrics such as drug-likeness, synthetic accessibility, toxicity, and/or solubility. In conjunction with the modul "Deep Molecular Conformation Generation" DMCG e, this model plays a key role in providing a fitness function for the Bayesian Optimization loop in Step 3.
Module "Deep Molecular Conformation Generation" DMCG is used to efficiently generate a large number (> 100 000) of conformations of a designed degrader. When given a pre-designed degrader (either man-
ually or via Deep Linker Generation), this set of conformations enables the determination, which conformations of the linker and consequently the degrader are valid within the ternary complex structure. This information provides a key fitness function for the Bayesian Optimization loop in Step 3.
The methodology for determining a ternary complex includes the following steps, which are briefly described in the subsections below. For more details regarding the Bayesian Optimization loop and the three deep learning modules, see the section on our modules.
The value chain for designing a degrader molecule starts with an amino acid sequence or protein structure that acts as a potential target for a degrader molecule. With significant progress in predicting the 3D structure from an amino acid sequence, referred to as the 'protein folding problem', the method according to the invention starts from such information.
Starting from an amino acid sequence associated with proteins or SMILES strings associated with fragments, the 3D structure is determined via in-house models that are inspired by open-source frameworks such as AlphaFold and RosettaFold for proteins or RDKit in the case of fragments. Here, also the method of homology modeling can be used. In addition, the direct use of experimentally determined 3D structures as an input to the pipeline is possible. This step outputs 3D structures of not only proteins of interests but also E3 ligases.
Determination of fragment-protein (either fragment-E3 ligase, or fragment-pathogenic protein) interactions using module Deep Interaction Prediction. The output is essential to understand the extent and mode of the fragment binding the corresponding protein. This interaction also plays a role in determining the way in which the constituent proteins interact within the ternary complex.
The computation of the protein-protein interactions and the resulting complex formed is the deciding factor in solving the problem of ternary complex determination. This because the protein-protein interaction is the primary interaction stabilizing the ternary complex.
When calculating the structure of protein-protein complexes, one sees that the proteins can interact with each other in a large number of possible orientations and conformations, each resulting in a candidate structure for the protein-protein complex. However, not all protein-protein complexes are feasible when considering the presence of a degrader molecule. For many complexes, when one considers the positioning of the fragments bound to the respective proteins, it is seen that a valid linker molecule cannot be designed.
Using this fact, an iterative optimization process with active learning and Bayesian Optimization is applied, that uses the constraints imposed by the linker design to determine the structure of the protein-protein complex. In the Bayesian Optimization loop, a fitness function for each candidate protein-protein structure is acquired, which is computed by help of the following modules.
Module Determination of Protein-Protein Interactions DIP. Using the graph representations of the E3 ligase and the proteins of interest as created by the Deep Interaction Prediction module, the protein-protein interactions between the E3 ligase and the proteins of interest are predicted using graph-based convolutional neural networks.
Module Deep Linker Generation. Generative models are used to predict whether a valid linker can be generated to connect the fragments as bound in this protein-protein complex. The model takes into ac-
count the relative position and orientations of the degrader fragments as well as pharmacological constraints to design a valid linker. This enables to ignore protein-protein complexes for which the bound degrader fragments cannot be linked by a valid linker structure.
Once the degrader molecules have been generated, this method allows to efficiently generate a potentially large dataset of conformations (> 100000). This conformation generation is used to score the linkers generated by Deep Linker Generation above. Additionally, when dealing with a pre-designed degrader, by analyzing a large dataset of generated conformations, the probability of a valid degrader conformation within a particular protein-protein complex candidate can be determined. This gives an additional score that allows to filter out viable protein-protein complex candidates.
The efficiency of this approach is twofold. The use of the Bayesian Optimization loop along with the fitness function ensures, that no computation is wasted on protein-protein complex candidates, that are invalid in the context of the degrader molecule.
Additionally, the use of the deep-learning modules for protein-protein interactions, linker generation and molecular conformation generation means that the space of interactions in the ternary complex can be screened while avoiding expensive docking and molecular dynamics simulations.
To calculate a final stable ternary complex structure a monte-carlo based method to pack the designed linker in the complexes and perform energy minimization is used. Candidates for this include AMBER and M ERCK force fields for the degrader molecule and PyRosetta for the proteins and ternary complexes. Then clustering techniques are used to choose the complexes with the best energy and consensus from possible ternary complexes.
The optimization loop above means that this step, although more computationally expensive per computation, becomes cheap because it is applied only to a small number of candidate structures.
The goal of the pipeline is the determination of ternary complex structures consisting of the proteins of interest, the degrader and the E3 ligase. This in turn involves modeling the interactions between proteins, i.e., the proteins of interest and the E3 ligase, as well as between proteins and the degrader. Typical methods to achieve this apply particularly expensive docking operations.
With the present method, the use of deep learning methods that first process the structural information of the proteins and fragments into graph representations is proposed. These representations are then processed using graph-based convolutional neural networks, which consider the geometric and chemical properties of the biomolecules, to compute features that can be used to predict the interactions between the respective molecules.
In mathematics, and more specifically in graph theory, a graph is a structure amounting to a set of objects in which some pairs of the objects are in some sense "related". The objects correspond to mathematical abstractions called vertices (also called nodes or points) and each of the related pairs of vertices is called an edge (also called link or line). Typically, a graph is depicted in diagrammatic form as a set of dots or circles for the vertices, joined by lines or curves for the edges.
According to the present invention, molecules are presented as graphs through their point clouds and chemo-geometric features and process this representation using deep graph representation learning DGRL network architectures. The final deep learning architecture leverages the fact that all the nodes in a certain neighborhood of a node share common properties with that node (in the real world but also in
their graph representation). These properties, that are expressed with edges, can be "summarized" with the help of weight sharing. That's the reason why the main layer components of the used neural network are convolutional layers.
A brief summary of the layers in the used architecture is as follows: At first Cluster-GCN (Chiang, et al., 2019) is used. This convolutional layer architecture does not only already demonstrate superior performance on similar molecular datasets, but it also reduces the memory and time complexity by a high margin. This fact is of considerable importance because the network has to be fast during runtime. The subsequent layer is GraphConv (Morris, et al., 2019). This convolutional layer architecture proved useful because of not only self-supervised representation-learning capabilities, that allow it to exploit atom level complexities, their geometries and all of the interactions between the atoms, but also its efficiency in computing the graph convolutions, which is again important during runtime. The code that encompasses these two main layers is PyTorch-Geometric code (Fey & Lenssen, 2019) doing the standard batching, pooling, gluing these layers together so that slowly a lower and lower dimensional representation is reached, until the final prediction of the score function is made.
The first step in the deep graph representation learning DGRL pipeline is to map the initial 3D structures of proteins and fragments to a suitable representation that respects the chemo-geometric properties of the biomolecules involved. Subsequently, deep graph representation learning DGRL methods are applied to model the respective interactions.
Once the 3D structures of proteins/fragments have been obtained, e.g., through means of a machine learning model or through access to experimental data, the coordinates of proteins and fragments are mapped to a graph representation. In its most general form, a graph consists of nodes and edges, i.e., atoms and their connections. By this way, the pipeline can model neighborhood relations between the constituents of proteins/fragments, which are used to embed chemical properties in abstract feature vectors that are used for making predictions, as it will be explained below.
First, the procedure for fragments is sketched: The graph that describes a degrader fragment is either constructed using k-nearest-neighbor, or ball queries. In the former approach one chooses a node, e.g., an atom, to be connected to k of its neighboring nodes. Ball query graphs, on the other hand, are constructed by specifying a cutoff distance. If the distance between two constituents lies below this threshold value, the algorithm is allowed to place an edge.
In the case of proteins, it is not the atoms of the atomic point cloud that form the nodes of the graph. Instead, the model computes a representation for the surface of the proteins on which the DGRL models operate. Suitable surface representations are given by surface meshes, which are computed by triangulation of the (virtual) protein surface, or surface point clouds. It is these points on this virtual surface that are connected to their neighboring points to form the relevant graph.
As shown in Figure 2, 3D coordinates of the estimated protein surface, 3D atomic coordinates and their respective atom types and, lastly, the normal vectors which are estimated based on the local coordinate features are used as input for the estimation of the chemo-geometric features.
For the chemical features, explicitly structural information of atoms, e.g., interatomic distances and relative angles, and other features such as hydrophobicity are modelled. The full set of features to feed them into the deep graph representation learning DGRL pipeline are aggregated. This is used to learn low dimensional representations of inputs and the generated features. For the geometric features the following
strategy is used: Starting from the 3D coordinates of the surface representation curvature features, e.g., the curvedness and shape index (Koenderink & Doorn, 1992) are determined. Intermediate features are also learned from these inputs before again applying an aggregation layer and letting the deep graph representation learning DGRL pipeline learn a suitable low dimensional representation.
After the initial data, i.e., the raw atomic coordinates of proteins and fragments, are mapped to a suitable graph representation, the pipeline proceeds to generate embeddings of chemical and geometrical properties of the molecules. This assumes that a complete description of chemo-geometric features is needed to model protein-protein and protein-fragment interactions accurately.
For fragments the procedure is straightforward. Due to the graph structure of small molecules, well- known deep graph representation learning DGRL strategies can be employed to learn embeddings of chemical information on the nodes of the graphs. To describe the 3D structures free of any bias from the center of mass and global rotations, the deep graph representation learning DGLR models depend only on inter-atomic distances and angles between constituents.
The situation is different for proteins, where the graph representation is of points on the surface mesh or surface point cloud, which do not correspond directly to the constituents of the protein. In order to embed the chemical features onto these points, a graph is created where each surface point is connected to the k atoms of the protein that are closest (by Euclidean distance) to it. The chemical information associated to the atoms is processed, and by the use of deep graph representation learning DGRL methods, embed representations of this onto the surface points. More concretely, different convolutional and attention layers to learn a low dimensional representation of the chemical information are leveraged. This learning is not only based upon the 3D coordinates of atoms and the atom types but also, some chemical information is generated explicitly and fed into the module deep graph representation learning DGRL. More concretely, this information consists of angles between atoms, interatomic distances, hydrophobicity and hydrogen bond potential. It has been observed that providing some explicit information lets the network learn the hidden ones that are not familiar.
An important geometric feature that becomes available when processing surfaces is that of curvature, which plays an important role in identifying both interaction sites for fragments and proteins on a protein surface. To model geometric information, the procedure is similarly to the chemical approach. Along with the initial 3D coordinates of the point clouds and the local coordinate features that are fed into the deep graph representation learning DGRL, in addition some explicit geometric features based on the curvature are generated. Examples of these features are the shape index and the curvedness (Koenderink & Doorn, 1992) or shallow MLPs (Multi-Layer Perceptron), which are used to learn intermediate geometric features. All three features depend on the principal curvatures which are defined as the maximum and the minimum normal curvatures at a given point. Existing approaches suggested in literature such as (Melzi, et al., 2019) and (Cao, et al., 2021) are used to calculate these curvatures. Again, the motivation of using explicit geometrical features is to take the burden off the main deep graph representation learning DGRL to learn these features. In this way, it can focus on condensing latent features into a low dimensional representation which helps the algorithm to make correct predictions.
Figure 3 shows the main deep graph representation learning DGRL pipeline: In this block, all the pre-processing steps are combined in one final model with various convolutional layers. These layers mainly consist of GraphConv (Morris, et al., 2019) and ClusterGCNConv (Chiang, et al., 2019) and were constructed
with a manual hyperparameter search to minimize the loss and achieve the best ROC-AUC score possible for the classification.
Given that the protein of interest will have an interaction located on dedicated regions of its surface, it is necessary to predict the areas where this interaction occurs. At this stage, as outlined above, the necessary pre-processing of the 3D structures as well as the necessary chemical and geometrical representations of the protein surfaces is already accomplished. Thus, it can be proceeded to learn, with the help of geometric deep learning, which surface regions are the interaction sites. More concretely, the process of achieving the interaction site classification can be divided into two parts. The first one, as noted above, is done with suitable chemo-geometric features, where the best low-dimensional representation has been learned. The subsequent step is applying the main deep graph representation learning DGRL pipeline on these features so that the classification can eventually be performed. The main idea behind this pipeline is weight sharing, i.e., convolutions. Most properties of an atom or surface point depend on its immediate neighborhood. Since it is a reasonable assumption that points in a neighborhood interact similarly across different neighborhoods, the choice of convolutional layers is a natural one to calculate the low dimensional representations of these properties. The choice of convolutional layers (GraphConv and Cluster- GCN) is motivated by their previous performances on the molecular data as well as their theoretic advantages, such as their ability to perform self-supervised learning, and to exploit atomic level granularity, as well as their superior speed. The latter is important for the runtime systems. This classification will be of importance for further tasks down the pipeline.
Figure 3. shows a Deep Interaction Prediction module: Taking in inputs in the form of atomic 3D coordinates and atom types, this information is used to estimate the protein surface. For the calculation of protein surfaces, standard algorithms for point cloud representations conversion into meshes, e.g., Points2Surf (Erler, et al., 2020) and Delaunay triangulation are used. After calculating the protein surface and selecting patches, the patches are forwarded together with the atomic coordinates and the atoms into a pipeline to generate geometric, chemical, and local coordinate features. After the generated chemical and geometric features have been concatenated, and the local coordinate features have been created, this information is forwarded in form of graph representations into a deep learning pipeline with multiple convolutional layers that ought to learn deep relationships and rotational invariance of the protein surfaces in question.
To summarize: The main components are GraphConv (Morris, et al., 2019) and ClusterGCNConv (Chiang, et al., 2019) layers, which are combined to perform the binary classification indicating if the surface in question is a potential interaction site
Figure 5 shows a Fragment-protein interaction module: Estimation and pre-processing for the protein component in this architecture is the same as for the interaction site prediction presented in Figure 4. The other constituent of the input pair, i.e., the fragment, needs a different representation. The start is similarly for the fragment, where atom coordinates and atom types are taken. Then the 3D structure of the fragment to a graph representation is mapped, which is capable of modelling interatomic relationships. This is achieved by using a combination of DimeNet [(Klicpera, et al., 2020a) and (Klicpera, et al., 2020b)] and explicit features that model interatomic relationships. Then, the chemical and geometric features using GraphConv (Morris, et al., 2019) are embedded and the outputs to a final deep pipeline passed, which is composed of multiple GraphConv and ClusterGCNConvs (Chiang, et al., 2019) layers. The last step
exploits the outputs of the deep graph representation learning deep graph representation learning DGRL for the fragment and the deep graph representation learning DGRL for the protein to learn whether the fragment and the protein interact. This prediction relies again on binary classification
Once the possible surface regions are identified, where the binding will occur, the aim is the prediction whether protein and a fragment will interact. Again, it is proceeded similarly to the previous section where the necessary pre-processing has been performed for the protein and the fragment in the ternary complex, i.e., representing them as the respective graphs, and embedding in them the geometric and chemical features. This resultant graph embedding is processed by the main deep graph representation learning DGRL pipeline to where a binary label will be predicted of whether it interacts with the ligand or not.
To train this pipeline, a dataset of proteins and ligands interacting is used. For each protein and ligand, the ground truth of whether the pair does in fact interact is used, in order to train the deep graph representation learning DGRL pipeline to recognize what constitutes interaction and what not. The elaborated procedure may be considered as "fuzzy" docking, where not any Root-Mean-Square-Deviation (RMSD) values are predicted as part of our inference, but rather a simple binary classification indicating if two proteins would interact or not.
Figure 6 shows a protein-protein interaction module. This interaction is modelled similarly to the case of interaction site identification and fragment-protein interaction. To be precise, the pipeline that was used to determine the interaction site on a single protein for both proteins in parallel as shown in Figure 4 is used similarly. To achieve the desired effect, the loss function to make sure that the pipeline is learning to model the interactions between proteins is adjusted.
As previously, the learned interaction is not quantified in terms of a continuous value like the RMSD, but rather by a binary classification indicating the interaction between the respective pair of proteins.
When designing degrader molecules as well as computing ternary complexes to be formed, determining the structure of the protein-protein PP complex that is formed in the presence of degrader molecules is a major step.
For the determination of protein-protein PP complexes, due to the complexity of protein molecules, many possible interactions, and potential conformations between them must be considered. This might lead to a computationally intractable number of potential protein-protein PP complexes. The presence of degrader molecules within a complex significantly alters these interactions and conformations.
For instance, assuming one knows, for each protein, the corresponding interaction sites where a degrader fragment binds. An energetically favorable (stable) PP complex might feature a relative orientation for which the fragment interaction sites in each protein are too distant to be connected via linked fragments. Then, this complex is infeasible due to spatial constraints despite being energetically favorable.
State-of-the-art protocols for the calculation of ternary complex structures handle this problem by calculating the possible protein-protein PP complexes via blind PP docking and afterwards filtering the obtained complexes based on whether the degrader molecule can be placed within the complex in an energetically favorable manner.
However, the constraint which the degrader molecule imposes can be used to selectively sample the protein-protein PP complexes prior to computing their interaction. This approach thus enables the use of
more complex virtual screening alternatives via DGRL. The following Bayesian Optimization (Active Learning) loop as shown in Figure 7 can be used to sample protein-protein PP complexes in the presence of a degrader molecule in an efficient and automated form:
1. Random sampling of potential protein-protein complexes. Each complex is characterized by the relative rotation and translation (RRT) between the two constituent proteins, as well as the conformations of these proteins.
2. For each: a. Calculate a fitness measure (score) that correlates to the strength of the protein-protein interaction given their current RRT and conformations. This is called the protein-protein interaction fitness (PPI-fitness). This fitness is calculated from the output of the Deep Interaction Prediction module in the present workflow. b. Given the RRT of the proteins, the orientations and positions of the interacting degrader fragments are calculated as bound to their respective interaction sites. Given this placement of fragments and thereby their relative orientation (RRT), an additional constraintfitness is calculated, which reflects the feasibility of linking the fragments with a geometrically and pharmacologically viable linker. c. Then a combined-fitness is calculated that considers both the above scores (PPI-fitness and constraint-fitness) and gives an acceptable measure of how likely it is that the protein-protein complex will lead to a valid ternary complex structure.
3. Subsequently, a surrogate model (see surrogate model explanation below) is calculated, to predict the combined-fitness (a Gaussian Process) using the scores obtained in the loop from step 2. An important fact here is that the surrogate function can report the uncertainty in its prediction.
4. A new set of conformations/orientations is selected for which the surrogate model lacks knowledge, i.e., expresses high uncertainty, or predicts a high score. This tradeoff between exploitation and exploration is managed by an acquisition strategy as shown below.
5. Loop from 2.
Some of the concepts in the loop above are clarified below.
The surrogate model is a model that takes as input the representation (i.e., RRT + NMA coordinates) of a particular protein-protein complex candidate and predicts the associated combined-fitness. It is trained using the actual combined-fitness as data points. A Gaussian Process model is used, that can predict not only an estimate of the combined-fitness, but also give a reliable measure of the uncertainty in its estimate. The Kernel function used for the Gaussian Process is the well-known Matern Kernel that is modified to handle the relative translations and rotations. This specific kernel function is not essential to the advantage proposed by this patent and can be substituted for any valid alternative in the representation space.
The acquisition strategy is a key aspect of a Bayesian Optimization BO loop and determines in what manner and to what extent exploration for exploitation is traded. The fact that the surrogate model reports the uncertainty of its estimate is crucial here and allows to make principled decisions regarding this tradeoff. Several standard acquisition strategies may be used, for instance, noisy Expected Improvement, Upper Confidence Bound, and Knowledge Gradient. These strategies are implemented by use of the openly available BoTorch framework (Balandat, et al., 2020).
In order to sample potential protein-protein PP complexes, each complex candidate by the RRT between the two constituent proteins, as well as a vector representation of the conformations of these proteins are chosen. The relative translation is represented by a 3D vector between the center of masses of the two proteins. The relative rotation is represented by a 4D normalized quaternion.
Each candidate complex is sampled by picking a random RRT and conformation using an even distribution over the above representation space. A uniformly random direction is picked for translation with the distance exponentially distributed. The rotations are selected evenly at random.
To represent the conformations of the proteins, the technique of Normal Mode Analysis (NMA) is employed. The top 10 normal modes of vibration of each protein are calculated using the potential field of an elastic network model (e.g., Anisotropic Network Model). Furthermore, the potential field, which accounts for the rigidity of the residues binding to the degrader fragments, is adjusted. In this manner, each conformation of a protein is specified by a 10-D vector which describes the extents by which the protein is distorted along the respective normal mode.
Convolutional neural networks are applied, which operate on graph representations of the protein molecules to predict the score. These representations account for the geometric and chemical properties in order to predict features that are subsequently processed to eventually yield a measure of the interaction strength.
Two ways to calculate a score are applied, that is informative regarding the feasibility of a degrader molecule given the RRT, and conformations of the two proteins.
In the first method, a Deep Linker Generation model is used, that takes as input the coordinates of the fragments, as bound to the respective proteins in their respective positions and orientations, and thereby the fragments relative orientation (RRT). The model then generates a linker that joins the two fragments. This linker is then scored on the basis of any number of pharmacological constraints such as toxicity and drug-likeness. Additionally, through the use of the Deep Molecular Conformation Generation module, the geometric viability of the linker is determined. Together, this provides the constraint-fitness.
When estimating the ternary complex for a pre-designed degrader, a deep learning-based approach (Deep Molecular Conformation Generation) is used to generate a large dataset (> 100000 datapoints) of energetically stable (low energy) degrader conformations, including the two fragments and the linker.
Each generated degrader conformation is characterized by the relative rotation and translation (RRT's) between its two fragments and the distribution of valid conformations over the RRT space is learned. For instance, one may fit a mixture of Gaussians using expectation maximization. Hence, given the RRT of the two proteins, since the binding pocket for each of the degrader fragments is known, the RRT between the degrader fragments can be computed. The learned distribution function can be used to compute the constraint score.
The combined-fitness can be any function of the PPI-fitness and the constraint-fitness that mimics a logical AND operation. This means that if either of the fitness scores indicates a particularly unfit protein-protein complex candidate, the combined fitness must be low. For instance, if the PPI-fitness and the constraintfitness are normalized to lie between Oto 1, the product of these fitness scores would be a valid combined- fitness.
One of the key considerations in ternary complex determination is the stability and validity of the degrader molecule itself. In the Bayesian Optimization BO protocol, this is specified through the constraint-fitness. As previously described, one of the methods to achieve it is to analyze the dataset of stable (low energy) conformations of the degrader molecule. A method that can generate a large number (> 100000) of conformations of a large molecule such as a degrader, which can have more than 60 atoms is needed.
The problem of molecular conformation generation, i.e., predicting an ensemble of low energy 3D conformations based on a molecular graph, is traditionally treated with either stochastic or systematic approaches. The former is based on molecular dynamics (MD) simulations or Markov Chain Monte Carlo (MCMC) techniques. Stochastic algorithms can be accurate but are difficult to scale to larger molecules (e.g., proteins) as the runtime becomes prohibitive. On the other hand, systematic (rule-based) methods are based on careful selection and design of torsion templates (torsion rules), and knowledge of rigid 3D fragments. These methods can be fast and generate conformations in the order of seconds. However, their prediction might become inaccurate for larger molecules, or molecules that are not subject to any of these rules (torsion rules). Therefore, systematic methods are fast, but they do/may not generalize.
The question that then arises is the following: How can the best of these two worlds be kept? Le., generate for larger molecules ensembles of conformations in an accurate and fast manner. According to the invention, an end-to-end trainable machine learning model that can handle and generate conformations is preferred. In addition, it models conformations in a SE(3) invariant manner, which means that the likelihood of a particular conformation is unaffected by rigid translation and rotation operations. This is a desirable inductive bias for molecular generation tasks, as molecules do not change if the entire molecule is translated or rotated. This model is based on a recently proposed machine learning technique, i.e., score-based generative models. The score is the gradient of the log density of the data distribution with respect to the data. Instead of learning a model that has minimum distance to the data distribution, the model that has the minimum distance to the score (gradient) of the data distribution will be learned. The score of the data distribution can be considered as a vector (gradient field) that guides the molecule towards stable (low energy) conformations as shown in Figure 8.
It starts with a random initialization of the positions of the atoms of the molecule in 3D space, and the score (gradient) will guide the molecule toward a suitable state. After an accurate score based on the training data has been learned, annealed Langevin dynamics can be leveraged to create an ensemble of stable conformations within a short amount of time. It is also possible to fix some parts of the molecules (two fragments) and apply the gradient (score) on other parts of the molecule (e.g., linker) to generate constrained conformations. Using the ensembles of generated conformations, a function can be learned, that predicts the likelihood of an energetically stable linker for a particular relative position and orientation of the fragments.
Figure 8 shows the Deep Molecular Conformation Generation from the 2D graph: The input is the graph, and the goal is generating an ensemble of stable (low energy) 3D conformations. It will be initiated with random 3D coordinates for the molecule in 3D space, and in each iteration, these coordinates change a little bit towards a more stable conformation. Something that guides the coordinate change is pseudoforce which comes from the estimation of the score. The score is the gradient of data distribution, and it will be tried to learn the score based on the training data. After that, this score is used to guide the atoms to the specific conformation through stochastic Langevin dynamics.
A machine learning model has been leveraged for generating conformations from input molecular graphs. So, some data has been used for the training the model. The data that has been used for training is GEOM- Q.M9 and GEOM-DRUGS data (Axelrod & Gomez-Bombarelli, 2020), which consists of a molecular graph and corresponding ground truth conformations. Q.M9 contains smaller molecules (up to 9 heavy atoms), but DRUGS contains larger and drug-like molecules. You can find some more information about the training dataset in Table 1. Tab 1 shows the Statistics of the GEOM data, which contains Q.M9 and DRUGS dataset.
So, the problem is how this data can be leveraged to learn the conditional distribution of conformations given a graph P(r\G; 0).
One traditional method for conformation generation is molecular dynamics (MD) simulations. It begins with some random initialization of the molecule in 3D space, then based on interatomic potential and forces, the atoms' position changes until reaching some stable (low energy) conformations. For calculating and determining the interatomic potential, different methods have been proposed. The most accurate method is based on Density Functional Theory (DFT), but the computation is very intensive, and it becomes prohibitively expensive for large molecules (with a high number of rotatable bonds). For alleviating this issue, people have proposed some empirical potential (force-field). These empirical force fields are fast, but they are not very accurate. Machine learning methods have the potential to be fast and at the same time, accurate.
The method that is used in the present example is based on score matching generative models that have been used recently in the machine vision domain for generating realistic images (Song & Ermon, 2019). The goal of a score-based generative model is to estimate the score (gradient of the data distribution with respect to data) by minimizing the following loss.
In the present case (generative model for conformation generation), the gradient is learned with respect to the positions of the atom in the molecule. s(r; 0) = Vr logP(r|G)
This gradient can be considered as some pseudo force that guides the evolution of molecules towards stable (low energy) conformations. In most cases, because the support of the data distribution is sparse, people are using a noise conditional score-based generative model (Song & Ermon, 2021). In this case, the goal is to estimate the noisy version of the data score:
This alleviates some of the issues that can appear regarding the generation and creating new samples.
Now the only missing ingredient is how the score network s(r; 0) is defined. The score network (s(r; 0)) can be anything that maps the input molecules to the gradient with respect to input coordinates (the output will be 3N dimensional where the N will be the number of atoms in a molecule).
According to the invention, it is proposed the use of a message passing neural network (MPNN) (Battaglia, et al., 2018) as the score network, and the goal is to learn the parameters of this score network. The input to the MPNN is a molecule (graph) with nodes (atoms) and edges (bonds). In an MPNN, one assigns features to each node (e.g., atom type, atom position, aromaticity, charge) and edge (e.g., bond types).
Figure 9 shows a message passing neural networks. The <Pe, <PV, <PU are update functions for edge (E), node (V), and global feature (u) update, respectively. pe^v (reduce edges to nodes) and pv^u (reduce vertices to global features) are aggregation functions that take a set of input and reduce it to a single element in a permutation invariant manner. These updates are described in the equation below.
Score network as shown in Figure 10 is MPNN that updates the edge and node features at each step. The output will be three coordinates for each node which represent the pseudo-force (gradient) that change the position of each node.
An MPNN layer updates the edge features e(/ and node features and computes a global feature u at each step. One can use series of MPNN layers to update the edge and node features hierarchically. At each layer, edge features can be updated by using a learned function of the current edge feature as well as the node features of connected nodes. Then, for each node, the edge features of connected edges can be aggregated and update the node features using a learned function of this aggregation. At the end, global features (that belong to the whole graph, in the case molecule) will be updated.
Here, pe^v denotes a differentiable, permutation invariant aggregation function, e.g., sum, mean or max, and denote differentiable functions the parameters of which can be learned, such as MLPs (Multi-Layer Perceptron). In the present method, element-wise summation for aggregation function and MLPs for the differentiable functions have been used. At the end, for each node, its features are processed via a final readout MLP (weights shared across nodes) to output a three-valued output, that is of the form of a gradient with respect to the cartesian coordinates of that particular node. The network is trained to reproduce the score function in these outputs. The network is trained with the GEOM-DRUGS data to learn a valid score function for larger molecules.
For the generation phase, the initiation starts from some random coordinate in 3D, update the coordinate sequentially based on the learned score to come up with an ensemble of low energy conformations
In the following the generation of novel linkers for given fragments in order to form complete degraders is described. The input is given in the form of graph representations of the fragments to be linked, e.g., as SMILES strings, augmented with structural input in the form of the relative spatial orientation (RRT) of the fragments. (Figure 11)
Given the above information of the fragments as input, the essence of the procedure discussed in this subsection is to connect them by generating novel linker graphs. Each generated linker graph corresponds
to a complete degrader graph when considered with the two fragments. By assessing the obtained degrader graphs with graph/pharmaceutical metrics such as uniqueness, chemical validity, quantitative estimate of drug-likeness, synthetic accessibility, toxicity, solubility, ring aromaticity, and/or pan-assay interference compounds, a fitness score can be reported to the surrogate model.
Now referring exclusively to the fitness scoring procedure, additional quality metrics can be reported to the surrogate model using the structural information provided as input. To do so, a 3D structure needs to be established based on the generated degrader graphs. Traditional methods, such as those mentioned above, could be employed here. Alternatively, presenting itself more advantageous, our in-house Deep Molecular Conformation Generation method from said section is applied. The quality metrics reported to the surrogate model are then based on comparisons (e. g., RMSD) between the structural input, which serves as a target to reach (i.e., a label in machine learning jargon) and the 3D coordinates established from the generated degrader graphs.
Additionally, the energy, as determined either by classical methods such as force-fields or dedicated machine learning algorithms, normalized per degree of freedom of the molecule, presents itself as a viable measure of the validity of the degrader since it reports on the molecules strain. Finally, it has to be noted, that the model, by removing the relative orientations from its architecture, can generate linker graphs without any structural information as an input. Then, however, the quality of the generated linkers is expected to be lower.
The model is inspired by DeLinker (Imrie, et al., 2020), with most fundamental differences listed at the bottom of this section.
The model is a Variational Autoencoder (VAE), whereby both the encoder as well as the decoder are implemented via standard Gated Graph Neural Networks (GGNN). The decoder takes as input a set of latent variables and generates a linker to connect the input fragments. The encoder, on the other hand, imposes a distribution over the latent variables that is conditioned on the graph and structure of the unlinked input fragments.
For training the model, the fragment graph X is processed using the encoder GGNN, yielding the set of latent variables zv, one for each node (atom) in the graph. Additionally, to allow for more control over the generative process, the decoder is fed a low-dimensional latent vector z derived via a learned mapping from the node embeddings of the label (ground truth) degrader (i.e., the target degrader supposed to be generated). Loosely speaking, this allows the decoder to learn to generate different "types" of linkers conditioned on z (i.e., via a conditioned multi-modal distribution). By using zv and the z, the model can be augmented to learn a prediction of constraints such as toxicity and alike. Then, during runtime, by optimizing over z, zv, the decoder can improve the quality of the generated linkers with respect to these constraints. During training, both z and zv are regularized to approximate the standard normal distribution.
To generate the linker given z, zv, a set of candidate atoms are added to the graph and initialized with random node features. Using these features, the atom types are initialized. At each step of the generation, one can make use of the features zv, z, atom types Iv, and the features and types of the candidate atoms to generate one bond (can be of any type) connecting an unconnected candidate node to an already connected candidate node in the graph. The valency of the already connected node also affects the choice of the bond. It can be continued to choose bonds for this node until a bond to a special "STOP" atom is
picked, at which point the next connected atom in the queue is chosen. This queue is created and traversed in a breadth-first manner. Note that every bond that is selected changes the graph V. This means that the features zv, Iv are recomputed in each iteration.
During generation, one can draw z from a standard normal distribution and add noise to the encoding of X to calculate zv. Note that, if during training one can learn to predict the properties mentioned below as a function of z, zv, during generation, it can be optimized over z, zv to condition the model to generate degraders of better quality. Properties such as a quantitative estimate of drug-likeness, synthetic accessibility, toxicity, solubility, ring aromaticity, and/or pan-assay interference compounds are considered in this context.
Finally, the key points differentiating the model from other generative approaches are listed in the following: Firstly, as discussed above, 3D coordinates are generated from the obtained degrader graphs using the Deep Molecular Conformation Generation module. Secondly, the graph generation process is fed with structural information of higher quality.
Figure 12 illustrates the structural information provided, i.e., the fragments' relative orientation. This allows to directly interface with the RRT coordinates used in the Bayesian Optimization Pipeline (The relative orientation coordinates fed to the Deep Linker Generation model. The two rings represent the fragments of a degrader. Then, the distance from atom
to atom L2, the angles between the vectors and L±-L2 («I) as well as between the vectors L±-L2 and L2-E2 (a2) and the dihedral angle <p (stemming from all three mention vectors) are processed by the model as structural information.
Acknowledging that E -L and E2-L2 constitute rotatable bonds by design of the graph generation model, the following bond-angle-torsion coordinates completely specify the relative orientation of the fragments: the lengths E1-L1, L±-L2, L2-E2, the bond angles a and a2 and the dihedral angle <p. In comparison to the pseudo-bond length L±-L2, the physical bond lengths hardly vary. Furthermore, the atom types and L2 are not available prior to the graph generation process but are modeled as placeholder atoms. Thus, the model is not fed with the bond lengths L -E and L2-E2. Also, dihedral angles are circular, rendering it difficult for the model to learn from this feature due to the circular discontinuity. Therefore, instead of feeding the model <p directly, the cosine and sine of <p are provided, both of which are continuous. Note that the mapping of <p to the pair of its angular functions sine and cosine is bijective.
Claims
Patent Claims . A computer implemented, machine learning based method for determining ternary complexes in targeted protein degradation, by representing biomolecules as graphs and then feeding these graphs as inputs into a machine learning system comprising steps of
- determination of the 3D structure of relevant proteins (1)
- determination of the interactions between each fragment of the degrader and the corresponding proteins as well as identification of the corresponding interaction (2)
- Protein-Protein complex prediction (3)
- Refinement of the ternary complex, with the designed linker (4). . Computer implemented method, according to claim 1, wherein Bayesian Optimization is used to sample protein-protein complexes for ternary complex determination in targeted protein degradation. . Computer implemented method, according to claim 2, wherein relative rotations and translations are used to represent the space of protein-protein complexes to optimize over. . Computer implemented method, according to claims 2 or 3, wherein docking is used as the oracle for the determination of the protein-protein interaction. . Computer implemented method, according to one of claims 2 to 4, wherein deep graph representation learning is used as the oracle for the determination of the protein-protein interaction.. Computer implemented method, according to one of claims 2 to 5, wherein a learned distribution over a dataset of generated degrader conformers is used as a fitness function for a candidate protein-protein complex in the Bayesian Optimization loop. . Computer implemented method, according to one of claims 2 to 6, wherein the fitness of a generated linker is used as a fitness function for a candidate protein-protein complex in the Bayesian Optimization loop. . Computer implemented method, according to one of claims 1 one to 7, wherein a molecular graph representation of a biomolecule is fed into a deep interaction prediction model for ternary complex determination in targeted protein degradation. . Computer implemented method, according one of claims 1 to 8, wherein deep graph representation learning is used to predict interactions for ternary complexes and hence docking scores which represent the fitness of each ternary complex in targeted protein degradation. 0. Computer implemented method, according to claim 9, wherein protein-protein interactions are determined via machine learning. 1. Computer implemented method, according to claim 9 or 10, wherein fragment-protein interactions are determined via machine learning. 2. Computer implemented method, according to claim 6, wherein deep molecular conformation generation is used to generate the dataset of molecular conformers.
Computer implemented method, according to claim 14, wherein deep molecular conformation generation is used to validate the linkers. Computer implemented method, according to claim 2 of generating valid linkers via deep learning given a particular relative position and orientation of the degrader fragments. Computer implemented method, according to claim 1, wherein biomolecules are generated initiating the process of targeted protein degradation. Computer implemented system, prepared for the execution of the methods according to claims 1 to 15, characterized by the use of functional modules as "Deep Interaction Prediction" (DIP), "Deep Linker Generation " (DLG), "Deep Molecular Conformation Generation" (DMCG).
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
ATA138/2021 | 2021-08-12 | ||
AT1382021 | 2021-08-12 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023016621A1 true WO2023016621A1 (en) | 2023-02-16 |
Family
ID=78078171
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/EP2021/025372 WO2023016621A1 (en) | 2021-08-12 | 2021-09-29 | Ternary complex determination for plausible targeted protein degradation using deep learning and design of degrader molecules using deep learning |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2023016621A1 (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109785902A (en) | 2019-02-20 | 2019-05-21 | 成都分迪科技有限公司 | A kind of prediction technique of ubiquitination degradation target protein |
US20200190136A1 (en) * | 2017-06-09 | 2020-06-18 | Dana-Farber Cancer Institute, Inc. | Methods for generating small molecule degraders and dimerizers |
-
2021
- 2021-09-29 WO PCT/EP2021/025372 patent/WO2023016621A1/en unknown
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200190136A1 (en) * | 2017-06-09 | 2020-06-18 | Dana-Farber Cancer Institute, Inc. | Methods for generating small molecule degraders and dimerizers |
CN109785902A (en) | 2019-02-20 | 2019-05-21 | 成都分迪科技有限公司 | A kind of prediction technique of ubiquitination degradation target protein |
Non-Patent Citations (6)
Title |
---|
DANIEL ZAIDMANJAIME PRILUSKYNIR LONDON, J. CHEM. INF. MODEL., vol. 60, no. 10, 2020, pages 4894 - 4903 |
FANG YANG ET AL: "Graph-based prediction of Protein-protein interactions with attributed signed graph embedding", BMC BIOINFORMATICS, BIOMED CENTRAL LTD, LONDON, UK, vol. 21, no. 1, 21 July 2020 (2020-07-21), pages 1 - 16, XP021279593, DOI: 10.1186/S12859-020-03646-8 * |
IMRIE FERGUS ET AL: "Deep Generative Models for 3D Linker Design", JOURNAL OF CHEMICAL INFORMATION AND MODELING, vol. 60, no. 4, 20 March 2020 (2020-03-20), US, pages 1983 - 1995, XP055916311, ISSN: 1549-9596, Retrieved from the Internet <URL:http://pubs.acs.org/doi/pdf/10.1021/acs.jcim.9b01120> DOI: 10.1021/acs.jcim.9b01120 * |
LIM SANGSOO ET AL: "A review on compound-protein interaction prediction methods: Data, format, representation and model", COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, vol. 19, 1 January 2021 (2021-01-01), Sweden, pages 1541 - 1556, XP055916146, ISSN: 2001-0370, DOI: 10.1016/j.csbj.2021.03.004 * |
M8ICHAEL L DRUMMONDCHRISTOPHER I. WILLIAMS, J. CHEM. INF. MODEL., vol. 59, no. 4, 2019, pages 1634 - 1644 |
MUNETOMO MASAHARU ET AL: "An automated ligand evolution system using Bayesian optimization algorithm", WSEAS TRANSACTIONS ON INFORMATION SCIENCE AND APPLICATIONS, 1 May 2009 (2009-05-01), pages 788 - 797, XP055916195, Retrieved from the Internet <URL:https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.501.5347&rep=rep1&type=pdf> [retrieved on 20220428] * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kim et al. | Computational and artificial intelligence-based methods for antibody development | |
Dhakal et al. | Artificial intelligence in the prediction of protein–ligand interactions: recent advances and future directions | |
Zhang et al. | EDock: blind protein–ligand docking by replica-exchange monte carlo simulation | |
Aggarwal et al. | DeepPocket: ligand binding site detection and segmentation using 3D convolutional neural networks | |
Soleymani et al. | Protein–protein interaction prediction with deep learning: A comprehensive review | |
Kozlovskii et al. | Spatiotemporal identification of druggable binding sites using deep learning | |
Ragoza et al. | Protein–ligand scoring with convolutional neural networks | |
Schneidman-Duhovny et al. | Predicting molecular interactions in silico: II. Protein-protein and protein-drug docking | |
WO2017196963A1 (en) | Computational method for classifying and predicting protein side chain conformations | |
Dalkas et al. | SEPIa, a knowledge-driven algorithm for predicting conformational B-cell epitopes from the amino acid sequence | |
Sunny et al. | Protein–protein docking: Past, present, and future | |
Pan et al. | Introduction to protein structure prediction: methods and algorithms | |
US20220406403A1 (en) | System and method for generating a novel molecular structure using a protein structure | |
Pencheva et al. | AMMOS: automated molecular mechanics optimization tool for in silico screening | |
Pérez de Alba Ortíz et al. | The adaptive path collective variable: a versatile biasing approach to compute the average transition path and free energy of molecular transitions | |
Kotelnikov et al. | Sampling and refinement protocols for template-based macrocycle docking: 2018 D3R Grand Challenge 4 | |
Ozdemir et al. | Developments in integrative modeling with dynamical interfaces | |
Song et al. | Protein–ligand docking using differential evolution with an adaptive mechanism | |
Tao et al. | Docking cyclic peptides formed by a disulfide bond through a hierarchical strategy | |
Ugurlu et al. | Cobdock: an accurate and practical machine learning-based consensus blind docking method | |
Maheshwari et al. | Across-proteome modeling of dimer structures for the bottom-up assembly of protein-protein interaction networks | |
WO2023016621A1 (en) | Ternary complex determination for plausible targeted protein degradation using deep learning and design of degrader molecules using deep learning | |
CN116758978A (en) | Controllable attribute totally new active small molecule design method based on protein structure | |
Metcalf et al. | Directional Δ G Neural Network (DrΔ G-Net): A Modular Neural Network Approach to Binding Free Energy Prediction | |
Chen et al. | Is fragment-based graph a better graph-based molecular representation for drug design? A comparison study of graph-based models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21786332 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 07/06/2024) |