WO2023027279A1

WO2023027279A1 - Method for predicting whether or not atom inside chemical structure binds to kinase

Info

Publication number: WO2023027279A1
Application number: PCT/KR2022/003743
Authority: WO
Inventors: 정종영
Original assignee: 디어젠 주식회사
Priority date: 2021-08-27
Filing date: 2022-03-17
Publication date: 2023-03-02

Abstract

The present disclosure relates to a method for predicting whether or not a compound binds to a hinge of the active site of a kinase, the method comprising the steps of: generating a feature vector representing information about the surrounding environment of each of the atoms of the compound on the basis of the chemical structure of the compound; and classifying, on the basis of the feature vector, whether or not each atom of the compound binds to a hinge region of the kinase.

Description

Method for predicting kinase binding of atoms in chemical structure

The present disclosure relates to a method for predicting whether a compound binds, and more specifically, to a method for predicting whether each atom of a compound binds to a kinase.

Kinase is a protein (enzyme) that mediates cell signal transmission related to cell growth, cell death, inflammation, and metabolism by transferring γ-phosphate of ATP to other proteins. These are the main target proteins to be considered. In order to develop a kinase-targeted anti-cancer drug, it is necessary to selectively inhibit the target kinase and at the same time design it to have a structure different from known kinase-binding compounds. At this time, the differentiation of the novel kinase-binding compound is determined by the scaffold structure that binds to the hinge region of the kinase active site. For example, hydrogen bonds with the backbone of the first (gk+1) and third (gk+3) amino acid residues from the gate keeper (gk) amino acid residue become the main interaction of kinase binding.

In addition, the parent structure of the kinase binding compound should be designed to reflect the steric complementarity inside the kinase active site while forming the main hydrogen bonds mentioned above.

Therefore, a kinase-binding compound with a novel structure should be designed in consideration of steric hindrance with the kinase by the chemical structure around the hydrogen bonding site, including whether or not local hydrogen bonds are formed.

The present disclosure aims to determine the binding potential of the kinase active site of a compound to the hinge.

The purpose of the present disclosure is not limited to the above-mentioned purpose, and other objects and advantages of the present disclosure not mentioned above can be understood by the following description and will be more clearly understood by the embodiments of the present disclosure. Further, it will be readily apparent that the objects and advantages of the present disclosure may be realized by means of the instrumentalities and combinations indicated in the claims.

According to an embodiment of the present disclosure for realizing the above object, a combination prediction method including at least one processor is disclosed. The method may include generating a feature vector representing environmental information of each atom of the compound based on the chemical structure of the compound; and classifying whether each atom of the compound binds to a hinge region of the kinase based on the feature vector.

In an alternative embodiment, the generating of the feature vector may include generating a feature vector expressing a spatial distribution of each atom of the compound and surrounding atoms using an atomic center symmetry function.

In an alternative embodiment, the method further comprises classifying whether steric hindrance occurs between each atom of the compound and the hinge region of the kinase using a machine learning classification model based on the feature vector. can

In an alternative embodiment, the machine learning classification model may be trained based on a protein data bank (PDB) and hyperparameter optimized using a Bayes rule.

In an alternative embodiment, the generating of the feature vector includes generating a feature vector on a latent space using a graph neural network (GNN) based on the compound, and whether the combination Classifying may include classifying whether a corresponding compound binds to a kinase using GNN based on the feature vector.

In an alternative embodiment, the GNN is trained based on a protein data bank (PDB), and the PDB includes: a two-dimensional graph including connection information between atoms; or a 3D graph including spatial information between atoms; may include at least one of them.

In an alternative embodiment, the method may further include generating a library by filtering the compounds whose binding properties are classified according to a predetermined criterion.

In an alternative embodiment, the method may further include outputting a loss value using a molecular generation deep learning model based on the compounds classified as to whether or not they bind.

In an alternative embodiment, the step of classifying whether each atom of the compound binds to the hinge region of the kinase based on the feature vector generates a prediction value for each class indicating the type of binding ; and determining the type of binding between each atom and the hinge region of the kinase based on the class having the highest predicted value for each class.

In an alternative embodiment, the class may include, for each atom in the compound: a class indicating whether or not a hydrogen bond with an amino acid residue first closest to an amino acid residue corresponding to a gate keeper; Alternatively, it may include at least one of classes indicating whether a hydrogen bond is present with an amino acid residue located third from the gatekeeper.

According to an embodiment of the present disclosure for realizing the above object, an apparatus is disclosed. The device may include a processor including one or more cores; and memory. In addition, the processor generates a feature vector including environment information of each atom of the compound based on the chemical structure of the compound, and each atom of the compound and the hinge of the kinase based on the feature vector (hinge) It is possible to classify whether or not it is combined with the area.

According to an embodiment of the present disclosure for realizing the above object, a computer program stored in a computer readable storage medium is disclosed. The computer program performs operations for predicting whether or not to bind, and the operations include: generating a feature vector including information about the surrounding environment of each atom of the compound based on the chemical structure of the compound; and classifying whether each atom of the compound binds to a hinge region of the kinase based on the feature vector.

The present disclosure is to design a kinase-binding compound by determining the possibility of binding to a hinge of a kinase active site of a compound, and in designing a kinase-binding compound, the chemical structure around the hydrogen bonding point By classifying the binding between each atom of the compound and the hinge region of the kinase so that steric hindrance does not occur with the kinase, the binding can be predicted.

1 is a block diagram of a computing device for predicting whether to combine according to an embodiment of the present disclosure.

Figure 2 is a schematic diagram for explaining the kinase hinge region and steric hindrance of a compound prior to describing an embodiment of the present disclosure.

3 is a schematic diagram illustrating a network function according to an embodiment of the present disclosure.

4 is a flowchart illustrating a method of predicting whether each atom of a compound is bound to a hinge region of a kinase according to an embodiment of the present disclosure.

5 is a schematic diagram illustrating a method in which a processor predicts whether to combine using a machine learning classification model according to an embodiment of the present disclosure.

6 is a schematic diagram illustrating a method in which a processor predicts whether to combine using a GNN according to an embodiment of the present disclosure.

7 is a flowchart illustrating a method for generating a library based on compounds whose binding properties are classified according to an embodiment of the present disclosure.

8 is a simplified and general schematic diagram of an exemplary computing environment in which embodiments of the present disclosure may be implemented.

Various embodiments are now described with reference to the drawings. In this specification, various descriptions are presented to provide an understanding of the present disclosure. However, it is apparent that these embodiments may be practiced without these specific details.

The terms “component,” “module,” “system,” and the like, as used herein, refer to a computer-related entity, hardware, firmware, software, a combination of software and hardware, or an execution of software. For example, a component may be, but is not limited to, a procedure, processor, object, thread of execution, program, and/or computer running on a processor. For example, both an application running on a computing device and a computing device may be components. One or more components may reside within a processor and/or thread of execution. A component can be localized within a single computer. A component may be distributed between two or more computers. Also, these components can execute from various computer readable media having various data structures stored thereon. Components may be connected, for example, via signals with one or more packets of data (e.g., data and/or signals from one component interacting with another component in a local system, distributed system) to other systems and over a network such as the Internet. data being transmitted) may communicate via local and/or remote processes.

In addition, the term “or” is intended to mean an inclusive “or” rather than an exclusive “or”. That is, unless otherwise specified or clear from the context, “X employs A or B” is intended to mean one of the natural inclusive substitutions. That is, X uses A; X uses B; Or, if X uses both A and B, "X uses either A or B" may apply to either of these cases. Also, the term "and/or" as used herein should be understood to refer to and include all possible combinations of one or more of the listed related items.

Also, the terms "comprises" and/or "comprising" should be understood to mean that the features and/or components are present. However, it should be understood that the terms "comprises" and/or "comprising" do not exclude the presence or addition of one or more other features, elements, and/or groups thereof. Also, unless otherwise specified or where the context clearly indicates that a singular form is indicated, the singular in this specification and claims should generally be construed to mean "one or more".

In addition, the term “at least one of A or B” should be interpreted as meaning “when only A is included”, “when only B is included” and “when A and B are combined”.

Skilled artisans will further understand that the various illustrative logical blocks, components, modules, circuits, means, logics, and algorithm steps described in connection with the embodiments disclosed herein may be implemented in electronic hardware, computer software, or both. It should be recognized that it can be implemented with combinations of To clearly illustrate the interchangeability of hardware and software, various illustrative components, blocks, configurations, means, logics, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented in hardware or as software depends on the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application. However, such implementation decisions should not be interpreted as causing a departure from the scope of this disclosure.

The description of the presented embodiments is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these embodiments will be apparent to those skilled in the art of this disclosure. The general principles defined herein may be applied to other embodiments without departing from the scope of this disclosure. Thus, the present disclosure is not limited to the embodiments presented herein. This disclosure is to be interpreted in the widest light consistent with the principles and novel features presented herein.

In the present disclosure, network functions, artificial neural networks, and neural networks may be used interchangeably.

The configuration of the computing device 100 shown in FIG. 1 is only a simplified example. In one embodiment of the present disclosure, the computing device 100 may include other components for performing a computing environment of the computing device 100, and only some of the disclosed components may constitute the computing device 100.

The computing device 100 may include a processor 110 , a memory 130 , and a network unit 150 .

The processor 110 may include one or more cores, and includes a central processing unit (CPU), a general purpose graphics processing unit (GPGPU), and a tensor processing unit (TPU) of a computing device. unit), data analysis, and processors for deep learning. The processor 110 may read a computer program stored in the memory 130 and process data for machine learning according to an embodiment of the present disclosure. According to an embodiment of the present disclosure, the processor 110 may perform an operation for learning a neural network. The processor 110 is used for neural network learning, such as processing input data for learning in deep learning (DL), extracting features from input data, calculating errors, and updating neural network weights using backpropagation. calculations can be performed. At least one of the CPU, GPGPU, and TPU of the processor 110 may process learning of the network function. For example, the CPU and GPGPU can process learning of network functions and data classification using network functions. In addition, in an embodiment of the present disclosure, the learning of a network function and data classification using a network function may be processed by using processors of a plurality of computing devices together. In addition, a computer program executed in a computing device according to an embodiment of the present disclosure may be a CPU, GPGPU or TPU executable program.

According to an embodiment of the present disclosure, the memory 130 may store any type of information generated or determined by the processor 110 and any type of information received by the network unit 150 .

According to an embodiment of the present disclosure, the memory 130 is a flash memory type, a hard disk type, a multimedia card micro type, or a card type memory (eg, SD or XD memory, etc.), RAM (Random Access Memory, RAM), SRAM (Static Random Access Memory), ROM (Read-Only Memory, ROM), EEPROM (Electrically Erasable Programmable Read-Only Memory), PROM (Programmable Read-Only Memory) -Only Memory), a magnetic memory, a magnetic disk, and an optical disk may include at least one type of storage medium. The computing device 100 may operate in relation to a web storage that performs a storage function of the memory 130 on the Internet. The above description of the memory is only an example, and the present disclosure is not limited thereto.

The network unit 150 according to an embodiment of the present disclosure includes a Public Switched Telephone Network (PSTN), x Digital Subscriber Line (xDSL), Rate Adaptive DSL (RADSL), Multi Rate DSL (MDSL), and VDSL ( Various wired communication systems such as Very High Speed DSL), Universal Asymmetric DSL (UADSL), High Bit Rate DSL (HDSL), and Local Area Network (LAN) may be used.

In addition, the network unit 150 presented in this specification includes Code Division Multi Access (CDMA), Time Division Multi Access (TDMA), Frequency Division Multi Access (FDMA), Orthogonal Frequency Division Multi Access (OFDMA), SC-FDMA ( Single Carrier-FDMA) and other systems.

In the present disclosure, the network unit 150 may be configured regardless of its communication mode, such as wired and wireless, and may be configured with various communication networks such as a personal area network (PAN) and a wide area network (WAN). can In addition, the network may be the known World Wide Web (WWW), or may use a wireless transmission technology used for short-range communication, such as Infrared Data Association (IrDA) or Bluetooth. The techniques described herein may also be used in other networks mentioned above.

Concepts of terms for describing embodiments of the present disclosure will be described with reference to FIG. 2 .

Kinase is a protein (enzyme) that mediates cell signal transmission related to cell growth, apoptosis, inflammation, and metabolism by transferring gamma phosphate of ATP to other proteins. It is a major target protein to be considered for the development of anticancer drugs am. In order to develop a kinase-targeted anti-cancer drug, it is necessary to selectively inhibit the target kinase and at the same time design it to have a structure different from known kinase-binding compounds. At this time, the differentiation of the novel kinase-binding compound is determined by the scaffold structure that binds to the hinge region of the kinase active site.

The chemical structure shown in FIG. 2 represents the chemical structure of bosutinib used as a treatment drug for chronic myelogenous leukemia, and is composed of a hinge binding site 200 and a scaffold 201. At this time, the parent body 201 is hydrogen bonded, and the hinge bonding site 200 may be capable of hydrogen bonding with the parent body. (Steric hindrance, a term used in the present disclosure, means that a collision occurs between the three-dimensional structure of a compound bound to kinase and the three-dimensional structure of kinase, making it difficult to bind to each other.)

At this time, since the hydrogen bond is generated by the electrostatic attraction between hydrogen and oxygen or hydrogen and nitrogen, in the binding of the compound and kinase, the effect on the attraction of surrounding ions and the change in reactivity due to the three-dimensional structure of the molecule are three-dimensional. Disruptions may occur.

In addition, in the hinge of the kinase active site, the backbone of the amino acid residue closest to the first (gk + 1) from the amino acid residue corresponding to the gate keeper (gk) and the backbone of the third (gk + 3) closest amino acid residue The hydrogen bond of is the main interaction of kinase bonds. At this time, a chemical structure including a parent structure capable of forming a hydrogen bond with oxygen or nitrogen contained in the peptide bond of the first backbone and the peptide bond of the third backbone is a basic requirement for a kinase binding inhibitor. .

Throughout this specification, computational model, neural network, network function, and neural network may be used interchangeably. A neural network may consist of a set of interconnected computational units, which may generally be referred to as nodes. These nodes may also be referred to as neurons. A neural network includes one or more nodes. Nodes (or neurons) constituting neural networks may be interconnected by one or more links.

In a neural network, one or more nodes connected through a link may form a relative relationship of an input node and an output node. The concept of an input node and an output node is relative, and any node in an output node relationship with one node may have an input node relationship with another node, and vice versa. As described above, an input node to output node relationship may be created around a link. More than one output node can be connected to one input node through a link, and vice versa.

In a relationship between an input node and an output node connected through one link, the value of data of the output node may be determined based on data input to the input node. Here, a link interconnecting an input node and an output node may have a weight. The weight may be variable, and may be changed by a user or an algorithm in order to perform a function desired by the neural network. For example, when one or more input nodes are interconnected by respective links to one output node, the output node is set to a link corresponding to values input to input nodes connected to the output node and respective input nodes. An output node value may be determined based on the weight.

As described above, in the neural network, one or more nodes are interconnected through one or more links to form an input node and output node relationship in the neural network. Characteristics of the neural network may be determined according to the number of nodes and links in the neural network, an association between the nodes and links, and a weight value assigned to each link. For example, when there are two neural networks having the same number of nodes and links and different weight values of the links, the two neural networks may be recognized as different from each other.

A neural network may be composed of a set of one or more nodes. A subset of nodes constituting a neural network may constitute a layer. Some of the nodes constituting the neural network may form one layer based on distances from the first input node. For example, a set of nodes having a distance of n from the first input node may constitute n layers. The distance from the first input node may be defined by the minimum number of links that must be passed through to reach the corresponding node from the first input node. However, the definition of such a layer is arbitrary for explanation, and the order of a layer in a neural network may be defined in a method different from the above. For example, a layer of nodes may be defined by a distance from a final output node.

An initial input node may refer to one or more nodes to which data is directly input without going through a link in relation to other nodes among nodes in the neural network. Alternatively, in a relationship between nodes based on a link in a neural network, it may mean nodes that do not have other input nodes connected by a link. Similarly, the final output node may refer to one or more nodes that do not have an output node in relation to other nodes among nodes in the neural network. Also, the hidden node may refer to nodes constituting the neural network other than the first input node and the last output node.

A deep neural network (DNN) may refer to a neural network including a plurality of hidden layers in addition to an input layer and an output layer. Deep neural networks can reveal latent structures in data. In other words, it can identify the latent structure of a photo, text, video, sound, or music (e.g., what objects are in the photo, what the content and emotion of the text are, what the content and emotion of the audio are, etc.). . Deep neural networks include convolutional neural networks (CNNs), recurrent neural networks (RNNs), auto encoders, generative adversarial networks (GANs), and restricted boltzmann machines (RBMs). machine), deep belief network (DBN), Q network, U network, Siamese network, Generative Adversarial Network (GAN), and Graph Neural Network (GNN). can The description of the deep neural network described above is only an example, and the present disclosure is not limited thereto.

The neural network may be trained using at least one of supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning. Learning of the neural network may be a process of applying knowledge for the neural network to perform a specific operation to the neural network.

A neural network can be trained in a way that minimizes output errors. In learning the neural network, the learning data is repeatedly input into the neural network, the output of the neural network for the training data and the error of the target are calculated, and the error of the neural network is transferred from the output layer of the neural network to the input layer in the direction of reducing the error. It is a process of updating the weight of each node of the neural network by backpropagating in the same direction. In the case of teacher learning, the learning data in which the correct answer is labeled is used for each learning data (ie, the labeled learning data), and in the case of comparative teacher learning, the correct answer may not be labeled in each learning data. That is, for example, learning data in the case of teacher learning about data classification may be data in which each learning data is labeled with a category. Labeled training data is input to the neural network, and an error can be calculated by comparing the output (category) of the neural network and the label of the training data. As another example, in the case of comparative history learning for data classification, an error may be calculated by comparing input learning data with a neural network output. The calculated error is back-propagated in a reverse direction (ie, from the output layer to the input layer) in the neural network, and the connection weight of each node of each layer of the neural network may be updated according to the back-propagation. The amount of change in the connection weight of each updated node may be determined according to a learning rate. The neural network's computation of input data and backpropagation of errors can constitute a learning cycle (epoch). The learning rate may be applied differently according to the number of iterations of the learning cycle of the neural network. For example, a high learning rate may be used in the early stage of neural network training to increase efficiency by allowing the neural network to quickly obtain a certain level of performance, and a low learning rate may be used in the late stage to increase accuracy.

In neural network learning, generally, training data can be a subset of real data (ie, data to be processed using the trained neural network). Therefore, errors for training data are reduced, but errors for real data are reduced. There may be incremental learning cycles. Overfitting is a phenomenon in which errors on actual data increase due to excessive learning on training data. For example, a phenomenon in which a neural network that has learned a cat by showing a yellow cat does not recognize that it is a cat when it sees a cat other than yellow may be a type of overfitting. Overfitting can act as a cause of increasing the error of machine learning algorithms. Various optimization methods can be used to prevent such overfitting. To prevent overfitting, methods such as increasing the training data, regularization, inactivating some nodes in the network during the learning process, and using a batch normalization layer should be applied. can

According to an embodiment of the present disclosure, a computer readable medium storing a data structure is disclosed.

Data structure can refer to the organization, management, and storage of data that enables efficient access and modification of data. Data structure may refer to the organization of data to solve a specific problem (eg, data retrieval, data storage, data modification in the shortest time). A data structure may be defined as a physical or logical relationship between data elements designed to support a specific data processing function. A logical relationship between data elements may include a connection relationship between user-defined data elements. A physical relationship between data elements may include an actual relationship between data elements physically stored in a computer-readable storage medium (eg, a persistent storage device). The data structure may specifically include a set of data, a relationship between data, and a function or command applicable to the data. Through an effectively designed data structure, a computing device can perform calculations while using minimal resources of the computing device. Specifically, the computing device can increase the efficiency of operation, reading, insertion, deletion, comparison, exchange, and search through an effectively designed data structure.

The data structure can be divided into a linear data structure and a non-linear data structure according to the shape of the data structure. A linear data structure may be a structure in which only one data is connected after one data. Linear data structures may include lists, stacks, queues, and decks. A list may refer to a series of data sets in which order exists internally. The list may include a linked list. A linked list may be a data structure in which data are connected in such a way that each data is connected in a single line with a pointer. In a linked list, a pointer can contain information about connection to the next or previous data. A linked list can be expressed as a singly linked list, a doubly linked list, or a circular linked list depending on the form. A stack can be a data enumeration structure that allows limited access to data. A stack can be a linear data structure in which data can be processed (eg, inserted or deleted) at only one end of the data structure. The data stored in the stack may be a LIFO-Last in First Out (Last in First Out) data structure. A queue is a data listing structure that allows limited access to data, and unlike a stack, it can be a data structure (FIFO-First in First Out) in which data stored later comes out later. A deck can be a data structure that can handle data from either end of the data structure.

The nonlinear data structure may be a structure in which a plurality of data are connected after one data. The non-linear data structure may include a graph data structure. A graph data structure can be defined as a vertex and an edge, and an edge can include a line connecting two different vertices. A graph data structure may include a tree data structure. The tree data structure may be a data structure in which one path connects two different vertices among a plurality of vertices included in the tree. That is, it may be a data structure that does not form a loop in a graph data structure.

Throughout this specification, computational model, neural network, network function, and neural network may be used interchangeably. Hereinafter, a neural network is unified and described. The data structure may include a neural network. And the data structure including the neural network may be stored in a computer readable medium. The data structure including the neural network includes preprocessed data for processing by the neural network, data input to the neural network, weights of the neural network, hyperparameters of the neural network, data acquired from the neural network, activation function associated with each node or layer of the neural network, and A loss function for learning may be included. A data structure including a neural network may include any of the components described above. That is, the data structure including the neural network includes preprocessed data for processing by the neural network, data input to the neural network, weights of the neural network, hyperparameters of the neural network, data obtained from the neural network, activation function associated with each node or layer of the neural network, and neural network. It may be configured to include all or any combination thereof, such as a loss function for learning of . In addition to the foregoing configurations, the data structure comprising the neural network may include any other information that determines the characteristics of the neural network. In addition, the data structure may include all types of data used or generated in the computational process of the neural network, but is not limited to the above. A computer readable medium may include a computer readable recording medium and/or a computer readable transmission medium. A neural network may consist of a set of interconnected computational units, which may generally be referred to as nodes. These nodes may also be referred to as neurons. A neural network includes one or more nodes.

The data structure may include data input to the neural network. A data structure including data input to the neural network may be stored in a computer readable medium. Data input to the neural network may include training data input during the neural network learning process and/or input data input to the neural network after learning has been completed. Data input to the neural network may include pre-processed data and/or data subject to pre-processing. Pre-processing may include a data processing process for inputting data to a neural network. Accordingly, the data structure may include data subject to pre-processing and data generated by pre-processing. The foregoing data structure is only an example, and the present disclosure is not limited thereto.

The data structure may include the weights of the neural network. (In this specification, weights and parameters may be used in the same meaning.) Also, a data structure including weights of a neural network may be stored in a computer readable medium. A neural network may include a plurality of weights. The weight may be variable, and may be changed by a user or an algorithm in order to perform a function desired by the neural network. For example, when one or more input nodes are interconnected by respective links to one output node, the output node is set to a link corresponding to values input to input nodes connected to the output node and respective input nodes. A data value output from an output node may be determined based on the weight. The foregoing data structure is only an example, and the present disclosure is not limited thereto.

As a non-limiting example, the weights may include weights that are varied during neural network training and/or weights for which neural network training has been completed. The variable weight in the neural network learning process may include a weight at the time the learning cycle starts and/or a variable weight during the learning cycle. The weights for which neural network learning has been completed may include weights for which learning cycles have been completed. Accordingly, the data structure including the weights of the neural network may include a data structure including weights that are variable during the neural network learning process and/or weights for which neural network learning is completed. Therefore, it is assumed that the above-described weights and/or combinations of weights are included in the data structure including the weights of the neural network. The foregoing data structure is only an example, and the present disclosure is not limited thereto.

The data structure including the weights of the neural network may be stored in a computer readable storage medium (eg, a memory or a hard disk) after going through a serialization process. Serialization can be the process of converting a data structure into a form that can be stored on the same or another computing device and later reconstructed and used. The computing device may transmit and receive data through a network by serializing the data structure. The data structure including the weights of the serialized neural network may be reconstructed on the same computing device or another computing device through deserialization. The data structure including the weights of the neural network is not limited to serialization. Furthermore, the data structure including the weights of the neural network is a data structure for increasing the efficiency of operation while minimizing the resource of the computing device (for example, B-Tree, Trie, m-way search tree, AVL tree, Red-Black Tree). The foregoing is only an example, and the present disclosure is not limited thereto.

The data structure may include hyper-parameters of the neural network. Also, the data structure including the hyperparameters of the neural network may be stored in a computer readable medium. A hyperparameter may be a variable variable by a user. Hyperparameters include, for example, learning rate, cost function, number of learning cycle iterations, weight initialization (eg, setting the range of weight values to be targeted for weight initialization), hidden unit number (eg, the number of hidden layers and the number of nodes in the hidden layer). The foregoing data structure is only an example, and the present disclosure is not limited thereto.

A graph neural network (GNN), a type of neural network model used in the present disclosure, is a neural network model constructed in relation to a graph, and the graph is composed of values and edges connecting values and values. , it is mainly used when analyzing data that represents relationships or interactions between data. In general, a graph is expressed as a graph (G) = ((value (V), side (E))). {A, B, C}, {A, B}, {A, C}, {B, C}). At this time, it can be understood that GNN uses each value as a node and each edge as a link in relation to a graph. By utilizing the GNN, problems of classifying individual nodes, problems of predicting link values (weights) between nodes, and problems of classifying the entire graph can be solved. At this time, in the present disclosure, each value may contain information about the type of atom, and each side may contain information including location information between atoms, whether or not they are bonded, and attractive and repulsive forces.

Referring to FIG. 4, an embodiment of a method for classifying whether each atom of a compound is bound to a hinge region of a kinase is disclosed.

Referring to step S400, the processor 110 of the computer device 100 according to an embodiment of the present disclosure obtains interaction information between neighboring atoms that can affect each atom of the compound based on the chemical structure of the compound. A feature vector representing the included environment information may be generated. In this case, the feature vector may represent a spatial distribution of each atom of the compound and surrounding atoms using an atomic center symmetry function. For example, the feature vector representing the surrounding environment information includes atomic number, atomic orbital hybridization, ring formation, aromaticity, number of bonded atoms, hydrogen bond class, atomic mass, Gasteiger atomic charge atomic charge), and spatial distribution of atoms.

In addition, the bond between each atom and the hinge region of the kinase in the present disclosure may be a hydrogen bond. In this case, the processor 110 may generate a feature vector for representing the surrounding environment information of each atom of the compound using an atom-centered symmetry function or GNN. In this case, the GNN can be learned based on information about the structure of the kinase and the compound, including the position of the hinge in the kinase, the type of atoms bonded to the hinge, and the like.

Specifically, the GNN may be learned based on binding information between atoms in a molecule included in a protein included in a protein data bank. In addition, when the GNN learned based on the protein data bank receives the chemical structure of a compound based on the initialized weights, it can output feature values for predicting whether there is hydrogen bonding between atoms in the hinge region and whether there is steric hindrance. there is. (At this time, the feature value exists in a virtual space within the hidden layer of the GNN.) However, data for learning the GNN is not limited to the protein data bank, and other databases related to the combination of proteins and compounds can also be used. there is.

Subsequently, referring to step S401, the processor 110 may classify whether or not hydrogen bonds between each atom of the compound and the hinge region of the kinase are based on the feature vector. In addition, in classifying the binding, the processor 110 may classify whether steric hindrance occurs between each atom of the compound and the hinge region of the kinase using a machine learning classification model based on the feature vector. Also, in classifying whether or not the combination is present, the processor 110 determines the type and distance of atoms and the three-dimensional structure of a compound causing steric hindrance and the three-dimensional structure of a kinase based on feature values existing in a virtual space in the GNN. Considering whether or not a collision occurs, it is possible to output whether or not a collision occurs as a result of the output layer of the GNN.

At this time, whether or not the binding is determined is based on the backbone of the first (gk+1) closest amino acid residue and the third (gk) from the gatekeeper amino acid residue and the surrounding amino acid residues of the molecule included in the cannase. +3) Whether the amino acid residue closest to the backbone is bound to the backbone, and the matrix structure in which the oxygen or nitrogen contained in the peptide bond of the backbone closest to the oxygen contained in the peptide bond of the first closest backbone can form a hydrogen bond It may include whether or not a steric hindrance effect by ions occurs.

Specifically, the processor 110 classifies whether each atom (eg, nitrogen, oxygen, etc.) of the compound binds to the hinge region of the kinase based on the feature vector, class by class of each atom of the compound A predicted value is generated, and the class having the highest predicted value for each class can be classified according to the binding between each atom and the hinge region of the kinase. At this time, the class may include four classes. For example, ① [Class 1] whether each atom of the compound hydrogen bonds to the oxygen contained in the peptide bond of the backbone of the amino acid residue (gk + 1) closest to the amino acid residue corresponding to the gatekeeper, ② [Class 2] Whether or not each atom of the compound hydrogen bonds to nitrogen included in the peptide bond of the backbone of the amino acid residue (gk + 3) closest to the third from the gatekeeper and ③ [Class 3] Each atom of the compound, from the gatekeeper Whether hydrogen bonds to hydrogen included in the peptide bond of the backbone of the third nearest amino acid residue (gk + 3) and ④ [Class 4] whether or not each atom of the compound corresponds to Salpin Class 1 to 3 above can do.

In addition, the binding information calculated through the above process can be used as an evaluation index of a molecular generation deep learning model for designing a kinase inhibitor compound.

Referring to FIG. 5, the processor 110 of the computer device 100 according to an embodiment of the present disclosure uses an atom-centered symmetry function 501 based on the chemical structure of the compound 500 to determine the number of atoms of the compound 500. A feature vector 502 representing surrounding environment information may be generated. In addition, the processor 110 may classify whether each atom of the compound 500 binds to the hinge region of the kinase 504 by using the machine learning classification model 503 based on the feature vector 502 . In this case, the machine learning classification model 503 is learned based on the protein data bank, and hyperparameters may be optimized using a Bayesian method. In addition, the machine learning classification model 503 may include models such as SVM, Xgboost, and lightgbm.

Referring to FIG. 5, an embodiment of a method for predicting whether to combine using a plurality of machine learning classification models is disclosed. The processor 110 may generate a feature vector 502 representing information about the surrounding environment of each atom of the compound 500 using the atomic center symmetry function 501 based on the chemical structure of the compound 500 . In addition, the processor 110 may classify 504 whether each atom of the compound 500 is bonded to the hinge region of the kinase by using a machine learning model based on the feature vector 502 . In this case, the processor 110 may generate a plurality of output data indicating whether or not to combine 504 based on a plurality of different machine learning models. In addition, the processor 110 may ensemble the plurality of output data to generate a final output indicating whether or not the final combination 504 is performed. At this time, the ensemble method may use a statistical average value, a variance value, a maximum value, or a minimum value.

Referring to FIG. 6, an embodiment of a method for the processor 110 to predict whether or not to combine using a GNN is disclosed.

Referring to FIG. 6 , the processor 110 of the computer device 100 according to an embodiment of the present disclosure uses a graph neural network (GNN) 601 based on the chemical structure of the compound 600 to determine the compound 600. ) It is possible to classify whether each atom of kinase binds to the hinge region (603). At this time, the GNN (601) can generate a feature vector (602) on the latent space, and classify the combination (603) based on the feature vector (602). At this time, the GNN 601 is learned based on the protein data bank 610, and the protein data bank 610 includes a two-dimensional graph including connection information between atoms of compounds including proteins; Alternatively, it may include at least one of a 3D graph including spatial information between atoms. At this time, the GNN 601 is a neural network based on a graph defined by a point-to-point connection including state information, and learns the state of each value based on the connection relationship between values and the state information of neighboring values. , Based on the relationship between the values of the learned state, the combination 603 can be classified. The processor 110 may include, as state information, the type of atom (eg, carbon, hydrogen, or nitrogen), which is information related to atoms of a compound, in each value of the GNN and may be used. In addition, the processor 110 can three-dimensionally consider the state information and neighbor relationships between atoms using GNN, and predict whether the hinge region of the kinase is effective for hydrogen bonding by predicting the attraction caused by the electrons included in each atom. there is.

Meanwhile, the processor 110 may perform an operation of predicting whether to combine using both the machine learning classification model according to FIG. 5 and the GNN according to FIG. 6 . Specifically, the processor 110 may ensemble the output data of the machine learning classification model according to FIG. 5 and the output data of the GNN according to FIG. 6 to generate a final output indicating whether or not they are combined. In addition, the processor may ensemble a plurality of output data of the plurality of machine learning classification models according to FIG. 5 and output data of the GNN according to FIG. 6 to generate a final output indicating whether or not to combine. In this case, as the ensemble method, a method of average, variance, maximum value, and minimum value may be used. When determining whether to combine using different types of models, the accuracy of classification can be improved because feature vectors of various viewpoints can be reflected.

Referring to FIG. 7, an embodiment of a method for a processor to generate a library based on compounds whose binding properties have been classified is disclosed. Referring to step S700, the processor 110 of the computer device 100 according to an embodiment of the present disclosure may generate a feature vector representing information about the environment of each atom of the compound based on the chemical structure of the compound. . Subsequently, referring to step S701, the processor 110 may classify whether each atom of the compound binds to the hinge region of the kinase based on the feature vector. Subsequently, referring to step S702, a library may be created by filtering the compounds classified for binding or not according to a predetermined criterion. For example, the processor 110 may generate a data set by filtering data related to a specific cancer among the compounds classified as to whether or not to bind to each other, and including annotations indicating which cancers are associated. That is, the processor 110 may use the machine learning and GNN-based classification model to extract a kinase-focused library from a public compound database. The kinase-concentrated compound library may take the form of a set of catalog files representing kinases related to each antigen with respect to data stored in a storage medium, and may contain summarized information on large amounts of data.

For example, considering the amino acid residues that are the gatekeepers of the included molecule and the surrounding amino acid residues according to each compound, ① a bond between the backbone of the first nearest amino acid residue and the third closest (gk+3) backbone of the amino acid residue whether or not, ②whether it contains a parent structure that can form a hydrogen bond with oxygen or nitrogen contained in the peptide bond of the backbone closest to the first and oxygen contained in the peptide bond of the third closest backbone, ③chemical chemical properties around the hydrogen bond point Whether or not steric hindrance can occur due to the structure may be included. Specifically, it may include whether or not a compound has a three-dimensional structure that affects the reactivity of ions and molecules, so that a negative phenomenon may occur in the chemical reaction rate.

Although the present disclosure has been described above as being generally embodied by a computing device, those skilled in the art will understand that the present disclosure may be combined with computer executable instructions and/or other program modules that may be executed on one or more computers and/or with hardware. It will be appreciated that it can be implemented as a combination of software.

Generally, program modules include routines, programs, components, data structures, etc. that perform particular tasks or implement particular abstract data types. In addition, it will be appreciated by those skilled in the art that the methods of the present disclosure may be used in single-processor or multiprocessor computer systems, minicomputers, mainframe computers as well as personal computers, handheld computing devices, microprocessor-based or programmable consumer electronics, and the like. It will be appreciated that other computer system configurations may be implemented, including (each of which may be operative in connection with one or more associated devices).

The described embodiments of the present disclosure may be practiced in a distributed computing environment where certain tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.

Computers typically include a variety of computer readable media. Computer readable media can be any medium that can be accessed by a computer, including volatile and nonvolatile media, transitory and non-transitory media, removable and non-transitory media. Includes removable media. By way of example, and not limitation, computer readable media may include computer readable storage media and computer readable transmission media. Computer readable storage media are volatile and nonvolatile media, transitory and non-transitory, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. includes media Computer readable storage media may include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital video disk (DVD) or other optical disk storage device, magnetic cassette, magnetic tape, magnetic disk storage device or other magnetic storage device. device, or any other medium that can be accessed by a computer and used to store desired information.

A computer readable transmission medium typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism. Including all information delivery media. The term modulated data signal means a signal that has one or more of its characteristics set or changed so as to encode information within the signal. By way of example, and not limitation, computer readable transmission media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared, and other wireless media. Combinations of any of the above are also intended to be included within the scope of computer readable transmission media.

An exemplary environment 1100 implementing various aspects of the present disclosure is shown including a computer 1102, which includes a processing unit 1104, a system memory 1106, and a system bus 1108. do. System bus 1108 couples system components, including but not limited to system memory 1106 , to processing unit 1104 . Processing unit 1104 may be any of a variety of commercially available processors. Dual processor and other multiprocessor architectures may also be used as the processing unit 1104.

System bus 1108 may be any of several types of bus structures that may additionally be interconnected to a memory bus, a peripheral bus, and a local bus using any of a variety of commercial bus architectures. System memory 1106 includes read only memory (ROM) 1110 and random access memory (RAM) 1112 . A basic input/output system (BIOS) is stored in non-volatile memory 1110, such as ROM, EPROM, or EEPROM, and is a basic set of information that helps transfer information between components within computer 1102, such as during startup. contains routines. RAM 1112 may include high-speed RAM, such as static RAM for caching data.

Computer 1102 includes an internal hard disk drive (HDD) 1114 (e.g., EIDE, SATA), which internal hard disk drive 1114 may be configured for external use in a suitable chassis (not shown). , a magnetic floppy disk drive (FDD) 1116 (e.g., for reading from or writing to removable diskette 1118), and an optical disk drive 1120 (e.g., a CD-ROM disk ( 1122) or for reading from or writing to other high-capacity optical media such as DVD). The hard disk drive 1114, magnetic disk drive 1116, and optical disk drive 1120 are connected to the system bus 1108 by a hard disk drive interface 1124, magnetic disk drive interface 1126, and optical drive interface 1128, respectively. ) can be connected to The interface 1124 for external drive implementation includes at least one or both of USB (Universal Serial Bus) and IEEE 1394 interface technologies.

These drives and their associated computer readable media provide non-volatile storage of data, data structures, computer executable instructions, and the like. In the case of computer 1102, drives and media correspond to storing any data in a suitable digital format. Although the description of computer readable media above refers to HDDs, removable magnetic disks, and removable optical media such as CDs or DVDs, those skilled in the art can use zip drives, magnetic cassettes, flash memory cards, and cartridges. It will be appreciated that other tangible computer readable media such as , , and the like may also be used in the exemplary operating environment and that any such media may include computer executable instructions for performing the methods of the present disclosure. .

A number of program modules may be stored on the drive and RAM 1112, including an operating system 1130, one or more application programs 1132, other program modules 1134, and program data 1136. All or portions of the operating system, applications, modules and/or data may be cached in RAM 1112. It will be appreciated that the present disclosure may be implemented in a variety of commercially available operating systems or combinations of operating systems.

A user may enter commands and information into the computer 1102 through one or more wired/wireless input devices, such as a keyboard 1138 and a pointing device such as a mouse 1140. Other input devices (not shown) may include a microphone, IR remote control, joystick, game pad, stylus pen, touch screen, and the like. Although these and other input devices are often connected to the processing unit 1104 through an input device interface 1142 that is connected to the system bus 1108, a parallel port, IEEE 1394 serial port, game port, USB port, IR interface, may be connected by other interfaces such as the like.

A monitor 1144 or other type of display device is also connected to the system bus 1108 through an interface such as a video adapter 1146. In addition to the monitor 1144, computers typically include other peripheral output devices (not shown) such as speakers, printers, and the like.

Computer 1102 may operate in a networked environment using logical connections to one or more remote computers, such as remote computer(s) 1148 via wired and/or wireless communications. Remote computer(s) 1148 may be a workstation, computing device computer, router, personal computer, handheld computer, microprocessor-based entertainment device, peer device, or other common network node, and generally includes It includes many or all of the components described for, but for simplicity, only memory storage device 1150 is shown. The logical connections shown include wired/wireless connections to a local area network (LAN) 1152 and/or a larger network, such as a wide area network (WAN) 1154 . Such LAN and WAN networking environments are common in offices and corporations and facilitate enterprise-wide computer networks, such as intranets, all of which can be connected to worldwide computer networks, such as the Internet.

When used in a LAN networking environment, computer 1102 connects to local network 1152 through wired and/or wireless communication network interfaces or adapters 1156. Adapter 1156 may facilitate wired or wireless communications to LAN 1152 , which includes a wireless access point installed therein to communicate with wireless adapter 1156 . When used in a WAN networking environment, computer 1102 may include a modem 1158, be connected to a communicating computing device on WAN 1154, or establish communications over WAN 1154, such as over the Internet. have other means. A modem 1158, which may be internal or external and a wired or wireless device, is connected to the system bus 1108 through a serial port interface 1142. In a networked environment, program modules described for computer 1102, or portions thereof, may be stored on remote memory/storage device 1150. It will be appreciated that the network connections shown are exemplary and other means of establishing a communication link between computers may be used.

Computer 1102 is any wireless device or entity that is deployed and operating in wireless communication, eg, printers, scanners, desktop and/or portable computers, portable data assistants (PDAs), communication satellites, wireless detectable tags associated with It operates to communicate with arbitrary equipment or places and telephones. This includes at least Wi-Fi and Bluetooth wireless technologies. Thus, the communication may be a predefined structure as in conventional networks or simply an ad hoc communication between at least two devices.

Wi-Fi (Wireless Fidelity) makes it possible to connect to the Internet without wires. Wi-Fi is a wireless technology, such as a cell phone, that allows such devices, eg, computers, to transmit and receive data both indoors and outdoors, i.e. anywhere within coverage of a base station. Wi-Fi networks use a radio technology called IEEE 802.11 (a, b, g, etc.) to provide secure, reliable, and high-speed wireless connections. Wi-Fi can be used to connect computers to each other, to the Internet, and to wired networks (using IEEE 802.3 or Ethernet). Wi-Fi networks can operate in the unlicensed 2.4 and 5 GHz radio bands, for example, at 11 Mbps (802.11a) or 54 Mbps (802.11b) data rates, or in products that include both bands (dual band) .

Those skilled in the art will understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, instructions, information, signals, bits, symbols and chips that may be referenced in the above description are voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields s or particles, or any combination thereof.

Those skilled in the art will understand that the various illustrative logical blocks, modules, processors, means, circuits, and algorithm steps described in connection with the embodiments disclosed herein are electronic hardware, (for convenience) , may be implemented by various forms of program or design code (referred to herein as software) or a combination of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends on the particular application and the design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

Various embodiments presented herein may be implemented as a method, apparatus, or article of manufacture using standard programming and/or engineering techniques. The term article of manufacture includes a computer program, carrier, or media accessible from any computer-readable storage device. For example, computer-readable storage media include magnetic storage devices (eg, hard disks, floppy disks, magnetic strips, etc.), optical disks (eg, CDs, DVDs, etc.), smart cards, and flash memory devices (eg, EEPROM, cards, sticks, key drives, etc.), but are not limited thereto. Additionally, various storage media presented herein include one or more devices and/or other machine-readable media for storing information.

It is to be understood that the specific order or hierarchy of steps in the processes presented is an example of example approaches. Based upon design priorities, it is to be understood that the specific order or hierarchy of steps in the processes may be rearranged within the scope of this disclosure. The accompanying method claims present elements of the various steps in a sample order, but are not meant to be limited to the specific order or hierarchy presented.

The description of the presented embodiments is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these embodiments will be apparent to those skilled in the art of this disclosure, and the general principles defined herein may be applied to other embodiments without departing from the scope of this disclosure. Thus, the present disclosure is not to be limited to the embodiments presented herein, but is to be interpreted in the widest scope consistent with the principles and novel features presented herein.

Claims

A method for predicting whether a coupling is performed by a computing device including at least one processor,

generating a feature vector representing surrounding environment information of each atom of the compound based on the chemical structure of the compound; and

Classifying whether each atom of the compound binds to a hinge region of a kinase based on the feature vector;

including,

method.
According to claim 1,

Generating the feature vector,

generating a feature vector expressing the spatial distribution of each atom of the compound and surrounding atoms using an atomic center symmetry function;

including,

method.
According to claim 2,

The step of classifying whether or not the combination is,

Classifying whether steric hindrance occurs between each atom of the compound and the hinge region of the kinase using a machine learning classification model based on the feature vector;

Further comprising,

method.
According to claim 3,

The machine learning classification model,

It is learned based on PDB (protein data bank),

which is hyperparameter optimized using the Bayes rule,

method.
According to claim 1,

Generating the feature vector,

generating a feature vector on a latent space using a graph neural network (GNN) based on the compound;

including,

The step of classifying whether or not the combination is,

Based on the feature vector, classifying whether or not the corresponding compound binds to a kinase using GNN;

including,

method.
According to claim 5,

The GNN is

It is learned based on PDB (protein data bank),

The PDB is:

A two-dimensional graph including connection information between each atom; or

a 3D graph including spatial information between atoms;

including at least one of

method.
According to claim 1,

The method,

Generating a library by filtering the compounds whose binding properties are classified according to a predetermined criterion;

Further comprising,

method.
According to claim 1,

The method,

outputting a loss value using a molecular generation deep learning model based on the compound classified as to whether or not it binds;

Further comprising,

method.
According to claim 1,

The step of classifying whether each atom of the compound binds to the hinge region of the kinase based on the feature vector,

generating prediction values for each class indicating a type of combination; and

determining the type of binding between each atom and the hinge region of the kinase based on the class having the highest predicted value for each class;

including,

method.
According to claim 9,

The class,

For each atom in the compound:

A class indicating whether or not a hydrogen bond with an amino acid residue first closest to an amino acid residue corresponding to a gate keeper; or

A class indicating whether there is a hydrogen bond with the third closest amino acid residue from the gatekeeper.

including at least one of

method.
As a device,

a processor comprising one or more cores; and

Memory;

including,

the processor,

Based on the chemical structure of the compound, a feature vector including information about the surrounding environment of each atom of the compound is generated, and

Classifying whether each atom of the compound binds to the hinge region of the kinase based on the feature vector,

Device.
A computer program stored on a computer readable storage medium, the computer program causing operations for predicting coupling, the operations comprising:

generating a feature vector including environment information of each atom of the compound based on the chemical structure of the compound; and

classifying whether each atom of the compound binds to a hinge region of a kinase based on the feature vector;

including,

A computer program stored on a computer readable storage medium.