KR20220004945A

KR20220004945A - Scaffold-based molecular design with a graph generative model

Info

Publication number: KR20220004945A
Application number: KR1020210193395A
Authority: KR
Inventors: 김우연; 임재창; 황상연; 문석현; 김승수
Original assignee: 한국과학기술원
Priority date: 2020-05-22
Filing date: 2021-12-30
Publication date: 2022-01-12
Also published as: KR20210145059A

Abstract

The present invention relates to a scaffold-based molecular design method capable of simultaneously controlling multiple desired properties. When a method for generating an extended molecular structure having a predetermined scaffold structure of the present invention is used, a desired molecular structure with optimized performance can be efficiently derived by using only a small amount of labeled data.

Description

Scaffold-based molecular design with a graph generative model

본 발명은 원하는 다중 특성을 동시제어 가능한 스캐폴드-기반 분자 디자인 방법에 관한 것이다. The present invention relates to a scaffold-based molecular design method capable of simultaneously controlling multiple desired properties.

분자 디자인의 목표는 목적하는 기능성을 가진 분자를 찾는 것이다. 광범위한 화학 공간(chemical space)은 다양한 응용 분야에서 새로운 분자를 발견할 수 있는 무한한 가능성을 제공한다. 그러나 동시에 매우 광범위한 특성은 허용가능한 시간 동안 가장 유망한 후보를 찾으려고 할 때 어려움을 가져온다. 예를 들어, 약물 탐색(drug discovery)에서 잠재적인 약물-유사 분자의 수는 약 10²³ ~ 10⁶⁰으로 추정되며[1] 그 중 10⁸ 개만 합성된 적이 있다[2]. 따라서 무차별적인 화학 공간의 탐색에 의해 새로운 약물을 발견하는 것은 사실상 불가능하고, 보다 합리적인 접근법에 대한 요구가 필연적으로 급증했다.The goal of molecular design is to find molecules with the desired functionality. The extensive chemical space offers limitless possibilities for discovering new molecules in a variety of applications. But at the same time, the very wide range of characteristics presents difficulties when trying to find the most promising candidates in an acceptable time period. For example, in drug discovery, the number of potential drug-like molecules ^{is estimated to be about 10 23} to 10 ⁶⁰ [1], of which ^{only 10 8} have been synthesized [2]. Therefore, it is virtually impossible to discover new drugs by indiscriminate exploration of the chemical space, and the demand for more rational approaches has inevitably surged.

합리적인 분자 설계의 일반적인 전략은 가능성이 있는 것으로 알려진 분자에서 시작하여 검색 공간을 좁히는 것이다. 새로운 분자의 설계는 필요한 특성을 최적화하는 작용기의 올바른 조합을 찾아서 진행하는 반면, 원래 분자의 핵심 구조 또는 스캐폴드(scaffold)는 의도적으로 기본 특성을 유지하기 위해 유지된다. 이 스캐폴드-기반 설계는 소분자 약물 발견(drug discovery)에서 표준 접근법 중 하나였으며, 여기서 먼저 표적 단백질의 정보를 기반으로 스캐폴드를 식별하고 최적의 효능(potency) 및 선택성을 나타내는 것을 찾기 위해 유도체 화합물 라이브러리를 탐색한다[3-5]. 재료 화학과 같은 다른 분야의 분자 설계도 유사한 전략을 채택한다. 대표적인 예는 전형적인 전자-공여체(electron-donor) 및 수용체 모이어티(moiety)가 알려진 유기 전자공학(organic electronics)이다. 스캐폴드 모이어티를 적절하게 선택하거나 조합한 다음 사이드-체인 최적화를 수행하면 유기 발광 다이오드[6], 필드-이펙트 트랜지스터[7], 광촉매[8] 및 태양 전지[9]의 새로운 구성 요소를 설계할 수 있다. 그러나 핵심 구조를 수정하면 검색 공간이 크게 줄어들지만, 대부분의 실제 응용 프로그램에서 나머지 부분은 여전히 인간의 직관으로 해결할 수 있는 범위를 넘어서기 쉽다.A common strategy in rational molecular design is to narrow the search space, starting with molecules known to be probable. The design of a new molecule proceeds by finding the right combination of functional groups to optimize the necessary properties, while the core structure or scaffold of the original molecule is intentionally maintained to retain its basic properties. This scaffold-based design has been one of the standard approaches in small molecule drug discovery, where we first identify the scaffolds based on information from the target protein and use derivative compounds to find those that exhibit optimal potency and selectivity. Search the library [3-5]. Molecular design in other fields, such as materials chemistry, employs a similar strategy. A representative example is organic electronics in which typical electron-donor and acceptor moieties are known. Proper selection or combination of scaffold moieties followed by side-chain optimization enables the design of novel components in organic light-emitting diodes [6], field-effect transistors [7], photocatalysts [8] and solar cells [9]. can do. However, while modifying the core structure significantly reduces the search space, in most practical applications the rest is still likely to be beyond the reach of human intuition.

화학 공간의 보다 철저하고 지능적인 탐색이 요구됨에 따라 최근 딥 생성 모델(deep generation models)의 발전은 in silico 분자 설계에서 모델의 사용을 점점 더 자극하고 있다[10, 11]. 모든 관련 연구는 실제로 새로운 분자를 발견하는 수고를 줄이는 것을 공통 목표로 하고 있다[12-31]. 위에서 설명한 확률에도 불구하고 오늘날까지도 스캐폴드-기반 분자 설계에 많은 관심이 기울여지고 있지 않았다. 한 가지 예외는 Li et al., [25]의 연구로, 여기서 저자는 분자에 포함된 스캐폴드의 지문 표현(fingerprint representation)을 사용한다. 그 후, 지문(fingerprint)은 생성된 분자가 원하는 스캐폴드를 포함하는 경향이 있도록 생성 과정을 조절하는데 사용된다. 스캐폴드를 유지하는 또 다른 가능한 방법은 분자의 연속적인 표현을 배우는 것이다[15,17,24,32]. 발생된 잠재 공간에서 쿼리 분자(query molecule)의 주변을 검색하면 모양이 유사한 분자를 생성할 수 있다. 따라서, 학습된 공간에서의 거리가 코어 구조 유사성과 밀접한 연관성이 있는 경우 스캐폴드의 유도체를 생성할 수 있다.As more thorough and intelligent exploration of the chemical space is required, the recent development of deep generation models has increasingly stimulated the use of models in in silico molecular design [10, 11]. All related studies have a common goal of reducing the effort of actually discovering new molecules [12-31]. Despite the probabilities described above, not much attention has been paid to scaffold-based molecular design to this day. One exception is the work of Li et al., [25], where the authors use a fingerprint representation of the scaffold embedded in the molecule. The fingerprint is then used to control the production process so that the resulting molecule tends to contain the desired scaffold. Another possible way to maintain scaffolds is to learn sequential representations of molecules [15,17,24,32]. Searching the vicinity of a query molecule in the generated latent space can generate molecules with similar shapes. Therefore, when the distance in the learned space is closely related to the core structure similarity, a derivative of the scaffold can be generated.

스캐폴드-기반 분자 생성의 기술된 체계(scheme)는 다음과 같은 단점을 갖는다. 첫 번째, 스캐폴드 종류의 분자를 범주적으로 나타내려면 스캐폴드의 사전정의된 어휘가 필요하다. 이는 분자를 생성할 때 임의의 종류의 스캐폴드를 찾는데(querying) 제약을 가한다. 두 번째, 이의 확률적 특성으로 인해, 조건부 생성과 잠재 공간 검색은 원하는 스캐폴드를 포함하지 않는 분자를 허용한다. 마지막으로, 우리가 아는 한, 스캐폴드-기반 분자 설계의 목표를 달성하는 데 필수적인, 생성된 분자의 특성과 스캐폴드를 동시에 제어하는 연구는 없었다.The described scheme of scaffold-based molecular generation has the following disadvantages. First, a predefined vocabulary of scaffolds is needed to categorically represent molecules of the scaffold type. This places constraints on the querying of any kind of scaffold when generating molecules. Second, due to its stochastic nature, conditional generation and latent spatial search allow molecules that do not contain the desired scaffold. Finally, to the best of our knowledge, there have been no studies that simultaneously control the scaffold and properties of the resulting molecule, which are essential to achieve the goal of scaffold-based molecular design.

관련 연구를 살펴보면 다음과 같다. The related studies are as follows.

분자에 대한 생성(generative) 모델을 정의하려면, 특징 학습(feature learning) 및 분자 생성에 적합한 분자 표현(molecule representation)을 선택해야 한다[37]. 가장 널리 사용되는 사양(specification) 중 하나는 문자의 문자열을 사용하여 분자의 2D 구조 및 입체이성질체를 나타내는 SMILES이다. 따라서 SMILES-기반 생성 모델은 언어 모델[12-14], VAE[ 15-17, 32], 적대 오토인코더(adversarial autoencoder)[18] 및 생성적 적대 네트워크(generative adversarial network)[19]를 포함하는 다양한 틀(framework) 아래에서 개발되었다. 일부 모델은 목표 지향적 생성을 증가시키기 위해 강화 학습(reinforcement learning) 전략을 사용했다[19-23].To define a generative model for a molecule, it is necessary to select a molecular representation suitable for feature learning and molecular generation [37]. One of the most widely used specifications is SMILES, which uses strings of letters to represent 2D structures and stereoisomers of molecules. Therefore, the SMILES-based generative model is a language model [12-14], VAE [15-17, 32], adversarial autoencoders [18] and generative adversarial networks [19], including It was developed under various frameworks. Some models have used reinforcement learning strategies to increase goal-directed generation [19-23].

SMILES-기반 모델의 잘 입증된 성공에도 불구하고, SMILES는 분자 유사성을 일관되게 전달하는 데 근본적인 한계가 있다[24]. SMILES와 대조적으로, 표현(representation)으로서 분자 그래프는 분자의 구조적 특성을 보다 자연스럽게 표현할 수 있다. 그래프를 통한 분포를 학습하면 낮은 확장성(scalability) 및 고유하지 않은 노드 순서화(node ordering)와 같은 보다 까다로운 문제가 발생하기 때문에[25,38], 최근이 되어서야 그래프의 생성 모델(generative model), 특히 분자 그래프(molecular graph)에 대한 보고서가 등장하기 시작했다. 대부분의 연구는 그래프를 생성에서 순차적인 방법을 채택한다. 예를 들어, Li et al.,[25] You et al. [26] 및 Li et al.[27]의 모델은 노드 추가(node addition) 및 에지 추가(edge addition)로 구성된 일련의 그래프 작성 동작(action)을 예측하여 그래프를 생성한다. Assouel et al.[28] 및 Liu et al.[29]는 유사하게 순차적 체계(scheme)를 채택했지만, 이들의 그래프 생성은 노드 초기화(node initialization) 및 연결의 2개(완전히[28] 또는 부분적으로[29])의 별개(seperate) 단계를 통해 진행된다. 표준 노드 별(node-by-node) 생성을 구조 별(structure-by-structure) 방식으로 조립(coarse-grain)할 수도 있다. Jin et al.의 접합 트리 VAE(junction tree VAE)는 분자 절편(molecular fragment)의 트리를 생성한 다음 절편(fragment)의 세립(fine-grained) 어셈블리를 통해 전체 분자 그래프(molecular graph)를 복구한다[24]. 순차적인 생성 체계(scheme, 방식)와 현저히 대조되는 것은 전체 그래프의 단일- 단계 생성이다. Simonovsky 및 Komodakis[30]에 의한 GraphVAE 모델과 De Cao 및 Kipf[31]에 의한 MolGAN은 인접 행렬(adjacency matrix)과 그래프 속성(graph attribute)이 모두 예측되어 한 번에 그래프를 생성한다.Despite the well-proven success of SMILES-based models, SMILES has fundamental limitations in delivering molecular similarities consistently [24]. In contrast to SMILES, molecular graphs as representations can more naturally represent the structural properties of molecules. Since learning a distribution over a graph introduces more tricky problems such as low scalability and non-unique node ordering [25,38], it is not until recently that generative models of graphs, In particular, reports on molecular graphs began to appear. Most studies employ a sequential method in generating graphs. See, for example, Li et al., [25] You et al. [26] and Li et al. The model in [27] creates a graph by predicting a series of graphing actions consisting of node addition and edge addition. Assouel et al. [28] and Liu et al. [29] similarly adopted a sequential scheme, but their graph generation involves two (fully [28] or partially [29]) separate steps: node initialization and concatenation. proceeds through Standard node-by-node generation can also be coarse-grained in a structure-by-structure manner. Jin et al. The junction tree VAE (junction tree VAE) of VAE creates a tree of molecular fragments and then recovers the entire molecular graph through fine-grained assembly of fragments [24]. A sharp contrast to the sequential generation scheme is the single-step generation of the entire graph. The GraphVAE model by Simonovsky and Komodakis [30] and MolGAN by De Cao and Kipf [31] generate graphs at a time by predicting both adjacency matrix and graph attributes.

본 발명자들은 두 가지 이상의 원하는 특성을 동시에 제어 가능하고, 한정된 데이터에 의한 학습만으로도 효율적인 신규 분자의 생성이 가능한 분자 생성 방법을 개발하고자 예의 연구 노력하였다. 그 결과, 스캐폴드를 입력으로 받아들이고, 원자와 결합을 순차적으로 추가함으로써 원하는 성능을 발휘하는 분자의 합성이 가능함을 규명함으로써, 본 발명을 완성하게 되었다. The present inventors made intensive research efforts to develop a method for generating new molecules that can simultaneously control two or more desired properties and can efficiently generate new molecules only by learning from limited data. As a result, the present invention was completed by accepting the scaffold as an input and finding out that it is possible to synthesize a molecule exhibiting the desired performance by sequentially adding atoms and bonds.

따라서, 본 발명의 목적은 소정의 스캐폴드 구조를 갖는 확장된 분자 구조 생성 방법을 제공하는 것이다.Accordingly, it is an object of the present invention to provide a method for generating an extended molecular structure having a desired scaffold structure.

본 발명의 일 양태에 따르면, 본 발명은 다음 단계를 포함하는 소정의 스캐폴드 구조를 갖는 확장된 분자 구조 생성 방법을 제공한다:According to one aspect of the present invention, the present invention provides a method for generating an extended molecular structure having a predetermined scaffold structure comprising the steps of:

(a) 다음을 포함하는 스캐폴드-기반 분자생성 모델 학습 단계; 및(a) learning a scaffold-based molecular generation model comprising: and

(a-1) 인코더

를 통해 전체-분자 그래프 G 를 입력 받아, G 를 잠재 벡터 z로 인코딩하도록 인코더

를 학습시키는 단계; (a-1) Encoder

Encoder to take full-molecular graph G as input and encode G as a latent vector z

learning;

(a-2) 디코더

를 통해 스캐폴드 그래프 S 를 추가 입력 받고, S 에 대해 순차적으로 노드(node) 및 에지(edge)를 추가함으로써, z로부터 G 를 회복시키도록 디코더

를 학습시키는 단계;(a-2) decoder

A decoder to recover G from z by receiving an additional input of the scaffold graph S and sequentially adding nodes and edges to S

learning;

(b) 학습된 스캐폴드-기반 분자생성 모델에 스캐폴드 그래프를 입력하여 확장된 분자 구조를 얻는, 타겟 분자 생성 단계.(b) A target molecule generation step, in which the expanded molecular structure is obtained by inputting the scaffold graph into the learned scaffold-based molecular generation model.

본 발명자들은 두 가지 이상의 원하는 특성을 동시에 제어 가능하고, 한정된 데이터에 의한 학습만으로도 효율적인 신규 분자의 생성이 가능한 분자 생성 방법을 개발하고자 예의 연구 노력하였다. 그 결과, 스캐폴드를 입력으로 받아들이고, 원자와 결합을 순차적으로 추가함으로써 원하는 성능을 발휘하는 분자의 합성이 가능함을 규명하였다. The present inventors made intensive research efforts to develop a method for generating new molecules that can simultaneously control two or more desired properties and can efficiently generate new molecules only by learning from limited data. As a result, it was confirmed that the synthesis of molecules exhibiting the desired performance is possible by accepting the scaffold as an input and sequentially adding atoms and bonds.

본 발명의 일 실시예에 있어서, 본 발명의 인코더는 다음으로 구성된 알고리즘을 포함한다:In one embodiment of the present invention, the encoder of the present invention comprises an algorithm consisting of:

전파단계로서, H'_V(G) = propag ate (H_V(G), H_E(G)); 및As the propagation step, H' _V(G) = propagate (H _V(G) , H _E(G) ); and

판독단계로서, h_G = readout(H'_V(G)).As the read phase, h _G = readout (H' _V(G) ).

본 발명의 일 실시예에 있어서, 본 발명의 전파단계는 다음 단계들로 이루어진 것이다:In one embodiment of the present invention, the propagation step of the present invention consists of the following steps:

(a) 각 노드 및 이의 인접 노드 간의 집성된 메시지(aggregated message)를 다음식 1에 의해 계산하는 제1 전파단계; 및 (a) a first propagation step of calculating an aggregated message between each node and its adjacent node by Equation 1 below; and

[식 1][Equation 1]

[식 2][Equation 2]

본 발명의 일 실시예에 있어서, 본 발명의 전파단계는 그래프 전파(graph propagation)를 조절하기 위해, 추가적인 벡터 c를 포함하여 집성된 메시지를 계산한다.In one embodiment of the present invention, the propagation step of the present invention calculates an aggregated message including an additional vector c to adjust graph propagation.

본 발명의 일 실시예에 있어서, 본 발명의 디코더

는 다음 단계를 포함하는 디코딩 프로세스에 의해 z로부터 G 를 회복시킨다: In one embodiment of the present invention, the decoder of the present invention

recovers G from z by a decoding process comprising the following steps:

(a) (i)추가될 원자 유형을 선택하거나 (ii)빌딩 프로세스를 종결하는, 노드 추가(node addition) 단계;(a) (i) selecting the atom type to be added or (ii) terminating the building process, node addition;

(b) (i)단계(a)에서 새로 추가된 노드에 대한 결합 유형을 선택하거나 또는 (ii)단계 (a)로 돌아가는, 에지 추가(edge addition) 단계; 및(b) (i) selecting a binding type for the newly added node in step (a) or (ii) returning to step (a), an edge addition step; and

(c) w를 제외한 기존 노드(existing node)로부터 노드 v를 선택한 후, 단계(b)에서 선택한 결합 유형과 추가적인 에지(edge)(v,w)를 추가하고 단계(b)로 돌아가는, 노드 선택(node selection) 단계.(c) select node v from existing nodes excluding w, then add the join type selected in step (b) and additional edges (v,w) and return to step (b), node selection (node selection) step.

본 발명의 일 실시예에 있어서, 본 발명의 단계 (a) 내지 (c)의 다음 행위를 결정하기 위한 확률 벡터는 다음 식 3 내지 5 중에서 선택되는 어느 하나의 식에 의해 계산된다:In an embodiment of the present invention, the probability vector for determining the next action in steps (a) to (c) of the present invention is calculated by any one of the following equations 3 to 5:

[식 3][Equation 3]

[식 4][Equation 4]

[식 5][Equation 5]

본 발명은 스캐폴드-기반 분자 설계에서의 사용을 목표로 하는 생성 모델의 개발을 보여준다. 본 발명의 모델은 분자 스캐폴드를 입력으로 받아들이고 유도체 분자를 생성하도록 이를 확장하는 VAE(variational autoencode)이다[33]. 스캐폴드 확장은 스캐폴드의 일 부분에 새로운 원자와 결합을 순차적으로 추가하여 수행된다. 본 발명은 SMILES(simplified molecular-input line-entry system)과 같이 널리 사용되는 텍스트 형식 대신 분자 구조를 나타내는 그래프를 사용한다[34]. 이는 새로운 원자 또는 결합, 즉 노드(node) 또는 에지(edge)가 어떻게 매번 추가되는지를 그래프가 보다 자연스럽게 표현할 수 있는 반면 본 발명의 순차적 확장 과정 전반에서 문자열 표현(string representation)이 실질적으로 변경될 수 있기 때문이다[24]. 처리되고 있는 그래프의 구조에 따라 새로운 요소를 추가하는 매 결정을 내림으로써, 그래프 신경망은 노드와 에지 사이의 구조적 의존성을 추출하기 위해 사용된다[25,35,36]. 생성된 분자의 특성은 생성 과정을 원하는 특성으로 조절하여 제어될 수 있다. 학습(learning)을 통해, 본 발명의 모델은 어느 임의의 스캐폴드를 수용할 수 있고 이의 특성을 제어하고 핵심 구조를 유지하면서 새로운 분자를 생성할 수 있다.The present invention shows the development of generative models targeted for use in scaffold-based molecular design. Our model is a variational autoencode (VAE) that takes a molecular scaffold as input and extends it to generate derivative molecules [33]. Scaffold expansion is accomplished by sequentially adding new atoms and bonds to a portion of the scaffold. The present invention uses a graph representing the molecular structure instead of a widely used text format such as the simplified molecular-input line-entry system (SMILES) [34]. This allows the graph to more naturally represent how a new atom or bond, i.e., a node or edge, is added each time, while the string representation can change substantially throughout the sequential expansion process of the present invention. Because there is [24]. By making every decision to add a new element according to the structure of the graph being processed, graph neural networks are used to extract structural dependencies between nodes and edges [25,35,36]. The properties of the resulting molecule can be controlled by adjusting the production process to the desired properties. Through learning, the model of the present invention can accommodate any arbitrary scaffold and generate new molecules while controlling its properties and maintaining its core structure.

이상적으로, 임의의 출발 구조가 없는 생성 모델, 즉 분자에 대한 de novo 생성 모델이 더 바람직할 것이다. 그러나, de novo 생성에서 특성을 최적화하는 성능은 관련된 구조-특성 관계를 배우기 어려울 때 유의하게 떨어질 수 있다[18]. 이러한 경향은 다수의 특성들이 제어되어야 할 때 더 악화될 수 있고, 이는 충분한 양의 개별 특성에 대한 데이터의 획득이 종종 실용적이지 않기 때문이다. 스캐폴드의 특성을 활용함으로써, 본 발명의 스캐폴드-기반 생성 모델은 표적(target, 목표) 특성을 보다 쉽게 최적화할 수 있고 바람직한 특성을 보존할 수 있다. 후자의 예로서, 합성 경로가 잘 확립된 스캐폴드를 사용함으로써 생성된 분자의 합성 접근성을 유지할 수 있으며, 이는 in silico 설계에서 결정적인 측면이다.Ideally, a generative model without any starting structures, ie a de novo generative model for the molecule, would be more desirable. However, the performance of optimizing properties in de novo generation can be significantly degraded when the related structure-property relationships are difficult to learn [18]. This tendency can be exacerbated when multiple properties have to be controlled, since obtaining data for a sufficient amount of individual properties is often impractical. By utilizing the properties of the scaffold, the scaffold-based generative model of the present invention can more easily optimize target properties and preserve desirable properties. As an example of the latter, it is possible to maintain the synthetic accessibility of the generated molecules by using a scaffold with well-established synthetic pathways, which is a crucial aspect in in silico design.

본 발명의 연구는 두 가지 측면에서 분자 설계에 기여한다. 보다 실질적으로 중요한 한 가지는 스캐폴드-기반 분자 설계를 위한 도구로서 모델을 제안하는 것이다. 보다 분석적으로 중요한 다른 하나는 본 발명의 슈퍼그래프(supergraph) 공간 탐색이다. 후자를 구체화하기 위해, 일단 스캐폴드가 유지되면, 가능한 생성(generation)의 세트는 이의 그래프가 스캐폴드 그래프의 슈퍼그래프인 분자만을 포함한다는 점을 본 발명자들은 먼저 주목하였다. 즉, 스캐폴드를 고정하면 검색 공간에 강한 제약이 가해지므로, 그러한 제약이 생성 성능 및 속성 제어가능성에 어떤 영향을 미치는지 조사하는 것이 의미가 있다. 우리 실험의 첫 번째 부분이 이 점을 해결한다. 세 부분으로 더 나누어, 본 발명의 실험은 (i) 전체 생성 성능(generation performance), (ii) 스캐폴드 의존성 및 (iii) 본 발명의 모델의 속성 제어가능성(property controllability)을 다룬다. (i) 부분에서, 본 발명의 모델이 분자 생성에서 높은 속도의 유효성, 독창성 및 신규성을 달성함을 보여준다. (ii) 부분에서, 본 발명의 모델이 관찰된 스캐폴드 뿐만 아니라 관찰되지 않은 스캐폴드를 확장하는 데 일관되게 우수한 성능을 보여주며, 본 발명의 모델이 훈련 데이터에서 분자-스캐폴드 발판 매칭을 암기하는 대신 유효한 화학 규칙을 학습함으로써 더 작은 분자로부터 새로운 분자를 만들 수 있음을 보여준다. (iii) 부분에서, 본 발명은 제한된 검색 공간에도 불구하고, 본 발명의 모델이 생성된 분자의 단일 또는 다중 속성을 목표 값으로부터 허용 가능한 편차로 제어할 수 있음을 보여준다.The study of the present invention contributes to molecular design in two aspects. One of the more practically important is to propose a model as a tool for scaffold-based molecular design. Another of more analytically important is the supergraph space search of the present invention. To specify the latter, we first noted that once the scaffold is maintained, the set of possible generations includes only molecules whose graph is a supergraph of the scaffold graph. In other words, since fixing the scaffold imposes strong constraints on the search space, it is meaningful to investigate how such constraints affect generation performance and property controllability. The first part of our experiment addresses this point. Further divided into three parts, the experiments of the present invention address (i) overall generation performance, (ii) scaffold dependence and (iii) property controllability of the model of the present invention. In part (i), it is shown that the model of the present invention achieves a high rate of validity, originality and novelty in molecular generation. In part (ii), our model shows consistently good performance in extending observed as well as unobserved scaffolds, and our model memorizes molecular-scaffold scaffold matching from training data. Instead, it shows that new molecules can be made from smaller molecules by learning valid chemical rules. In part (iii), the present invention shows that, despite the limited search space, the model of the present invention can control single or multiple properties of the generated molecule to an acceptable deviation from the target value.

마지막으로, 보다 실용적인 측면으로 돌아가서, 본 발명은 인간 표피 성장 인자 수용체(epidermal growth factor receptor, EGFR)의 억제제를 설계하는 데 모델을 적용했다. 본 발명자들은 레이블이 없는 분자들로부터 또한 배우기 위해 속성 예측자(property predictor)를 포함하여 준-지도 학습 기법(semi-supervised learning)을 수행했으며, 이는 자주 발생하는 데이터 부족의 문제를 처리하는데 도움을 줄 수 있다. 결과적으로, 이 모델은 향상된 효능(potency)을 갖는 새로운 억제제를 생성할 수 있었으며, 여기서 독창적인 생성의 20%는 예측된 반최고치억제농도(half-maximal inhibitory concentration)에서 100배 이상의 감소를 나타냈다. 이것은 본 발명의 모델이 이용 가능한 데이터의 부족으로 인해 학습이 어려운 보다 현실적인 문제에서 분자를 설계하는 법을 배울 수 있음을 보여준다.Finally, returning to a more practical aspect, the present invention applied a model to design inhibitors of human epidermal growth factor receptor (EGFR). We performed semi-supervised learning, including a property predictor, to also learn from unlabeled molecules, which helps to deal with the frequently occurring problem of data shortage. can give As a result, this model was able to generate new inhibitors with improved potency, where 20% of the ingenious production showed a 100-fold or greater reduction in the predicted half-maximal inhibitory concentration. This shows that our model can learn to design molecules in more realistic problems that are difficult to learn due to the lack of available data.

(a) 본 발명은 소정의 스캐폴드 구조를 갖는 확장된 분자 구조 생성 방법을 제공한다.(a) The present invention provides a method for generating an extended molecular structure having a desired scaffold structure.

(b) 본 발명의 방법을 이용하는 경우, 적은 양의 레이블된 데이터만을 이용하여 효율적으로 원하는 성능이 최적화된 분자 구조를 도출할 수 있다. (b) When the method of the present invention is used, it is possible to efficiently derive a molecular structure with an optimized desired performance using only a small amount of labeled data.

도 1은 학습 단계에서의 본 발명의 모델 아키텍쳐를 나타낸다.
도 2는 생성된 분자들의 특성 분포를 나타낸다. 범례의 값은 생성 작업의 타겟 특성 값을 나타낸다. 각 플롯의 빨간색 선은 트레이닝 데이터세트에서의 분자의 각 특성 분포를 보여준다.
도 3은 세 스캐폴드로부터 생성된 예시 분자들을 나타낸다. 표시된 값들은 생성의 타겟 조건 및 스캐폴드의 특성 값을 나타낸다.
도 4는 생성된 분자들의 특성 값들의 예측된 공동 분포를 나타낸다. 범례는 생성을 위해 사용된 타겟 값들을 나타낸다. 모든 분포에서, 가장 안쪽 윤곽은 10%를 둘러싸고, 가장 바깥쪽 윤곽은 90%를 둘러싸고, 중간의 각 n번째는 모집단의 n×10%를 둘러싼다. 각 플롯의 상단 및 오른쪽 끝에 각각 가로 및 세로 축에 대한 특성의 한계 분포가 있다.
도 5는 생성된 분자의 특성 값의 산점도(scatter plot)를 나타낸다. 범례는 생성을 위해 사용된 특성 값들의 8개 세트를 나열한다.
도 6은 준-지도 학습 실험에서의 테스트 스캐폴드 및 생성된 분자들에 대해 예측된 인간-EGFR pIC50 값의 분포를 나타낸다. 특성 예측자를 추가함으로써 준-지도 학습된 VAE 내로 본 발명의 모델을 확장하였다. 1 shows the model architecture of the present invention in the learning phase.
Figure 2 shows the distribution of properties of the resulting molecules. The values in the legend represent the target attribute values of the creation operation. The red line in each plot shows the distribution of each characteristic of the molecule in the training dataset.
3 shows exemplary molecules generated from three scaffolds. The indicated values represent the target conditions of generation and characteristic values of the scaffold.
4 shows the predicted co-distribution of characteristic values of the resulting molecules. The legend indicates the target values used for creation. In all distributions, the innermost contour surrounds 10%, the outermost contour surrounds 90%, and each nth in the middle surrounds n × 10% of the population. At the top and right ends of each plot are marginal distributions of features on the horizontal and vertical axes, respectively.
5 shows a scatter plot of characteristic values of the generated molecules. The legend lists the eight sets of property values used for creation.
6 shows the distribution of predicted human-EGFR pIC50 values for test scaffolds and generated molecules in a semi-supervised learning experiment. We extended our model into semi-supervised VAEs by adding feature predictors.

이하, 실시예를 통하여 본 발명을 더욱 상세히 설명하고자 한다. 이들 실시예는 오로지 본 발명을 보다 구체적으로 설명하기 위한 것으로, 본 발명의 요지에 따라 본 발명의 범위가 이들 실시예에 의해 제한되지 않는다는 것은 당업계에서 통상의 지식을 가진 자에 있어서 자명할 것이다.Hereinafter, the present invention will be described in more detail through examples. These examples are only for illustrating the present invention in more detail, and it will be apparent to those of ordinary skill in the art that the scope of the present invention is not limited by these examples according to the gist of the present invention. .

실시예Example

실험방법Experimental method

실시예 1: 전체 과정 및 모델 구조Example 1: Overall process and model structure

본 발명의 목적은 주어진 스캐폴드를 하위 구조로 유지하면서 표적 특성(property 속성)을 갖는 분자를 생성하는 것이다. 이를 위해, 본 발명의 생성 모델을 분자 스캐폴드의 그래프 표현 S를 받아들이고 S의 수퍼그래프(supergraph)인 그래프 G를 생성하도록 설정했다. G의 기본 분포(underlying distribution)는 p(G;S)로 표현될 수 있다. 여기서 본 발명의 표기법(notation)은 G와 S 사이의 특정 관계, 즉 수퍼그래프-서브그래프 관계를 나타낸다. 본 발명은 또한 p(G;S)는 G 단독의 분포라는 점을 강조한다; S는 분포의 도메인(domain)을 명시적으로 한정하는 모수적 인수(parametric argument)로 작용한다. 분자 특성(molecular property)은 조건으로서 도입되었으며, 이 조건에서 모델은 조건부 분포 p(G;S|y,y _S)를 정의할 수 있고, 여기서 y와 y _S는 각각 분자 및 이의 스캐폴드의 특성 값(property value)을 포함하는 벡터이다. 분자 생성에 관한 다른 연구에서[25,27] 종종, 하위 구조 모이어티(moiety)가 조건으로서 부과되므로, 조건부 분포 p(G|S)가 정의된다. p(G;S)와 달리, 조건부 분포 p(G|S)에서 G의 공간은 S에 의해 명시적으로 한정되지 않으며 확률적 특성으로 인해 S의 수퍼그래프가 아닌 G도 허용된다. 다른 한편으로는, p(G;S)에 따라 본 발명의 모델이 생성하는 그래프는 항상 S를 하위 구조로서 포함한다. 아래에서 볼 수 있듯이, 본 발명의 모델은 S에 노드와 에지를 순차적으로 추가하여 이를 실현한다. 계속 진행하기 전에, 본 발명은 다음에 사용할 표기법(notation)에 대한 표 1을 참조하도록 권한다. It is an object of the present invention to generate molecules with target properties (property properties) while maintaining a given scaffold as a substructure. To this end, the generative model of the present invention was set up to accept the graph representation S of the molecular scaffold and generate a graph G, which is a supergraph of S. The underlying distribution of G can be expressed as p(G;S). Here, the notation of the present invention indicates a specific relationship between G and S, that is, a supergraph-subgraph relationship. The present invention also emphasizes that p(G;S) is a distribution of G alone; S acts as a parametric argument that explicitly defines the domain of the distribution. A molecular property is introduced as a condition, in which the model can define a conditional distribution p(G;S|y , y _S ), where y and y _S are the properties of the molecule and its scaffold, respectively. A vector containing property values. In other studies of molecular generation [25,27], a conditional distribution p(G|S) is defined, as often a substructural moiety is imposed as a condition. Unlike p(G;S), in the conditional distribution p(G|S), the space of G is not explicitly bounded by S, and due to its probabilistic nature, G which is not a supergraph of S is also allowed. On the other hand, the graph generated by the model of the present invention according to p(G;S) always includes S as a substructure. As can be seen below, the model of the present invention achieves this by sequentially adding nodes and edges to S. Before proceeding, the present invention recommends reference to Table 1 for the following notation.

또한 명확한 구별이 필요한 경우 본 발명은 분자를 스캐폴드와 구별하기 위해 "전체 분자"라고 할 것이다.Also, when a clear distinction is required, the present invention will refer to the "whole molecule" to distinguish the molecule from the scaffold.

우리 모델의 학습 목적은 실제 분자의 분포를 따르는 분포를 갖는 더 큰 그래프로 그래프를 확장하는 전략이다. 본 발명은 이의 스캐폴드로부터 데이터 세트의 분자를 회수(recover)하기 위해 본 발명의 모델을 훈련시킴으로써 이를 달성한다. 분자의 스캐폴드는 Bemis 및 Murcko[39]에 의한 것과 같이 결정론적인 방법으로 정의될 수 있으며, 이는 본 발명의 실험에서 우리가 사용한 것이다.The training objective of our model is a strategy to extend the graph to a larger graph with a distribution that follows the distribution of real molecules. The present invention achieves this by training our model to recover the molecules of the data set from its scaffold. Molecular scaffolds can be defined in a deterministic way as by Bemis and Murcko [39], which we used in our experiments.

목표 그래프의 구성은 노드 및 에지 추가를 연속적으로 결정하여 수행된다. 각 구성 단계에서의 결정은 단계에서의 그래프의 노드 특징(node feature) 및 에지 특징(edge feature)으로부터 도출된다. 노드 특징 및 에지 특징은 이전 단계의 구성 히스토리를 반영하도록 반복적으로 업데이트된다. 건설 과정은 아래에 더 자세히 설명될 것이다.The construction of the target graph is performed by successively determining node and edge additions. Decisions at each construction step are derived from node features and edge features of the graph at the step. Node features and edge features are iteratively updated to reflect the configuration history of previous steps. The construction process will be described in more detail below.

본 발명의 모델을 도 1에 도시된 구조(architecture)를 가진 VAE로 실현했다. 구조(architecture)는 인코더(encoder)

와 디코더(decoder)

로 구성되며, 각각

와 θ로 매개 변수화된다(parametrized). 인코더는 그래프 G를 인코딩 벡터 z에 인코딩하고 디코더는 z를 디코딩하여 G를 복구한다. 디코더는 추가적인 입력으로서 스캐폴드 그래프 S 를 필요로 하고, 실제 디코딩 과정은 노드 및 에지를 S 에 순차적으로 추가함으로써 실행된다. 처리되고 있는 일시적 그래프(transient graph)의 노드 및 에지 특징을 업데이트할 때 인코딩 벡터 z는 이의 정보에 지속적으로 영향을 미침으로써 그 역할을 수행한다. p(G;S)와 유사하게, 본 발명의 표기법

(G;S|z)는 우리 디코더의 후보 생성이 항상 S의 수퍼그래프임을 나타낸다. 인코더 표기법

(z|G;S)에 대해서는, 본 발명은 인코더가 또한

와

의 공동 최적화로 인해 스캐폴드에 의존도를 갖는다고 강조한다.The model of the present invention was realized with a VAE having the architecture shown in FIG. 1 . The architecture is an encoder

and decoder

consists of, each

and θ are parameterized. The encoder encodes the graph G into an encoding vector z and the decoder decodes z to recover G. The decoder requires a scaffold graph S as an additional input, and the actual decoding process is executed by sequentially adding nodes and edges to S . When updating the node and edge features of the transient graph being processed, the encoding vector z does its job by continuously influencing its information. Similarly to p(G;S), the notation of the present invention

(G;S| z ) indicates that our decoder's candidate generation is always a supergraph of S. encoder notation

For ( z |G;S), the present invention is that the encoder also

Wow

We emphasize that it is scaffold dependent due to the co-optimization of

잠재 변수 모델(latent variable model)로서, VAE는 다수(manifold)의 학습된 잠재 분포가 모든 정보 영역을 포함하기에 불충분할 때 의미있는 샘플을 생성하기 어려울 수 있다[40-42]. 분자의 생성 모델에서, 이러한 어려움은 화학적으로 유효하지 않게 생성된 분자[15,43]의 형태 자체로 나타나며, 이에 따라 다양한 SMILES-기반[43-46] 및 그래프-기반[24,29,47-49] 모델은 생성의 유효성을 향상시키기 위해 명시적 제약 또는 학습 알고리즘을 도입했다. 본 발명의 모델에도 비슷한 제약 또는 알고리즘을 포함할 수 있으나; 이러한 것이 없음에도 불구하고, 본 발명의 모델은 생성된 그래프에서 높은 화학적 유효성을 보여준다(아래 결과 2 참조).As a latent variable model, VAE can be difficult to generate meaningful samples when the learned latent distribution of a manifold is insufficient to cover all information domains [40-42]. In generative models of molecules, this difficulty manifests itself in the form of chemically ineffectively generated molecules [15,43], and thus various SMILES-based [43-46] and graph-based [24,29,47- 49] models introduced explicit constraints or learning algorithms to improve the validity of their generation. The model of the present invention may include similar constraints or algorithms; Despite this absence, our model shows high chemical effectiveness in the resulting graph (see result 2 below).

실시예 2: 그래프 인코딩Example 2: Graph Encoding

그래프 인코딩의 목표는 전체 분자의 전체 그래프 G의 잠재 벡터 z를 생성하는 것이다. 주어진 임의의 전체 분자의 그래프 G = (V(G), E(G))에서, 본 발명은 먼저 각 노드 ν ∈ V(G)는 노드 특징 벡터 h _ν로 연관하고 각 에지 (u, ν) ∈ E(G)는 엣지 특징 벡터 h _uν로 연관한다. 초기 노드 및 엣지 특징의 경우, 본 발명은분자의 원자 유형 및 결합 유형을 선택한다. 그런 다음 본 발명은초기 특징 벡터를 더 큰 차원의 새 벡터에 임베딩(embed)하여 벡터가 노드 및 에지의 속 및 사이에 딥 인포메이션(deep information)를 표현할 수 있는 충분한 용량을 갖도록 한다. 분자의 구조 정보를 완전히 인코딩하기 위해, 본 발명은 모든 노드 임베딩 벡터(node embedding vector) h _ν가 자신의 노드 ν의 유일한 정보뿐만 아니라 ν와 이의 인접한 노드(its neighborhood)의 관계를 포함하길 원한다. 이는 각 노드의 정보를 그래프의 다른 노드로 전파하여 수행될 수 있다. 그래프 신경망이 특정하게 각각 실현됨으로써, 다양한 관련 방법들이 고안되었다[35].The goal of graph encoding is to generate a latent vector z of the entire graph G of the entire molecule. In the graph G = (V(G), E(G)) of any given whole molecule, the present invention first associates each node ν ∈ V(G) with a node feature vector h _ν , and each edge (u, ν) ∈ E(G) is associated with the edge feature vector h _{uν .} For initial node and edge features, the present invention selects the atomic type and bond type of the molecule. The present invention then embeds the initial feature vector into a new vector of a larger dimension so that the vector has sufficient capacity to represent deep information within and between nodes and edges. In order to fully encode the structural information of a molecule, the present invention wants every node embedding vector h _v to contain not only unique information of its own node v, but also the relationship of v and its neighborhood (its neighborhood). This can be done by propagating the information of each node to other nodes in the graph. As graph neural networks are specifically realized individually, various related methods have been devised [35].

본 연구에서, 본 발명은 상호 작용 네트워크의 변형으로서 인코더

를 구현하였다[36,50]. 본 발명의 네트워크의 알고리즘은 전파 단계(propagation phase)와 판독 단계(readout phase)로 구성되며, 다음과 같이 작성된다.In this study, the present invention is an encoder as a variant of an interactive network.

has been implemented [36,50]. The algorithm of the network of the present invention consists of a propagation phase and a readout phase, and is written as follows.

H' _V(G) = propagate(H _V(G), H _E(G)) (1) H' _V(G) = propagate ( H _V(G) , H _E(G) ) (1)

h _G = readout(H' _V(G)). (2) h _G = readout ( H' _V(G) ). (2)

전파 단계 자체는 두 단계로 구성된다. 첫 번째 단계는 각 노드 및 이의 인접 노드(its neighbors) 간의 집성된 메시지(aggregated message)를 다음과 같이 계산하며,The propagation phase itself consists of two phases. The first step computes the aggregated message between each node and its neighbors as

(3)

메시지 함수(message function) M을 사용한다. 두 번째 단계는 집성된 메시지를 사용하는 노드 벡터를 다음과 같이 업데이트하며,We use the message function M . The second step is to update the node vector using the aggregated message as follows:

(4)

업데이트 함수(update function) U를 사용한다. H _V(G)에서 모든 노드 특징 벡터를 업데이트하면 식 (1)에 작성된 바와 같이 업데이트된 세트 H'_V(G)를 초래한다. 상이한 반복 단계에서 상이한 파라미터 세트를 사용하여, 적용될 때마다 고정된 횟수만큼 전파 단계를 반복한다. 전파 후, 판독 단계(식 (2))는 노드 특징 벡터의 가중 합(weighted sum)을 계산하여, 그래프 전체를 요약하는 하나의 벡터 표현 h _G를 생성한다. 그 다음 마지막으로, 잠재 벡터 z는 평균과 분산(variance)이 h _G로부터 추정되는 정규 분포에서 샘플링된다.Use the update function U . Updating all of the nodes in the feature vector _V H _(G) resulting in an updated set H _{'V (G),} as written in equation (1). Repeat the propagation step a fixed number of times each time it is applied, using different parameter sets in different iteration steps. After propagation, the read step (Equation (2)) computes the weighted sum of the node feature vectors, producing a single vector representation h _{G that summarizes the entire graph.} Then finally, the latent vector z is sampled from a normal distribution whose mean and variance are estimated from h _{G .}

그래프 전파(graph propagation)는 집성된 메시지를 계산하는 데 추가적인 벡터 c를 포함하여 조절될 수 있다. 이 경우, 함수 M과 (따라서) 전파(propagte)는 c를 추가 인수(argument)(즉, 이들은 M (·,·,·,c) 및 propagate(·,·,c)가 됨)로 받아들인다. 입력 그래프(input graph)를 인코딩할 때, 속성-제어 생성(property-controlled generation)을 가능하게하기 위해 속성 벡터(property vector) y 및 y _S의 연결(concatenation)이 되도록 c를 선택한다. 그래프 디코딩(decoding) 동안, y, y _S 및 잠재 벡터 z의 연결(concatenation)을 조건 벡터(condition vector)로서 사용한다(아래 참조).The graph propagation can be adjusted by including an additional vector c to compute the aggregated message. In this case, the function M and (hence) propagte take c as additional arguments (i.e. they become M (·,·,·, c ) and propagate (·,·, c )). . When encoding the input graph, we choose c to be the concatenation of the property vectors y and y _{S to enable property-controlled generation.} Uses a graph for decoding (decoding), y, _S y and the potential of the connection vector z (concatenation) as a vector condition (condition vector) (see below).

실시예 3: 그래프 디코딩Example 3: Graph Decoding

그래프 디코딩의 목표는 그래프 인코딩 단계에서 샘플링된 잠재 벡터 z로부터 전체 분자의 그래프 G를 재구성(reconstruct)하는 것이다. 본 발명의 그래프 디코딩 과정은 Li et al.[25]의 순차적 생성 전략(sequential generation strategy)에 의해 동기가 부여된다. 본 발명의 연구에서, 본 발명은 스캐폴드 그래프 G₀(RDKit 소프트웨어[51]를 사용하여 Bemis-Murcko 방법[39]에 의한 G로부터 추출됨)로부터 노드 및 에시를 연속적으로 추가함으로써 전체 분자 그래프 G를 구축한다. 여기서, G₀ = S는 초기 스캐폴드 그래프를 나타내며, G₀으로부터 구성된 일시적 (또는 완료된) 그래프를 나타내기 위해 G_t를 쓸 것이다.The goal of graph decoding is to reconstruct the graph G of the entire molecule from the latent vector z sampled in the graph encoding step. The graph decoding process of the present invention is described in Li et al. It is motivated by the sequential generation strategy of [25]. In the study of the present invention, the present invention is _{the entire molecular graph G by successively adding nodes and ashes from the scaffold graph G 0} (extracted from G by the Bemis-Murcko method [39] using RDKit software [51]). to build Here, G ₀ =S represents the initial scaffold graph, and we will write _{G t} to represent the transient (or completed) graph constructed from _{G 0 .}

본 발명의 그래프 디코딩은 G₀의 초기 노드 기능을 준비하고(preparing) 전파하는(propagating) 것으로 시작된다. G와 마찬가지로, 스캐폴드 분자의 원자 유형 및 결합 유형을 임베딩하여 G₀의 초기 특징 벡터(initial feature vector)를 준비한다(prepare). 이 초기 임베딩(embedding)은 전체 분자에 사용된 동일한 네트워크(도 1에 임베딩됨(embed))에 의해 수행된다. 그 후 G₀의 초기 특징 벡터는 또 다른 상호 작용 네트워크에 의해 고정된 횟수로 전파된다. 전파가 완료되면, 디코더는 노드 추가의 루프(loop) 및 수반되는 에지 추가의 루프(내부)를 통해 처리함으로써 G₀을 확장한다. 과정의 구체적인 설명은 다음과 같다:The graph decoding of the present invention begins with preparing and propagating the initial node function _{of G 0 .} As with G, an initial feature vector of _{G 0} is prepared by embedding the atomic type and bond type of the scaffold molecule. The initial embedded (embedding) is performed by (embedded search (embed) in Figure 1), the same network used for the whole molecule. Then, _{the initial feature vector of G 0} is propagated a fixed number of times by another interaction network. Once propagation is complete, the decoder expands _{G 0} by processing through a loop of node additions and a subsequent loop of edge additions (inside). A detailed description of the process follows:

· 1 단계: 노드 추가(node addition). 원자 유형을 선택하거나 또는 추정된 확률로 빌딩 프로세스(building process, 구축 과정)를 종결한다. 원자 유형이 선택된 경우, 현재 일시적 그래프(transient graph) G_t로 선택된 유형과 함께 새로운 노드(new node) w를 추가하고 2 단계로 진행한다. 그렇지 않으면, 빌딩 프로세스를 종료하고 그래프를 되돌린다(return). · Step 1: node addition. Select an atom type or terminate a building process with an estimated probability. If an atom type is selected, add a new node w with the type currently selected as the transient graph G _{t and proceed to step 2.} Otherwise, terminate the building process and return the graph.

· 2 단계: 에지 추가(edge addition). 주어진 새로운 노드에서, 결합 유형을 선택하거나 또는 추정된 확률로 1 단계로 되돌아간다. 결합 유형을 선택한 경우, 3 단계로 진행한다.· Step 2: edge addition. At a given new node, choose a join type or return to step 1 with an estimated probability. If the binding type is selected, proceed to step 3.

· 3 단계: 노드 선택(node selection). 추정된 확률로 w를 제외한 기존 노드(existing node)로부터 노드 ν를 선택한다. 그 다음, 2 단계에서 선택된 결합 유형과 함께 G_t에 새로운 에지(new edge) (v, w)를 추가한다. 2 단계에서 에지 추가를 계속한다.· Step 3: Node selection. With the estimated probability, a node ν is selected from existing nodes excluding w. Then, we add a new edge (v, w) to _{G t} with the type of bonding selected in step 2. Continue adding edges in step 2.

전체 공정의 흐름은 도 1의 우측에 나타내었다. 1-3 단계에서 제외된 것은 적절한 이성질체(isomer)를 선택하는 마지막 단계이며, 아래에서 별도로 설명된다.The flow of the entire process is shown on the right side of FIG. 1 . Excluded from steps 1-3 is the final step of selecting the appropriate isomer, which is described separately below.

모든 단계에서, 본 모델은 후보 동작(candidate action)에 대한 확률 벡터(probability vector)를 추정하여 다음 행위(action)를 결정한다. 현재 단계에서 원자를 추가해야하는지(1 단계), 에지(edge)를 추가해야하는지(2 단계), 또는 연결할 원자를 선택해야하는지(단계 3)에 따라, 확률 벡터는 다음 중 하나의 계산식에 의해 계산된다:In all steps, the model determines the next action by estimating a probability vector for a candidate action. Depending on whether the current step needs to add atoms (step 1), edges (step 2), or select atoms to connect to (step 3), the probability vector is computed by one of the following formulas:

첫번째 확률 벡터

은(na + 1)-길이 벡터이고, 여기서 엘리먼트(element)

내지

는 원자 유형에 대한 확률에 해당하고

은 종료 확률(termination probability)이다. 크기가 nb + 1 인 벡터인

는

내지

엘리먼트가 n_b 결합 타입에 대한 확률에 해당하며,

은 에지(edge) 추가를 중지할 확률이다. 마지막으로, 세번째 벡터

의 i 번째 엘리먼트는 i 번째 존재하는 노드와 마지막으로 추가된 노드를 연결할 확률이다.first probability vector

is (na + 1)-length vector, where element

inside

is the probability for the atomic type and

is the termination probability. a vector of size nb + 1

Is

inside

element corresponds to the probability for _{n b associative types,}

is the probability of stopping adding an edge. Finally, the third vector

The i-th element of is the probability of connecting the i-th existing node and the last added node.

본 발명의 모델이 w와 같은 새로운 노드를 추가하는 것으로 결정할 때, 대응하는 특성 벡터(feature vector)

가

에 추가되어야 한다. 이를 위해, 본 발명의 모델은 w의 원자 타입을 제시함으로써 초기 특성 벡터(initial feature vector)

를 준비한 후, 그것을

에서의 존재하는 노드 특성들과 통합하여, 적절한

를 계산한다. 유사하게, (v,w)라 불리는 새로운 에지(edge)가 추가될 때, 본 발명의 모델은

를

및

로부터 계산하여,

가 (v,w)의 결합 타입을 보여주는

를 업데이트한다. 새로운 노드 및 에지(edge)를 초기화하는 대응되는 모듈은 다음과 같다:When the model of the present invention decides to add a new node equal to w, the corresponding feature vector

go

should be added to To this end, the model of the present invention provides an initial feature vector by presenting the atomic type of w.

After preparing it

Integrating with the existing node properties in

to calculate Similarly, when a new edge called (v,w) is added, the model of the present invention is

cast

and

calculated from

showing the type of association of (v,w)

update The corresponding modules that initialize new nodes and edges are:

그래프 구축 모듈 addNode, addEdge 및 selectNode는 노드 특성을 전파하는 선행하는 단계를 포함한다. 예를 들어, addNode에 의해 수행된 실제 작업은 다음과 같다. The graph building modules addNode , addEdge and selectNode include the preceding steps of propagating node properties. For example, the actual work performed by addNode is:

상기에서

는 함수 합성(function composition)을 나타낸다. 우변에 따르면, 상기 모듈은 k 번의 그래프 전파(propagation)을 통해 노드 특성 벡터를 업데이트 한 뒤, 판독 벡터(readout vector)를 계산하고, 그 뒤 이를 z와 연결하고(concatenates), 최종적으로 다층 퍼셉트론 f를 통해

을 출력한다. 마찬가지로, addEdge 및 selectNode 모두는 propagete의 반복되는 적용으로 시작한다. 이러한 방식으로, 노드 특성들은 과도 그래프(transient graph)가 evolve 될 때마다 반복적으로 업데이트되고, 모든 구축 이벤트(building event)의 예측은 선행 이벤트(preceding event)의 이력(history)에 의존하게 된다. from above

denotes function composition. According to the right side, the module updates the node feature vector through k graph propagation, calculates a readout vector, then concatenates it with z, and finally multilayer perceptron f Through the

to output Similarly, both addEdge and selectNode start with repeated application of propagete . In this way, the node properties are iteratively updated whenever the transient graph evolves, and the prediction of all building events depends on the history of the preceding event.

식(10)에 나타낸바와 같이, addNode(그리고 addEdge 및 selectNode) 내에서의 그래프의 전파(propagation)은 잠재적 벡터(latent vector) z를 통합하고, 이는 전체-분자 그래프 G를 인코딩한다. 이는 그래프 구축 결정(graph building decisions)시 본 발명의 모델이 z를 참조하도록 만들고, z를 디코딩하는 것에 의해 궁극적으로 G를 재구축하도록 한다. 본 발명의 모델이 전체-분자 특성 y 및 스캐폴드 특성 y_s에 대해 조건화될 것이라면, 식(5)-(7) 및 (10)은 z대신에 =concat(z,y,y_s)를 통합하는 것으로 이해된다.As shown in equation (10), the propagation of the graph within addNode (and addEdge and selectNode ) incorporates a latent vector z, which encodes a full-molecular graph G . This makes the model of the present invention refer to z when making graph building decisions, and ultimately rebuilds G by decoding z. If our model is to be conditioned on the whole-molecular property y and the scaffold property y _s , then equations (5)-(7) and (10) incorporate =concat(z,y,y _s ) instead of z is understood to be

실시예 4: 분자 생성Example 4: Molecular Generation

새로운 분자를 생성할 때, 입력으로서 스캐폴드 S 가 필요하고, 잠재적 벡터(latent vector) z는 표준 정규 분포로부터 샘플링된다. 그런 뒤, 디코더는 S의 슈퍼그래프로서 새로운 분자 그래프

를 생성한다. 지정된 분자 특성을 갖는 분자를 생성하는 것을 원하는 경우, 상응하는 특성 벡터 y 및 y_s는 구축 프로세스(building process)의 조건에 제공되어야 한다. When generating a new molecule, a scaffold S is required as input, and a latent vector z is sampled from a standard normal distribution. Then, the decoder creates a new molecular graph as a supergraph of S

to create If it is desired to generate molecules with specified molecular properties, the corresponding property vectors y and y _s must be provided to the conditions of the building process.

실시예 5: 이성질체 선택Example 5: Selection of isomers

분자들은 입체이성질체(stereoisomer)를 가질 수 있다. 분자의 입체이성질체는 원자들 사이의 동일한 연결을 갖지만 다른 3D 기하구조(geometries)를 갖는다. 결과적으로, 분자의 완전한 생성은 또한 분자의 입체이성질체 특성을 구체화해야한다. z로부터 분자 그래프

가 구축된 후, 원자들 및 결합들의 입체화학적 배열이 결정된다[24]. 이성질체 선택 모듈 selectIsomer는 RDKit에 의해 열거된 모든 가능성있는 입체이성질체의 그래프 I를 만들어내고, 이들의 입체화학적 표지(stereochemical labels)가 없는 2D 구조들은

의 그것들과 동일하다. 모든 생성된 I는 노드 및 에지 특성들에서의 원자들 및 결합들의 입체화학적 배열을 포함한다. 그런 뒤, 모듈은 선택 확률을 다음과 같이 추정한다.Molecules may have stereoisomers. Stereoisomers of molecules have identical linkages between atoms, but different 3D geometries. Consequently, the complete creation of a molecule must also specify the stereoisomeric properties of the molecule. Molecular graph from z

After , the stereochemical arrangement of atoms and bonds is determined [24]. The isomer selection module selectIsomer produces a graph I of all possible stereoisomers enumerated by RDKit, and the 2D structures without their stereochemical labels are

are identical to those of Every generated I contains the stereochemical arrangement of atoms and bonds in node and edge properties. Then, the module estimates the selection probability as

여기에서, 벡터

의 엘리먼트들은 각각의 I를 선택하는 추정된 확률이다. here, vector

The elements of h are the estimated probability of choosing each I.

실시예 6: 목적 함수(objective function)Example 6: objective function

본 발명의 목적 함수(objective function)는 일반적인 VAE의 로그-우도(log-likelihood)의 형태를 갖는다[33]:The objective function of the present invention has the form of a log-likelihood of a general VAE [33]:

여기에서 D _KL[·∥·]은 Kullback-Leibler 발산(divergence)이고, p(z)는 본 발명의 모델에서 표준 정규화되는 것으로 정의된 선행 분포(prior distribution)이다. 우변의 첫 번째 항의 최대화는 본 발명의 디코더 p _θ 가 그들의 잠재적인 묘사(latent representations) z로부터의 그래프 G를 회복시키는 확률을 최대화하고, 두 번째 항의 최대화는 본 발명의 인코더 q _Φ 가 선행기술에 가능한한 가까워지도록 한다. 실시예 1에서 언급한 바와 같이, 식(12)에서의 추가 인수(additional argument) S는, G가 S의 슈퍼그래프인 본 발명의 모델에 의해 부과된 명시적 제약을 나타낸다. 실제 학습에서는, 스캐폴드 데이터세트 S 가 있으며, 각각의 스캐폴드

에 대해 상응하는 전체-분자 데이터세트 D(S)가 있다. 분자의 세트는 스캐폴드 세트 및 전체-분자 세트의 콜렉션을 생성할 수 있다: 미리 주어진 세트 내의 모든 분자의 스캐폴드가 정의되면, S를 생성하고, 미리 주어진 세트의 상기 분자들은 상기 콜렉션

내로 그룹핑될 수 있다. 이러한 S 및 D(S)를 사용하여, 본 발명의 목적은 식(12)의 우변을 최대화시키고, 이에 의해

를 최대화시키는, 파라미터

및 θ의 최적의 값을 찾는 것이다. 여기에서, 이중 기대(double expectation)는 스캐폴드 S에 대한 전체-분자 세트 D(S)의 의존성을 명시적으로 나타낸다. 즉, 본 발명의 정의에 따르면, 스캐폴드 세트 S를 정의하는 것이 첫 번째이고, 그 후, 각각의

는 전체-분자 세트 D(S)를 정의한다. ESI에서, 전체 프로세스의 알고리즘과 함께 모듈의 구현 및 정확한 작동에 대해 자세히 설명한다. Here, D _KL [····] is the Kullback-Leibler divergence, and p (z) is the prior distribution defined as standard normalized in the model of the present invention. The maximization of the first term on the right side maximizes the probability that the decoder p _θ of the present invention recovers the graph G from their latent representations z, and the maximization of the second term indicates that the encoder q _Φ of the present invention is not in the prior art. Try to get as close as possible. As mentioned in Example 1, the additional argument S in equation (12) represents the explicit constraint imposed by the model of the invention in which G is a supergraph of S. In actual training, there is a scaffold dataset S , and each scaffold

There is a corresponding full-molecular dataset D ( S ) for A set of molecules can produce a set of scaffolds and a collection of full-molecular sets: once the scaffolds of all molecules in a given set are defined, it produces S , and said molecules of a given set are said collection.

can be grouped into Using these S and D ( S ), it is an object of the present invention to maximize the right-hand side of equation (12), thereby

parameter that maximizes

and to find an optimal value of θ. Here, the double expectation explicitly indicates the dependence of the whole-molecule set D(S) on the scaffold S. That is, according to the definition of the present invention, it is first to define a scaffold set S, and then, each

defines the whole-molecular set D ( S ). In ESI, the implementation and correct operation of the module are described in detail along with the algorithm of the whole process.

결과 및 논의Results and discussion

1. 데이터세트 및 실험1. Datasets and Experiments

본 발명의 데이터세트는 InterBioScreen Ltd에 의해 제공된 합성 스크리닝 화합물들(synthetic screening compounds)(version March 2018)이다[52]. 상기 데이터 세트(따라서 IBS 데이터세트)는 초기에 H, C, N, O, F, P, S, Cl 및 Br 원자로 구성된 유기화합물의 SMILES 스트링을 포함하였다. 본 발명자들은 연결되지 않은 이온들 또는 단편들을 포함하는 스트링들 및 RDKit로 읽을 수 없는 것들을 걸러내었다. 본 발명의 전처리 과정은 349,809개의 훈련된 분자들과 116,603개의 테스트 분자를 도출하였다. 중 원자(heavy atoms)의 수는 평균 27개였고, 최대 132였으며, 평균 분자량은 389 g/mol이었다. 훈련 세트에서 스캐폴드 종류의 수는 85,318개 였고, 테스트 세트에서 42,751개였다. 본 발명의 실험들은 설명된 데이터 세트를 사용하여 스캐폴드-기반 그래프 생성 모델의 교육 및 평가를 포함한다. 조건부 분자 생성을 위해, 분자량(molecular weight; MW), 위상 극 표면 영역(topological polar surface area; TPSA) 및 옥탄올-물 분배 계수 (log P)를 사용했다. 세가지 속성 중 하나, 둘 또는 모두를 사용하여, 단독으로 또는 공동으로 모델을 조건화했다. 학습 속도를 0.0001까지 설정하였고, 모델의 모든 인스턴스를 20 에포크(epochs)까지 훈련시켰다. 레이어 디멘젼과 같은 다른 하이퍼-파라미터는 ESI에 명시되었다. RDKit를 사용하여 분자 특성을 계산하였다. 다음에서, MW(g/mol) 및 TPSA(Å²)의 단위들은 간편함을 위해 생략될 것이다. The dataset of the present invention is synthetic screening compounds (version March 2018) provided by InterBioScreen Ltd [52]. The data set (and hence the IBS data set) initially contained SMILES strings of organic compounds composed of H, C, N, O, F, P, S, Cl and Br atoms. We filter out strings that contain unconnected ions or fragments and those that cannot be read with RDKit. The preprocessing process of the present invention yielded 349,809 trained molecules and 116,603 test molecules. The average number of heavy atoms was 27, the maximum was 132, and the average molecular weight was 389 g/mol. The number of scaffold types in the training set was 85,318 and 42,751 in the test set. Experiments of the present invention involve training and evaluation of a scaffold-based graph generation model using the described data set. For conditional molecular generation, molecular weight (MW), topological polar surface area (TPSA) and octanol-water partition coefficient (log P ) were used. Models were conditioned either alone or jointly using one, two, or all of the three attributes. The learning rate was set to 0.0001, and all instances of the model were trained up to 20 epochs. Other hyper-parameters such as layer dimensions are specified in the ESI. Molecular properties were calculated using RDKit. In the following, the units of MW (g/mol) and TPSA (Å ² ) will be omitted for the sake of brevity.

2. 유효성, 독창성 및 신규성 분석2. Analysis of effectiveness, originality and novelty

생성된 분자의 유효성, 독창성 및 신규성은 분자 생성 모델의 기본 평가 지표(basic evaluation metrics)이다. 세 가지 지표(metrics)의 정확한 의미를 위해, 다음의 정의를 따른다[53]: The validity, originality and novelty of the generated molecule are basic evaluation metrics of the molecular generation model. For the precise meaning of the three metrics, the following definition follows [53]:

유효성(Validity) = 유효한 그래프의 # / 생성된 그래프의 #Validity = # of valid graphs / # of generated graphs

독창성(Uniqueness) = 비-중복, 유효 그래프의 # / 유효 그래프의 #Uniqueness = non-overlapping, # of valid graphs / # of valid graphs

신규성(Novelty) = 훈련 세트에 있지 않은 고유 그래프의 # / 고유 그래프의 #Novelty = # of eigengraph not in training set / # of eigengraph

원자가(valency)와 같은 기본적인 화학적 요구사항들이 만족될 때, 유효한 그래프를 정의하였다. 실제로 RDKit를 사용하여 생성된 그래프의 유효성을 결정하였다. 스캐폴드로부터 생성된 분자는 후보 생성물들의 공간을 제한하기 때문에, 본 발명의 모델에서 매트릭스를 확인하는 것은 특히 중요하다. 데이터세트로부터 100개의 스캐폴드를 무작위로 선택하고, 각각의 스캐폴드로부터 100개의 분자들을 생성함으로써, MW, TPSA 또는 log P에 대해 단독으로 조건화된 모델을 평가하였다. 타겟 값들(각각의 특성에 대해 100개의 값들)을 데이터세트 상의 각각의 특성 분포로부터 무작위로 샘플링하였다. MW에 대해서, MW가 그 스캐폴드의 MW보다 더 작은 분자를 생성하는 것은 부자연스러운 바, 평가에서 그러한 케이스들은 배제시켰다. A valid graph was defined when basic chemical requirements such as valency were met. We actually used RDKit to determine the validity of the generated graph. It is particularly important to identify the matrix in the model of the present invention, as molecules generated from the scaffold limit the space of candidate products. Models conditioned alone for MW, TPSA or log P were evaluated by randomly selecting 100 scaffolds from the dataset and generating 100 molecules from each scaffold. Target values (100 values for each feature) were randomly sampled from each feature distribution on the dataset. For the MW, it is unnatural for the MW to produce a molecule smaller than the MW of the scaffold, so such cases were excluded from the evaluation.

표 2에 본 발명의 모델로부터 생성된 분자들의 유효성, 독창성 및 신규성, 그리고 비교를 위해 다른 분자 생성 모델의 결과들을 요약하였다. Table 2 summarizes the validity, originality and novelty of molecules generated from the model of the present invention, and the results of other molecular generation models for comparison.

다른 모델에 의한 값들이 그들 각각의 보고서로부터 수집되었기 때문에, 비교 내용은 대략적이다. 스캐폴드에 의해 부과된 엄격한 제한에도 불구하고, 본 발명의 모델은 다른 모델들의 결과와 비교하여, 높은 유효성, 독창성 및 신규성을 나타내었다. 본 발명의 트레이닝 세트 내의 스캐폴드의 대부분이 매우 적은 전체-분자들만을 갖는다는 것을 고려했을 때, 높은 독창성 및 신규성은 특히 의미가 있었다. 예를 들어, 트레이닝 세트 내의 85,318 개의 스캐폴드 중에서, 79,700 스캐폴드가 10개 미만의 전체-분자를 갖는다. 그러므로, 본 발명의 모델이 단순히 트레이닝 세트를 암기함으로써, 그러한 높은 성능을 달성했을 가능성은 거의 없으며, 본 발명의 모델이 임의의 스캐폴드를 확장하기 위해 일반적인 화학 규칙을 학습한다고 결론지을 수 있다. Since values by different models were collected from their respective reports, the comparisons are approximate. Despite the strict limitations imposed by the scaffold, our model showed high validity, originality and novelty compared to the results of other models. The high originality and novelty were particularly significant, given that most of the scaffolds in the training set of the present invention have only very few whole-molecules. For example, out of 85,318 scaffolds in the training set, 79,700 scaffolds have less than 10 total-molecules. Therefore, it is unlikely that the model of the present invention achieved such high performance simply by memorizing the training set, and it can be concluded that the model of the present invention learns general chemical rules to extend any scaffold.

3. 단일-특성 제어(single-property control)3. single-property control

다음 분석을 위해, 본 발명의 스캐폴드-기반 그래프 생성 모델이 소정의 스캐폴드 및 의욕하는 특성들을 동시에 갖는 분자들을 생성할 수 있는지 테스트하였다. For the next analysis, it was tested whether the scaffold-based graphing model of the present invention could generate molecules with the desired scaffold and desired properties simultaneously.

몇몇 분자 생성 모델들이 생성된 분자들의 분자 특성들을 제어하기 위해 개발되었음에도 불구하고, 주어진 스캐폴드에 의해 부여된 제한 조건 하에서 분자 특성을 제어하는 것은 더 어려울 것이다. 본 발명자들은 타겟 값을 80, 100, MW를 120, TPSA를 300, 350 및 400, log P를 5, 6 및 7로 설정하였다. 모든 9개의 케이스들에 대해, 본 발명은 결과 2에의 결과를 위해 사용된 것과 동일한 100 스캐폴드를 사용하였고, 각각의 스캐폴드에 대해 100 분자를 생성하였다. Although several molecular generation models have been developed to control the molecular properties of the generated molecules, it will be more difficult to control the molecular properties under the limiting conditions imposed by a given scaffold. The present inventors set target values of 80, 100, MW of 120, TPSA of 300, 350 and 400, and log P of 5, 6 and 7. For all 9 cases, we used the same 100 scaffolds used for results in Results 2, generating 100 molecules for each scaffold.

도 2는 생성된 분자의 특성 분포를 나타낸다. 특성 분포들이 폭표 값 주변으로 잘 집중되어 있는 것이 관찰된다. 이는 좁은 탐색 공간(narrowed search space)에도 불구하고, 본 발명의 모델이 원하는 특성들을 가진 새로운 분자들을 성공적으로 생성하였다는 것을 보여준다. 본 발명의 모델이 주어진 스캐폴드들을 지정된 특성 값들에 따라 어떻게 확장하는지 보기 위해, 도 3에서 생성된 분자들 중 몇몇을 그림으로 나타내었다. MW=400, TPSA=120, 및 log P=7인 타겟 조건에 대해서, 3개의 다른 스캐폴드를 사용하여 9개의 예들을 무작위로 샘플링하였다. Figure 2 shows the distribution of properties of the resulting molecule. It is observed that the feature distributions are well concentrated around the burst value. This shows that, despite a narrowed search space, the model of the present invention successfully generated new molecules with desired properties. To see how our model expands given scaffolds according to specified property values, some of the molecules generated in Fig. 3 are pictorially shown. For target conditions with MW=400, TPSA=120, and log P=7, 9 examples were randomly sampled using 3 different scaffolds.

본 발명의 모델은 적절한 사이드 체인들을 추가함으로써, 지정된 특성들을 갖는 새로운 분자들을 생성하였다: 예를 들어, 본 발명의 모델은 스캐폴드에 대해 소수성(hydrophobic) 그룹들을 추가하여, 고-log P 분자를 생성하였고, 반면에, 극성 기능기들을 추가하여 고-TPSA 분자들을 생성하였다. Our model created new molecules with specified properties by adding appropriate side chains: for example, our model added hydrophobic groups to the scaffold to create high-log P molecules , while polar functional groups were added to generate high-TPSA molecules.

4. 스캐폴드 의존성4. Scaffold Dependencies

본 발명의 분자 디자인 프로세스는 연속적으로 추가된 노드들 및 에지들을 갖는 주어진 스캐폴드로부터 시작한다. 그래서, 본 발명의 모델의 성능은 스캐폴드의 종류에 영향을 받을 수 있다. 따라서, 새로운 스캐폴드가 주어졌을 때, 본 발명의 모델이 바람직한 분자들을 생성하는 성능을 유지하는지 시험하였다. 구체적으로 본 발명은트레이닝 세트에 포함되지 않은 100개의 새로운 스캐폴드의 세트(이하 “unseen” 스캐폴드)와 트레이닝 세트로부터의 추가적인 100개의 스캐폴드 세트(이하 “seen” 스캐폴드)를 준비하였다. 그런 뒤, 무작위로 지정된 특성 값들을 갖는 각각의 스캐폴드에 대한 100개의 분자들을 생성하였다. 본 프로세스는 MW, TPSA 및 log P에 대해 반복하였다. The molecular design process of the present invention starts with a given scaffold with successively added nodes and edges. Thus, the performance of the model of the present invention may be affected by the type of scaffold. Therefore, given a new scaffold, it was tested whether the model of the present invention retains its ability to generate desirable molecules. Specifically, in the present invention, a set of 100 new scaffolds not included in the training set (“unseen” scaffold) and an additional 100 scaffold set from the training set (hereinafter “seen” scaffold) were prepared. Then, 100 molecules were generated for each scaffold with randomly assigned property values. This process was repeated for MW, TPSA and log P.

표 3에 seen 및 unseen 스캐폴드로부터 생성된 분자들의 유효성, 독창성 및 MAD를 요약하였다. Table 3 summarizes the effectiveness, originality and MAD of molecules generated from seen and unseen scaffolds.

여기서, MAD는 지정된 특성 값들과 생성된 분자의 특성 사이의 평균 절대 차이(mean absolute difference)를 나타낸다. 결과는 스캐폴드의 두 세트 사이의 세 매트릭스의 유의한 차이가 없다는 것을 보여준다. 이는 본 발명의 모델이 제어된 특성들을 갖는 유효한 분자들을 생성하는데에 있어서, 임의의 스캐폴드에 대해 일반적으로 적용될 수 있다는 것을 보여준다. Here, MAD represents the mean absolute difference between the specified property values and the property of the generated molecule. The results show that there is no significant difference in the three matrices between the two sets of scaffolds. This shows that the model of the present invention is generally applicable to any scaffold in generating effective molecules with controlled properties.

5. 다중-특성 제어5. Multi-characteristic control

새로운 분자 설계에서 하나의 특정의 분자 특성을 제어하는 것만을 요구하는 경우는 드물다. 무엇보다도, 약물 디자인은 특히 다수의 분자 특성의 동시적인 제어를 요한다. 이러한 점에 비추어, 본 발명의 모델의 W, TPSA 및 log P 중 어느 둘을 동시에 제어하는 성능을 먼저 테스트하였다. 본 발명자들은 MW와 TPSA, MW와 log P, log P와 TPSA를 공동으로 제어하는 모델의 3가지 사례를 훈련시켰다. 그런 뒤, 두 타겟 값(350 및 450의 MW, 50 및 100의 TPSA, 및 2 및 5의 log P)들을 갖는 각각의 특성을 구체화시켰고, 그들을 조합하여 각각의 쌍에 대한 4개의 생성 조건들을 준비하였다. 모든 생성 조건 하에서, 결과 2 및 결과 3에서 사용한 무작위로 샘플링된 100개의 스캐폴드를 사용하였고, 각 스캐폴드로부터 100개의 분자들을 생성시켰다. 타겟 MW가 사용된 스캐폴드의 MW보다 더 작은 생성물들은 제외시켰다. New molecular designs rarely require only controlling one specific molecular property. Above all, drug design particularly requires simultaneous control of multiple molecular properties. In view of this point, the performance of simultaneously controlling any two of W, TPSA and log P of the model of the present invention was first tested. We trained three instances of a model jointly controlling MW and TPSA, MW and log P , and log P and TPSA. Then, we specified each characteristic with two target values (MW of 350 and 450, TPSA of 50 and 100, and log P of 2 and 5), and combined them to prepare four production conditions for each pair did Under all production conditions, 100 randomly sampled scaffolds used in Results 2 and 3 were used, and 100 molecules were generated from each scaffold. Products with a target MW smaller than the MW of the scaffold used were excluded.

도 4는 MW와 TPSA, MW와 log P, 및 log P와 TPSA로 조건화된 생성물들의 결과를 보여준다. 생성된 분자들에 대한 특성 값들의 공동 분포를 표시하였다. 가우시안 커널(Gaussian kernels)을 커널 밀도 추정에 사용하였다. 분포의 모드들이 타겟 값들의 점 부근에 잘 위치해 있는 것을 관찰하였다. 예외적으로 타겟 (log P, TPSA)=(2,50)에 의한 분포는 더 큰 log P 및 TPSA 값들에 비해 상대적으로 더 긴 테일을 보여주었다. 이는 log P 및 TPSA가 정의에 의해 서로 음의 상관 관계를 갖고, 둘 모두에 대해 작은 값을 요구하는 것은 비물리적인(unphysical) 생성 작업이 될 수 있기 때문이다. 분자 특성 사이의 고유한 상관관계는 심지어 타당하게 보이는 타겟이 분산된 특성 분포를 나타내는 결과를 초래할 수 있다. 이러한 예는 또 다른 타겟(log P, TPSA)-(5, 50)의 결과일 수 있지만, 대부분의 경우에서 전체 생성의 매우 작은 부분에 불과하다. 4 shows the results of the products conditioned with MW and TPSA, MW and log P , and log P and TPSA. The co-distribution of property values for the resulting molecules is indicated. Gaussian kernels were used to estimate the kernel density. It was observed that the modes of distribution are well located near the points of the target values. Exceptionally, the distribution by target (log P, TPSA)=(2,50) showed a relatively longer tail compared to the larger log P and TPSA values. This is because log P and TPSA are negatively correlated with each other by definition, and requiring small values for both can be an unphysical production task. Intrinsic correlations between molecular properties can result in even plausible targets exhibiting a dispersed property distribution. Such an example could be the result of another target (log P, TPSA)-(5, 50), but in most cases only a very small fraction of the total production.

추가적으로 3가지 특성 모두를 통합하여 조건화 생성을 테스트하였다. 상술한 대로 W, TPSA 및 log P의 동일한 타겟 값들을 사용하였고, 총 8 조건의 생성 결과를 얻었다. 스캐폴드 세트 및 생성의 수를 포함하는 나머지 설정들은 유지하였다. 결과는 도 5에 나타내었고, 생성된 분자들의 W, TPSA 및 log P 값들을 표시하였다. 이 도표는 서로 다른 타겟 조건으로부터의 분포가 서로 잘 분리되어 있음을 보여준다. 이중(double)-특성 결과와 마찬가지로, 모든 분포는 타겟 값을 중심으로 잘 밀집분포되었다. Additionally, all three properties were incorporated to test the generation of conditioning. As described above, the same target values of W, TPSA and log P were used, and a total of 8 conditions were obtained. The remaining settings, including the number of scaffold sets and generations, were maintained. The results are shown in FIG. 5, and W, TPSA and log P values of the generated molecules are indicated. This plot shows that the distributions from different target conditions are well separated from each other. As with the double-characteristic results, all distributions are well-densely distributed around the target value.

또한 단일 및 다중 속성 제어를 위한 모델의 생성 성능을 정량적으로 비교했다. 표 4는 MAD, 유효성 및 신규성 측면에서 단일-, 이중- 및 삼중-특성 제어의 성능 통계를 보여준다.We also quantitatively compared the generation performance of models for single- and multi-attribute control. Table 4 shows the performance statistics of single-, double- and triple-characteristic controls in terms of MAD, effectiveness and novelty.

동일한 100개의 스캐폴드를 사용하여 무작위로 지정된 타겟 조건 하에서 매번 100개의 분자를 생성했다. 통합된 특성의 수가 1개에서 2개 및 3개로 증가함에 따라 디스크립터의 전체 크기가 잘 보존된다. MAD 값의 약간의 증가는, 다중 특성들 사이의 고유한 상관관계(intrinsic correlations)에 의해 강제되는 화학적 공간의 추가적인 제한(confinement)에 기인한다. 그럼에도 불구하고 악화(worsening)의 크기는 특성들의 평균값(MW가 389, TPSA가 77, log P가 3.6)에 비해 작다.The same 100 scaffolds were used to generate 100 molecules each time under randomly specified target conditions. The overall size of the descriptor is well preserved as the number of integrated features increases from one to two and three. The slight increase in the MAD value is due to the additional confinement of the chemical space enforced by the intrinsic correlations between multiple properties. Nevertheless, the magnitude of the deterioration is small compared to the average values of the characteristics (MW 389, TPSA 77, log P 3.6).

6. 준-지도 학습(semi-supervised learning)에 의한 EGFR 억제제의 디자인 6. Design of EGFR inhibitors by semi-supervised learning

딥 러닝 방법들은 종종 그들의 성능을 완전히 발휘하기 위해 수백만 개의 레이블이 지정된 데이터가 필요하다. 그러나, 많은 실제 응용에서는 데이터 부족으로 어려움을 겪고 있다. 예를 들어, 억제제 디자인에서, 타겟 단백질에 대한 이용가능한 결합 친화도 크기는 오직 수천개 정도의 양 뿐이다. 이러한 케이스에 대한 하나의 가능한 접근은, 많은 양의 레이블 지정되지 않은 데이터를 작은 양의 레이블이 지정된 데이터에 학습을 위해 통합시키는, 준-지도 학습(semi-supervised learning)이다. 실제로 데이터세트에서 적은 일부의 분자들만이 특성 값을 가질 때, 조건화된 분자 디자인에 대한 준-지도 학습이 효율적이라는 종래 보고가 있었다[16]. 유사한 상황에서 본 발명의 스캐폴드 기반의 분자 그래프 생성 모델의 적용 가능성을 확인하기 위해, 본 발명의 모델에 준-지도 학습의 체계를 구비시켰으며, 데이터의 양이 제한되는 인간 EGFR 단백질의 억제제를 얼마나 잘 디자인할 수 있는지 테스트하였다. Deep learning methods often require millions of labeled data to fully unleash their performance. However, many practical applications suffer from lack of data. For example, in inhibitor design, the amount of binding affinity available for a target protein is only in the order of a few thousand. One possible approach to this case is semi-supervised learning, which incorporates large amounts of unlabeled data into small amounts of labeled data for training. In fact, there has been a previous report that semi-supervised learning for conditioned molecular design is efficient when only a small fraction of molecules in the dataset have characteristic values [16]. In order to confirm the applicability of the scaffold-based molecular graph generation model of the present invention in a similar situation, a system of semi-supervised learning was provided in the model of the present invention, and an inhibitor of human EGFR protein, which has a limited amount of data We tested how well we could design.

준-지도학습된(semi-supervised) VAE[16,54]를 적용시켰고, 이는 보통의 VAE에 레이블 예측자를 추가시킴으로써, 쉽게 구현할 수 있기 때문이다. 준-지도 학습된 VAE에서, 예측자들을 함께 트레이닝시켜, VAE 부분과 함께, 인간 EGFR에 대한 pIC50(negative logarithm of half maximal inhibitory concentration) 값을 예측하였다. We applied a semi-supervised VAE [16,54] because it can be easily implemented by adding a label predictor to the normal VAE. In semi-supervised VAE, predictors were trained together to predict negative logarithm of half maximal inhibitory concentration (pIC50) values for human EGFR, along with the VAE portion.

레이블된 분자들에 대해 실제 pIC₅₀ 값들을 사용하여 예측자들을 트레이닝 시켰고, 또한 VAE 조건 벡터에서 그들을 사용하였다. 레이블되지 않은 분자들에 대해, 본 발명은예측자에 의해 예측된 pIC₅₀ 값들을 사용하여 오직 VAE 부분을 트레이닝시켰다. 본 발명자들이 앞서 사용한 IBS 트레이닝 분자들에 대해, ChEMBL 데이터베이스[55]로부터 8,016 분자들을 추가하여, 준-지도 학습을 위한 본 발명의 데이터세트를 준비하였다. ChEMBL 분자들은 pIC₅₀ 값을 갖는 것들이었고, 또한 레이블된 것이었던 반면에, IBS 분자들은 레이블되지 않았다. 각 분자는 그것이 추출된 스캐폴드와 짝지어졌고, IBS 분자들 뿐만 아니라 ChEMBL의 모든 스캐폴드는 레이블이 제거되도록 처리되었다. ChEMBL 분자들을 6,025 트레이닝 분자들 및 1,991 테스트 분자들로 나누었고, 트레이닝 세트에서의 1.7% 레이블 비율만이 도출되었다. 예측자들을 효과적으로 트레이닝시키기 위해, 모든 뱃치(batch)에서 1:5의 비율을 갖는 레이블되고 언레이블된 입력들을 샘플링하였다. 학습의 20 에포크 후, 예측된 pIC₅₀ 값이 5 및 6 사이인 ChEMBL 테스트 세트에서의 100 스캐폴드를 선택하였고, 각 스캐폴드로부터 예측된 pIC₅₀ 값이 8인 100 분자들을 생성시켰다. _{Predictors were trained using the actual pIC 50} values for the labeled molecules and also used them in the VAE condition vector. For unlabeled molecules, we trained only the VAE portion using the _{pIC 50 values predicted by the predictors.} For the IBS training molecules previously used by the present inventors, 8,016 molecules were added from the ChEMBL database [55] to prepare the present dataset for semi-supervised learning. ChEMBL molecules were those with pIC ₅₀ values and were also labeled, whereas IBS molecules were unlabeled. Each molecule was paired with the scaffold from which it was extracted, and all scaffolds of ChEMBL as well as IBS molecules were treated to remove the label. ChEMBL molecules were divided into 6,025 training molecules and 1,991 test molecules, resulting in only 1.7% label percentage in the training set. To effectively train predictors, labeled and unlabeled inputs with a ratio of 1:5 were sampled from every batch. After 20 epochs of training, 100 scaffolds from the ChEMBL test set with _{predicted pIC 50} _{values between 5 and 6 were selected, and 100 molecules with predicted pIC 50} values of 8 were generated from each scaffold.

ChEMBL 테스트 분자들에 대한 pIC₅₀ 예측의 MAD는 0.58이었다. 총 10,000 회의 생성에서, 본 발명의 모델은 96.6%의 유효성, 44.9%의 독창성 및 99.7%의 신규성을 보였다. 상대적으로 낮은 독창성은 하나의 고정된 타겟 값의 사용이 검색 공간에 대한 엄격한 조건을 적용시킴으로써, 생성의 중복성(redundancy)을 증가시키기 때문인 것으로 예상된다. 도 6은 100개의 스캐폴드 및 생성된 분자들에 대하여 예측된 pIC₅₀ 값들의 분포를 보여준다. 생성된 분자들의 분포가 8 보다 낮은 값들에 집중분포되고, 상대적으로 넓은 분산을 보여줌에도 불구하고, 스캐폴드와 비교하여 pIC₅₀ 값의 평균 향상은 1.29였고(독창성 있는 분자들만 계수하였다), 이는 IC₅₀의 관점에서의 억제능의 약 19.7 배 강화이다. 실제로 더 관심이 있을, pIC₅₀이 8보다 클 것으로 예상되는 것들은 독창적인 생성물 중 20%에 속한다. 이러한 결과들은 데이터의 준비가 문제가 되는 많은 실제 상황에서 최소한의 확장만으로 본 발명의 모델이 적용 가능하다는 것을 보여준다. _{The MAD of the pIC 50} prediction for the ChEMBL test molecules was 0.58. In a total of 10,000 creations, our model showed 96.6% effectiveness, 44.9% originality, and 99.7% novelty. The relatively low originality is expected because the use of one fixed target value imposes strict conditions on the search space, thereby increasing the redundancy of the creation. 6 shows the distribution of _{predicted pIC 50} values for 100 scaffolds and generated molecules. Although the distribution of the generated molecules was concentrated at values lower than 8 and showed a relatively wide variance, _{the average improvement of the pIC 50} value compared to the scaffold was 1.29 (only unique molecules were counted), which _It is about 19.7 times strengthening of the inhibitory ability in terms of 50. Of more interest in practice, those with a pIC ₅₀ expected to be greater than 8 belong to 20% of the original products. These results show that the model of the present invention can be applied with minimal extension to many practical situations where data preparation is a problem.

본 발명을 통해, 스캐폴드-기반 분자 그래프 생성 모델이 제안되었다. 본 발명의 모델은 스캐폴드의 그래프에 대해 새로운 원자들 및 결합들을 순차적으로 추가함으로써, 원하는 하위구조 또는 스캐폴드로부터 새로운 분자들을 생성한다. 예를 들어 생성 과정 동안 하위 구조를 확률적으로 조건화하는 것과 같은 다른 연관된 방법들과 대조적으로, 본 발명의 전략은 생성된 분자들의 스캐폴드의 확장을 자연스럽게 보장한다. Through the present invention, a scaffold-based molecular graph generation model has been proposed. Our model creates new molecules from the desired substructure or scaffold by sequentially adding new atoms and bonds to the graph of the scaffold. In contrast to other related methods, such as, for example, stochastic conditioning of substructures during generation, the strategy of the present invention naturally guarantees expansion of the scaffold of generated molecules.

스캐폴드로부터의 분자 생성은 이미 존재하는 특성들을 활용할 수 있기 때문에, 타겟 특성들을 최적화하거나 유지하는 것이 de novo 분자 생성에 의한 것 보다 더 쉬울 수 있다. 약리학적 활성이 최적화되는 것을 목표로하는 경우, 분자 구조의 복잡성으로 인해, 맨 처음으로부터 최적의 분자 구조를 생성하는 것이 어려울 수 있다. 그러한 경우, 적당한 활성을 갖는 스캐폴드를 사용하면 최적화를 보다 실현가능하게 할 수 있다. 여러 속성을 동시에 제어 할 때도 비슷한 장점이 있다. 예를 들어, 생성된 분자의 합성 접근성을 확보하고 활성을 향상시키는 것이 목표라면, 잘 합성될 수 있는 스캐폴드를 사용하면 목표가 후자에 더 집중되어 검색 효율이 높아진다.Because molecular generation from scaffolds can utilize already existing properties, it may be easier to optimize or maintain target properties than by de novo molecule generation. When pharmacological activity is aimed at being optimized, it can be difficult to generate an optimal molecular structure from scratch due to the complexity of the molecular structure. In such cases, the use of a scaffold with appropriate activity may make the optimization more feasible. There are similar advantages when controlling multiple properties at the same time. For example, if the goal is to secure synthetic accessibility and improve activity of the generated molecule, using a scaffold that can be synthesized well will focus the goal more on the latter, increasing search efficiency.

생성된 분자들의 유효성, 독창성 및 신규성을 시험함으로써, 본 발명의 모델을 평가하였다. 스캐폴드에 의해 부과된 검색 공간에 대한 제약에도 불구하고, 본 발명의 모델은 이전의 SMILES-기반 및 그래프-기반 분자 생성 모델들과 비교하여, 유사한 결과를 보여주었다. 본 발명의 모델은 트레이닝 세트에 포함되지 않은 새로운 스캐폴드가 주어졌을 때, 3가지 메트릭스의 관점에서 일관되게 잘 동작하였다. 이는 본 발명의 모델이 트레이닝 세트에서 스캐폴드와 분자 사이의 페어링을 기억하는 것보다는 양질의 일반화를 달성했다는 것을 의미한다. 게다가 주어진 스캐폴드를 유지하는 동안, 본 발명의 모델은 원하는 정도의 분자 특성, 예를 들어 분자량, 위상학적 극성 표면 영역 및 옥탄올-물 분배 계수(octanol-water partition coefficient)를 갖는 새로운 분자들을 성공적으로 생성한다. 특성-제어된 생성은 다중 분자 특성들을 동시에 통합할 수 있다. 본 발명의 모델은 적은 양의 레이블된 데이터만이 이용 가능한 일반적인 상황들 중 하나인 단백질 억제제 디자인과 같은 적용에서 준-지도 학습을 통합할 수 있다. 본 발명에서 보인 모든 결과들로부터, 본 발명의 스캐폴드-기반 분자 그래프 생성 모델은 보존된 코어 구조를 갖는 분자들의 기능성을 최적화하는 실용적인 방법을 제공함을 알 수 있다. The model of the present invention was evaluated by testing the effectiveness, originality and novelty of the resulting molecules. Despite the constraints on the search space imposed by the scaffold, our model showed similar results compared to previous SMILES-based and graph-based molecular generation models. Our model performed consistently well in terms of three metrics, given a new scaffold that was not included in the training set. This means that our model achieved good generalization rather than memorizing the pairings between scaffolds and molecules in the training set. Moreover, while maintaining a given scaffold, our model successfully generates new molecules with the desired degree of molecular properties, such as molecular weight, topologically polar surface area and octanol-water partition coefficient. create with Property-controlled generation can incorporate multiple molecular properties simultaneously. Our model can incorporate semi-supervised learning in applications such as protein inhibitor design, which is one of the common situations where only small amounts of labeled data are available. From all the results presented in the present invention, it can be seen that the scaffold-based molecular graph generation model of the present invention provides a practical method for optimizing the functionality of molecules with a conserved core structure.

참조Reference

1 P. G. Polishchuk, T. I. Madzhidov and A. Varnek, J. Comput.- Aided Mol. Des., 2013, 27, 675-679.1 P. G. Polishchuk, T. I. Madzhidov and A. Varnek, J. Comput.- Aided Mol. Des., 2013, 27, 675-679.

2 S. Kim, P. A. Thiessen, E. E. Bolton, J. Chen, G. Fu, A. Gindulyte, L. Han, J. He, S. He, B. A. Shoemaker, J. Wang, B. Yu, J. Zhang and S. H. Bryant, Nucleic Acids Res., 2016, 44, D1202-D1213.2 S. Kim, PA Thiessen, EE Bolton, J. Chen, G. Fu, A. Gindulyte, L. Han, J. He, S. He, BA Shoemaker, J. Wang, B. Yu, J. Zhang and SH Bryant, Nucleic Acids Res., 2016, 44, D1202-D1213.

3 K. H. Bleicher, Y. W¨uthrich, G. Adam, T. Hoffmann and A. J. Sleight, Bioorg. Med. Chem. Lett., 2002, 12, 3073-3076.3 K. H. Bleicher, Y. W¨uthrich, G. Adam, T. Hoffmann and A. J. Sleight, Bioorg. Med. Chem. Lett., 2002, 12, 3073-3076.

4 G. L. Card, L. Blasdel, B. P. England, C. Zhang, Y. Suzuki, S. Gillette, D. Fong, P. N. Ibrahim, D. R. Artis, G. Bollag, M. V. Milburn, S.-H. Kim, J. Schlessinger and K. Y. J. Zhang, Nat. Biotechnol., 2005, 23, 201-207.4 G. L. Card, L. Blasdel, B. P. England, C. Zhang, Y. Suzuki, S. Gillette, D. Fong, P. N. Ibrahim, D. R. Artis, G. Bollag, M. V. Milburn, S.-H. Kim, J. Schlessinger and K. Y. J. Zhang, Nat. Biotechnol., 2005, 23, 201-207.

5 M. E. Welsch, S. A. Snyder and B. R. Stockwell, Curr. Opin. Chem. Biol., 2010, 14, 347-361.5 M. E. Welsch, S. A. Snyder and B. R. Stockwell, Curr. Opin. Chem. Biol., 2010, 14, 347-361.

6 Y. Im, M. Kim, Y. J. Cho, J.-A. Seo, K. S. Yook and J. Y. Lee, Chem. Mater., 2017, 29, 1946-1963.6 Y. Im, M. Kim, Y. J. Cho, J.-A. Seo, K. S. Yook and J. Y. Lee, Chem. Mater., 2017, 29, 1946-1963.

7 J. Dhar, U. Salzner and S. Patil, J. Mater. Chem. C, 2017, 5, 7404-7430.7 J. Dhar, U. Salzner and S. Patil, J. Mater. Chem. C, 2017, 5, 7404-7430.

8 A. Al Mousawi, F. Dumur, P. Garra, J. Toufaily, T. Hamieh, B. Graff, D. Gigmes, J. P. Fouassier and J. Lalev´ee, Macromolecules, 2017, 50, 2747-2758.8 A. Al Mousawi, F. Dumur, P. Garra, J. Toufaily, T. Hamieh, B. Graff, D. Gigmes, J. P. Fouassier and J. Lalev´ee, Macromolecules, 2017, 50, 2747-2758.

9 X. Sun, F. Wu, C. Zhong, L. Zhu and Z. Li, Chem. Sci., 2019, 10, 6899-6907.9 X. Sun, F. Wu, C. Zhong, L. Zhu and Z. Li, Chem. Sci., 2019, 10, 6899-6907.

10 B. Sanchez-Lengeling and A. Aspuru-Guzik, Science, 2018, 361, 360-365.10 B. Sanchez-Lengeling and A. Aspuru-Guzik, Science, 2018, 361, 360-365.

11 H. Chen, O. Engkvist, Y. Wang, M. Olivecrona and T. Blaschke, Drug Discovery Today, 2018, 23, 1241-1250.11 H. Chen, O. Engkvist, Y. Wang, M. Olivecrona and T. Blaschke, Drug Discovery Today, 2018, 23, 1241-1250.

12 M. H. S. Segler, T. Kogej, C. Tyrchan and M. P. Waller, ACS Cent. Sci., 2018, 4, 120-131.12 M. H. S. Segler, T. Kogej, C. Tyrchan and M. P. Waller, ACS Cent. Sci., 2018, 4, 120-131.

13 A. Gupta, A. T. M¨uller, B. J. H. Huisman, J. A. Fuchs, P. Schneider and G. Schneider, Mol. Inf., 2018, 37, 1700111.13 A. Gupta, A. T. M¨uller, B. J. H. Huisman, J. A. Fuchs, P. Schneider and G. Schneider, Mol. Inf., 2018, 37, 1700111.

14 E. J. Bjerrum and B. Sattarov, Biomolecules, 2018, 8(4), 131.14 E. J. Bjerrum and B. Sattarov, Biomolecules, 2018, 8(4), 131.

15 R. G´omez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hern´andez-Lobato, B. S´anchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams and A. Aspuru-Guzik, ACS Cent. Sci., 2018, 4, 268-276.15 R. G´omez-Bombarelli, JN Wei, D. Duvenaud, JM Hern´andez-Lobato, B. S´anchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, TD Hirzel, RP Adams and A. Aspuru -Guzik, ACS Cent. Sci., 2018, 4, 268-276.

16 S. Kang and K. Cho, J. Chem. Inf. Model., 2019, 59, 43-52.16 S. Kang and K. Cho, J. Chem. Inf. Model., 2019, 59, 43-52.

17 J. Lim, S. Ryu, J. W. Kim and W. Y. Kim, J. Cheminf., 2018, 10, 31.17 J. Lim, S. Ryu, J. W. Kim and W. Y. Kim, J. Cheminf., 2018, 10, 31.

18 D. Polykovskiy, A. Zhebrak, D. Vetrov, Y. Ivanenkov, V. Aladinskiy, P. Mamoshina, M. Bozdaganyan, A. Aliper, A. Zhavoronkov and A. Kadurin, Mol. Pharm., 2018, 15, 4398-4405.18 D. Polykovskiy, A. Zhebrak, D. Vetrov, Y. Ivanenkov, V. Aladinskiy, P. Mamoshina, M. Bozdaganyan, A. Aliper, A. Zhavoronkov and A. Kadurin, Mol. Pharm., 2018, 15, 4398-4405.

19 G. Lima Guimaraes, B. Sanchez-Lengeling, C. Outeiral, P. L. Cunha Farias and A. Aspuru-Guzik, arXiv e-prints, arXiv:1705.10843, 2017.19 G. Lima Guimaraes, B. Sanchez-Lengeling, C. Outeiral, P. L. Cunha Farias and A. Aspuru-Guzik, arXiv e-prints, arXiv:1705.10843, 2017.

20 N. Jaques, S. Gu, D. Bahdanau, J. M. Hern´andez-Lobato, R. E. Turner and D. Eck, Proceedings of the 34^th International Conference on Machine Learning, Sydney, Australia, 2017, pp. 1645-1654.20 N. Jaques, S. Gu, D. Bahdanau, JM Hern´andez-Lobato, RE Turner and D. Eck, Proceedings of the 34 ^th International Conference on Machine Learning, Sydney, Australia, 2017, pp. 1645-1654.

21 M. Olivecrona, T. Blaschke, O. Engkvist and H. Chen, J. Cheminf., 2017, 9, 48.21 M. Olivecrona, T. Blaschke, O. Engkvist and H. Chen, J. Cheminf., 2017, 9, 48.

22 D. Neil, M. H. S. Segler, L. Guasch, M. Ahmed, D. Plumbley, M. Sellwood and N. Brown, 6th International Conference on Learning Representations, Workshop Track Proceedings, Vancouver, BC, Canada, 2018.22 D. Neil, M. H. S. Segler, L. Guasch, M. Ahmed, D. Plumbley, M. Sellwood and N. Brown, 6th International Conference on Learning Representations, Workshop Track Proceedings, Vancouver, BC, Canada, 2018.

23 M. Popova, O. Isayev and A. Tropsha, Sci. Adv., 2018, 4, eaap7885.23 M. Popova, O. Isayev and A. Tropsha, Sci. Adv., 2018, 4, eaap7885.

24 W. Jin, R. Barzilay and T. Jaakkola, Proceedings of the 35^th International Conference on Machine Learning, Stockholmsm¨assan, Stockholm Sweden, 2018, pp. 2323-2332.24 W. Jin, R. Barzilay and T. Jaakkola, Proceedings of the 35 ^th International Conference on Machine Learning, Stockholmsm¨assan, Stockholm Sweden, 2018, pp. 2323-2332.

25 Y. Li, O. Vinyals, C. Dyer, R. Pascanu and P. Battaglia, 6^th International Conference on Learning Representations, Workshop Track Proceedings, Vancouver, BC, Canada, 2018.25 Y. Li, O. Vinyals, C. Dyer, R. Pascanu and P. Battaglia, 6 ^th International Conference on Learning Representations, Workshop Track Proceedings, Vancouver, BC, Canada, 2018.

26 J. You, B. Liu, Z. Ying, V. Pande and J. Leskovec, in Advances in Neural Information Processing Systems 31, ed. S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi and R. Garnett, Curran Associates, Inc., 2018, pp. 6410-6421.26 J. You, B. Liu, Z. Ying, V. Pande and J. Leskovec, in Advances in Neural Information Processing Systems 31, ed. S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi and R. Garnett, Curran Associates, Inc., 2018, pp. 6410-6421.

27 Y. Li, L. Zhang and Z. Liu, J. Cheminf., 2018, 10, 33.27 Y. Li, L. Zhang and Z. Liu, J. Cheminf., 2018, 10, 33.

28 R. Assouel, M. Ahmed, M. H. Segler, A. Saffari and Y. Bengio, arXiv e-prints, arXiv:1811.09766, 2018.28 R. Assouel, M. Ahmed, M. H. Segler, A. Saffari and Y. Bengio, arXiv e-prints, arXiv:1811.09766, 2018.

29 Q. Liu, M. Allamanis, M. Brockschmidt and A. Gaunt, in Advances in Neural Information Processing Systems 31, ed. S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi and R. Garnett, Curran Associates, Inc., 2018, pp.7795-7804.29 Q. Liu, M. Allamanis, M. Brockschmidt and A. Gaunt, in Advances in Neural Information Processing Systems 31, ed. S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi and R. Garnett, Curran Associates, Inc., 2018, pp.7795-7804.

30 M. Simonovsky and N. Komodakis, Artificial Neural Networks and Machine Learning - ICANN 2018, Cham, Switzerland, 2018, pp. 412-422.30 M. Simonovsky and N. Komodakis, Artificial Neural Networks and Machine Learning - ICANN 2018, Cham, Switzerland, 2018, pp. 412-422.

31 N. De Cao and T. Kipf, arXiv e-prints, arXiv:1805.11973, 2018.31 N. De Cao and T. Kipf, arXiv e-prints, arXiv:1805.11973, 2018.

32 M. Skalic, J. Jim´enez, D. Sabbadin and G. De Fabritiis, J. Chem. Inf. Model., 2019, 59, 1205-1214.32 M. Skalic, J. Jim´enez, D. Sabbadin and G. De Fabritiis, J. Chem. Inf. Model., 2019, 59, 1205- 1214.

33 D. P. Kingma and M. Welling, 2nd International Conference on Learning Representations, Banff, AB, Canada, 2014.33 D. P. Kingma and M. Welling, 2nd International Conference on Learning Representations, Banff, AB, Canada, 2014.

34 D. Weininger, J. Chem. Inf. Model., 1988, 28, 31-36.34 D. Weininger, J. Chem. Inf. Model., 1988, 28, 31-36.

35 Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang and P. S. Yu, arXiv e-prints, arXiv:1901.00596, 2019.35 Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang and P. S. Yu, arXiv e-prints, arXiv:1901.00596, 2019.

36 J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals and G. E. Dahl, Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 2017, pp. 1263-1272.36 J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals and G. E. Dahl, Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 2017, pp. 1263-1272.

37 D. Polykovskiy, A. Zhebrak, B. Sanchez-Lengeling, S. Golovanov, O. Tatanov, S. Belyaev, R. Kurbanov, A. Artamonov, V. Aladinskiy, M. Veselov, A. Kadurin, S. Nikolenko, A. Aspuru-Guzik and A. Zhavoronkov, arXiv e-prints, arXiv:1811.12823, 2018.37 D. Polykovskiy, A. Zhebrak, B. Sanchez-Lengeling, S. Golovanov, O. Tatanov, S. Belyaev, R. Kurbanov, A. Artamonov, V. Aladinskiy, M. Veselov, A. Kadurin, S. Nikolenko , A. Aspuru-Guzik and A. Zhavoronkov, arXiv e-prints, arXiv:1811.12823, 2018.

38 J. You, R. Ying, X. Ren, W. Hamilton and J. Leskovec, Proceedings of the 35th International Conference on Machine Learning, Stockholmsm¨assan, Stockholm Sweden, 2018, pp. 5708-5717.38 J. You, R. Ying, X. Ren, W. Hamilton and J. Leskovec, Proceedings of the 35th International Conference on Machine Learning, Stockholmsm¨assan, Stockholm Sweden, 2018, pp. 5708-5717.

39 G. W. Bemis and M. A. Murcko, J. Med. Chem., 1996, 39, 2887-2893.39 G. W. Bemis and M. A. Murcko, J. Med. Chem., 1996, 39, 2887-2893.

40 A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow and B. Frey, 4th International Conference on Learning Representations, Workshop Track Proceedings, San Juan, Puerto Rico, 2016.40 A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow and B. Frey, 4th International Conference on Learning Representations, Workshop Track Proceedings, San Juan, Puerto Rico, 2016.

41 J. He, D. Spokoyny, G. Neubig and T. Berg-Kirkpatrick, 7^th International Conference on Learning Representations, Conference Track Proceedings, New Orleans, LA, USA, 2019.41 J. He, D. Spokoyny, G. Neubig and T. Berg-Kirkpatrick, 7 ^th International Conference on Learning Representations, Conference Track Proceedings, New Orleans, LA, USA, 2019.

42 I. Goodfellow, arXiv e-prints, arXiv:1701.00160, 2016.42 I. Goodfellow, arXiv e-prints, arXiv:1701.00160, 2016.

43 R.-R. Griffiths and J. M. Hern?andez Lobato, Chem. Sci., 2020, DOI: 10.1039/C9SC04026A.43 R.-R. Griffiths and J. M. Hern?andez Lobato, Chem. Sci., 2020, DOI: 10.1039/C9SC04026A.

44 M. J. Kusner, B. Paige and J. M. Hern?andez-Lobato, Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 2017, pp. 1945-1954.44 M. J. Kusner, B. Paige and J. M. Hern?andez-Lobato, Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 2017, pp. 1945-1954.

45 H. Dai, Y. Tian, B. Dai, S. Skiena and L. Song, 6^th International Conference on Learning Representations, Conference Track Proceedings, Vancouver, BC, Canada, 2018.45 H. Dai, Y. Tian, B. Dai, S. Skiena and L. Song, 6 ^th International Conference on Learning Representations, Conference Track Proceedings, Vancouver, BC, Canada, 2018.

46 O. Mahmood and J. M. Hernndez-Lobato, arXiv e-prints, arXiv:1905.09885, 2019.46 O. Mahmood and J. M. Hernndez-Lobato, arXiv e-prints, arXiv:1905.09885, 2019.

47 T. Ma, J. Chen and C. Xiao, in Advances in Neural Information Processing Systems 31, ed. S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi and R. Garnett, Curran Associates, Inc., 2018, pp. 7113-7124.47 T. Ma, J. Chen and C. Xiao, in Advances in Neural Information Processing Systems 31, ed. S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi and R. Garnett, Curran Associates, Inc., 2018, pp. 7113-7124.

48 B. Samanta, A. De, G. Jana, P. K. Chattaraj, N. Ganguly and M. Gomez-Rodriguez, arXiv e-prints, arXiv:1802.05283, 2018.48 B. Samanta, A. De, G. Jana, P. K. Chattaraj, N. Ganguly and M. Gomez-Rodriguez, arXiv e-prints, arXiv:1802.05283, 2018.

49 Y. Kwon, J. Yoo, Y.-S. Choi, W.-J. Son, D. Lee and S. Kang, J. Cheminf., 2019, 11, 70.49 Y. Kwon, J. Yoo, Y.-S. Choi, W.-J. Son, D. Lee and S. Kang, J. Cheminf., 2019, 11, 70.

50 P. Battaglia, R. Pascanu, M. Lai, D. Jimenez Rezende and K. Kavukcuoglu, in Advances in Neural Information Processing Systems 29, ed. D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon and R. Garnett, Curran Associates, Inc., 2016, pp. 4502-4510.50 P. Battaglia, R. Pascanu, M. Lai, D. Jimenez Rezende and K. Kavukcuoglu, in Advances in Neural Information Processing Systems 29, ed. D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon and R. Garnett, Curran Associates, Inc., 2016, pp. 4502-4510.

51 RDKit: Open-Source Cheminformatics, http://www.rdkit.org.51 RDKit: Open-Source Cheminformatics, http://www.rdkit.org.

52 InterBioScreen Ltd, http://www.ibscreen.com.52 InterBioScreen Ltd, http://www.ibscreen.com.

53 N. Brown, M. Fiscato, M. H. Segler and A. C. Vaucher, J. Chem. Inf. Model., 2019, 59, 1096-1108.53 N. Brown, M. Fiscato, M. H. Segler and A. C. Vaucher, J. Chem. Inf. Model., 2019, 59, 1096-1108.

54 D. P. Kingma, S. Mohamed, D. Jimenez Rezende and M. Welling, in Advances in Neural Information Processing Systems 27, ed. Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence and K. Q. Weinberger, Curran Associates, Inc., 2014, pp. 3581-3589.54 D. P. Kingma, S. Mohamed, D. Jimenez Rezende and M. Welling, in Advances in Neural Information Processing Systems 27, ed. Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence and K. Q. Weinberger, Curran Associates, Inc., 2014, pp. 3581-3589.

55 A. Gaulton, L. J. Bellis, A. P. Bento, J. Chambers, M. Davies, A. Hersey, Y. Light, S. McGlinchey, D. Michalovich, B. Al-Lazikani and J. P. Overington, Nucleic Acids Res., 2011, 40, D1100-D1107.55 A. Gaulton, LJ Bellis, AP Bento, J. Chambers, M. Davies, A. Hersey, Y. Light, S. McGlinchey, D. Michalovich, B. Al-Lazikani and JP Overington, Nucleic Acids Res., 2011 , 40, D1100-D1107.

Claims

A method for generating an extended molecular structure having a desired scaffold structure comprising the steps of:
(a) learning a scaffold-based molecular generation model comprising: and
(a-1) Encoder

learning;
(a-2) decoder

learning;
(b) A target molecule generation step, in which the expanded molecular structure is obtained by inputting the scaffold graph into the learned scaffold-based molecular generation model.

The method of claim 1,
wherein the encoder comprises an algorithm consisting of:
As the propagation step, H' _V(G) = propagate (H _V(G) , H _E(G) ); and
As the read phase, h _G = readout (H' _V(G) ).

3. The method of claim 2,
A method, characterized in that the propagating step consists of the following steps:
(a) a first propagation step of calculating an aggregated message between each node and its adjacent node by Equation 1 below; and
[Equation 1]

(b) a second propagation step of updating a node vector using the aggregated message by Equation 2 below;
[Equation 2]

.

4. The method of claim 3,
The propagation step comprises calculating an aggregated message including an additional vector c to adjust the graph propagation.

The method of claim 1,
the decoder

recover G from z by a decoding process comprising the following steps:
(a) (i) selecting the atom type to be added or (ii) terminating the building process, node addition;
(b) (i) selecting a binding type for the newly added node in step (a) or (ii) returning to step (a), an edge addition step; and
(c) select node v from existing nodes excluding w, then add the join type selected in step (b) and additional edges (v, w) and return to step (b), node selection (node selection) step.

6. The method of claim 5,
A method, characterized in that the probability vector for determining the next action of steps (a) to (c) is calculated by any one of Equations 3 to 5 below:
[Equation 3]

[Equation 4]

[Equation 5]

.