KR20220022059A

KR20220022059A - Apparatus for substructure-based neural machine translation for retrosynthetic prediction and method thereof

Info

Publication number: KR20220022059A
Application number: KR1020200094585A
Authority: KR
Inventors: 볼칸 우미트; 고준수; 이주용
Original assignee: 주식회사 아론티어
Priority date: 2020-07-29
Filing date: 2020-07-29
Publication date: 2022-02-24
Also published as: KR20220062485A

Abstract

With the rapid development of machine translation, neural network machine translation has played an important role in the field of retrosynthesis to find a rational synthesis route for a target molecule. Previous studies have shown that using the sequence-to-sequence framework of neural network machine translation is a promising way to solve a problem with retrosynthetic schemes. The present invention reconstructs a problem with retrosynthetic schemes as a problem with language translation by using a sequence-to-sequence model that does not require a template. A model of the embodiment is trained in an end-to-end manner and is a fully data-centered method. Unlike a conventional method for translating simplified molecular input line entry system (SMILES) strings of reactants and products, the present invention provides a method for representing chemical reactions based on molecular fragments. Therefore, more advanced prediction results can be calculated than existing calculation methods. In addition, an embodiment of the present invention can solve problems with existing retrosynthetic methods such as generating many invalid SMILES strings. Specifically, an embodiment of the present invention can predict very similar reactant molecules with an accuracy of 57.7%. In addition, the present invention can provide more robust prediction than existing methods. A retrosynthetic prediction comprises a step of training a neural machine translation (NMT) model by using a final product-reactant pair.

Description

Substructure-based neural network machine translation apparatus for prediction of inverse synthesis and a translation method using the same

역합성이란 어떤 유기 화합물을 목표로 하는 합성 반응에 있어서, 특정 화학구조를 가지고 있는 목표 화합물을 어떠한 출발 물질에서 어떤 경로를 통해 합성할 것인가를 추적해 가는 과정을 의미한다. 이를 위해, 실제로 이루어지는 반응 경로와는 역방향으로 목표 화합물의 화학구조에 도착할 수 있을 만한 전구체를 확인하고, 다시 그러한 전구체에 도달하는 다른 전구체를 고려하는 식으로 출발 물질에 이르게 된다. 이는 특히 약물 디자인에 있어서, 표적 단백질과 결합 가능한 리간드 후보군을, 예를 들어, 결합 친화도 기반으로 가상 스크리닝한 후에, 이러한 후보군들 중에서 실제로 합성 가능한 또는 상용화 가능한 물질을 찾아내는데 있어 중요한 단계이다. Reverse synthesis refers to a process of tracing a target compound having a specific chemical structure from which starting material through which route in a synthesis reaction targeting an organic compound. To this end, the starting material is reached by identifying a precursor that can arrive at the chemical structure of the target compound in the opposite direction from the actual reaction path, and considering other precursors that arrive at the precursor again. In particular, in drug design, after virtual screening of a ligand candidate group capable of binding to a target protein, for example, based on binding affinity, it is an important step in finding a substance that can actually be synthesized or commercially available from among these candidate groups.

유기화학에 대한 지식이 다년간의 연구로 축적되었음에도 불구하고, 타겟 분자를 위한 효율적인 합성 경로를 디자인하는 것은 여전히 유기 합성의 중요한 과제로 남아있다. 역합성 접근(retrosynthetic approach)은 사용가능한 반응물과 시약의 세트로부터 타겟 분자를 생성하는 논리적인 합성 경로를 제공한다. 역합성 변환은 순차적 연산이 요구되므로, 이러한 접근은 자연적으로 반복적(iterative)이고 재귀적(recursive)이다. 역합성 변환은 보다 단순하고, 상업적으로 사용가능한 분자들이 특정될 때까지 재귀적으로 일어난다.Although knowledge of organic chemistry has been accumulated through many years of research, designing efficient synthetic pathways for target molecules still remains an important task in organic synthesis. The retrosynthetic approach provides a logical synthetic route for generating target molecules from a set of available reactants and reagents. Since inverse synthesis transforms require sequential operations, this approach is naturally iterative and recursive. Reverse synthetic transformations are simpler, and occur recursively until commercially available molecules are identified.

연산적 역합성 분석은 1969년에 Corey 와 Wipke에 의해 알고리즘 방식으로 최초로 정형화되었다(문헌 [Corey, E. J.; Todd Wipke, W. Computer-assisted design of complex organic syntheses. Science 1969, 166, 178-192]). 이 알고리즘은 알려진 화학반응으로의 모든 가능한 분절을 고려하여, 생성물의 복잡성을 감소시키고 화학적으로 합리적인 경로가 특정될 때까지 프로세싱을 진행한다. 이러한 분절들은 수작업으로 이루어진 최소한의 변환 규칙으로서, 반응 템플릿으로 널리 알려져 있다. 이러한 변환 규칙을 수동으로 인코딩하려면 심층적인 화학 전문지식과 직관이 필요하다. 수작업으로 코딩되어야만 하는 변환 규칙의 큰 크기(>10000)를 고려하면, 합성 지식의 수동적 관리는 매우 복잡한 작업이다. 더 나아가, 반응 템플릿에 대한 의존성은 잠재적으로 예측 정확도를 제한한다. 특히, 반응이 템플릿 영역 외에서 이루어지는 경우 그러하다. 이후의 연구에서는 반응 템플릿을 자동 추출이 가능하게 함으로써 화학자들이 더 빠르게 보다 나은 경로를 찾을 수 있도록 하였다. 그러나, 이는 이전 연구들에서 기인하는 상술한 한계를 언급하지는 않았다. 컴퓨터를 이용한 합성 계획은 최근에 들어 다수의 연구에서 정리되어 있다. Computational retrosynthesis analysis was first formalized as an algorithmic method by Corey and Wipke in 1969 (Corey, EJ; Todd Wipke, W. Computer-assisted design of complex organic syntheses. Science 1969, 166, 178-192). . The algorithm considers all possible fragments into a known chemical reaction, reduces the complexity of the product and proceeds with processing until a chemically rational pathway is specified. These segments are hand-made minimal transformation rules, widely known as reaction templates. Manually encoding these transformation rules requires in-depth chemistry expertise and intuition. Considering the large size (>10000) of transformation rules that must be manually coded, manual management of synthetic knowledge is a very complex task. Furthermore, dependence on response templates potentially limits prediction accuracy. In particular, if the reaction takes place outside the template region. Subsequent work allowed chemists to find better routes faster by enabling automatic extraction of reaction templates. However, this did not address the aforementioned limitations resulting from previous studies. A computer-aided synthesis scheme has recently been documented in a number of studies.

문헌 [Kayala, M. A.; Azencott, C.-A.; Chen, J. H.; Baldi, P. Learning to Predict Chemical Reactions. J. Chem. Inf. Model. 2011, 51, 2209-2222] 및 문헌 [Kayala, M. A.; Baldi, P. ReactionPredictor: Prediction of Complex Chemical Reactions at the Mechanistic Level Using Machine Learning. J. Chem. Inf. Model. 2012, 52, 2526-2540]에서 개발된 반응 예측기(Reaction Predictor)는 최초의 템플릿 프리(template-free) 접근이다. 이는 룰 기반 모델링(rule-based modeling)과 기계학습(machine learning)의 아이디어를 합성 계획의 프레임워크에 적용시킨 기계론적 레벨의 전략이었다. 문헌 [Jin, W.; Coley, C. W.; Barzilay, R.; Jaakkola, T. Predicting organic reaction outcomes with weisfeiler-lehman network. Adv. Neur. In. 2017, 2017-Decem, 2608-2617]에서는 Weisfeiler-Lehman 네트워크에 기초한 완전한 데이터 기반의 새로운 템플릿 프리 방식을 제안하였다. 두 접근 모두 후보 생성물을 생성하기 위한 엔드 투 엔드(end-to-end) 해법을 제공한다. 문헌 [Cadeddu, A.; Wylie, E. K.; Jurczak, J.; Wampler-Doty, M.; Grzybowski, B. A. Organic chemistry as a language and the implications of chemical linguistics for structural and retrosynthetic analyses. Angew. Chem. Int. Edit. 2014, 53, 8108-8112]에서 제공된 이론적 발견은, 다양한 형태의 신경망 기계 번역(neural machine translation: NMT) 아키텍처 방식을 이용하여 정반응 또는 역반응을 예측하는 과제를 위한 다른 템플릿-프리(template-free) 방법들의 발전을 더욱 촉진시켰다. 언어 코퍼스(corpus)에서의 문장과 화학 코퍼스, 예를 들어, 화학적 공간에서의 분자 사이의 명백한 유추에 기초하여, Cadeddu et al. 에서는 분자를 이루는 블록으로서의 하위구조(substructure)의 랭크 빈도 분포(rank-frequency distribution)는 언어 코퍼스에서의 단어와 그것과 유사하다는 것을 보여주었다. 이 검증은 언어 분석의 개념이 정반응 및 역반응 예측의 문제에 대한 해결책을 찾는데 쉽게 적용될 수 있음을 의미한다. 이러한 관점에서, 역합성 예측은 기계번역의 시퀀스-투-시퀀스 프레임워크를 적용하기에 적합하다. See Kayala, M. A.; Azencott, C.-A.; Chen, J. H.; Baldi, P. Learning to Predict Chemical Reactions. J. Chem. Inf. Model. 2011, 51, 2209-2222 and Kayala, M. A.; Baldi, P. ReactionPredictor: Prediction of Complex Chemical Reactions at the Mechanistic Level Using Machine Learning. J. Chem. Inf. Model. 2012, 52, 2526-2540], the reaction predictor is the first template-free approach. This was a strategy at the mechanistic level that applied the ideas of rule-based modeling and machine learning to the framework of synthetic planning. See Jin, W.; Coley, C. W.; Barzilay, R.; Jaakkola, T. Predicting organic reaction outcomes with weisfeiler-lehman network. Adv. Neur. In. 2017, 2017-Decem, 2608-2617] proposed a new template-free method based on complete data based on the Weisfeiler-Lehman network. Both approaches provide an end-to-end solution for generating candidate products. Cadeddu, A.; Wylie, E. K.; Jurczak, J.; Wampler-Doty, M.; Grzybowski, B. A. Organic chemistry as a language and the implications of chemical linguistics for structural and retrosynthetic analyses. Angew. Chem. Int. Edit. 2014, 53, 8108-8112], other template-free for the task of predicting forward or reverse response using various types of neural machine translation (NMT) architectural approaches. further development of methods. Based on the explicit analogy between sentences in the linguistic corpus and molecules in the chemical corpus, for example, in chemical space, Cadeddu et al. showed that the rank-frequency distribution of a substructure as a block constituting a molecule is similar to that of a word in a language corpus. This validation means that the concept of linguistic analysis can be easily applied to find solutions to the problems of forward and backward prediction. From this point of view, reverse synthesis prediction is suitable for applying the sequence-to-sequence framework of machine translation.

시퀀스-투-시퀀스 학습은 순환 신경망 (Recurrent neural network: RNN) 레이어를 사용하여 임의의 길이의 소스 시퀀스를 실수(real numbers)로 구성되는 고정 차원 컨텍스트 벡터로 맵핑한다. 컨텍스트 벡터는 소스 시퀀스의 구문 및 의미 구조에 대한 정보를 포함한다. 이 RNN 레이어와 연결되어, 다른 RNN은 컨텍스트 벡터를 타겟 시퀀스로 디코딩한다. 이러한 관점에서, 두 개의 RNN 유닛들은 인코더-디코더 시스템의 한 쌍과 같이 함께 행동한다. 문헌 [Sutskever, I.; Vinyals, O.; Le, Q. V. Sequence to sequence learning with neural networks. Adv. Neur. In. 2014, 4, 3104-3112] 에서는 LSTM(long short-term memory) 기반의 아키텍처가, 시퀀스의 장거리 관계를 처리할 수 있는 그들의 능력을 통해, 일반적인 시퀀스-투-시퀀스 문제를 해결할 수 있음을 보여줬다. 문헌 [Liu, B.; Ramsundar, B.; Kawthekar, P.; Shi, J.; Gomes, J.; Luu Nguyen, Q.; Ho, S.; Sloane, J.; Wender, P.; Pande, V. Retrosynthetic Reaction Prediction Using Neural Sequence-to-Sequence Models. ACS Cent. Sci. 2017, 3, 1103-1113] 에서는 역합성 예측을 위한 최초의 다층 LSTM 기반의 시퀀스-투-시퀀스 모델을 제안했다. 이것의 GRU(gated recurrent unit) 변형은 정반응 예측을 위해 Nam과 Kim에 의해 제안되었다(문헌 [Nam, J.; Kim, J. Linking the Neural Machine Translation and the Prediction of Organic Chemistry Reactions. 2016, 1-19]). Sequence-to-sequence learning uses a recurrent neural network (RNN) layer to map source sequences of arbitrary length into fixed-dimensional context vectors made up of real numbers. The context vector contains information about the syntax and semantic structure of the source sequence. Connected to this RNN layer, another RNN decodes the context vector into the target sequence. In this respect, the two RNN units act together as a pair of encoder-decoder systems. See Sutskever, I.; Vinyals, O.; Le, Q. V. Sequence to sequence learning with neural networks. Adv. Neur. In. 2014, 4, 3104-3112] showed that long short-term memory (LSTM)-based architectures can solve common sequence-to-sequence problems through their ability to handle long-range relationships of sequences. Liu, B.; Ramsundar, B.; Kawthekar, P.; Shi, J.; Gomes, J.; Luu Nguyen, Q.; Ho, S.; Sloane, J.; Wender, P.; Pande, V. Retrosynthetic Reaction Prediction Using Neural Sequence-to-Sequence Models. ACS Cent. Sci. 2017, 3, 1103-1113] proposed the first multilayer LSTM-based sequence-to-sequence model for reverse synthesis prediction. Its gated recurrent unit (GRU) variant was proposed by Nam and Kim for forward reaction prediction (Nam, J.; Kim, J. Linking the Neural Machine Translation and the Prediction of Organic Chemistry Reactions. 2016, 1- 19]).

최근, 가장 우수한 신경망 기계 번역(neural machine translation: NMT)은 신경망 아키텍처의 구성으로 어텐션 메커니즘(attention mechanism)을 포함시켜 긴 문장에 대한 성능을 강화시켰다. 또한, 오직 어텐션 메커니즘에만 기초하는 변환 아키텍처에서 구현되는 역합성 예측도 있다. 인코더-디코더 모델들은, 특히 어텐션 메커니즘이 도입된 후, 번역 작업의 본질을 다루는데 있어 유사한 전략들을 도입했다. 분자 구조의 SMILES(simplified molecular input line entry system) 표현은 시퀀스-투-시퀀스 기반 모델을 위한 전형적인 입력이다. 그러나 종래의 연구들은 하위구조, 조각(fragment), 레벨에서의 번역에 집중하는 모델은 보여주지 않았다.Recently, the best neural machine translation (NMT) has enhanced the performance for long sentences by including an attention mechanism as a configuration of the neural network architecture. There is also the inverse synthesis prediction implemented in the transformation architecture based only on the attention mechanism. Encoder-decoder models introduced similar strategies in addressing the nature of the translation task, especially after the attention mechanism was introduced. A simplified molecular input line entry system (SMILES) representation of a molecular structure is a typical input for sequence-to-sequence based models. However, previous studies have not shown models that focus on translation at the substructure, fragment, or level.

본 발명의 실시예들에서는, 하위구조 레벨에서의 화학변화를 학습시킴으로써 역합성 반응을 예측하는 템플릿-프리 접근을 제공한다. 실시예들은 MACCS(molecular access system) 키를 사용하여 단어에 대응하는 하위구조의 세트에 기초하는 문장으로 분자를 표현한다. 또한, 실시예들은 SMILES 기반의 토큰화로부터 발생하는 문제들을 적절하게 제거하는 고유의 토큰화(tokenization) 구조를 제공한다. 실시예들은 양방향 LSTM 셀들로 구성되며, 선행하는 반응 분류 정보 없는 완전한 데이터 기반이며, 엔드-투-엔드 방식으로 학습된다. 실시예는 데이터세트와 서술자 큐레이션 단계를 포함하여 방법론의 모든 측면을 철저히 논의한다. USPTO 반응 데이터세트에서 취득한 3개의 데이터세트에 기초한 평가 결과를 제시한다.Embodiments of the present invention provide a template-free approach for predicting a retrosynthetic reaction by learning chemical changes at the substructure level. Embodiments use a molecular access system (MACS) key to represent a molecule in a sentence based on a set of substructures corresponding to words. In addition, embodiments provide a unique tokenization structure that adequately eliminates problems arising from SMILES-based tokenization. Embodiments consist of bidirectional LSTM cells, are fully data-based without preceding response classification information, and are learned in an end-to-end manner. The examples thoroughly discuss all aspects of the methodology, including the dataset and descriptor curation steps. We present the evaluation results based on three datasets acquired from the USPTO response dataset.

또한, 실시예는 데이터세트와 서술자의 분석과 함께 큐레이션에 선행하는 토큰화의 새로운 방법을 제공한다. 본 발명의 실시예에서는, 모델 아키텍처와 정확도 계산을 위한 평가 절차를 간략하게 기술하고, 번역 실험 세트의 결과를 MACCS 키 기반 분자 표현의 장점을 중점으로 설명하며, 본 실시예의 장점과 과제를 설명한다. The embodiment also provides a novel method of tokenization prior to curation with analysis of datasets and descriptors. In the embodiment of the present invention, the model architecture and evaluation procedure for accuracy calculation are briefly described, the results of the translation experiment set are described with emphasis on the advantages of MACCS key-based molecular expression, and the advantages and challenges of the present embodiment are described. .

본 발명의 다양한 실시예들은 새로운 형태의 역합성 방법을 제공하고자 한다. 구체적으로, 역합성 방법에 기계번역을 적용하여 템플릿을 요구하지 않는 템플릿 프리 역합성 방법을 제공하고자 한다. Various embodiments of the present invention are intended to provide a new type of reverse synthesis method. Specifically, it is intended to provide a template-free reverse synthesis method that does not require a template by applying machine translation to the reverse synthesis method.

또한, 실시예들은 신경망 기계번역을 통한 역합성 방법을 위하여 분자를 문자로 표현하는 데이터 전처리 방법들을 제공하고자 한다. In addition, the embodiments are intended to provide data preprocessing methods for expressing molecules as text for a reverse synthesis method through neural network machine translation.

또한, 분자의 하위구조 레벨에서의 기계번역을 위한 데이터 전처리 방법을 제공하고자 한다. In addition, we intend to provide a data preprocessing method for machine translation at the molecular substructure level.

또한, 실시예들은 SMILES 기반의 토큰화로부터 발생하는 문제들을 적절하게 제거하는 고유의 토큰화(tokenization) 구조를 제공한다.In addition, embodiments provide a unique tokenization structure that adequately eliminates problems arising from SMILES-based tokenization.

본 발명의 일 실시예는, 통신부, 제어부 및 메모리부를 포함하는 역합성 시스템에서 구현가능한 신경망 기계번역을 이용하여 역합성을 예측하는 방법으로서, 초기 생성물-반응물 페어를 생성하는 단계로서, 상기 통신부를 통하여 외부서버로부터 생성물-반응물에 대한 반응 데이터세트를 수신하는 단계; 상기 수신한 반응 데이터세트를 문자로 구성되는 서술자로 표현하는 단계; 및 상기 서술자로 표현된 초기 생성물-반응물 페어를 상기 메모리부에 저장하는 단계를 포함하는, 상기 초기 생성물-반응물 페어를 생성하는 단계; 상기 초기 생성물-반응물 페어에 대하여 하나 이상의 필터를 적용하여 필터링하는 단계; 상기 필터링된 생성물-반응물 페어를 일-대-일 맵핑하는 단계;상기 일-대-일 맵핑을 적용한 생성물-반응물 페어를 정렬하여 최종 생성물-반응물 페어를 생성하는 단계; 상기 최종 생성물-반응물 페어를 이용하여 신경망 기계번역(neural machine translation: NMT) 모델을 학습시키는 단계; 상기 학습된 NMT 모델을 평가하는 단계; 및 상기 NMT 모델에 새로운 생성물이 입력되면, 상기 새로운 생성물에 대응하는 후보 반응물을 예측하여 산출하는 단계를 포함하는, 역합성 예측 방법을 제공한다. An embodiment of the present invention provides a method for predicting reverse synthesis using neural network machine translation that can be implemented in a reverse synthesis system including a communication unit, a control unit, and a memory unit, generating an initial product-reactant pair, the communication unit Receiving a reaction data set for the product-reactant from an external server through the; expressing the received response data set as a descriptor composed of characters; and storing the initial product-reactant pair represented by the descriptor in the memory unit, generating the initial product-reactant pair; filtering by applying one or more filters to the initial product-reactant pair; One-to-one mapping of the filtered product-reactant pair; aligning the product-reactant pair to which the one-to-one mapping is applied to generate a final product-reactant pair; training a neural machine translation (NMT) model using the final product-reactant pair; evaluating the learned NMT model; and when a new product is input to the NMT model, predicting and calculating a candidate reactant corresponding to the new product.

또한, 상기 수신한 반응 데이터세트를 문자로 구성되는 서술자로 표현하는 단계는, 분자를 표현하는 MACCS(molecular access system) 키를 준비하는 단계; 상기 MACCS 키 각각에 대하여 소정의 문자를 할당하여, 상기 MACCS 키에 대한 문자 서술자를생성하는 단계; 상기 수신한 반응 데이터세트를 상기 MACCS 키로 표현하는 단계; 및 상기 MACCS 키로 표현된 반응 데이터세트를 상기 문자 서술자로 변환하는 단계를 포함할 수도 있다. In addition, the step of expressing the received response data set as a descriptor composed of characters may include: preparing a molecular access system (MACS) key representing a molecule; generating a character descriptor for the MACCS key by allocating a predetermined character to each of the MACCS keys; expressing the received response data set with the MACCS key; and converting the response dataset expressed by the MACCS key into the character descriptor.

또한, 상기 MACCS키를 준비하는 단계는, 상기 MACCS 키들 중 발생 빈도가 소정 이하인 MACCS 키들을 제거하여 큐레이션된 MACCS 키를 생성하는 단계를 포함할 수도 있다. In addition, the preparing of the MACCS key may include generating a curated MACCS key by removing MACCS keys having a frequency of occurrence less than or equal to a predetermined among the MACCS keys.

또한, 상기 MACCSS 키에 대한 문자 서술자를 생성하는 단계는, 상기 큐레이션된 MACCS 키 각각에 대하여 상기 소정의 문자를 할당하는, 역합성 예측 방법을 제공한다. In addition, the generating of the character descriptor for the MACCSS key provides a desynthesis prediction method in which the predetermined character is assigned to each of the curated MACCS keys.

또한, 상기 초기 생성물-반응물 페어에 대하여 하나 이상의 필터를 적용하여 필터링하는 단계는, 3개 이상의 반응물을 갖는 페어를 제거하는 필터, 동일한 생성물-반응물 페어를 제거하는 필터, 내부 트윈을 제거하는 필터, 100이상의 시퀀스 길이를 갖는 페어를 제거하는 필터 중 하나 이상의 필터를 포함할 수도 있다. In addition, the filtering by applying one or more filters to the initial product-reactant pair includes a filter that removes a pair having three or more reactants, a filter that removes the same product-reactant pair, a filter that removes internal twins, It may include one or more filters among filters that remove pairs having a sequence length of 100 or more.

또한, 상기 필터링된 생성물-반응물 페어를 일대일 맵핑하는 단계는, 상기 필터링된 생성물-반응물 페어에 대한 일-대-다 맵핑을 확인하고, 상기 일-대-다 맵핑 중 가장 짧은 시퀀스 길이를 갖는 분자를 선택하여 일-대-일 맵핑으로 축소하는, 역합성 예측 방법을 제공한다. In addition, in the one-to-one mapping of the filtered product-reactant pair, a one-to-many mapping for the filtered product-reactant pair is checked, and a molecule having the shortest sequence length among the one-to-many mappings A method for predicting inverse synthesis is provided, which reduces to a one-to-one mapping by selecting .

또한, 상기 일-대-일 맵핑을 적용한 생성물-반응물 페어를 정렬하여 최종 생성물-반응물 페어를 생성하는 단계는, 상기 생성물-반응물 페어의 반응물이 두 개 이상인 경우 길이에 따라 내림차순으로 정렬하고, 상기 반응물 사이를 기호로 연결하는, 역합성 예측 방법을 제공한다. In addition, the step of aligning the product-reactant pair to which the one-to-one mapping is applied to generate the final product-reactant pair includes, when two or more reactants of the product-reactant pair are arranged in descending order according to the length, A method for predicting reverse synthesis is provided, which connects the reactants with symbols.

또한, 상기 신경망 기계번역 모델은 시퀀스-투-시퀀스 모델인, 역합성 예측 방법을 제공한다. In addition, the neural network machine translation model provides a sequence-to-sequence model, a reverse synthesis prediction method.

또한, 상기 신경망 기계번역 모델은 두 개의 양방향 LSTM(long short-term memory)을 포함하고, 상기 두 개의 양방향 LSTM 중 하나는 인코더로 다른 하나는 디코더로 기능할 수도 있다. In addition, the neural network machine translation model may include two bidirectional long short-term memories (LSTMs), and one of the two bidirectional LSTMs may function as an encoder and the other as a decoder.

또한, 상기 인코더 및 디코더는 어텐션 메커니즘을 통해 연결될 수도 있다. In addition, the encoder and the decoder may be connected through an attention mechanism.

또한, 상기 양방향 LSTM은 각각 은닉 레이어를 포함하고, 상기 은닉 레이어에 이후에 드롭아웃 레이어를 더 포함하는, 역합성 예측 방법을 제공한다. In addition, the bidirectional LSTM provides a method for predicting inverse synthesis, each including a hidden layer, and further including a dropout layer after the hidden layer.

또한, 상기 신경망 기계번역 모델은 역전파 동안에 기울기의 노름(norm)이 임계값을 초과하지 않도록 그래디언트 클리핑을 더 포함할 수도 있다. In addition, the neural network machine translation model may further include gradient clipping so that the norm of the gradient does not exceed a threshold value during backpropagation.

또한, 상기 신경망 기계번역 모델의 학습 레이트는 3 개의 에폭마다 일정 계수로 감소될 수도 있다. In addition, the learning rate of the neural network machine translation model may be reduced by a constant coefficient every three epochs.

또한, 상기 학습된 NMT 모델을 평가하는 단계는, 상기 NMT 모델을 통해 예측되는 후보 반응물과 상기 최종 생성물-반응물 페어를 통해 얻는 실제 반응물 사이의 유사도를 연산하여 예측 정확도를 평가하는, 역합성 예측 방법을 제공한다. In addition, the evaluating the learned NMT model comprises calculating the similarity between the candidate reactant predicted through the NMT model and the actual reactant obtained through the final product-reactant pair to evaluate the prediction accuracy, reverse synthesis prediction method provides

또한, 상기 후보 반응물을 예측하여 산출하는 단계는, 상기 후보 반응물을 합성하기 위한 2차 후보 반응물을 예측하여 산출하는 단계를 더 포함할 수도 있다. Also, the predicting and calculating of the candidate reactant may further include predicting and calculating a secondary candidate reactant for synthesizing the candidate reactant.

본 발명의 실시예들은, 완전한 데이터 기반으로, 엔드-투-엔드 방식으로 학습되어, 템플릿 프리 역합성 방법을 제공한다. Embodiments of the present invention provide a template-free desynthesis method, which is learned in an end-to-end manner, based on complete data.

또한, 본 발명의 실시예들은, 언어의 기계번역과 하위구조 기반의 분자 번역의 유사성을 검증하고, 이에 기초하여 기계번역을 이용한 역합성 방법을 제공할 수 있다. In addition, embodiments of the present invention can verify the similarity between machine translation of language and molecular translation based on substructure, and provide a reverse synthesis method using machine translation based on this.

도 1은 역합성 시스템의 일 실시예를 개략적으로 나타낸 도면이다.
도 2는 일 실시예에 따른 MACCS 키의 빈도 분포를 나타내는 도면이다.
도 3은 MACCS 키에 할당되는 문자의 예시를 나타내는 도면이다.
도 4는 페어 길이의 분포를 예시적으로 나타내는 도면이다.
도 5는 실시예에 따른 생성물-반응물 페어를 생성하는 프로세스를 나타내는 도면이다.
도 6은 실시예에 따른 문자키와 다른 키들의 관계를 나타내는 도면이다.
도 7은 실시예에 따른 역합성 예측 모델의 학습 진행을 나타내는 도면이다.
도 8은 일 실시예에 따라 무작위로 선택된 예측들을 나타낸다.
도 9는 다른 실시예에 따라 선택된 예측들을 나타낸다.
도 10은 다른 실시예에 따라 선택된 예측들을 나타낸다.
도 11은 도 8의 반응 4의 첫 번째 반응물에 대한 후보들의 예시를 나타낸다.1 is a diagram schematically illustrating an embodiment of a reverse synthesis system.
2 is a diagram illustrating a frequency distribution of MACCS keys according to an embodiment.
3 is a diagram illustrating an example of a character assigned to a MACCS key.
4 is a diagram exemplarily illustrating a distribution of pair lengths.
5 is a diagram illustrating a process for generating a product-reactant pair according to an embodiment.
6 is a diagram illustrating a relationship between a character key and other keys according to an embodiment.
7 is a diagram illustrating a learning progress of an inverse synthesis prediction model according to an embodiment.
8 shows randomly selected predictions according to an embodiment.
9 shows selected predictions according to another embodiment.
10 shows selected predictions according to another embodiment.
11 shows an example of candidates for the first reactant of reaction 4 of FIG. 8 .

본 발명은 본 명세서에 첨부된 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 한편, 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 본 명세서에서 사용되는 "포함한다(comprises)" 또는 "포함하는(comprising)"은 언급된 구성요소, 단계 외에 하나 이상의 다른 구성요소, 단계의 존재 또는 추가를 배제하지 않는다. 제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 구성요소들은 용어들에 의해 한정되어서는 안 된다. 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.The present invention will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings herein. However, the present invention is not limited to the embodiments disclosed below, but will be implemented in a variety of different forms, and only these embodiments allow the disclosure of the present invention to be complete, and common knowledge in the technical field to which the present invention belongs It is provided to fully inform the possessor of the scope of the invention, and the present invention is only defined by the scope of the claims. On the other hand, the terms used herein are for the purpose of describing the embodiments and are not intended to limit the present invention. In this specification, the singular also includes the plural unless specifically stated otherwise in the phrase. As used herein, "comprises" or "comprising" does not exclude the presence or addition of one or more other elements, steps, in addition to the recited elements, steps. Terms such as first, second, etc. may be used to describe various elements, but the elements should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another.

이하에서는 도면을 참조하여 본 발명의 다양한 실시예를 예시적으로 설명한다. Hereinafter, various embodiments of the present invention will be exemplarily described with reference to the drawings.

도 1은 실시예에 따른 역합성 시스템의 구성의 일례를 나타내는 블록도로서, 본 실시예에 관련된 부분을 개념적으로 나타내고 있다. 각각의 구성은 하나의 장치에 모두 구비되어 단독으로 처리를 행할 수도 있으나 이에 한정되는 것은 아니며, 네트워크를 통해 접속되어 각각의 구성이 분리된 장치에서 수행되는 것 또한 포함할 수 있다. Fig. 1 is a block diagram showing an example of the configuration of a reverse synthesis system according to the embodiment, and conceptually shows parts related to the present embodiment. Each configuration may be provided in one device and may be processed independently, but the present invention is not limited thereto, and may also include that each configuration is performed in a separate device by being connected through a network.

외부 서버(20)는 네트워크를 통해 역합성 시스템(10)과 서로 접속될 수 있고, 단백질 구조 정보, 리간드 구조 정보, 단백질-리간드 복합체 구조 정보, 단백질-리간드 복합체 결합 자리에 따른 결합 친화도 정보, 유전자 정보, 분자간 상호 작용 정보, 및/또는 단백질 구조 유사성 정보 등의 정보를 제공할 수도 있다. 또한, 외부 서버(20)는 SMILES 표기법 또는 MACCS 키로 표기된 분자들의 데이터세트들을 제공할 수도 있다. 또는, 예를 들어, 외부 서버(20)는 USPTO에서 제공하는 분자 데이터베이스 또는 이를 전처리한 데이터세트일 수도 있다. 외부 서버(20)는 예를 들어, 외부 서버(20)는 역합성 시스템(10)의 예측 처리를 위한 데이터 베이스이거나 또는 이를 제공하는 서버일 수 있다. The external server 20 may be connected to each other with the reverse synthesis system 10 through a network, and protein structure information, ligand structure information, protein-ligand complex structure information, protein-ligand complex binding site binding affinity information, Information such as genetic information, intermolecular interaction information, and/or protein structure similarity information may be provided. The external server 20 may also provide datasets of molecules denoted by SMILES notation or MACCS key. Alternatively, for example, the external server 20 may be a molecular database provided by the USPTO or a preprocessed dataset. The external server 20 may be, for example, a database for the prediction processing of the inverse synthesis system 10 or a server providing the same.

역합성 시스템(10)은 제어부(11), 통신부(12), 입출력 인터페이스부(13), 메모리부(14)를 포함할 수 있다. The inverse synthesis system 10 may include a control unit 11 , a communication unit 12 , an input/output interface unit 13 , and a memory unit 14 .

제어부(11)는 역합성 시스템(10)의 전체를 제어하는 구성으로서, 예를 들어, CPU, GPU 등의 프로세싱 유닛을 포함할 수 있다. 제어부(11)는 메모리부(14)에 저장된 정보들을 이용하여 후술할 모델들을 학습시킬 수 있고, 또한 학습된 모델을 통해 새로운 입력에 대한 예측값 산출을 수행할 수도 있다. 구체적으로, 제어부(11)는 신경망 기계 번역(neural machine translation: NMT) 모델을 제어할 수 있다. 이를 위하여 제어부(11)는 OS(operating system) 등의 제어 프로그램이나, 각종의 처리 순서 등을 규정한 프로그램, 데이터를 저장하기 위한 내부 메모리를 포함할 수도 있다. 그리고, 제어부(11)는 이들 프로그램 등에 의해 다양한 처리를 실행하기 위한 정보 처리를 수행할 수 있다. The control unit 11 is a component that controls the entire inverse synthesis system 10 , and may include, for example, a processing unit such as a CPU or a GPU. The controller 11 may learn models to be described later by using the information stored in the memory unit 14 , and may also calculate a predicted value for a new input through the learned model. Specifically, the controller 11 may control a neural machine translation (NMT) model. To this end, the control unit 11 may include an internal memory for storing a control program such as an operating system (OS), a program defining various processing sequences, and the like, and data. Then, the control unit 11 can perform information processing for executing various processing by these programs or the like.

또한, 통신부(12)는 통신 회선 등에 접속되는 라우터(router) 등의 통신 장치에 접속될 수 있는 인터페이스를 포함할 수 있고, 역합성 시스템(10)과 외부 서버(20)와의 통신을 제어할 수 있다. In addition, the communication unit 12 may include an interface that can be connected to a communication device such as a router connected to a communication line or the like, and can control communication between the reverse synthesis system 10 and the external server 20 . there is.

입출력 인터페이스부(13)는 입력부(15), 디스플레이부(16)에 접속되는 인터페이스일 수 있다. 입출력 인터페이스부(13)를 통해 예측 시스템(10)과 사용자가 소통할 수 있다. 예를 들어, 디스플레이부(16)는 애플리케이션 등의 표시 화면을 표시하는 표시 수단(예를 들면, 액정 또는 유기 EL 등으로 구성되는 디스플레이, 모니터, 및 터치 패널 등)일 수도 있다. 또한, 입력부(15)는, 예를 들면 키입력부, 터치 패널, 컨트롤 패드(예를 들면 터치 패드, 및 게임 패드 등), 마우스, 키보드, 및 마이크 등일 수도 있다. The input/output interface unit 13 may be an interface connected to the input unit 15 and the display unit 16 . A user may communicate with the prediction system 10 through the input/output interface unit 13 . For example, the display unit 16 may be a display means for displaying a display screen such as an application (eg, a display composed of liquid crystal or organic EL or the like, a monitor, a touch panel, etc.). Also, the input unit 15 may be, for example, a key input unit, a touch panel, a control pad (eg, a touch pad, a game pad, etc.), a mouse, a keyboard, and a microphone.

또한, 메모리부(14)는 각종의 데이터 베이스나 테이블 등을 저장하는 장치일 수 있다. 예를 들어, 메모리부는 필터링된 미국 특허 반응 데이터세트, SMILES 표기법 또는 MACCS 키로 표기된 분자들의 데이터세트, 분자를 표현하는 서술자 큐레이션 정보, 다양한 필터부, 단백질 구조 정보, 리간드 구조 정보, 단백질-리간드 복합체 구조 정보, 단백질-리간드 복합체 결합 자리에 따른 결합 친화도 정보, 유전자 정보, 분자간 상호 작용 정보, 및/또는 단백질 구조 유사성 정보 등의 정보를 포함할 수 있다.Also, the memory unit 14 may be a device for storing various databases or tables. For example, the memory unit is a filtered US patent reaction dataset, a dataset of molecules marked with SMILES notation or MACCS key, descriptor curation information representing the molecule, various filter units, protein structure information, ligand structure information, protein-ligand complex It may include information such as structural information, binding affinity information according to the protein-ligand complex binding site, genetic information, intermolecular interaction information, and/or protein structure similarity information.

데이터세트(dataset)dataset

실시예에서는 텍스트 마이닝 접근으로 취득한 필터링된 미국 특허 반응 데이터세트를 사용하였다. 문헌 [Schwaller, P.; Gaudin, T.; Lanyi, D.; Bekas, C.; Laino, T. "Found in Translation": predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models. Chem. Sci. 2018, 9, 6091-6098]에서는 원자-매핑없이 데이터세트에서 중복된 반응 문자열들을 제거했다. 그들은 또한 RDKit으로 표준화되지 못한 780개의 반응들을 제거했다. 데이터의 본질적인 제한은 엔트리들의 대부분이 단일 생성물 반응이라는 것이다. 그러므로, 실시예는 데이터세트의 92%에 해당하는 단일 생성물에 대해서만 고려했다. In the examples, a filtered US patent response dataset obtained with a text mining approach was used. Schwaller, P.; Gaudin, T.; Lanyi, D.; Bekas, C.; Laino, T. "Found in Translation": predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models. Chem. Sci. 2018, 9, 6091-6098] removed duplicate response strings from the dataset without atom-mapping. They also removed 780 responses that were not standardized with RDKit. An essential limitation of the data is that most of the entries are single product reactions. Therefore, the examples were considered only for a single product corresponding to 92% of the dataset.

SMILES 라인 표기법은 분자 구조를 문자, 숫자, 그리고 기호의 선형적 시퀀스로 표현한다. 따라서, 언어적 관점에서, SMILES는 문법적 사양을 가진 언어로 간주될 수 있다. 그러나, 실시예에서는, 소정의 166개의 하위구조들로 구성된 MACCS 키들을 사용하여 분자를 조각들(fragments)의 세트로 표현하였다. 이러한 이진 비트-기반 분자 서술자는 분자를 166비트 벡터 길이로 변환하며, 각 비트는 SMARTS(SMILES arbitrary target specification) 패턴의 소정의 사전(dictionary)으로부터 추출되는 특징과 연관된다. SMILES line notation represents a molecular structure as a linear sequence of letters, numbers, and symbols. Therefore, from a linguistic point of view, SMILES can be regarded as a language with grammatical specifications. However, in the embodiment, a molecule is expressed as a set of fragments using MACCS keys composed of a predetermined 166 substructures. This binary bit-based molecular descriptor transforms the molecule into a 166-bit vector length, each bit associated with a feature extracted from a predetermined dictionary of SMARTS (SMILES arbitrary target specification) patterns.

서술자 큐레이션(descriptor curation)Descriptor curation

실시예에서, 분자는 MACCS 키를 사용한 조각(fragment)들의 세트로 표현될 수 있다. 실시예는 실시예의 데이터세트에서 각각의 MACCS 키의 발생 횟수를 조사했다. 또한, 얻어진 결과를 975 백만개의 분자로 이루어진 Generated Data Base-13(GDB-13)의 서브세트인, 무작위로 선택된 백만 개의 약물 유사 소분자와 비교하였다. 도 2는 두 데이터베이스에서의 MACCS 키의 정규화된 빈도 분포를 보여준다. 도 2에서의 직접적인 페어 비교는 MACCS 키의 수를 감소시킬 수 있다. 실시예에서는, 한 번도 발생하지 않은 5개의 키들과 USPTO 데이터베이스에서 빈번하게 나타나지 않은 9개의 키들이 생략되었다. 비교분석에 따라, GDB-13 데이터베이스에서 확인되지 않거나, 거의 확인되지 않은 26개의 키들 또한 추가적으로 제외되었다. In an embodiment, a molecule may be represented as a set of fragments using a MACCS key. The example looked at the number of occurrences of each MACCS key in the example's dataset. In addition, the results obtained were compared with one million randomly selected drug-like small molecules, a subset of the Generated Data Base-13 (GDB-13) consisting of 975 million molecules. Figure 2 shows the normalized frequency distribution of MACCS keys in two databases. Direct pair comparison in FIG. 2 may reduce the number of MACCS keys. In an embodiment, 5 keys that never occurred and 9 keys that did not appear frequently in the USPTO database were omitted. According to the comparative analysis, 26 keys that were not identified or rarely identified in the GDB-13 database were additionally excluded.

실시예는 약물 유사 분자의 분석을 위한 모델을 개발하는 것이므로 모든 MACCS 키들이 사용될 필요는 없다. 발생 분석에 기초하여 불필요한 키들을 제거하는 것은 명백한 장점을 가진다. 이는 소스와 타겟 문장의 길이를 단축시키고 번역 작업에서 사용되는 키들의 랭크 분포를 더 향상시킨다. 실시예에서는, 모든 분자들은 126개의 MACCS 키들로 표현되고, 이는 GDB-13의 무작위로 샘플링된 서브세트의 98%를 적합하게 표현할 수 있다. 화학자가 다루는 기계 번역 작업에서는, 소스와 타겟 분자가 반응물과 생성물에 상호교환가능하게 대응하는 플레이스홀더(placeholder)이다. 선택은 목적하는 분석에 의존한다. 역합성 예측 작업을 위해서는, 소스와 타겟 문장은 각각 생성물과 반응물을 의미한다. Since the embodiment is to develop a model for the analysis of drug-like molecules, not all MACCS keys need be used. Removal of unnecessary keys based on occurrence analysis has obvious advantages. This shortens the length of the source and target sentences and further improves the rank distribution of keys used in translation work. In an embodiment, all molecules are represented by 126 MACCS keys, which can adequately represent 98% of the randomly sampled subset of GDB-13. In machine translation tasks handled by chemists, source and target molecules are placeholders that correspond interchangeably to reactants and products. The choice depends on the desired analysis. For the reverse synthesis prediction task, the source and target sentences represent products and reactants, respectively.

반응 전처리(reaction preprocessing)reaction preprocessing

실시예는 큐레이션된 MACCS 키의 제로가 아닌 지표만을 고려했다. 고유의 인위적인 “단어”를 형성하기 위하여 MACCS 키의 랭크 빈도 분포에 기초하여 제로가 아니게 랭크된 MACCS 키에 영어 문자를 할당하였다. 다시 말하면, 실시예의 생성물과 반응물 문장은 오직 126 개의 단어들로 구성되어, 즉, 각각의 문장은 독립적인 조각들의 정렬된 리스트이다. 단일 문자의 단어들은 영어의 가장 빈번한 문자 top-21의 상위 및 하위 케이스를 이용하여 생성되었다. 이중 문자의 단어들은 42개의 단일 문자마다 “x”와 “z”를 연접하여 구성되어, 모든 126 MACCS 키들에 대한 커버리지를 확보할 수 있도록 했다. 즉, 실시예의 문자화된 조각 어휘는 126의 고정된 길이를 가진다. 도 3은 MACCS 키들에 할당된 문자를 나타내는 예시적 도면이다. 예를 들어 MACCS 키 58에 대해서는 'mz' 가 할당되고, MACCS 키 110에 대해서는 “G”가 할당됨을 알 수 있다. 도 3의 할당표는 일 실시예의 설명을 위한 예시로서 본 발명이 이에 한정되는 것은 아니며, 당업자가 용이하게 도출할 수 있는 변형, 추가, 삭제 등 또한 본 발명의 기술적 사상에 포함됨은 자명하다. 또한, 본 발명의 실시예는 MACCS 키를 기초로 문자를 할당하여 분자를 표시하고 있으나 이는 실시예의 하나이며 이외의 다른 표현법에 대하여 문자를 할당하여 분자를 표현하는 것도 가능하다. The embodiment considered only non-zero indicators of the curated MACCS key. To form a unique artificial “word”, English characters were assigned to non-zero-ranked MACCS keys based on the rank frequency distribution of MACCS keys. In other words, the product and reactant sentences of the examples consist of only 126 words, ie, each sentence is an ordered list of independent pieces. Single-letter words were generated using upper and lower cases of the most frequent letters top-21 of English. The double-letter words are formed by concatenating “x” and “z” for every 42 single letters, ensuring coverage for all 126 MACCS keys. That is, the literal fragment vocabulary of the embodiment has a fixed length of 126. 3 is an exemplary diagram illustrating characters assigned to MACCS keys. For example, it can be seen that 'mz' is assigned to the MACCS key 58 and "G" is assigned to the MACCS key 110. The allocation table of FIG. 3 is an example for explaining an embodiment, and the present invention is not limited thereto, and it is obvious that modifications, additions, deletions, etc. that can be easily derived by those skilled in the art are also included in the technical spirit of the present invention. In addition, although the embodiment of the present invention displays the numerator by allocating characters based on the MACCS key, this is one of the embodiments and it is also possible to express the numerator by allocating characters for other expression methods.

아래의 표 1은 예시적으로 생성물-반응물 페어를 생성하는 과정을 나타낸다. 데이터세트의 모든 반응들에 대하여 동일한 절차가 진행되었다. 표 1은 역합성 예측 작업을 위한 생성물과 반응물 문장을 얻는 데이터 준비 단계를 나타낸다. Table 1 below shows an exemplary process for generating product-reactant pairs. The same procedure was followed for all reactions in the dataset. Table 1 shows the data preparation steps to obtain product and reactant sentences for the reverse synthesis prediction task.

StepStep Example (entry # 20032, Lowe's USPTO data set)Example (entry # 20032, Lowe's USPTO data set)

반응

반응 SMILES
(표준화)

reaction

REACTION SMILES
(standardization)

Reactants > reagents · solvents > products
C=CC.CC(=O)O>CC(=O)OC(C)=O.O=O>CC(=O)OCC(C)OC(C)=O

Reactants > reagents · solvents > products
C=CC.CC(=O)O>CC(=O)OC(C)=OO=O>CC(=O)OCC(C)OC(C)=O Removal of reagents and solvents [['C=CC','CC(=O)O'],'CC(=O)OCC(C)OC(C)=O'] MACCS domain
(Nonzero keys) [[34, 99, 160],[123, 139, 154, 157, 159, 160, 164]],
[72, 108, 109, 115, 116, 123, 126, 132, 136, 140, 141, 146, 149, 152, 153, 154, 155, 157, 159, 160, 164] character assignment
(Frequency basis) [['Uz','dz','s'],['Y','Ex','c','r','h','s','t']],
['iz','Fx','fx','ez','Gx','Y','E','hx','ux','Hx','Ix','O','H','v','u','c','w','r','h','s','t'] product iz Fx fx ez Gx Y E hx ux Hx Ix O H v u c w r h s t reactant(s) Y Ex c r h s t - Uz dz s

반응 데이터세트 큐레이션(reaction data set curation)reaction data set curation

생성물-반응물 페어 데이터세트는 실시예의 번역 모델에 의해 처리되기 전에 추가적인 큐레인션을 거친다. 모든 분자를 126개의 편집된 MACCS 키들로 표현한 후에, 여러 개의 필터를 적용하여 동일한 생성물-반응물 페어 및 내부 트윈을 제거한다. 내부 트윈은 생성물과 반응물 문장이 동일한 데이터 엔트리 페어를 의미한다. 이들은 화학적 변화가 실시예의 MACCS 키 기반의 표현법의 감도를 넘어서는 경우에 발생한다. 실시예는 하위구조의 하부공간에서 작동하도록 분자들을 MACCS 키들에 연결하였기 때문에 어느 정도의 정보는 손실되었다. 실시예의 전처리 절차는 5748개의 내부 트윈을 야기하였고, 이는 데이터세트에서 제거되었다. 추가적으로, 세 개 이상의 반응물을 가지는 반응들은 배제되었다. 가장 긴 페어의 길이는 너무 긴 조각 시퀀스를 피하도록 도 4에 도시된 바와 같이 100으로 세팅되었다. The product-reactant pair dataset undergoes additional curation before being processed by the translation model of the example. After representing all molecules with 126 edited MACCS keys, several filters are applied to remove identical product-reactant pairs and inner twins. An inner twin refers to a data entry pair in which the product and reactant statements are identical. These occur when the chemical change exceeds the sensitivity of the MACCS key-based representation of the embodiment. Some information is lost because the embodiment has linked molecules to MACCS keys to operate in the subspace of the substructure. The example pretreatment procedure resulted in 5748 internal tweens, which were removed from the dataset. Additionally, reactions with three or more reactants were excluded. The length of the longest pair was set to 100 as shown in FIG. 4 to avoid too long fragment sequences.

그 후, 생성물-반응물 페어는 생성물과 반응물 문장 사이의 일대일 대응을 보장하기 위하여 일대일 맵 생성기(injective map generator)에 공급된다. 만일 반응물 문장이 두 개의 반응물들로 구성되는 경우, 실시예는 그 시퀀스 길이에 따라 내림차순으로 정렬한다. 반응물들은 “-“ 기호로 분리된다. 큐레이션된 데이터세트는 총 352,546 개의 생성물-반응물 페어를 포함하고, 추가적으로 각각의 페어의 반응물 분자의 수에 따라 두 개의 분리 서브세트로 나눠진다. 즉, 단일 반응물(single reactant)와 이중 반응물(double reactant)의 데이터세트들로 세분화된다. 이러한 방식으로 데이터세트를 구성하는 것은 독립적으로 모델의 성능을 평가하는데 있어 중요하다. 도 5는 데이터세트의 사이즈와 함께 데이터세트의 큐레이션 단계는 설명하고 있다. The product-reactant pairs are then fed to an injective map generator to ensure a one-to-one correspondence between product and reactant sentences. If the reactant sentence consists of two reactants, the embodiment sorts them in descending order according to their sequence length. Reactants are separated by a “-” sign. The curated dataset contains a total of 352,546 product-reactant pairs, further divided into two separate subsets according to the number of reactant molecules in each pair. That is, it is subdivided into datasets of single reactant and double reactant. Organizing the dataset in this way is important for independently evaluating the performance of the model. Figure 5 illustrates the steps of curation of a dataset along with the size of the dataset.

도 5는 상술한 설명에 따라 최종적인 생성물-반응물 페어를 도출하는 과정을 도식화한 도면이다. 도 5에 나타난 바와 같이 학습 및 테스트를 위한 생성물-반응물 페어를 확보하기 위한 데이터세트 큐레이션 프로세스가 나타나 있다. 5 is a diagram schematically illustrating a process of deriving a final product-reactant pair according to the above description. As shown in Figure 5, the dataset curation process for obtaining product-reactant pairs for training and testing is shown.

예를 들어, 반응 전처리 단계에서는 USPTO 데이터세트에서 입력받은 1,002,970개의 분자들을(N=1,002,970) 126개의 MACCS 키로 표현한다. 이를 통해, 초기 생성물-반응물 페어를 생성할 수 있다. 반응 데이터세트 큐레이션 단계에서는 다수의 필터를 거치게 된다. 실시예는 예를 들어 4개의 필터를 적용하였다. 3개 이상의 반응물인 경우를 제거하고(N=922,823), 겹치는 페어들을 제거하고(N=786,219), 내부 트윈을 제거하고(N=780,471), 페어의 길이가 100이상인 경우를 제거하였다(N=429,889). 또한, 개시된 필터 이외의 다른 필터가 적용될 수도 있다. 본 실시예의 필터링은 가능한 필터링들의 일부 예시일 뿐 이에 한정되는 것은 아니다. 추가적인 필터링이 더 적용될 수도 있고, 또는 개시된 필터링의 일부만 적용될 수도 있다. For example, in the reaction preprocessing step, 1,002,970 molecules input from the USPTO dataset (N=1,002,970) are expressed as 126 MACCS keys. This may create an initial product-reactant pair. In the reaction dataset curation stage, it goes through a number of filters. The example applied, for example, 4 filters. Cases of 3 or more reactants were removed (N = 922,823), overlapping pairs were removed (N = 786,219), inner twins were removed (N = 780,471), and cases with a pair length greater than 100 were removed (N = 429,889). In addition, filters other than the disclosed filters may be applied. The filtering of this embodiment is only some examples of possible filtering, but is not limited thereto. Additional filtering may be further applied, or only a portion of the disclosed filtering may be applied.

필터링된 생성물-반응물 페어는 일대일 맵 생성기에 공급되고(N=352,546), 이를 정렬하여 최종적인 생성물-반응물 페어를 생성한다. 생성된 생성물-반응물 페어는 모델에 입력되어 학습 및 테스트를 거치게 된다. The filtered product-reactant pairs are fed to a one-to-one map generator (N=352,546), which is sorted to produce the final product-reactant pairs. The resulting product-reactant pairs are fed into the model to be trained and tested.

모델 아키텍처model architecture

실시예의 시퀀스-투-시퀀스 신경망은 두 개의 양방향 LSTM을 포함한다. 하나는 인코더, 다른 하나는 디코더를 위한 것이다. 또한, 양방향의 LSTM을 사용하한 모델 성능의 개선도를 수치화하기 위하여 단방향 LSTM도 사용했다. 실시예는 소스 시퀀스들의 모든 구성들 사이의 비지역적 관계를 확보할 수 있도록 Luong의 글로벌 어텐션 메커니즘을 통해 인코더와 디코더를 연결했다(문헌 [Luong, M. T.; Pham, H.; Manning, C. D. Effective approaches to attention-based neural machine translation. Conf. Proc. - EMNLP 2015 Conf. Empir. Methods Nat. Lang. Process. 2015, 1412-1421]). 어텐션 메커니즘은 신경망이 소스 문장의 상이한 부분에 집중할 수 있도록 하고, 학습 프로세스 과정에서 단어들 사이의 비선형적 관계를 고려할 수 있다. 실시예에서 적용한 글로벌 어텐션 메커니즘은 본질적으로 기계 번역 업무를 위해 제안된 문헌 [Bahdanau, D.; Cho, K. H.; Bengio, Y. Neural machine translation by jointly learning to align and translate. 3rd Int. Conf. Learn. Represent. ICLR 2015 - Conf. Track Proc.2015, 1-15]의 최초의 어텐션 메커니즘과 유사하다. 글로벌 접근은 디코더 유닛의 각각의 시간 단계에서의 각 타겟 단어에 대한 글로벌 문맥 벡터(global context vector)를 연산하기 위해 소스 문장의 모든 단어들에 “어텐션”에 집중한다. 그러므로, 글로벌 문맥 벡터는 모든 소스 은닉 상태(hidden state)에 대한 가중합을 나타낸다. 이 문맥 정보는 예측 정확도의 개선을 가져올 수 있다. The sequence-to-sequence neural network of the embodiment includes two bidirectional LSTMs. One for the encoder and the other for the decoder. In addition, a unidirectional LSTM was also used to quantify the degree of improvement in model performance using the bidirectional LSTM. The embodiment connected the encoder and the decoder through Luong's global attention mechanism to ensure non-local relationships between all configurations of source sequences (Luong, MT; Pham, H.; Manning, CD Effective approaches to Attention-based neural machine translation. Conf. Proc. - EMNLP 2015 Conf. Empir. Methods Nat. Lang. Process. 2015, 1412-1421]). The attention mechanism allows the neural network to focus on different parts of the source sentence, and can take into account non-linear relationships between words during the learning process. The global attention mechanism applied in the embodiment is essentially the proposed method for machine translation tasks [Bahdanau, D.; Cho, K. H.; Bengio, Y. Neural machine translation by jointly learning to align and translate. 3rd Int. Conf. Learn. Represent. ICLR 2015 - Conf. Track Proc.2015, 1-15] is similar to the first attention mechanism. The global approach concentrates on “attention” on all words in the source sentence to compute a global context vector for each target word at each time step of the decoder unit. Therefore, the global context vector represents the weighted sum of all source hidden states. This contextual information can lead to improved prediction accuracy.

학습 세부 정보Learning Details

실시예의 큐레이션된 데이터세트들은 학습용과 테스트용으로 각각 9:1 로 무작위로 나눠진다. 검증 세트는 학습 세트에서 임의적으로 샘플링된다(10%). 단어 임베딩은 어휘 내의 문자화된 조각을 표현하기 위해 사용된다. 임베딩 레이어가 생성된 후, 126 차원의 고정 길이의 밀집 벡터(dense vector)를 포함하는 학습가능한 텐서(tensor)가 무작위로 초기화된다. 임베딩 클래스 방법은 그 후, 텐서 상의 룩업을 통해 각 단어의 임베딩을 평가한다. 실시예는 확률적 기울기 강하(stochastic gradient descent) 알고리즘을 사용하여 인코더-디코더 모델의 모든 파라미터들을 학습시켰다. 교차 엔트로피 함수는 손실 함수로 사용되었다. The curated datasets of the examples are randomly divided in a ratio of 9:1 for training and testing, respectively. The validation set is randomly sampled from the training set (10%). Word embeddings are used to represent literal fragments within a vocabulary. After the embedding layer is created, a learnable tensor containing a 126-dimensional fixed-length dense vector is randomly initialized. The embedding class method then evaluates the embedding of each word through a lookup on the tensor. The embodiment trained all parameters of the encoder-decoder model using a stochastic gradient descent algorithm. The cross entropy function was used as the loss function.

각각의 데이터세트에 대하여, 실시예는 최적의 성능을 얻기 위하여 표 2에 설명된 바와 같이 하이퍼 파라미터 공간의 범위 내에서 일련의 테스트들을 진행했다. 표 2는 하이퍼 파라미터 공간 및 최적 모델을 위한 하이퍼 파라미터를 나타낸다. For each dataset, the Example ran a series of tests within the bounds of the hyperparameter space as described in Table 2 to obtain optimal performance. Table 2 shows the hyperparameter space and hyperparameters for the optimal model.

파라미터
(Parameter)parameter
(Parameter) 가능한 값
(Possible Values)possible values
(Possible Values) 최적 모델 파라미터
(Best Model Parameters)Optimal model parameters
(Best Model Parameters) RNN Cell 타입RNN Cell Type LSTM 또는 Bi-LSTMLSTM or Bi-LSTM Bi-LSTM(인코더 및 디코더)Bi-LSTM (encoder and decoder) 레이어의 수number of layers 2, 4, 또는 62, 4, or 6 22 유닛의 수number of units 500, 1000, 2000500, 1000, 2000 20002000 러닝 레이트running rate 0.1 - 80.1 - 8 44 붕괴 인자(Decay factor)Decay factor 0.50 - 0.900.50 - 0.90 0.850.85 드롭아웃(Dropout)Dropout 0.1 - 0.50.1 - 0.5 0.10.1 어텐션 타입Attention type Luong's global attention mechanismLuong's global attention mechanism

예비 실험들에 기초하여, 실시예는 두 개의 양방향 LSTM 레이어들로 인코더와 디코더를 생성하고, 각각의 레이어들은 2000 개의 은닉 유닛들을 가진다. 과적합(overfitting)에 취약하지 않도록 은닉 레이어에 이어 0.1 드롭아웃 레이트의 드롭아웃(dropout layer)가 포함되었다. 잠재적인 기울기 폭발 (exploding gradient) 문제를 피하기 위하여, 실시예는 역전파(backpropagation) 동안에 기울기의 노름(norm)이 임계값(0.25)을 초과하지 않음을 보장할 수 있도록 그래디언트 클리핑을 도입했다. 최초 학습 레이트는 4.0으로 설정되고, 세 개의 에폭(epoch)마다 0.85의 계수로 감소되었다. Based on preliminary experiments, the embodiment creates an encoder and a decoder with two bidirectional LSTM layers, each of which has 2000 hidden units. A dropout layer with a dropout rate of 0.1 was included following the hidden layer to avoid being vulnerable to overfitting. To avoid the potential exploding gradient problem, the embodiment introduces gradient clipping to ensure that the norm of the gradient does not exceed a threshold value (0.25) during backpropagation. The initial learning rate was set to 4.0 and decreased by a factor of 0.85 every three epochs.

이러한 하이퍼-파라미터 설정으로, 단일 NVIDIA RTX 2080Ti GPU 카드의 64의 배치 사이즈로, 초당 약 3300 단어의 평균 학습 속도를 나타냈다. 메모리 제한으로 큰 배치 사이즈는 테스트되지 않았으며, 은닉 레이어의 크기에도 동일하게 적용되었다. 실시예는 최소 30 에폭 동안 모델을 학습시켰고, 각각의 에폭은 32만개의 문장 페어들로 구성된 큐레이션된 데이터세트에 대하여 약 2 시간이 소요됐다. With these hyper-parameter settings, a batch size of 64 on a single NVIDIA RTX 2080Ti GPU card resulted in an average learning rate of about 3300 words per second. Large batch sizes due to memory limitations were not tested, and the same applies to the size of the hidden layer. The example trained the model for a minimum of 30 epochs, each epoch taking about 2 hours on a curated dataset of 320,000 sentence pairs.

실시예는 PyTorch version 1.3.0과 Python version 3.6.8에서 구현되었다. 오픈 소스인 RDKit 모듈 version 2020.03.1이 MACCS 키와 유사도 맵을 얻는데 활용되었다. The embodiment was implemented in PyTorch version 1.3.0 and Python version 3.6.8. An open source RDKit module version 2020.03.1 was used to obtain MACCS keys and similarity maps.

평가 절차evaluation process

실시예의 역합성 모델의 성능을 평가하기 위하여, 구조적 유사도를 연산하는데 있어 가장 우수한 메트릭 중 하나로 알려진 Tanimoto 계수를 유사도 메트릭으로 선택하였다(문헌 [Bajusz, D.; Racz, A.; Heberger, K. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? J. Cheminformatics 2015, 7, 1-13]). 예측된 시퀀스와 모든 테스트 분자의 실측자료 사이의 페어간 유사성을 연산하였다. Tanimoto 계수(T _c)는 0에서 1 사이의 값을 가진 두 개의 화학 구조물 사이에서 측정된다. 분자가 공통 조각을 공유하지 않으면 계수는 0이고, 동일한 분자는 1의 Tanimoto 계수를 갖는다. 이들은 Tanimoto 유사성 메트릭의 양 끝단의 케이스이지만, 유사한 분자와 비유사한 분자를 정의하는 하나의 기준은 없다. 실시예는 번역 실험의 품질을 평가하기 위하여 세 개의 임계값(0.50, 0.70 및 0.85)를 정의했다. 예측 시퀀스와 실측 자료 사이의 유사도는 Tanimoto 유사도 측정(수식 1)을 사용하는 검증 세트에 나타나는 모든 페어에 대하여 각각의 에폭의 마지막에 연산된다. In order to evaluate the performance of the inverse synthesis model of the example, the Tanimoto coefficient, which is known as one of the most excellent metrics for calculating structural similarity, was selected as a similarity metric (Bajusz, D.; Racz, A.; Heberger, K. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? J. Cheminformatics 2015, 7, 1-13]). Pairwise similarity between the predicted sequence and the measured data of all test molecules was calculated. The Tanimoto coefficient ( T _c ) is measured between two chemical structures with values between 0 and 1. If molecules do not share a common fragment, the coefficient is 0, and identical molecules have a Tanimoto coefficient of 1. These are cases at both ends of the Tanimoto similarity metric, but there is no single criterion that defines similar and dissimilar molecules. The Example defined three thresholds (0.50, 0.70 and 0.85) to evaluate the quality of the translation experiment. The similarity between the predicted sequence and the ground truth is computed at the end of each epoch for all pairs appearing in the validation set using the Tanimoto similarity measure (Equation 1).

[수식 1][Formula 1]

모든 반응은 결합된 데이터세트에 포함되므로, 실시예는 하나 또는 두개의 반응물로 예측을 산출한다. 그러므로, 예측된 시퀀스와 실측 자료를 비교하는 것은 복수의 가능성이 있다. 반응물의 수에 대응하는 평가를 위한 잠재적인 페어들이 표 3에 리스트되어 있다. 예측 시퀀스와 실측 자료의 가능한 모든 페어들 사이의 Tanimoto 유사도가 연산되었다. 그 후, 구조가 유사할수록 매칭 가능성이 높다는 추정을 기초로, 가장 높은 유사도의 페어(들)가 선택되었다. 표 3에서 예측 시퀀스와 실측 자료 사이의 가능한 페어들이 제시된다. 각각의 페어의 유사도가 수식 1로 연산되었다. All reactions are included in the combined dataset, so the example yields predictions with one or two reactants. Therefore, comparing the predicted sequence with the ground truth has multiple possibilities. Potential pairs for evaluation corresponding to the number of reactants are listed in Table 3. Tanimoto similarity between predicted sequences and all possible pairs of ground truth data was calculated. Thereafter, the pair(s) having the highest similarity was selected based on the assumption that the more similar the structures are, the higher the matching probability is. In Table 3, possible pairs between the prediction sequence and the ground truth are presented. The similarity of each pair was calculated by Equation 1.

실측 자료actual data 예측prediction 가능한 페어들의 리스트list of possible pairs P -> R_A + R_B P -> R _A + R _B P -> P_A + P_B
P -> P_C P -> P _A + P _B
P -> P _C [(R _A P _A - R _B P _B ), (R _A P _B - R _B P _A )]
[R _A P _C , R _B P _C ][( R _A P _A - R _B P _B ), ( R _A P _B - R _B P _A )]
[ R _A P _C , R _B P _C ] P -> R_C P -> R _C P -> P_A + P_B
P -> P_C P -> P _A + P _B
P -> P _C [R _C P _A , R _C P _B ]
[R _C P _C ][ R _C P _A , R _C P _B ]
[ R _C P _C ]

결과와 논의Results and discussion

예측 정확도Prediction Accuracy

실시예의 성능은 3 개의 데이터세트에 기초하여 평가되었다. 단일 반응물(single reactant), 이중 반응물(double reactant), 결합된 테스트세트(Combined). 테스트세트에 대한 평가결과는 표 4에서 나타내고 있다. 각각의 테스트 데이터세트의 예측 품질은 페어별 Tanimoto 유사도로 표현된다. 실시예는 번역 모델의 성공 비율을 평가하기 위해 3 개의 기준을 도입했다: 1) 정확한 매칭의 수(T _c = 1.0), 2) 생리활성적 유사 매칭 (0.85 < T _c< 1.00), 3) 모든 테스트 분자에 대한 예측과 실측 시퀀스(조각들의 일련) 사이의 평균 Tanimoto 유사도로 표현되는 전체적 성공 비율. 표 4는 Tanimoto 유사도 메트릭에 기초한 단일 반응물, 이중 반응물, 결합된 테스트세트의 성공 비율을 나타낸다. The performance of the examples was evaluated based on three datasets. single reactant, double reactant, and combined test set. Table 4 shows the evaluation results for the test set. The prediction quality of each test dataset is expressed as the pairwise Tanimoto similarity. The Examples introduced three criteria to evaluate the success rate of the translation model: 1) number of exact matches ( T _c = 1.0), 2) bioactive quasi-matches (0.85 < T _c < 1.00), 3) Overall success rate expressed as mean Tanimoto similarity between prediction and ground truth sequences (series of fragments) for all test molecules. Table 4 shows the success rates of single reactant, double reactant, and combined test sets based on the Tanimoto similarity metric.

데이터세트dataset 단일single 이중double 합계Sum 크기size 학습 페어learning pair 88,15188,151 229,141229,141 317,292317,292 테스트 페어test pair 9,7949,794 25,46025,460 35,25435,254 테스트 분자들test molecules 9,7949,794 50,91150,911 55,95855,958 평균 페어 길이average pair length 7474 7373 7474 성공 비율success rate Bi-LSTMBi-LSTM T _c = 1.0 T _c = 1.0 29.0%29.0% 27.9%27.9% 25.3%25.3% 0.85 < T _c <1.00 ^a 0.85 < T _c < 1.00 ^a 28.7%28.7% 10.5%10.5% 12.9%12.9%

^b

^b 0.840.84 0.66 0.68 LSTM T _c = 1.0 22.9% 21.6% 19.4% 0.85 < T _c < 1.00 ^a 29.7% 10.2% 12.5%

^b 0.82 0.62 0.64

^a 생리활성적 유사 분자 ^b 평균 유사도 ^a bioactive similar molecule ^b average similarity

단일 반응물 반응에 있어서, 실시예의 양방향 LSTM 모델은 처음의 두 기준을 합쳐 57.7%의 정확도를 달성했다. 정확한 예측과 생리활성적 유사 매칭은 각각 29.0%와 28.7%이다. 예측치와 실측 시퀀스 사이의 평균 T _c는 0.84이다. 이 결과들은 실시예의 모델이 단일 반응물 반응에 대하여 높은 정확도로 예측함을 보여준다. For single reactant reactions, the bidirectional LSTM model of the Examples achieved an accuracy of 57.7% by combining the first two criteria. Accurate prediction and physiologically similar matching were 29.0% and 28.7%, respectively. The mean T _c between the predicted values and the ground truth sequence is 0.84. These results show that the model of the Example predicts with high accuracy the single reactant response.

이중 반응물 반응에 있어서, 정확한 예측의 성공 비율(27.9%)은 단일 반응물 반응과 거의 동일하다. 그러나, 높은 유사 예측의 성공 비율은 28.5%에서 10.5%로 열화되었다. 결합세트에 대해서는, 예측의 25.3%가 정확하였고, 12.9%가 높은 유사 예측으로 나타났다. 유사하게, 평균 T _c값은 0.84에서 0.66으로 떨어졌으며, 이중과 결합된 반응물을 합친 데이터세트는 0.68을 나타냈다. For double reactant reactions, the success rate of accurate predictions (27.9%) is approximately the same as for single reactant reactions. However, the success rate of high similarity prediction deteriorated from 28.5% to 10.5%. For the combined set, 25.3% of the predictions were correct, and 12.9% of the predictions were high similarity predictions. Similarly, the mean T _c value dropped from 0.84 to 0.66, and the dataset combined with the double and bound reactants showed 0.68.

이중과 결합된 세트에서의 낮은 정확도의 원인 중 하나는 “-“ 기호가 적합하게 예측되지 않았기 때문으로 보인다. 다른 원인은 소분자의 빈도 발생이 데이터세트의 적은 수의 MACCS 키로 표현되기 때문으로 보인다. 실제로, 61822개의 상이한 반응에서 477개의 분자는 7개 미만의 MACCS 키로 표현된다. 보다 구체적으로, 3944개의 반응들은 도 6에 나타난 7개의 MACCS 키 중 하나로 표현되는 반응물을 포함한다. 그러나, 상술한 키들에 대응하는 고유의 구조의 개수는 29개에 불과하다. 이런 작고 간단한 구조들이 데이터세트에 밀집되어 있기 때문에, 잘못 예측된 조각들이 성공 비율에 지대한 영향을 미치게 된다(1-비트 케이스에서의 제로값). One of the reasons for the low accuracy in the double and combined sets seems to be that the "-" sign was not predicted properly. Another cause seems to be that the frequency occurrence of small molecules is represented by a small number of MACCS keys in the dataset. Indeed, 477 molecules in 61822 different reactions are represented by less than 7 MACCS keys. More specifically, 3944 responses include reactants represented by one of the seven MACCS keys shown in FIG. 6 . However, the number of unique structures corresponding to the above-described keys is only 29. Because these small and simple structures are densely populated in the dataset, erroneously predicted fragments have a significant impact on the success rate (zero values in the 1-bit case).

실시예의 결과는 또한 양방향 LSTM 기반 모델이 단방향 LSTM 기반 모델을 능가함을 보여준다. 정확한 매칭의 성공 비율은 전체 데이터세트에 대하여 일관적으로 6% 가량 낮게 나타난다. 이는 실시예의 MACCS 키 기반의 분자 표현이 키들의 순서에 의존하지 않기 때문으로 볼 수 있다. 다시 말해, 분자와 화학적 반응에 대한 대부분의 정보는 키의 동시발생에 임베딩된다. The results of the examples also show that the bidirectional LSTM based model outperforms the unidirectional LSTM based model. The success rate of exact matching is consistently lower by about 6% for the entire dataset. This can be seen because the molecular expression based on the MACCS key of the embodiment does not depend on the order of the keys. In other words, most of the information about molecules and chemical reactions is embedded in the co-occurrence of keys.

실시예의 모델에는 사전에 반응 분류 정보가 제공되지 않았기 때문에, 실시예의 예측 정확도는 반응 등급 레이블을 고려하지 않은 다른 역합성 예측 방법들과 비교되었다. 여러 개의 최근 리포트들이 다양한 모델들의 예측 정확도를 요약하고 있다. 문헌 [Lin, K.; Xu, Y.; Pei, J.; Lai, L. Automatic retrosynthetic route planning using template free models. Chem. Sci. 2020, 11, 3355-3364]에서 제공된 결과에 따르면 top-1 정확도는 28.3%(문헌 [Liu, B.; Ramsundar, B.; Kawthekar, P.; Shi, J.; Gomes, J.; Luu Nguyen, Q.; Ho, S.; Sloane, J.; Wender, P.; Pande, V. Retrosynthetic Reaction Prediction Using Neural Sequence-to-Sequence Models. ACS Cent. Sci. 2017, 3, 1103-1113], USPTO의 5만개 데이터세트에 대한 LSTM 모델)에서 54.1%(문헌 [Lin, K.; Xu, Y.; Pei, J.; Lai, L. Automatic retrosynthetic route planning using template free models. Chem. Sci. 2020, 11, 3355-3364]의 USPTO MIT 데이터세트에 대한 Transformer 모델) 범위이다. 대안적인 접근으로서, 문헌 [Coley, C. W.; Rogers, L.; Green, W. H.; Jensen, K. F. Computer-Assisted Retrosynthesis Based on Molecular Similarity. ACS Cent. Sci. 2017, 3, 1237-1245]의 유사도 기반 모델은 USPTO의 5만개 데이터세트에 대하여 37.3%의 top-1 정확도를 달성했다. Since response classification information was not previously provided to the model of Examples, the prediction accuracy of Examples was compared with other retrosynthetic prediction methods that did not consider response class labels. Several recent reports have summarized the prediction accuracy of various models. Lin, K.; Xu, Y.; Pei, J.; Lai, L. Automatic retrosynthetic route planning using template free models. Chem. Sci. 2020, 11, 3355-3364] showed that the top-1 accuracy was 28.3% (Liu, B.; Ramsundar, B.; Kawthekar, P.; Shi, J.; Gomes, J.; Luu Nguyen). , Q.; Ho, S.; Sloane, J.; Wender, P.; Pande, V. Retrosynthetic Reaction Prediction Using Neural Sequence-to-Sequence Models. ACS Cent. Sci. 2017, 3, 1103-1113], USPTO of 54.1% (Lin, K.; Xu, Y.; Pei, J.; Lai, L. Automatic retrosynthetic route planning using template free models. Chem. Sci. 2020, 11, 3355-3364] of the Transformer model for the USPTO MIT dataset). As an alternative approach, see Coley, C. W.; Rogers, L.; Green, W. H.; Jensen, K. F. Computer-Assisted Retrosynthesis Based on Molecular Similarity. ACS Cent. Sci. 2017, 3, 1237-1245] achieved a top-1 accuracy of 37.3% for the USPTO's 50,000 dataset.

실시예 모델이 어떻게 화학적 반응의 문법을 학습하는지 확인하기 위하여, 도 7은 단일 반응물 검증 세트에 대한 학습 에폭에 따른 임계값에 대한 예측 정확도의 진화도를 나타낸다. 구체적으로, 도 7은 하위구조 레벨에서의 분자 변형을 포착하여 네트워크가 성공적으로 반응 규칙을 학습했음을 보여준다. 정확한 예측(T _c= 1.0)의 수는 처음 10 에폭 동안에 급격이 상승했다. 20 에폭 이후에는, 값이 거의 3배가 되었다. 각각의 조각에 대한 더 나은 예측을 할 수 있는 가능성은 학습 동안에 더 높아진다. 이는 성공적인 학습의 명확한 지표이다. 정확한 예측의 개선은 매우 나쁜 예측(T _c< 0.50)을 제외하고는 비-정확한 매칭의 각각의 감소의 결과인 것으로 보인다. 나쁜 예측(검증세트의 5%)의 품질은 개선되지 않았는데 이는 부족한 정보, 복잡성, 그리고 데이터에 포함된 노이즈 때문인 것으로 보인다. 이러한 결과는 모든 다른 데이터세트에서도 유사하게 반복되었다. To confirm how the example model learns the grammar of chemical reactions, FIG. 7 shows the evolution of prediction accuracy for thresholds according to learning epochs for a single reactant validation set. Specifically, Fig. 7 shows that the network has successfully learned the response rules by capturing molecular modifications at the substructure level. The number of correct predictions ( T _c = 1.0) rose sharply during the first 10 epochs. After 20 epochs, the value has almost tripled. The probability of making better predictions for each piece is higher during learning. This is a clear indicator of successful learning. The improvement in accurate predictions appears to be the result of each decrease in non-exact matches, except for very bad predictions ( T _c < 0.50). The quality of the bad predictions (5% of the validation set) did not improve, probably due to the lack of information, complexity, and noise contained in the data. These results were similarly repeated for all other datasets.

실시예에서는 T _c > 0.85 의 후보 반응물은 그것의 실제 상대와 충분히 유사한 것으로 추정한다. 이러한 추정을 검증하기 위하여, 실시예는 후보 반응물을 그들의 실제 반응물과 비교하여 후보 반응물의 품질을 평가하였다. 실시예는 후술하는 요인들이 올바른지 여부를 조사하였다: 작용기 상호변환 또는 결합 단절, 반응성 작용기, 코어 구조, 및 치환기. 측면 치환기의 정확도는 반응물의 기능성을 매칭시키는데 덜 중요한 것으로 간주되었는데, 특히, 그것이 단순 알킬인 경우에 그러하다. In the examples, it is assumed that a candidate reactant with T _c > 0.85 is sufficiently similar to its actual counterpart. To validate this assumption, the Examples evaluated the quality of the candidate reactants by comparing them to their actual reactants. The Examples investigated whether the following factors are correct: functional group interconversion or bond severing, reactive functional groups, core structure, and substituents. The accuracy of the lateral substituents was considered less important in matching the functionality of the reactants, especially when it was a simple alkyl.

도 8은 가능한 예측 케이스를 설명하기 위하여 무작위로 선택된 예측들을 나타낸다. 유사도 맵은 후보들과 실제 반응물 사이의 유사도를 시각화하기 위해 제공되었다. 구체적으로, 도 8은 비-정확한 후보들이 그들의 유사도 정도에 따라 상이함을 나타낸다. 도 8은 Morgan fingerprint 와 Tanimoto 메트릭을 이용한 유사도 스코어 계산과 유사도 맵을 나타낸다. 색상은 전체 유사도에 대한 원자 레벨의 기여도를 나타낸다. 구체적으로 녹색은 유사도 스코어의 증가, 적색은 유사도 스코어의 감소, 무색은 영향 없음을 나타낸다. 8 shows randomly selected predictions to account for possible prediction cases. A similarity map was provided to visualize the similarity between the candidates and the actual reactant. Specifically, FIG. 8 shows that non-exact candidates differ according to their degree of similarity. 8 shows the similarity score calculation and the similarity map using the Morgan fingerprint and Tanimoto metrics. The color represents the atomic level contribution to the overall similarity. Specifically, green indicates an increase in the similarity score, red indicates a decrease in the similarity score, and colorless indicates no effect.

반응 1은 8개의 탄소로 구성된 주사슬과 정확한 부위에 α, β-불포화 알데히드기가 정확하게 유도된 반응물을 도출하였다. 알데히드보다는 에스테르가 예상되었지만, 알데히드 환원은 또한 동일한 표적 알코올을 제공할 수 있다. 이는 실시예의 예측이 작용기 상호변환을 정확하게 확인했음을 나타낸다. 한편, 하나의 올레핀이 누락되고 4개 중 2개의 메틸기의 부위와 수가 잘못 해석되었다. Reaction 1 derived a reaction product in which a main chain composed of 8 carbons and α, β-unsaturated aldehyde groups were precisely induced at the correct site. Although an ester rather than an aldehyde was expected, aldehyde reduction can also provide the same target alcohol. This indicates that the predictions of the Examples correctly confirmed the functional group interconversion. On the other hand, one olefin was missing and the site and number of methyl groups of two out of four were misinterpreted.

반응 2에서는, 에스테르기의 위치를 제외하고는, 코어 헤테로시클릭 고리, 피리딘 및 티아졸, 및 그들의 연결이 정확하게 생성되었다. 실제 반응물에서는, 메틸에스테르기는 피리딘의 C6에 부착된 반면, 에틸에스테르기는 후보에서 티아졸 고리의 C4에 부착되었다. 에틸에스트레기의 위치가 정확하다면, 알코올기를 얻기 위해 단일 단계의 환원이 필요할 것이다. In Reaction 2, except for the position of the ester group, the core heterocyclic ring, pyridine and thiazole, and their linkages were created correctly. In the actual reactant, the methyl ester group is attached to C6 of the pyridine, whereas the ethylester group is attached to C4 of the thiazole ring in the candidate. If the position of the ethyl ester group is correct, a single step of reduction will be required to obtain the alcohol group.

반응 3에서는, 피라졸의 코어 구조 및 그의 메틸에스테르기가 정확하게 예측되었다. 그러나, 클로라이드, 반응성 작용기 중 하나, 및 피라졸 고리 상의 치환기가 누락되고, 티올의 구조가 잘못 해석되었다. In Reaction 3, the core structure of pyrazole and its methyl ester group were accurately predicted. However, chloride, one of the reactive functional groups, and a substituent on the pyrazole ring are missing, and the structure of the thiol is misinterpreted.

반응 4의 예측 결과는 실시예가 코어 구조, 결합 단절, 및 반응성 작용기를 정확하게 예측했음을 보여준다. 그러나, 할로겐화물의 수와 부위는 잘 못 되었다. The predicted results of Reaction 4 show that the Example accurately predicted the core structure, bond cleavage, and reactive functional groups. However, the number and sites of halides were wrong.

반응 5에서는, 하나의 반응물은 정확하게 예측되었으나 다른 반응물은 부분적으로 잘못 예측되었다. 잘못 예측된 후보에서, (2-나프틸)비닐기는 (페닐)메틸기로 잘못 예측되었지만, 반응성 작용기인 아실하이드라진이 정확하게 예측되었다. In reaction 5, one reactant was predicted correctly but the other reactant was partially predicted incorrectly. In the erroneously predicted candidate, the (2-naphthyl)vinyl group was erroneously predicted as a (phenyl)methyl group, but the reactive functional group, acylhydrazine, was correctly predicted.

반응 6의 결과는 O-하이드록실아민의 전구체로서 N-하이드록시프탈이미드에 대한 정확한 매칭을 보여준다. 그러나, 알킬 할라이드의 구조는 페닐렌기가 부족하다. 코어 구조의 추정은 이 반응에서 크게 실패했다. 한편, 반응성 작용기와 결합 단절은 정확하게 제안되었다. The results of reaction 6 show an exact match for N-hydroxyphthalimide as a precursor of O-hydroxylamine. However, the structure of the alkyl halide lacks a phenylene group. Estimates of the core structure have largely failed in this reaction. On the other hand, the reactive functional group and bond severing have been precisely proposed.

상술한 평가의 정량적 요약을 표 5에 나타낸다. 세 개의 기준: 작용기 상호변환 또는 결합해리, 코어 구조, 및 반응성 작용기에 동일하게 가중치가 부여되었다. 이들은 유사도 스코어와 함께 화학적으로 합리적인 스코어를 형성하는데 활용된다. 코어 구조와 관련하여, O, N, S와 같은 헤테로원자를 보유할 수 있는 가장 긴 탄소 사슬 및/또는 고리가 고려된다. 작용기 상호변환 또는 결합해리는 역합성 분석에서 가장 중요한 부분이므로, 반응 부위의 정확한 위치는 각각 1과 0에 해당하는 참/거짓 값으로 엄격하게 스코어링된다. 관련된 작용기는 동일한 가중치를 받는다. 실시예는 각각의 후보 반응물을 개별적으로 스코어링하고, 결과를 평균내어 각 기준에 대한 최종 스코어를 도출했다. A quantitative summary of the evaluations described above is presented in Table 5. Three criteria: functional group interconversion or dissociation, core structure, and reactive functional groups were weighted equally. Together with the similarity score, they are utilized to form a chemically reasonable score. With regard to the core structure, the longest carbon chain and/or ring capable of bearing heteroatoms such as O, N, S is contemplated. Since functional group interconversion or dissociation is the most important part of reverse synthesis analysis, the exact position of the reaction site is strictly scored with true/false values corresponding to 1 and 0, respectively. Related functional groups receive the same weight. The examples scored each candidate reactant individually and averaged the results to derive a final score for each criterion.

표 5에서, FGI 또는 결합 해리 및 반응성 작용기 컬럼은 참(1)/거짓(0) 방식으로 정확도를 나타낸다. 코어 구조 컬럼은 측면 치환기의 타입 및 위치 뿐만 아니라 코어 구조 그 자체의 정확도를 확보함으로써 후보 분자들의 코어 구조의 평균 정확도를 나타낸다. 오차의 소스는 괄호 안에 기재된다. C1은 후보 1을 가리키고, C2는 후보 2를 가리킨다. 예를 들어, ”C2=0.33, 2/3 조각”은 후보 반응물 2의 정확도가 0.33이라는 것을 의미하는데, 그 이유는 3개 조각 중에서 2개가 잘못 예측되었기 때문이다. 세 개의 기준의 평균은 별개 컬럼으로 나타낸다. Tc 컬럼은 후보 반응물의 평균 Tc 값을 나타낸다.In Table 5, the FGI or bond dissociation and reactive functional groups columns show accuracy in a true (1)/false (0) manner. The core structure column indicates the average accuracy of the core structure of candidate molecules by securing the accuracy of the core structure itself as well as the type and position of the side substituents. The source of error is given in parentheses. C1 points to candidate 1, and C2 points to candidate 2. For example, “C2=0.33, 2/3 fragments” means that candidate reactant 2 has an accuracy of 0.33 because 2 out of 3 fragments were predicted incorrectly. The mean of the three criteria is shown in separate columns. The Tc column represents the average Tc value of the candidate reactants.

반응 No.Reaction No. FGI 또는
결합해리FGI or
bond dissociation 코어 구조core structure 반응성 작용기reactive functional groups 평균Average T_c T _c 1One 1.001.00 0.33 (2/3 조각)0.33 (2/3 pieces) 1.001.00 0.780.78 0.640.64 22 1.001.00 0.67 (조각 부위 1/3)0.67 (1/3 piece area) 1.001.00 0.890.89 0.850.85 33 1.001.00 0.69 (C1=0.88, 조각의 측면 하위구조; C2=0.5, 1/2 조각)0.69 (C1=0.88, lateral substructure of piece; C2=0.5, 1/2 piece) 0.50 (티올에 대하여 1, 클로라이드에 대하여 0)0.50 (1 for thiol, 0 for chloride) 0.730.73 0.570.57 44 1.001.00 0.96 (C1=0.92, 측면 하위구조의 부위; C2=1.0, Cl은 배제)0.96 (C1=0.92, region of the lateral substructure; C2=1.0, excluding Cl) 1.001.00 0.990.99 0.860.86 55 1.001.00 0.83 (C1=0.67, 1/3 조각; C2는 정확)0.83 (C1=0.67, 1/3 piece; C2 is correct) 1.001.00 0.940.94 0.840.84 66 1.001.00 0.67 (C1은 정확; C2=0.33, 2/3 조각)0.67 (C1 is correct; C2=0.33, 2/3 pieces) 1.001.00 0.890.89 0.730.73

실시예가 모든 6개의 반응에 대하여 작용기 상호변환 또는 결합해리를 정확하게 예측했다는 것은 주목할 만하다. 반응 3을 제외하면, 반응성 작용기는 정확하게 반영되었다. 스코어에 영향을 주는 예측 오차는 주로 코어 구조와 연관되어 있음을 확인할 수 있다. 실시예는 후보 반응물들이 평균적으로 유사한 생리활성적 영역(T_c = 0.87)에 있는 10개 임의로 선택된 반응들을 포함하는 보다 구체적인 세트에 이러한 구조 데이터 기반의 스코어링 전략을 적용했다. 도 9, 도 10 및 표 6은 이를 구체적으로 개시하고 있다. 결과는 실시예가 생리화학적으로 유사한 반응물 후보들에 대하여 반응성 작용기 뿐만 아니라 작용기 상호변환 또는 결합해리도 매우 정확하게 예측하고 있음을 명확하게 보여준다. It is noteworthy that the examples accurately predicted functional group interconversion or dissociation for all six reactions. Except for reaction 3, the reactive functional groups were accurately reflected. It can be seen that the prediction error affecting the score is mainly related to the core structure. The Examples applied this structural data-based scoring strategy to a more specific set comprising 10 randomly selected responses in which the candidate reactants, on average, were in similar bioactive domains (T _c = 0.87). 9, 10, and Table 6 specifically disclose this. The results clearly show that the Example predicts not only the reactive functional group but also the functional group interconversion or dissociation very accurately for the reactant candidates that are physiologically and chemically similar.

반응들의 화학적 검사는 평균 유사도 스코어와 구조 데이터 기반으로 생성된 스코어와 밀접하게 관련되어 있음을 나타낸다. 실시예의 스코어링 접근은 후보 반응물과 유사도 스코어의 품질이 수동적으로 획득하는 것과 일치한다는 명확한 아이디어를 제공한다. 유사도 측정이 구조 데이터 기반 스코어보다 낮은 스코어를 가지는 것은 곁사슬과 기하학적 요소가 포함하기 때문일 수도 있다. 유사도 스코어의 해석이 객관적 평가를 위해 다소 어려움에도 불구하고 이는 역합성의 예측 품질을 평가하는데 사용될 수 있다. 높은 유사도 스코어는 요구하는 분자가 유기 화학 규칙에 따라 보다 합성적으로 접근 가능하다는 것을 나타낸다.Chemical examination of the reactions indicated a close correlation with the mean similarity score and the score generated based on the structural data. The scoring approach of the Examples provides a clear idea that the quality of candidate reactants and similarity scores is consistent with those obtained passively. The similarity measure scores lower than the structural data-based score may be due to the inclusion of side chains and geometric elements. Although the interpretation of the similarity score is somewhat difficult for objective evaluation, it can be used to evaluate the predictive quality of retrosynthesis. A high similarity score indicates that the desired molecule is more synthetically accessible according to the rules of organic chemistry.

표 6에서, FGI 또는 결합 해리 및 반응성 작용기 컬럼은 참(1)/거짓(0) 방식으로 정확도를 나타낸다. 코어 구조 컬럼은 측면 치환기의 타입 및 위치 뿐만 아니라 코어 구조 그 자체의 정확도를 확보함으로써 후보 분자들의 코어 구조의 평균 정확도를 나타낸다. 오차의 소스는 괄호 안에 기재된다. C1은 후보 1을 가리키고, C2는 후보 2를 가리킨다. 예를 들어, ”C2=0.33, 2/3 조각”은 후보 반응물 2의 정확도가 0.33이라는 것을 의미하는데, 그 이유는 3개 조각 중에서 2개가 잘못 예측되었기 때문이다. 세 개의 기준의 평균은 별개 컬럼으로 나타낸다. Tc 컬럼은 후보 반응물의 평균 Tc 값을 나타낸다. In Table 6, the FGI or bond dissociation and reactive functional groups columns show accuracy in a true (1)/false (0) manner. The core structure column indicates the average accuracy of the core structure of candidate molecules by securing the accuracy of the core structure itself as well as the type and position of the side substituents. The source of error is given in parentheses. C1 points to candidate 1, and C2 points to candidate 2. For example, “C2=0.33, 2/3 fragments” means that candidate reactant 2 has an accuracy of 0.33 because 2 out of 3 fragments were predicted incorrectly. The mean of the three criteria is shown in separate columns. The Tc column represents the average Tc value of the candidate reactants.

반응 No.Reaction No. FGI 또는
결합해리FGI or
bond dissociation 코어 구조core structure 반응성 작용기reactive functional groups 평균Average T_c T _c 1One 1.001.00 0.98 (C1=1.00; C2=0.95, 알킬 #C 6/5)0.98 (C1=1.00; C2=0.95, alkyl #C 6/5) 1.001.00 0.990.99 0.870.87 22 1.001.00 0.83 (C1=1.00; C2=0.67, 1/3 조각)0.83 (C1=1.00; C2=0.67, 1/3 piece) 1.001.00 0.940.94 0.810.81 33 1.001.00 1.001.00 1.001.00 1.001.00 0.840.84 44 1.001.00 0.75 (C1=1.00; C2=0.50, 1/2 조각)0.75 (C1=1.00; C2=0.50, 1/2 piece) 1.001.00 0.920.92 0.870.87 55 1.001.00 0.79 (C1=0.75, 조각 부위 1/2; C2=0.83, 조각 부위 1/3)0.79 (C1=0.75, fragment area 1/2; C2=0.83, fragment area 1/3) 1.001.00 0.930.93 0.860.86 66 1.001.00 0.88 (C1=0.75, 조각 부위 1/2, C2=1.00)0.88 (C1=0.75, slice area 1/2, C2=1.00) 1.001.00 0.960.96 0.940.94 77 1.001.00 0.96 (C1=0.97, 고리 #C 5/6; C2=0.94, 알킬 #C 5/4)0.96 (C1=0.97, ring #C 5/6; C2=0.94, alkyl #C 5/4) 1.001.00 0.990.99 0.910.91 88 1.001.00 1.001.00 1.001.00 1.001.00 0.870.87 99 1.001.00 1.001.00 1.001.00 1.001.00 0.830.83 1010 1.001.00 0.97 (C1=1.00; C2=0.94, 측면 하위구조의 부위)0.97 (C1=1.00; C2=0.94, region of the lateral substructure) 1.001.00 0.990.99 0.850.85

실시예의 장점Advantages of embodiments

문자 기반의 SMILES 방법에 비해 실시예의 단어 기반의 MACCS 키 방법이 갖는 주요 장점은, 의미있는 결과를 산출하기 위해 네트워크가 비교적 간단한 문법적 규칙(키들의 오름차순 및 동시발생)을 학습하도록 요구된다는 것이다. SMILES 기반의 방법에서는, 네트워크가 복잡한 SMILES의 문법뿐만 아니라 합성적으로 올바른 시퀀스를 예측하기 위하여 표준화된 표현을 이해해야 한다. 문헌 [Liu, B.; Ramsundar, B.; Kawthekar, P.; Shi, J.; Gomes, J.; Luu Nguyen, Q.; Ho, S.; Sloane, J.; Wender, P.; Pande, V. Retrosynthetic Reaction Prediction Using Neural Sequence-to-Sequence Models. ACS Cent. Sci. 2017, 3, 1103-1113] 에서 정리된 바와 같이, SMILES 표기법의 구문 구조를 학습하는 것의 어려움은 부적합한 SMILES와 같은 문제적 결과를 야기시킬 수도 있다. 일반적으로 현존하는 문자 기반의 모델들은, 문자 그대로의 무효, 문자상으로는 유효하나 화학적으로 비합리적, 또는 문자상으로 그리고 화학적으로 유효하나 실현불가능한 후보들을 생성하는 문제점을 가지고 있다. 실시예는 이러한 문제들을 분자 구조의 SMILES 표현을 하위구조 도메인으로 투영시켜 해소하고 있다. The main advantage of the word-based MACCS key method of the embodiment over the character-based SMILES method is that it requires the network to learn relatively simple grammatical rules (ascending and co-occurrence of keys) to produce meaningful results. In the SMILES-based method, the network must understand not only the complex grammar of SMILES but also the standardized representation to predict synthetically correct sequences. Liu, B.; Ramsundar, B.; Kawthekar, P.; Shi, J.; Gomes, J.; Luu Nguyen, Q.; Ho, S.; Sloane, J.; Wender, P.; Pande, V. Retrosynthetic Reaction Prediction Using Neural Sequence-to-Sequence Models. ACS Cent. Sci. 2017, 3, 1103-1113], the difficulty of learning the syntactic structure of SMILES notation may lead to problematic results such as inappropriate SMILES. In general, existing text-based models have the problem of generating literal invalid, literally valid but chemically irrational, or literal and chemically valid but impractical candidates. The embodiment solves these problems by projecting the SMILES representation of the molecular structure into the substructural domains.

일반적으로, 올바른 역합성 예측을 만들 수 있는 가능성은 비교적 낮다. 실제로, 역합성 계획 업무의 정확도는 정반응 예측 업무에서 얻는 정확도의 레벨에 비해 두 배 정도 낮다. 이것은, 정반응에 대해서는 여러 개의 가능한 합성 경로를 사용할 수 있다고 가정할 때 특히 그러하다. 역방향 매핑에서 사용되는 데이터세트의 내용이 네트워크의 동작을 담당할 수도 있다는 점은 주목할 가치가 있다. 실시예의 데이터세트에서 분자를 설명하기 위한 추상화의 레벨을 고려하면, 반응물 도메인에서 생성물 도메인으로 반응물을 맵핑하고 그 후 이를 역으로 하는 것은 반드시 원래의 반응물을 제공하지는 않는다. 생성물로부터 반응물 도메인으로의 일-대-다(one-to-many) 맵핑의 존재는 학습 프로세스 동안 혼란을 일으킬 수 있다. 이러한 관측을 기반으로, 도메인들 사이의 보다 강력한 페어별 기능 관계를 보장하기 위한 간단한 아이디어가 채용되었다. 이를 위해, 실시예는 모든 일-대-다(one-to-many) 맵핑을 확인하고, 가장 짧은 시퀀스 길이를 갖는 분자(가장 낮은 레벨의 구조적 복잡성을 갖는 분자로 추정)를 선택하여 이를 일대일 매핑으로 축소했다. In general, the probability of making a correct retrosynthetic prediction is relatively low. In fact, the accuracy of the reverse synthesis planning task is about twice as low as the level of accuracy achieved in the forward prediction task. This is especially true given that several possible synthetic routes are available for forward reactions. It is worth noting that the contents of the dataset used in the reverse mapping may also be responsible for the behavior of the network. Given the level of abstraction for describing molecules in the dataset of examples, mapping reactants from reactant domains to product domains and then vice versa does not necessarily give the original reactants. The existence of a one-to-many mapping from product to reactant domains can cause confusion during the learning process. Based on these observations, a simple idea was adopted to ensure a stronger pairwise functional relationship between domains. To this end, the embodiment checks all one-to-many mappings, selects the molecule with the shortest sequence length (presumed to be the molecule with the lowest level of structural complexity) and maps it to one-to-one mapping. reduced to

특히, 실시예의 모델은 일관된 예측을 제공한다. 동일한 입력 분자에 대한 각각의 독립적인 수행에서, 실시예의 모델은 일관적으로 동일한 산출값을 제공하였다. 실시예 모델의 강력함은 낮은 복잡성과 분자 서술자의 좋은 해석 가능성 때문일 수 있다. 일반적으로, 역합성 모델들은 top-N 정확도 스코어를 사용하여 전체 모델 성능을 나타낸다. 그러나, 최근 문헌 [Schwaller, P.; Petraglia, R.; Zullo, V.; Nair, V. H.; Haeuselmann, R. A.; Pisoni, R.; Bekas, C.; Iuliano, A.; Laino, T. Predicting retrosynthetic pathways using transformer based models and a hyper-graph exploration strategy. Chem. Sci. 2020, 11, 3316-3325]에서 논의된 바와 같이, top-N 정확도 스코어는, 각각의 제안에 대하여 모델이 화학적으로 더 의미있는 예측을 하기보다는 데이터세트로부터 예측되는 답변을 산출하는 경향이 있기 때문에 역합성 모델을 평가하는데 적합한 메트릭이 아닐 수 있다. 비록, MACCS 키가 유사도 벤치마크에서 열악한 성능으로 비난을 받았으나, 이 서술자의 장점은 철저한 생성 알고리즘과 후속되는 해싱 절차에 의해 취득되는 지문과 비교하여 비트와 하위구조 사이의 일대일 대응이 있다는 점이다. 따라서, MACCS 키는 번역 방법론의 개념 증명 수준을 테스트하기 위한 자연스러운 선택이었다. In particular, the model of the embodiment provides consistent predictions. In each independent run on the same input molecule, the model of the Example consistently gave the same output. The robustness of the example model can be attributed to its low complexity and good interpretability of molecular descriptors. In general, retrosynthetic models use top-N accuracy scores to represent overall model performance. However, recently, Schwaller, P.; Petraglia, R.; Zullo, V.; Nair, V. H.; Haeuselmann, R. A.; Pisoni, R.; Bekas, C.; Iuliano, A.; Laino, T. Predicting retrosynthetic pathways using transformer based models and a hyper-graph exploration strategy. Chem. Sci. 2020, 11, 3316-3325], the top-N accuracy score is because for each proposal the model tends to produce predicted answers from the dataset rather than chemically more meaningful predictions. It may not be a suitable metric to evaluate the retrosynthesis model. Although MACCS keys have been criticized for poor performance in similarity benchmarks, the advantage of this descriptor is that there is a one-to-one correspondence between bits and substructures compared to the fingerprint obtained by an exhaustive generation algorithm and subsequent hashing procedure. Therefore, the MACCS key was a natural choice for testing the proof-of-concept level of the translation methodology.

하위구조 기반의 표현을 사용하는 것은 예측되는 시퀀스를 분자로 변환하기 힘들다. 분자 구조의 SMILES 표현은 고유하지는 않으나, 이는 가역적이다. 반면에, MACCS 키 표현은 고유하고, 비가역적인 특성을 보여준다. 이러한 이유로, 실시예는 사전에 계산된 MACCS 키의 데이터베이스에서 분자 구조에 접근하기 위해 룩업을 수행해야 한다. 실시예는 USPTO 반응 데이터세트로부터 룩업 테이블을 구성하여, 이를 반응물 분자에 대한 예측 MACCS 키에 연관시켰다. 또한, 후보 반응물은 상업적으로 이용가능한 화학물질 데이터베이스 또는 유사도 검색에 기초한 반응 데이터베이스로부터 확보될 수도 있다. PubChem(문헌 [Nakata, M.; Shimazaki, T. PubChemQC Project: A Large-Scale First-Principles Electronic Structure Database for Data-Driven Chemistry. Journal of Chemical Information and Modeling 2017, 57, 1300-1308])과 같은 몇몇 온라인 화학 데이터베이스들 또한 MACCS 키 기반 쿼리들을 허용하여 동일한 목적으로 사용될 수도 있다. Using substructure-based representations makes it difficult to convert predicted sequences into molecules. The SMILES representation of the molecular structure is not unique, but it is reversible. On the other hand, MACCS key representations exhibit unique, irreversible properties. For this reason, the embodiment must perform a lookup to access the molecular structure in the database of pre-computed MACCS keys. The example constructs a lookup table from the USPTO reaction dataset and associates it with the predicted MACCS key for the reactant molecule. Candidate reactants may also be obtained from commercially available chemical databases or reaction databases based on similarity searches. Some such as PubChem (Nakata, M.; Shimazaki, T. PubChemQC Project: A Large-Scale First-Principles Electronic Structure Database for Data-Driven Chemistry. Journal of Chemical Information and Modeling 2017, 57, 1300-1308) Online chemistry databases may also be used for the same purpose by allowing MACCS key based queries.

도 11은 도 8의 반응 4를 보다 구체적으로 나타낸다. 유사도 스코어는 반경 2의 원형 지문(Morgan)과 Tanimoto 메트릭을 사용하여 표시하였다. 고유의 조각은 SMARTS 패턴으로 표시되었다. 도 11은 또한 도 8의 반응 4의 첫 번째 반응물에 대하여 USPTO 반응 데이터세트에서 확보한 5 개의 후보들을 나타낸다. 다섯 반응물 모두 데이터베이스 안의 다른 반응과 연관된다. 확보된 분자들의 MACCS 키 표현은 동일하다. 이는, 때때로 예측 시퀀스에 상응하는 하나 이상의 매칭을 찾을 가능성이 있다는 것을 의미한다. 이러한 밀접하게 관련된 유사점은 경로 기반 또는 원형 지문을 사용하여 Tanimoto 계수를 연산함으로써 정렬될 수 있다. 이들은 동일한 세트에 대하여 다를 것이기 때문이다. 이를 위하여, 실시예는 비트 벡터로서 반경 2의 원형 지문을 사용하였다. 실시예는 후보들 중에서 가장 높은 유사도 값을 갖는 분자를 최종 결과로 선택하였다. 11 shows reaction 4 of FIG. 8 in more detail. The similarity score was expressed using a circular fingerprint (Morgan) with a radius of 2 and the Tanimoto metric. Unique pieces marked with SMARTS patterns. FIG. 11 also shows five candidates obtained from the USPTO reaction dataset for the first reactant of reaction 4 of FIG. 8 . All five reactants are associated with other reactions in the database. The MACCS key representation of the obtained molecules is the same. This means that sometimes it is likely to find one or more matches corresponding to the prediction sequence. These closely related similarities can be aligned by computing Tanimoto coefficients using path-based or circular fingerprints. Because they will be different for the same set. To this end, the embodiment uses a circular fingerprint of radius 2 as a bit vector. In the example, the molecule having the highest similarity value among the candidates was selected as the final result.

결론conclusion

실시예는 하위구조 레벨에서 관계를 학습하여 화학 반응의 반응 규칙을 자동으로 추출하기 위한 시퀀스-투-시퀀스 NMT 모델을 제공한다. MACCS 키의 제로가 아닌 구성들로 짧은 고정 길이의 어휘를 가지는 추상적인 언어로 구성함으로써, 세 개의 개념적 문제점들이 함께 다루고 해결한다. (i) 불규칙한 예측: SMILES 기반 표현은 모델 결과를 오차에 쉽게 노출시킴, (ii) 합성 가용성: 예측된 분자가 합성적으로 접근할 수 없음, (iii) top-N 정확도 메트릭: 모델에 의한 제안은 모델의 실행에 따라 변할 수 있음. 비교와 품질 검수는 실시예가 성공적으로 후보 반응물을 0.85 < T _c ≤ 1.00 범위에서 제공하고, 특히, 작용기 상호변환 또는 결합해리 및 반응성 작용기에 있어서, 전체적으로 높은 레벨의 정확도를 달성함을 보여준다. 실시예에 따른 접근은 유기 화학의 다양한 분야에 적용될 수 있는 높은 잠재력을 가지고 있다. The embodiment provides a sequence-to-sequence NMT model for automatically extracting reaction rules of chemical reactions by learning relationships at the sub-structure level. By constructing an abstract language with a short fixed-length vocabulary with non-zero constructs of MACCS keys, the three conceptual problems are addressed and solved together. (i) irregular prediction: SMILES-based representation easily exposes model results to error, (ii) synthetic availability: predicted molecule is not synthetically accessible, (iii) top-N accuracy metric: suggestion by model may change depending on the execution of the model. Comparisons and quality checks show that the Examples successfully provide candidate reactants in the range of 0.85 < T _c ≤ 1.00, and achieve a high overall level of accuracy, particularly for functional group interconversion or dissociation and reactive functional groups. The approach according to the embodiment has high potential to be applied to various fields of organic chemistry.

이상 설명된 본 발명에 따른 실시예들은 다양한 컴퓨터 구성요소를 통하여 수행될 수 있는 프로그램 명령어의 형태로 구현되어 컴퓨터 판독 가능한 기록 매체에 기록될 수 있다. 컴퓨터 판독 가능한 기록 매체는 프로그램 명령어, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 컴퓨터 판독 가능한 기록 매체에 기록되는 프로그램 명령어는 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능한 기록 매체의 예에는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM, DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 ROM, RAM, 플래시 메모리 등과 같은 프로그램 명령어를 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령어의 예에는, 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함된다. 상기 하드웨어 장치는 본 발명에 따른 처리를 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The embodiments according to the present invention described above may be implemented in the form of program instructions that can be executed through various computer components and recorded in a computer-readable recording medium. The computer-readable recording medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the computer-readable recording medium may be specially designed and configured for the present invention, or may be known and available to those skilled in the computer software field. Examples of the computer-readable recording medium include a hard disk, a magnetic medium such as a floppy disk and a magnetic tape, an optical recording medium such as a CD-ROM and DVD, and a magneto-optical medium such as a floppy disk. media), and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware device may be configured to operate as one or more software modules for carrying out the processing according to the present invention, and vice versa.

또한, 이상 설명된 본 발명에 따른 실시예들은 다양한 컴퓨터 구성요소를 통하여 수행될 수 있는 프로그램 명령어의 집합 및 이를 실행하기 위한 사용자 애플리케이션 자체일 수도 있다. 구체적으로, 서버를 통해 또는 저장매체를 통해 다운로드하여 클라이언트 컴퓨터에 설치할 수 있는 프로그램 그 자체일 수도 있다. In addition, the embodiments according to the present invention described above may be a set of program instructions that can be executed through various computer components and a user application itself for executing them. Specifically, it may be a program itself that can be downloaded through a server or a storage medium and installed on a client computer.

이상에서 본 발명이 구체적인 구성요소 등과 같은 특정 사항들과 한정된 실시예 및 도면에 의해 설명되었으나, 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명이 상기 실시예들에 한정되는 것은 아니며, 본 발명이 속하는 기술분야에서 통상적인 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형을 꾀할 수 있다.In the above, the present invention has been described with specific matters such as specific components and limited embodiments and drawings, but these are provided to help a more general understanding of the present invention, and the present invention is not limited to the above embodiments. , various modifications and variations can be devised from these descriptions by those of ordinary skill in the art to which the present invention pertains.

따라서, 본 발명의 사상은 상기 설명된 실시예에 국한되어 정해져서는 아니되며, 후술하는 특허청구범위 뿐만 아니라 이 특허청구범위와 균등하게 또는 등가적으로 변형된 모든 것들은 본 발명의 사상의 범주에 속한다고 할 것이다.Therefore, the spirit of the present invention should not be limited to the above-described embodiments, and not only the claims described below, but also all modifications equivalently or equivalently to the claims described below belong to the scope of the spirit of the present invention. will do it

10: 역합성 시스템 11: 제어부
12: 통신부 13: 입출력 인터페이스부
14: 메모리부 15: 입력부
16: 디스플레이부 20: 외부서버10: reverse synthesis system 11: control unit
12: communication unit 13: input/output interface unit
14: memory unit 15: input unit
16: display unit 20: external server

Claims

A method for predicting reverse synthesis using neural network machine translation that can be implemented in a reverse synthesis system including a communication unit, a control unit, and a memory unit, the method comprising:
A step of generating an initial product-reactant pair, the step of receiving a reaction data set for the product-reactant from an external server through the communication unit; expressing the received response data set as a descriptor composed of characters; and storing the initial product-reactant pair represented by the descriptor in the memory unit, generating the initial product-reactant pair;
filtering by applying one or more filters to the initial product-reactant pair;
one-to-one mapping of the filtered product-reactant pairs;
generating a final product-reactant pair by aligning the product-reactant pair to which the one-to-one mapping is applied;
training a neural machine translation (NMT) model using the final product-reactant pair;
evaluating the learned NMT model; and
When a new product is input to the NMT model, predicting and calculating a candidate reactant corresponding to the new product.

The method of claim 1,
The step of expressing the received response data set as a descriptor composed of characters,
Preparing a MACCS (molecular access system) key representing a molecule;
generating a character descriptor for the MACCS key by allocating a predetermined character to each of the MACCS keys;
expressing the received response data set with the MACCS key; and
and converting the response dataset expressed by the MACCS key into the character descriptor.

3. The method of claim 2,
The preparing of the MACCS key includes generating a curated MACCS key by removing MACCS keys having a frequency of occurrence less than or equal to a predetermined among the MACCS keys.

4. The method of claim 3,
In the generating a character descriptor for the MACCS key, the predetermined character is assigned to each of the curated MACCS keys.

The method of claim 1,
Filtering by applying one or more filters to the initial product-reactant pair,
A reverse synthesis comprising at least one of a filter that removes pairs having 3 or more reactants, a filter that removes identical product-reactant pairs, a filter that removes inner twins, and a filter that removes pairs with a sequence length of 100 or more. Prediction method.

The method of claim 1,
The one-to-one mapping of the filtered product-reactant pair comprises:
Reverse synthesis prediction, confirming the one-to-many mapping for the filtered product-reactant pair, and reducing the one-to-one mapping by selecting the molecule having the shortest sequence length among the one-to-many mappings method.

The method of claim 1,
The step of aligning the product-reactant pair to which the one-to-one mapping is applied to generate a final product-reactant pair comprises:
When there are two or more reactants of the product-reactant pair, they are arranged in descending order according to length, and symbols are connected between the reactants.

The method of claim 1,
The neural network machine translation model is a sequence-to-sequence model, a reverse synthesis prediction method.

9. The method of claim 8,
The neural network machine translation model includes two bidirectional long short-term memories (LSTMs),
and one of the two bidirectional LSTMs functions as an encoder and the other functions as a decoder.

10. The method of claim 9,
wherein the encoder and the decoder are connected through an attention mechanism.

10. The method of claim 9,
Each of the bidirectional LSTMs includes a hidden layer,
Inverse synthesis prediction method, further comprising a dropout layer after the hidden layer.

12. The method according to claim 10 or 11,
The neural network machine translation model further comprises gradient clipping so that a norm of a gradient does not exceed a threshold value during backpropagation.

13. The method of claim 12,
The learning rate of the neural network machine translation model is reduced by a constant coefficient every three epochs, inverse synthesis prediction method.

The method of claim 1,
Evaluating the learned NMT model comprises:
A reverse synthesis prediction method for evaluating prediction accuracy by calculating a similarity between a candidate reactant predicted through the NMT model and an actual reactant obtained through the final product-reactant pair.

The method of claim 1,
The step of predicting and calculating the candidate reactant,
The method further comprising predicting and calculating a secondary candidate reactant for synthesizing the candidate reactant.