KR20210026543A

KR20210026543A - A system of predicting biological activity for compound with target protein using geometry images and artificial neural network

Info

Publication number: KR20210026543A
Application number: KR1020190107483A
Authority: KR
Inventors: 조경민; 이상윤
Original assignee: 주식회사 에일론
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2021-03-10

Abstract

The present invention relates to a system for predicting the activity of a protein-binding component based on artificial neural network models. According to the present invention, the artificial neural network models predicting whether or not the compound is active with respect to a target protein are combined and then learning and prediction are performed. The system includes: a data management module storing protein-related compound activity data and three-dimensional shape or structure data on the protein and the compound; an individual neural network module provided with its own individual neural network module, extracting a descriptor from the three-dimensional shape or structure data on the protein or the compound, causing the individual neural network model to learn the extracted descriptor and the activity data, and storing result data for secondary learning output by application to the learned individual neural network model; an integrated neural network module provided with its own integrated neural network module, generating secondary learning data by integrating the result data for secondary learning generated at each individual neural network module, and causing the integrated neural network model to learn the generated secondary learning data; and an activity prediction module performing activity prediction by using the individual neural network model and the integrated neural network model with respect to a query protein and a query compound. In the system described above, protein-compound binding information learning and prediction are performed using artificial neural network models, and thus data bias can be reduced and accuracy can be enhanced even with respect to a difficult prediction question.

Description

{A system of predicting biological activity for compound with target protein using geometry images and artificial neural network}

본 발명은 신약개발에 있어서 초기단계인 유효물질 발굴 단계를 빠른 시간 내에 효율적으로 처리하기 위하여, 인공지능을 이용해 단백질과 화합물의 결합 정보를 학습하여, 주어진 단백질과 화합물이 결합하여 활성을 나타낼 수 있는지를 판단하고 예측하는, 복수의 인공신경망 모델 기반 단백질 결합 화합물의 활성 예측 시스템에 관한 것이다.The present invention learns the binding information between a protein and a compound using artificial intelligence in order to efficiently process the active substance discovery step, which is an early stage in the development of a new drug, to see if a given protein and compound can bind to show activity. It relates to a system for predicting the activity of a protein-binding compound based on a plurality of artificial neural network models to determine and predict.

또한, 본 발명은 타겟 단백질에 대한 화합물의 활성 여부를 예측하는 복수 개의 인공 신경망(Neural Network) 모델을 결합하기 위하여, 개별 인공신경망 모델을 통해 단백질과 화합물의 반응 정도에 대해 확률분포 데이터를 출력하게 하고, 이를 학습 데이터로 활용하는 별도의 통합 인공신경망을 구성하는, 복수의 인공신경망 모델 기반 단백질 결합 화합물의 활성 예측 시스템에 관한 것이다.In addition, in order to combine a plurality of artificial neural network models that predict whether a compound is active against a target protein, the present invention outputs probability distribution data on the degree of reaction between a protein and a compound through an individual artificial neural network model. And, it relates to a system for predicting the activity of a protein-binding compound based on a plurality of artificial neural network models, which constitutes a separate integrated artificial neural network using this as learning data.

컴퓨터를 활용한 신약개발은 신약 발굴(Drug Discovery) 단계에서 많이 쓰이는데, 화합물의 유효(Hit) 물질 발굴을 위해서는 크게 두 가지의 방식이 사용되어 지고 있다. 첫번째 방식은 화합물 기반 약물 설계로서, 어떤 화합물이 특정 구조 단백질에 결합을 한다는 사실에 기초해, 유사한 화학적 특징을 가진 화합물을 찾거나 만드는 방법이다. 두번째 방식은, 단백질 구조 기반 약물 설계로서, 단백질 3차원 구조 정보에 대한 사전 지식을 이용하여 이러한 구조의 형태와 크기가 비슷한 화합물 중 활성을 보이는 화합물을 찾아낸다.The development of new drugs using computers is widely used in the drug discovery stage, and two methods are largely used to discover the hit substance of a compound. The first method is compound-based drug design, which involves finding or making compounds with similar chemical properties based on the fact that a compound binds to a specific structural protein. The second method, protein structure-based drug design, uses prior knowledge of the three-dimensional structure information of the protein to find compounds showing activity among compounds with similar shapes and sizes.

단백질 구조 기반 약물 설계 방식도 크게 세가지로 나뉠 수 있다. 첫번째는 바인딩 사이트(binding site) 또는 바인딩 포켓(binding pocket) 형태의 정보에 가장 형태가 유사한 화합물을 도킹 프로그램을 활용하여 빠르게 찾아내는 방식이다[특허문헌 1]. 두번째로는, 바인딩 포켓(binding pocket)의 사이즈 및 크기에 맞게 화합물의 원소 혹은 분자 조각(molecular fragments)을 합치는 방식이다. 세번째로는, 바인딩 캐비티(binding cavity)에 활성을 보이는 화합물들 중 알려진 형태(conformation)를 최적화 시키는 방식이다. 이 중에서도, 컴퓨터 기술의 발달로 가상 스크리닝(virtual screening) 형태의 방법이 활발히 응용되어 지고 있다[특허문헌 2]. The protein structure-based drug design method can also be divided into three broad categories. The first is a method of quickly finding a compound having the most similar form to information in the form of a binding site or a binding pocket using a docking program [Patent Document 1]. The second is a method of combining elements or molecular fragments of a compound according to the size and size of a binding pocket. Third, it is a method of optimizing the known conformation among compounds showing activity in the binding cavity. Among these, a method in the form of virtual screening has been actively applied due to the development of computer technology [Patent Document 2].

가상 스크리닝(virtual screening)의 기술적인 측면에서, 기존의 기계적인 수백~수천만의 화합물 라이브러리에서 타겟 단백질에 활성을 보이는 화합물을 찾는 방법에서, 현재는 인공신경망(딥러닝)을 활용한 방법들의 움직임이 활발하다. 2015년 아톰와이즈사는 세계 최초로 분자의 결합 친화성 및 신약 타겟(단백질)의 결합 구조를 예측하는 아톰넷(AtomNet) 기술을 선보이게 되었다.From the technical aspect of virtual screening, in the method of finding compounds that exhibit activity on the target protein in the existing mechanical library of hundreds to tens of millions of compounds, the movement of methods using artificial neural networks (deep learning) is nowadays. It is actively. In 2015, Atomwise introduced AtomNet technology that predicts the binding affinity of molecules and the binding structure of a new drug target (protein) for the first time in the world.

그러나 난이도가 높은 예측 문제를 단일 모델로만 인공 신경망을 구성하여 해결할 때 데이터에 따라 편향성이 나타나게 되어 정확도가 떨어지는 경우가 많다, 이를 해결하기 위해 다양한 인공 신경망을 결합하여 예측하면, 정확도를 보다 높일 수 있을 것이다.However, when solving a high-difficulty prediction problem by constructing an artificial neural network with only a single model, the accuracy is often inferior due to the appearance of bias according to the data.To solve this problem, if various artificial neural networks are combined and predicted, the accuracy can be further improved. will be.

그런데 다양한 인공 신경망의 모델을 결합하는 방법은 복수개의 개별 인공신경망 모델을 인공신경망 통합모델이 제어할 수 있는 환경을 구축한 후 학습을 수행한다. 그런데 이러한 신경망의 통합 모델을 실제 적용할 때 다음과 같은 문제가 발생하게 된다..However, the method of combining models of various artificial neural networks is to perform learning after establishing an environment in which a plurality of individual artificial neural network models can be controlled by the artificial neural network integrated model. However, when the neural network's integrated model is actually applied, the following problems arise.

먼저, 각 단일 인공신경망 구축 자체가 복잡한 경우, 각 종류별로 각 인공신경망을 구축하게 되는데, 이때 서로 다른 개별 인공 신경망을 통일적으로 관리하기 위해서 개발 비용이나 유지 보수 비용이 급격하게 증가된다.First, when the construction of each single artificial neural network itself is complex, each artificial neural network is constructed for each type, and in this case, development cost or maintenance cost increases rapidly in order to unified management of different artificial neural networks.

또한, 각 개별 인공신경망을 통합하여 동시에 학습을 할 경우 인공지능 서버에 부하가 걸리거나 주 메모리 공간의 부족으로 학습 자체가 불가능 할 수 있다.In addition, when learning by integrating each individual artificial neural network at the same time, learning itself may be impossible due to a load on the artificial intelligence server or lack of main memory space.

또한, 각 개별 인공신경망이 완성된 후 실제 통합 인공신경망을 훈련(training) 할 때, 각 단일 인공신경망에 인입되는 학습 데이터셋이 모두 동일한 순서대로 무작위로 셔플링(shuffling)되어 있어야 한다. 이것은 데이터를 무작위로 셔플링 하지 않으면 인공신경망이 마지막으로 학습한 데이터에 맞춰 편향된 예측을 할 수 있기 때문이다.In addition, when training an integrated artificial neural network after each individual artificial neural network is completed, all training datasets that are introduced into each single artificial neural network must be randomly shuffled in the same order. This is because if the data is not shuffled randomly, the artificial neural network can make a biased prediction for the last learned data.

또한, 통합 인공신경망을 구현하는 개발자가 각 개별 인공신경망의 입력 데이터(학습 데이터 셋)의 구성과 데이터형식(data format)을 파악하고 있어야 하며, 데이터 자체에 대한 제어권도 획득해야 한다.In addition, the developer implementing the integrated artificial neural network must be aware of the composition and data format of the input data (learning data set) of each individual artificial neural network, and must also acquire control over the data itself.

따라서 위와 같은 어려움을 극복하고 투입되는 막대한 시간을 줄이고 전체 과정을 단순화하는 기술이 필요하다.Therefore, there is a need for a technology that overcomes the above difficulties, reduces the enormous amount of time spent, and simplifies the entire process.

한국 공개특허공보 제10-2018-0058648호(2018.06.01.공개)Korean Patent Application Publication No. 10-2018-0058648 (published on June 1, 2018) 한국 공개특허공보 제10-2019-0000167호(2019.01.02.공개)Korean Patent Application Publication No. 10-2019-0000167 (published on January 2, 2019)

http://dude.docking.org/ http://dude.docking.org/ Connolly, M. L., "Analytical molecular surface calculation.", J. Appl. Cryst. 1983, 16, 548-558 Connolly, M. L., "Analytical molecular surface calculation.", J. Appl. Cryst. 1983, 16, 548-558

본 발명의 목적은 상술한 바와 같은 문제점을 해결하기 위한 것으로, 타겟 단백질에 대한 화합물의 활성 여부를 예측하는 복수 개의 인공 신경망(Neural Network) 모델을 결합하기 위하여, 개별 인공신경망 모델을 통해 단백질과 화합물의 반응 정도에 대해 확률분포 데이터를 출력하게 하고, 이를 학습 데이터로 활용하는 별도의 통합 인공신경망을 구성하는, 복수의 인공신경망 모델 기반 단백질 결합 화합물의 활성 예측 시스템을 제공하는 것이다.An object of the present invention is to solve the above-described problems, and in order to combine a plurality of artificial neural network models that predict whether a compound is active against a target protein, proteins and compounds through individual artificial neural network models It is to provide a system for predicting the activity of protein-binding compounds based on a plurality of artificial neural network models, which constitutes a separate integrated artificial neural network that outputs probability distribution data for the response degree of and uses it as learning data.

상기 목적을 달성하기 위해 본 발명은 복수의 인공신경망 모델 기반 단백질 결합 화합물의 활성 예측 시스템에 관한 것으로서, 단백질에 대한 화합물의 활성 데이터, 및, 단백질과 화합물의 3차원 형상 또는 3차원 구조 데이터를 저장하는 데이터관리 모듈; 자신만의 개별 신경망 모듈을 구비하고, 단백질 또는 화합물의 3차원 형상 데이터 또는 3차원 구조 데이터로부터 디스크립터를 추출하고, 추출한 디스크립터와 활성 데이터로 상기 개별 신경망 모델을 학습시키고, 학습된 개별 신경망 모델에 적용하여 출력되는 2차 학습용 결과 데이터를 저장하는 개별 신경망 모듈; 자신만의 통합 신경망 모듈을 구비하고, 각각의 개별 신경망 모듈에서 생성된 2차 학습용 결과 데이터를 통합하여 2차 학습데이터를 생성하고, 생성된 2차 학습데이터로 상기 통합 신경망 모델을 학습시키는 통합 신경망 모듈; 및, 질의 단백질과 질의 화합물에 대하여 상기 개별 신경망 모델 및 상기 통합 신경망 모델을 이용하여 활성을 예측하는 활성예측 모듈을 포함하고, 상기 개별 신경망 모델은 적어도 2개 이상으로 구성되어 하나의 개별 신경망 모듈 그룹을 형성하는 것을 특징으로 한다.In order to achieve the above object, the present invention relates to a system for predicting the activity of a protein-binding compound based on a plurality of artificial neural network models, and storing the activity data of the compound for the protein and the three-dimensional shape or three-dimensional structure data of the protein and the compound. A data management module; Equipped with its own individual neural network module, extract a descriptor from 3D shape data or 3D structure data of a protein or compound, train the individual neural network model with the extracted descriptor and active data, and apply it to the trained individual neural network model An individual neural network module that stores the output data for secondary learning; An integrated neural network that has its own integrated neural network module, generates secondary training data by integrating the secondary training result data generated from each individual neural network module, and trains the integrated neural network model with the generated secondary training data. module; And an activity prediction module that predicts activity for a query protein and a query compound using the individual neural network model and the integrated neural network model, wherein the individual neural network model is composed of at least two and one individual neural network module group It is characterized in that to form.

또한, 본 발명은 복수의 인공신경망 모델 기반 단백질 결합 화합물의 활성 예측 시스템에 있어서, 상기 개별 신경망 모듈은 추출한 디스크립터와 활성 데이터로 1차 학습데이터를 생성하고, 1차 학습 데이터 중 일부로 상기 개별 신경망 모델을 학습시키고, 학습된 개별 신경망 모델에 1차 학습데이터 중 다른 일부를 적용하여 2차 학습용 결과 데이터를 생성하는 것을 특징으로 한다.In addition, the present invention is a system for predicting the activity of a protein-binding compound based on a plurality of artificial neural network models, wherein the individual neural network module generates primary training data from the extracted descriptor and activity data, and the individual neural network model is a part of the primary training data. And generating result data for secondary training by applying another part of the primary training data to the trained individual neural network model.

또한, 본 발명은 복수의 인공신경망 모델 기반 단백질 결합 화합물의 활성 예측 시스템에 있어서, 상기 1차 학습데이터는 단백질과 화합물의 식별정보, 단백질과 화합물에 대한 디스크립터와, 해당 단백질에 대한 해당 화합물의 활성 데이터로부터 결정되는 라벨 값을 포함하고, 상기 개별 신경망 모듈은 1차 학습데이터의 각 데이터에 대하여 해당 데이터의 디스크립터를 상기 개별 신경망 모델에 적용하여 예측결과 데이터를 획득하고, 예측결과 데이터에 해당 1차 학습데이터의 해당 데이터의 라벨을 부여하여 2차 학습용 결과 데이터를 생성하고, 상기 2차 학습용 결과 데이터에 단백질과 화합물에 대한 식별정보를 포함시키는 것을 특징으로 한다.In addition, in the present invention, in a system for predicting the activity of protein-binding compounds based on a plurality of artificial neural network models, the primary learning data includes identification information of proteins and compounds, descriptors for proteins and compounds, and activity of the corresponding compounds for the proteins. It includes a label value determined from the data, and the individual neural network module obtains the prediction result data by applying the descriptor of the corresponding data to the individual neural network model for each data of the primary training data, and obtains the prediction result data. A label of corresponding data of the learning data is assigned to generate result data for secondary learning, and identification information for proteins and compounds is included in the secondary learning result data.

또한, 본 발명은 복수의 인공신경망 모델 기반 단백질 결합 화합물의 활성 예측 시스템에 있어서, 상기 개별 신경망 모델의 출력 데이터는 라벨 값의 확률 값으로 출력되고, 상기 통합 신경망 모듈은 2차 학습용 결과 데이터를, 단백질과 화합물의 쌍을 기준으로 결합하여 통합시키되, 단백질과 화합물의 쌍이 일치하면 모든 개별 신경망 모듈의 2차 학습용 결과 데이터를 하나의 데이터로 결합하여 2차 학습데이터를 생성하는 것을 특징으로 한다.In addition, the present invention is a system for predicting the activity of a protein-binding compound based on a plurality of artificial neural network models, wherein the output data of the individual neural network models is output as a probability value of a label value, and the integrated neural network module provides result data for secondary learning, It is characterized by combining and integrating based on a pair of proteins and compounds, but if the pairs of proteins and compounds match, the secondary learning result data of all individual neural network modules are combined into one data to generate secondary learning data.

또한, 본 발명은 복수의 인공신경망 모델 기반 단백질 결합 화합물의 활성 예측 시스템에 있어서, 상기 개별 신경망 모듈 그룹은 3차원 형상 데이터로부터 디스크립터를 생성하는 적어도 하나의 개별신경망 모듈과, 3차원 구조 데이터로부터 디스크립터를 생성하는 적어도 하나의 개별신경망 모듈을 포함하는 것을 특징으로 한다.In addition, the present invention is a system for predicting the activity of a protein-binding compound based on a plurality of artificial neural network models, wherein the individual neural network module group comprises at least one individual neural network module that generates a descriptor from 3D shape data, and a descriptor from 3D structure data. It characterized in that it comprises at least one individual neural network module for generating.

또한, 본 발명은 복수의 인공신경망 모델 기반 단백질 결합 화합물의 활성 예측 시스템에 있어서, 상기 개별 신경망 모듈은 푸리에 변환을 이용하여 3차원 형상 데이터에서 1차원 데이터를 추출하여 디스크립터로 생성하거나, 호몰로지를 이용하여 3차원 구조 데이터에서 2차원 이미지를 추출하여 디스크립터로 생성하거나, 3차원 형상 데이터를 2차원 이미지로 투영하여 디스크립터를 생성하는 것을 특징으로 한다.In addition, the present invention is a system for predicting the activity of a protein-binding compound based on a plurality of artificial neural network models, wherein the individual neural network module extracts one-dimensional data from three-dimensional shape data using Fourier transform and generates a descriptor or a homology. It is characterized by generating a descriptor by extracting a two-dimensional image from the three-dimensional structure data and generating a descriptor, or by projecting the three-dimensional shape data as a two-dimensional image.

또한, 본 발명은 복수의 인공신경망 모델 기반 단백질 결합 화합물의 활성 예측 시스템에 있어서, 상기 3차원 형상 데이터는 닫힌 표면 데이터로서 닫힌 메쉬 표면(closed mesh surface)에 대한 3차원 데이터이고, 상기 3차원 구조 데이터는 단백질 또는 화합물의 화학적 구조에서의 원자 위치에 대한 3차원 데이터로서 3차원 포인트로 구성되는 데이터인 것을 특징으로 한다.In addition, the present invention is a system for predicting the activity of a protein-binding compound based on a plurality of artificial neural network models, wherein the 3D shape data is 3D data for a closed mesh surface as closed surface data, and the 3D structure The data are three-dimensional data on an atomic position in a chemical structure of a protein or a compound, and are characterized in that the data is composed of three-dimensional points.

상술한 바와 같이, 본 발명에 따른 복수의 인공신경망 모델 기반 단백질 결합 화합물의 활성 예측 시스템에 의하면, 복수의 인공 신경망 모델을 이용하여 단백질과 화합물의 결합 정보를 학습하고 예측함으로써, 난이도가 높은 예측 문제에 대해서도 데이터에 따른 편향성을 줄이고 그 만큼 정확도를 높일 수 있는 효과가 얻어진다. 즉, 각 개별 인공신경망이 예측하지 못하는 영역을 향상된 변별력으로 구분할 수 있다.As described above, according to the system for predicting the activity of protein-binding compounds based on a plurality of artificial neural network models according to the present invention, by learning and predicting binding information between proteins and compounds using a plurality of artificial neural network models, a prediction problem with high difficulty Also, the effect of reducing the bias according to the data and increasing the accuracy according to the data is obtained. In other words, regions that cannot be predicted by each individual artificial neural network can be classified with improved discrimination power.

또한, 본 발명에 의하면, 각 인공신경망 모델을 독립적으로 학습시킴으로써, 전체 시스템의 가용 자원을 적절히 분배할 수 있고, 개발 및 유지 보수 비용을 대폭 줄일 수 있는 효과가 얻어진다.In addition, according to the present invention, by independently learning each artificial neural network model, available resources of the entire system can be appropriately distributed, and development and maintenance costs can be significantly reduced.

본 발명에 따른 시스템에 의하면, 인공지능을 이용해 단백질과 화합물의 결합 정보를 학습하여 그 활성을 예측함으로써, 단백질과 결합된 화합물의 구조 정보로부터 활성 화합물 후보를 대량으로 짧은 시간 내에 추출할 수 있는 효과가 얻어진다. 이를 통해, 실험적으로 테스트할 화합물을 빠르게 선정하여 유효물질 발굴에 걸리는 시간과 비용을 대폭 단축시킬 수 있다.According to the system according to the present invention, by learning the binding information of a protein and a compound using artificial intelligence and predicting its activity, it is possible to extract a large amount of active compound candidates from the structural information of the compound bound to the protein in a short time. Is obtained. Through this, it is possible to quickly select a compound to be tested experimentally, and significantly reduce the time and cost required to discover an effective substance.

도 1은 본 발명을 실시하기 위한 전체 시스템의 구성도.
도 2는 본 발명에 따른 복수의 인공신경망 모델 기반 단백질 결합 화합물의 활성 예측 시스템의 구성에 대한 블록도.
도 3은 본 발명에 따른 복수의 인공신경망 모델 기반 단백질 결합 화합물의 활성 예측 시스템의 세부 구성에 대한 블록도.
도 4는 본 발명의 일실시예에 따른 단백질의 기질 결합부위와 화합물의 표면 3차원 형상 추출을 예시한 도면으로서, (a) 화합물, (b) 단백질의 기질 결합부위에 대한 도면.
도 5는 본 발명의 일실시예에 따른 단백질과 화합물의 푸리에 변환에 의한 푸리에 계수 벡터를 산출하는 예시도.
도 6은 본 발명의 일실시예에 따른 퍼시스턴스 다이어그램에 대한 예시 그래프.
도 7은 본 발명의 일실시예에 따른 호몰로지 이미지에 대한 예시도.
도 8은 본 발명의 일실시예에 따른 디스크립터를 생성하는 과정을 설명하는 흐름도.
도 9는 본 발명의 일실시예에 따른 3차원 형상에 구형 매개변수화(Spherical Parametrization)를 적용하여 기하학적 이미지(Geometry Images)를 생성하는 과정을 예시한 도면.
도 10은 본 발명의 일실시예에 따른 2차 학습데이터를 위한 통합 데이터를 예시한 표.
도 11은 본 발명의 실험에 따른 호몰로지 방식의 개별 신경망 모델에 대한 성능평가(소프트맥스 값 기준)를 나타낸 표.
도 12는 본 발명의 실험에 따른 푸리에 방식의 개별 신경망 모델에 대한 성능평가(소프트맥스 값 기준)를 나타낸 표.
도 13은 본 발명의 실험에 따른 통합 신경망 모델에 대한 성능평가를 나타낸 표.1 is a block diagram of an entire system for implementing the present invention.
Figure 2 is a block diagram of the configuration of a system for predicting the activity of a protein-binding compound based on a plurality of artificial neural network models according to the present invention.
3 is a block diagram of a detailed configuration of a system for predicting activity of a protein-binding compound based on a plurality of artificial neural network models according to the present invention.
4 is a diagram illustrating the extraction of a substrate-binding site of a protein and a three-dimensional shape of the surface of a compound according to an embodiment of the present invention, (a) a compound, and (b) a view of the substrate-binding site of a protein.
5 is an exemplary diagram for calculating a Fourier coefficient vector by Fourier transform of a protein and a compound according to an embodiment of the present invention.
6 is an exemplary graph for a persistence diagram according to an embodiment of the present invention.
7 is an exemplary diagram for a homology image according to an embodiment of the present invention.
8 is a flowchart illustrating a process of generating a descriptor according to an embodiment of the present invention.
9 is a diagram illustrating a process of generating geometric images by applying spherical parametrization to a three-dimensional shape according to an embodiment of the present invention.
10 is a table illustrating integrated data for secondary learning data according to an embodiment of the present invention.
11 is a table showing a performance evaluation (based on a softmax value) for an individual neural network model of a homology method according to an experiment of the present invention.
12 is a table showing the performance evaluation (based on a softmax value) for an individual neural network model of the Fourier method according to the experiment of the present invention.
13 is a table showing the performance evaluation of the integrated neural network model according to the experiment of the present invention.

이하, 본 발명의 실시를 위한 구체적인 내용을 도면에 따라서 설명한다.Hereinafter, specific details for the implementation of the present invention will be described with reference to the drawings.

또한, 본 발명을 설명하는데 있어서 동일 부분은 동일 부호를 붙이고, 그 반복 설명은 생략한다.In addition, in describing the present invention, the same portions are denoted by the same reference numerals, and repeated explanations thereof are omitted.

먼저, 본 발명을 실시하기 위한 전체 시스템의 구성의 예들에 대하여 도 1을 참조하여 설명한다.First, examples of the configuration of an entire system for implementing the present invention will be described with reference to FIG. 1.

도 1(a)와 1(b)에서 보는 바와 같이, 본 발명에 따른 복수의 인공신경망 모델 기반 단백질 결합 화합물의 활성 예측 시스템은 네트워크 상의 서버 시스템 또는 컴퓨터 단말 상의 프로그램 시스템으로 실시될 수 있다.1(a) and 1(b), the system for predicting the activity of protein-binding compounds based on a plurality of artificial neural network models according to the present invention may be implemented as a server system on a network or a program system on a computer terminal.

도 1(a)와 같이, 본 발명의 실시를 위한 전체 시스템의 일례는 분석 단말(10)과 활성 예측 시스템(30)으로 구성되고 서로 네트워크(20)로 연결된다. 또, 필요한 데이터를 저장하기 위한 데이터베이스(40)를 더 구비할 수 있다.As shown in FIG. 1 (a), an example of the overall system for the implementation of the present invention is composed of an analysis terminal 10 and an active prediction system 30, and are connected to each other by a network 20. In addition, a database 40 for storing necessary data may be further provided.

분석 단말(10)은 신약개발 연구원 등 사용자가 이용하는 PC, 노트북, 넷북, PDA, 모바일 등의 통상의 컴퓨팅 단말기이다. 사용자는 사용자 단말(10)을 통해 타겟 단백질 및 화합물의 3차원 구조 등 데이터를 활성 예측 시스템(30)으로 전달하거나, 그 활성 예측 결과 값을 활성 예측 시스템(30)으로부터 수신한다.The analysis terminal 10 is a general computing terminal such as PC, notebook, netbook, PDA, and mobile used by users such as new drug development researchers. The user transmits data, such as the 3D structure of the target protein and compound, to the activity prediction system 30 through the user terminal 10 or receives the activity prediction result value from the activity prediction system 30.

활성 예측 시스템(30)은 통상의 서버로서 네트워크(20)에 연결되어 인공신경망을 이용한 타겟 단백질에 대한 화합물 활성 예측을 지원하는 서비스를 제공한다. 한편, 활성 예측 시스템(30)은 상기 각 서비스를 인터넷 상의 웹페이지로 제공하는 웹서버 또는 웹어플리케이션 서버 등으로 구현될 수 있다. 또한, 활성 예측 시스템(30)은 클라우드 시스템으로 구현되어, 클라우드 기반으로 학습이나 분석 기능을 수행하고 활성 예측 서비스를 제공할 수 있다.The activity prediction system 30 is connected to the network 20 as a conventional server to provide a service that supports the prediction of compound activity for a target protein using an artificial neural network. Meanwhile, the activity prediction system 30 may be implemented as a web server or a web application server that provides each service as a web page on the Internet. In addition, the activity prediction system 30 may be implemented as a cloud system, and may perform a learning or analysis function based on a cloud and provide an active prediction service.

데이터베이스(40)는 활성 예측 시스템(30)에서 필요한 데이터를 저장하는 통상의 저장매체로서, 타겟 단백질 또는 화합물의 3차원 구조에 대한 데이터, 타겟 단백질과 화합물 간의 결합 등 활성 데이터를 저장한다.The database 40 is a general storage medium for storing data required by the activity prediction system 30, and stores activity data such as data on a three-dimensional structure of a target protein or compound, and binding between a target protein and a compound.

한편, 데이터베이스(40)는 이미 구축된 천연물이나 화학 합성물 라이브러리의 데이터를 가져와서 구축될 수 있다.On the other hand, the database 40 may be constructed by importing data from a library of natural products or chemical compounds that have already been built.

구체적으로, 데이터베이스(40)는 개별신경망 모델(41), 통합신경망 모델(42), 2차 학습용 결과 데이터를 저장하는 2차 학습용 데이터 스토리지(43) 등으로 구성될 수 있다. 그러나 상기 데이터베이스(40)의 구성은 바람직한 일실시예일 뿐이며, 구체적인 시스템을 개발하는데 있어서, 접근 및 검색의 용이성 및 효율성 등을 감안하여 데이터베이스 구축이론에 의하여 다른 구조로 구성될 수 있다.Specifically, the database 40 may include an individual neural network model 41, an integrated neural network model 42, and a secondary learning data storage 43 that stores result data for secondary learning. However, the configuration of the database 40 is only a preferred embodiment, and may be configured in a different structure according to the database construction theory in consideration of the ease and efficiency of access and search in developing a specific system.

한편, 활성 예측 시스템(30)은 서버와 클라이언트로 구성된 서버-클라이언트 시스템으로 구성될 수 있다. 즉, 활성 예측 시스템(30)의 디스크립터 생성, 주요 학습이나 분석 기능은 서버에 구축되고, 사용자 인터페이스 또는 분석을 위한 간단한 전처리 작업 등은 분석 단말(10)에 클라이언트 모듈로 구축될 수 있다. 서버와 클라이언트 간의 작업 분담은 통상의 서버-클라이언트 구축 이론에 따라 다양한 형태로 구현될 수 있다.Meanwhile, the activity prediction system 30 may be composed of a server-client system composed of a server and a client. That is, the descriptor generation, major learning or analysis functions of the active prediction system 30 are built in the server, and the user interface or simple preprocessing tasks for analysis can be built in the analysis terminal 10 as a client module. The division of work between the server and the client can be implemented in various forms according to the general server-client construction theory.

또한, 활성 예측 시스템(30)에서 학습 기능이나 예측 기능을 엔진 모듈로 구축되고, 분석 단말(10)에 설치된 클라이언트 서비스 모듈이 엔진 모듈을 이용하여, 사전에 수집된 데이터로 인공지능 모델을 학습시키고, 학습된 모델을 통해 타겟 단백질에 대한 화합물의 활성을 예측 서비스를 제공할 수 있다. 이 경우, 분석 단말(10)은 또 다른 서버로서 역할을 수행할 수 있다.In addition, in the active prediction system 30, a learning function or a prediction function is built as an engine module, and a client service module installed in the analysis terminal 10 uses the engine module to train an artificial intelligence model with previously collected data. In addition, it is possible to provide a service for predicting the activity of a compound against a target protein through the learned model. In this case, the analysis terminal 10 may serve as another server.

또한, 도 1(b)와 같이, 본 발명의 실시를 위한 전체 시스템의 다른 예는 컴퓨터 단말(13)에 설치되는 프로그램 형태의 활성 예측 시스템(30)으로 구성된다. 즉, 활성 예측 시스템(30)의 각 기능들은 컴퓨터 프로그램으로 구현되어 컴퓨터 단말(10)에 설치되어, 컴퓨터 단말(10) 상의 프로그램 시스템으로 실시될 수 있다. 컴퓨터 단말(10)에 설치된 프로그램은 하나의 프로그램 시스템(30)과 같이 동작할 수 있다. 한편, 활성 예측 시스템(30)에서 필요한 데이터들은 컴퓨터 단말(10)의 하드디스크 등 저장공간에 저장되어 이용된다.In addition, as shown in FIG. 1 (b), another example of the overall system for the implementation of the present invention is configured with an activity prediction system 30 in the form of a program installed in the computer terminal 13. That is, each function of the activity prediction system 30 may be implemented as a computer program and installed in the computer terminal 10, and implemented as a program system on the computer terminal 10. A program installed in the computer terminal 10 may operate like a single program system 30. Meanwhile, data required by the activity prediction system 30 are stored and used in a storage space such as a hard disk of the computer terminal 10.

한편, 다른 실시예로서, 복수의 인공신경망 모델 기반 단백질 결합 화합물의 활성 예측 시스템은 프로그램으로 구성되어 범용 컴퓨터에서 동작하는 것 외에 ASIC(주문형 반도체) 등 하나의 전자회로로 구성되어 실시될 수 있다. 또는 화합물 활성을 예측하는 것만을 전용으로 처리하는 전용 컴퓨터 단말(10)로 개발될 수도 있다. 이를 활성 예측 시스템(30)라 부르기로 한다. 그 외 가능한 다른 형태도 실시될 수 있다.On the other hand, as another embodiment, the system for predicting the activity of a protein-binding compound based on a plurality of artificial neural network models may be implemented as a single electronic circuit such as an ASIC (on-demand semiconductor) in addition to being configured as a program and operating on a general-purpose computer. Alternatively, it may be developed as a dedicated computer terminal 10 that exclusively processes only predicting compound activity. This will be referred to as the active prediction system 30. Other possible forms may also be implemented.

다음으로, 본 발명의 일실시예에 따른 복수의 인공신경망 모델 기반 단백질 결합 화합물의 활성 예측 시스템의 구성을 도 2를 참조하여 설명한다.Next, a configuration of a system for predicting activity of a protein-binding compound based on a plurality of artificial neural network models according to an embodiment of the present invention will be described with reference to FIG. 2.

도 2에서 보는 바와 같이, 본 발명에 따른 복수의 인공신경망 모델 기반 단백질 결합 화합물의 활성 예측 시스템(30)은 단백질에 대한 화합물의 활성 데이터 또는 단백질/화합물의 3차원 데이터를 관리하는 데이터관리 모듈(31), 3차원 데이터로부터 디스크립터를 생성하고 디스크립터와 활성 데이터로 1차 학습데이터를 생성하여 개별 신경망 모델(41)을 학습시키고 2차 학습데이터를 생성하는 개별 신경망 모듈(33), 다수의 개별 신경망 모델(33)로 구성된 개별 신경망 모델 그룹(32), 2차 학습데이터로 2차 신경망 모델(42)을 학습시키는 통합 신경망 모듈(34), 및, 검사대상 단백질(또는 질의 단백질)과 검사대상 화합물(또는 질의 화합물)에 대하여 개별 신경망 모델(41) 및 통합 신경망 모델(42)을 이용하여 활성을 예측하는 활성예측 모듈(35)로 구성된다.As shown in FIG. 2, the system 30 for predicting activity of a protein-binding compound based on a plurality of artificial neural network models according to the present invention includes a data management module ( 31), an individual neural network module (33) that trains an individual neural network model (41) by generating a descriptor from 3D data and generates primary training data from the descriptor and active data and generates secondary training data (33), a number of individual neural networks An individual neural network model group 32 composed of a model 33, an integrated neural network module 34 that trains a secondary neural network model 42 using secondary training data, and a test target protein (or query protein) and a test target compound It is composed of an activity prediction module 35 that predicts activity for (or query compound) by using an individual neural network model 41 and an integrated neural network model 42.

먼저, 데이터관리 모듈(31)은 단백질에 대한 화합물의 활성 데이터를 수집하거나, 단백질과 화합물의 3차원 형상 또는 3차원 구조 데이터를 형성하여 관리한다.First, the data management module 31 collects activity data of a compound for a protein, or forms and manages the three-dimensional shape or three-dimensional structure data of the protein and the compound.

활성 데이터는 특정 타겟 단백질에 대하여 특정 화합물이 활성 화합물(active)인지, 또는 비활성 화합물(inactive)인지를 나타내는 데이터이다. 즉, 단백질과 화합물의 결합의 활성 여부가 이미 알려진 데이터이다.The activity data is data indicating whether a specific compound is an active compound or an inactive compound for a specific target protein. In other words, it is already known whether the binding activity of a protein and a compound is active.

단백질 또는 화합물에 대한 3차원 형상 데이터 또는 3차원 구조 데이터는 단백질 또는 화합물의 3차원 형상이나 구조를 나타내는 3차원 데이터이다. 이때, 단백질 또는 화합물은 앞서 수집된 활성 데이터에 속하는 타겟 단백질 또는 화합물들이다.The three-dimensional shape data or three-dimensional structure data for a protein or compound is three-dimensional data representing the three-dimensional shape or structure of a protein or compound. In this case, the protein or compound is a target protein or compound belonging to the previously collected activity data.

다음으로, 개별 신경망 모듈(33)은 자신만의 신경망 모델(또는 개별 신경망 모듈)(41)을 구비하고, 단백질 또는 화합물의 3차원 형상 데이터 또는 3차원 구조 데이터로부터 디스크립터를 추출하고, 추출한 디스크립터와 활성 데이터로 1차 학습데이터를 생성하고, 1차 학습 데이터 중 일부로 개별 신경망 모델(41)을 학습시키고, 학습된 개별 신경망 모델(41)에 1차 학습데이터 중 다른 일부를 적용하여 2차 학습용 결과데이터를 생성한다.Next, the individual neural network module 33 has its own neural network model (or individual neural network module) 41, extracts a descriptor from 3D shape data or 3D structure data of a protein or compound, and extracts the extracted descriptor and The result for secondary training by generating primary training data as active data, training an individual neural network model 41 as part of the primary training data, and applying another part of the primary training data to the trained individual neural network model 41 Generate data.

또한, 개별 신경망 모듈(33)은 다수 개로 구성되어 하나의 그룹, 즉, 개별 신경망 모듈 그룹(32)으로 구성된다. 특히, 개별신경망 모듈(33)은 디스크립터를 추출하는 방식에 따라 다수 개가 구비될 수 있다. 즉, 디스크립터를 추출하는 방식은 각 개별신경망 모듈(33)에 따라 고유하다.In addition, the individual neural network module 33 is composed of a plurality of units and is composed of one group, that is, an individual neural network module group 32. In particular, a plurality of individual neural network modules 33 may be provided according to a method of extracting a descriptor. That is, the method of extracting the descriptor is unique for each individual neural network module 33.

한편, 바람직하게는, 개별 신경망 모듈 그룹(32)은 3차원 형상 데이터로부터 디스크립터를 생성하는 적어도 하나의 개별신경망 모듈(33)과, 3차원 구조 데이터로부터 디스크립터를 생성하는 적어도 하나의 개별신경망 모듈(33)로 구성한다. 즉, 단백질 및 화합물의 3차원 형상과 3차원 구조를 모두 반영함으로써, 활성 예측을 보다 정확하게 분류할 수 있다.Meanwhile, preferably, the individual neural network module group 32 includes at least one individual neural network module 33 for generating a descriptor from 3D shape data, and at least one individual neural network module for generating a descriptor from 3D structure data ( 33). That is, by reflecting both the three-dimensional shape and the three-dimensional structure of proteins and compounds, it is possible to more accurately classify the activity prediction.

바람직하게는, 디스크립터는 1차원 데이터 또는 2차원 이미지 데이터로 구성된다. 즉, 개별신경망 모듈(33)은 3차원 형상 데이터 또는 3차원 구조 데이터로부터 1차원 데이터 또는 2차원 이미지 데이터를 추출하여, 이를 디스크립터로 활용한다.Preferably, the descriptor consists of one-dimensional data or two-dimensional image data. That is, the individual neural network module 33 extracts 1D data or 2D image data from 3D shape data or 3D structure data and uses it as a descriptor.

또한, 개별신경망 모듈(33)은 단백질과 화합물의 디스크립터와, 단백질에 대한 화합물의 활성 데이터로 1차 학습데이터를 생성한다. 즉, 단백질과 화합물의 쌍에 대한 디스크립터와, 해당 단백질에 대한 해당 화합물의 활성 데이터를 해당 디스크립터에 라벨링함으로써 1차 학습데이터를 생성한다. 즉, 활성 데이터가 라벨(label)로 사용된다.In addition, the individual neural network module 33 generates primary learning data from descriptors of proteins and compounds, and activity data of compounds for proteins. That is, primary learning data is generated by labeling a descriptor for a pair of a protein and a compound and activity data of a corresponding compound for the protein in the descriptor. That is, the activity data is used as a label.

1차 학습데이터는 2개의 그룹으로 구분되어, 하나의 그룹은 자신의 개별신경망 모델(41)을 학습시키기 위해 사용되고, 다른 하나의 그룹은 2차 학습용 결과 데이터를 생성하기 위해 사용된다.The primary training data is divided into two groups, one group is used to train its individual neural network model 41, and the other group is used to generate result data for secondary training.

즉, 개별신경망 모듈(33)은 1차 학습데이터 중 일부(또는 첫번째 그룹)로 자신의 개별 신경망 모델(41)을 학습시킨다. 이때, 첫 번째 그룹은 그 내에서 학습용 데이터와 검증용 데이터(또는 테스트용 데이터) 등으로 분할될 수 있다.That is, the individual neural network module 33 trains its own individual neural network model 41 with some (or first group) of the primary training data. In this case, the first group may be divided into training data and verification data (or test data) therein.

또한, 개별신경망 모듈(33)은 학습된 개별 신경망 모델(41)에 두 번째 그룹을 적용하여, 2차 학습용 결과 데이터를 생성한다. 즉, 두 번째 그룹의 1차 학습데이터의 각 데이터에 대하여 해당 데이터의 디스크립터를 개별 신경망 모델(41)에 적용하여 예측결과 데이터를 획득하고, 예측결과 데이터에 해당 데이터(1차 학습데이터)의 라벨(해당 디스크립터의 라벨)을 부여하여 2차 학습용 결과 데이터를 생성한다.In addition, the individual neural network module 33 applies the second group to the trained individual neural network model 41 to generate result data for secondary learning. That is, for each data of the first training data of the second group, the descriptor of the data is applied to the individual neural network model 41 to obtain the prediction result data, and the label of the data (primary training data) in the prediction result data It creates result data for secondary learning by giving (label of the descriptor).

또한, 해당 데이터(1차 학습 데이터)의 단백질과 화합물에 대한 정보(또는 식별정보)를 2차 학습용 결과 데이터에 함께 포함시킨다.In addition, information (or identification information) on proteins and compounds of the data (primary learning data) is included in the result data for secondary learning.

즉, 2차 학습용 결과 데이터는 단백질과 화합물에 대한 정보(아이디 등 식별정보), 예측결과 데이터, 라벨로 구성된다.That is, the result data for secondary learning is composed of information on proteins and compounds (identification information such as ID), prediction result data, and labels.

다음으로, 개별 신경망 모델(41)은 개별신경망 모듈(33)에서 구비된 신경망 모델로서, DNN(Deep Neural Network) 또는 딥러닝 모델, 순환형 신경망(RNN), 합성곱 신경망(CNN) 등이 적용될 수 있다. 디스크립터가 2차원 이미지인 경우, 바람직하게는, 합성곱 신경망(CNN)을 사용한다.Next, the individual neural network model 41 is a neural network model provided in the individual neural network module 33, and a DNN (Deep Neural Network) or a deep learning model, a cyclic neural network (RNN), a convolutional neural network (CNN), etc. are applied. I can. When the descriptor is a two-dimensional image, preferably, a convolutional neural network (CNN) is used.

개별 신경망 모델(41)의 입력 데이터는 단백질의 디스크립터와 화합물의 디스크립터이고, 출력 데이터는 라벨 값(범주 값)의 확률 값이다. 즉, 출력 데이터는 각 라벨 값의 확률 값으로 출력된다. 각 개별신경망 모델(41)의 예측 결과값의 개수, 즉, 출력하는 라벨(Label)에 대한 확률의 개수는 범주 값(Label)의 개수와 동일하다. 바람직하게는, 출력 데이터의 라벨 값(또는 범주값)은 활성(active)과 비활성(inactive)이다.The input data of the individual neural network model 41 is a descriptor of a protein and a descriptor of a compound, and the output data is a probability value of a label value (category value). That is, the output data is output as a probability value of each label value. The number of predicted result values of each individual neural network model 41, that is, the number of probabilities for an output label, is the same as the number of category values (Label). Preferably, the label value (or category value) of the output data is active and inactive.

개별신경망 모델(41)은 데이터의 수치를 예측하는 회귀(regression) 형태와 데이터의 특정 범주를 예측하는 분류(Classification) 형태로 구분된다. 바람직하게는, 개별신경망 모델(41)의 출력은 분류(Classification) 형태로 구성된다.The individual neural network model 41 is divided into a regression type that predicts a numerical value of data and a classification type that predicts a specific category of data. Preferably, the output of the individual neural network model 41 is configured in the form of a classification.

특히, 개별신경망 모델(41)은 각 데이터(또는 분류를 나타내는 라벨 값)에 대한 예측 가능성을 표현할 수 있는 소프트맥스(softmax) 값으로 계산하여 출력한다. 소프트맥스 값을 구하기 위한 소프트맥스(softmax) 함수는 다음과 같다.In particular, the individual neural network model 41 calculates and outputs a softmax value capable of expressing predictability for each data (or label value indicating classification). The softmax function to obtain the softmax value is as follows.

[수학식 1][Equation 1]

여기서, x_i는 최종 출력 값이고, k+1은 범주 개수를 나타낸다. e는 오일러 상수 또는 자연상수이다.Here, x _i is the final output value, and k+1 is the number of categories. e is an Euler constant or natural constant.

즉, 소프트맥스(softmax) 함수의 결과값은 예측하고자 하는 범주(또는 라벨값)의 개수와 동일한 크기의 1차원 벡터로, 각 값을 모두 더하면 1.0이 되는 일종의 확률분포이다.That is, the result of the softmax function is a one-dimensional vector of the same size as the number of categories (or label values) to be predicted, and is a kind of probability distribution that becomes 1.0 when all of the values are added.

다음으로, 통합 신경망 모듈(34)는 자신만의 신경망 모델(또는 통합 신경망 모듈)(42)을 구비하고, 각각의 개별 신경망 모듈(33)에서 생성된 2차 학습용 결과 데이터를 통합하여 2차 학습데이터를 생성하고, 생성된 2차 학습데이터로 통합 신경망 모델(42)을 학습시킨다.Next, the integrated neural network module 34 has its own neural network model (or the integrated neural network module) 42, and the secondary learning by integrating the result data for secondary learning generated by each individual neural network module 33 Data is generated, and the integrated neural network model 42 is trained with the generated secondary training data.

즉, 통합 신경망 모듈(34)은 각 개별 신경망 모듈(33)이 생성한 2차 학습용 결과 데이터를, 단백질과 화합물의 쌍을 기준으로 결합하여 통합시킨다. 즉, 단백질과 화합물의 쌍이 동일하면, 서로 다른 개별 신경망 모듈의 2차 학습용 결과 데이터를 하나의 데이터로 결합하여 통합한다. 특히, 단백질과 화합물의 하나의 쌍에 대하여, 모든 개별 신경망 모듈의 2차 학습용 결과 데이터를 결합하여 2차 학습데이터를 생성한다.That is, the integrated neural network module 34 combines and integrates the secondary learning result data generated by each individual neural network module 33 based on a pair of proteins and compounds. That is, if the pair of protein and compound is the same, the result data for secondary learning of different individual neural network modules are combined into one data and integrated. In particular, for one pair of protein and compound, secondary training data is generated by combining the secondary training result data of all individual neural network modules.

앞서 설명한 바와 같이, 개별 신경망 모델(41)의 결과 데이터는 소프트맥스(softmax) 함수의 결과 값으로 출력된다. 따라서 통합 신경망 모듈(34)는 각 개별신경망 모델(41)이 출력하는 소프트맥스(softmax) 함수의 결과값을 수집하고, 그 소프트맥스 결과값을 결합하여 2차 학습데이터를 형성한다.As described above, the result data of the individual neural network model 41 is output as a result value of the softmax function. Accordingly, the integrated neural network module 34 collects the result values of the softmax function output from each individual neural network model 41, and combines the softmax result values to form secondary learning data.

또한, 2차 학습데이터는 단백질과 화합물의 각 식별정보도 함께 포함하여 구성된다. 따라서 2차 학습데이터는 단백질과 화합물의 각 식별정보와, 해당 단백질과 화합물에 대한 모든 개별신경망 모델(41)의 예측 결과값, 해당 단백질과 화합물의 활성 데이터(또는 라벨 값)로 구성된다.In addition, the secondary learning data is comprised of the identification information of each protein and compound. Therefore, the secondary learning data is composed of identification information of each protein and compound, prediction result values of all individual neural network models 41 for the protein and compound, and activity data (or label value) of the protein and compound.

또한, 통합 신경망 모듈(34)는 통합하여 생성된 2차 학습데이터를 이용하여 통합 신경망 모델(42)을 학습시킨다. 이때, 2차 학습데이터는 그 내에서 학습용 데이터와 검증용 데이터(또는 테스트용 데이터) 등으로 분할될 수 있다.In addition, the integrated neural network module 34 trains the integrated neural network model 42 using secondary learning data generated by integration. In this case, the secondary learning data may be divided into learning data and verification data (or test data) therein.

다음으로, 통합신경망 모델(42)은 통합 신경망 모듈(34)에서 구비된 신경망 모델로서, DNN(Deep Neural Network) 또는 딥러닝 모델, 순환형 신경망(RNN), 합성곱 신경망(CNN) 등이 적용될 수 있다. 디스크립터 또는 입력 데이터는 1차원 데이터를 사용한다.Next, the integrated neural network model 42 is a neural network model provided in the integrated neural network module 34, to which a deep neural network (DNN) or a deep learning model, a recurrent neural network (RNN), a convolutional neural network (CNN), etc. are applied. I can. One-dimensional data is used as the descriptor or input data.

바람직하게는, 통합신경망 모델(42)은 FNN(Fully-connected Neural Network)을 이용하여 구성한다. 통합신경망 모델(42)에 입력되는 각 데이터를 구성하는 원소의 개수(feature 수)는 다음과 같이 구할 수 있다.Preferably, the integrated neural network model 42 is configured using a Fully-connected Neural Network (FNN). The number of elements constituting each data input to the integrated neural network model 42 (the number of features) can be obtained as follows.

입력데이터 특징(feature) 수 = (각 개별신경망의 수) × (라벨의 수)Number of input data features = (number of individual neural networks) × (number of labels)

다음으로, 활성예측 모듈(35)는 검사대상 단백질(또는 질의 단백질)과 검사대상 화합물(또는 질의 화합물)에 대하여 개별 신경망 모델(41) 및 통합 신경망 모델(42)을 이용하여 활성을 예측한다.Next, the activity prediction module 35 predicts the activity of the test target protein (or query protein) and the test target compound (or query compound) using the individual neural network model 41 and the integrated neural network model 42.

즉, 활성예측 모듈(35)는 질의 단백질과 질의 화합물에 대한 3차원 데이터를 데이터관리 모듈(31)에 요청하고, 이를 개별 신경망 모듈(33)로 전달하거나 전달하게 한다.That is, the activity prediction module 35 requests 3D data on the query protein and the query compound to the data management module 31, and transmits or transmits it to the individual neural network module 33.

또한, 활성예측 모듈(35)는 질의 단백질과 질의 화합물의 3차원 데이터를, 학습된 개별 신경망 모델(41)에 적용하여 1차적으로 활성 결과를 예측하도록, 개별 신경망 모델(32)에 요청한다. 이때, 개별 신경망 모델(32)은 3차원 데이터로부터 디스크립터를 생성하고, 생성된 디스크립터를 개별 신경망 모델(41)에 적용하여 1차 예측결과를 산출한다.In addition, the activity prediction module 35 requests the individual neural network model 32 to primarily predict the activity result by applying the 3D data of the query protein and the query compound to the learned individual neural network model 41. At this time, the individual neural network model 32 generates a descriptor from 3D data, and applies the generated descriptor to the individual neural network model 41 to calculate a first-order prediction result.

이때, 활성예측 모듈(35)는 모든 개별 신경망 모듈(33)에, 질의 단백질과 질의 화합물에 대한 예측을 요청하고, 모든 예측결과를 통합 신경망 모듈(34)에 전달하거나 전달하게 한다.At this time, the activity prediction module 35 requests prediction of the query protein and the query compound to all individual neural network modules 33, and transmits or transmits all prediction results to the integrated neural network module 34.

또한, 활성예측 모듈(35)은 질의 단백질과 질의 화합물의 1차 예측결과에 대하여 통합 신경망 모델(42)을 적용하여, 2차 예측을 요청한다. 이때, 통합 신경망 모듈(34)는 모든 개별 신경망 모델(41)의 1차 예측결과를 모두 획득하면, 다수의 1차 예측결과를 통합하고, 통합된 데이터를 통합 신경망 모델(42)에 적용하여, 최종적인 2차 예측을 출력한다.In addition, the activity prediction module 35 applies the integrated neural network model 42 to the first prediction result of the query protein and the query compound, and requests the second prediction. At this time, the integrated neural network module 34, when acquiring all the first prediction results of all individual neural network models 41, integrates a plurality of first prediction results, and applies the integrated data to the integrated neural network model 42, The final quadratic prediction is output.

한편, 개별신경망 모듈(33)들과, 통합신경망 모듈(34)은 서로 독립적인 시스템으로 구축될 수 있다. 이때, 바람직하게는, 2차 학습용 데이터 스토리지(43) 등 데이터베이스(40)를 통해 데이터를 주고 받을 수 있다.Meanwhile, the individual neural network modules 33 and the integrated neural network module 34 may be constructed as systems independent of each other. In this case, preferably, data may be exchanged through the database 40 such as the data storage 43 for secondary learning.

다음으로, 본 발명의 일실시예에 따른 데이터관리 모듈(31)의 구성에 대하여 도 3을 참조하여 보다 구체적으로 설명한다.Next, a configuration of the data management module 31 according to an embodiment of the present invention will be described in more detail with reference to FIG. 3.

도 3에서 보는 바와 같이, 데이터관리 모듈(31)은 활성데이터를 수집하는 활성데이터 수집부(311), 및, 3차원 형성 데이터 또는 3차원 구조 데이터를 생성하는 3D데이터 형성부(312)로 구성된다.As shown in FIG. 3, the data management module 31 includes an active data collection unit 311 for collecting active data, and a 3D data forming unit 312 for generating 3D formation data or 3D structure data. do.

먼저, 활성데이터 수집부(311)은 각 타겟 단백질에 대한 각 화합물의 활성 데이터를 수집한다.First, the activity data collection unit 311 collects activity data of each compound for each target protein.

활성 데이터는 특정 타겟 단백질(또는 표적 단백질) P_i에 대하여 특정 화합물 C_j이 활성 화합물(active)인지, 또는 비활성 화합물(inactive)인지를 나타내는 데이터이다. 즉, 활성 데이터는 { ( P_i, C_j, A_ij ) }로 구성된다. 이때, A_ij 는 활성화 또는 비활성화 값(이진 값)을 갖는다.The activity data is data indicating whether a specific compound C _j is an active compound or an inactive compound with respect to a specific target protein (or target protein) P _i. That is, the activity data consists of {(P _i , C _j , A _ij ) }. At this time, A _ij has an activation or deactivation value (binary value).

한편, 활성 데이터는 타겟 단백질에 대한 화합물의 활성 정도를 나타내는 활성값으로 표시될 수 있다. 이 경우, 사전에 정해진 기준값(또는 임계값)을 기준으로 활성 또는 비활성으로 구분될 수 있다.Meanwhile, the activity data may be expressed as an activity value indicating the degree of activity of the compound with respect to the target protein. In this case, it may be classified as active or inactive based on a predetermined reference value (or threshold).

활성화(active)는 타겟 단백질 P_i에 대하여 특정 화합물 C_j가 결합하였다는 것을 의미하고, 비활성화(inactive)는 그러하지 않다는 것을 나타낸다.Activation (active) means that a specific compound C _j _{is bound to the target protein P i} , and inactive (inactive) means that it is not.

바람직하게는, 활성데이터 수집부(311)은 사전에 구축된 활성데이터의 데이터셋으로부터 활성데이터를 수집할 수 있다.Preferably, the activity data collection unit 311 may collect activity data from a data set of previously established activity data.

일례로서, DUD-E(A Database of Useful Decoys: Enhanced)[비특허문헌 1]에서 제공하는 데이터셋을 사용한다. DUD-E 데이터셋은 총 102개의 타겟 단백질에 대한 총 22,146개의 활성(active) 화합물(타겟 단백질당 평균 217개의 활성 화합물)과, 비활성(inactive) 화합물 대신 각 활성(active)에 대해 5~60여개씩 만들어진 디코이(decoy) 화합물을 제공한다. 이 중 4개의 타겟 단백질은 부적합성 등의 이유로 제거하였고, 남은 98개 단백질에 대한 데이터가 본 발명의 실험에 사용되었다. DUD-E 데이터셋의 용도는 벤치마크 데이터셋이다.As an example, a dataset provided by DUD-E (A Database of Useful Decoys: Enhanced) [Non-Patent Document 1] is used. The DUD-E dataset contains a total of 22,146 active compounds (average 217 active compounds per target protein) for a total of 102 target proteins, and 5-60 for each active instead of the inactive compound. It provides a decoy compound made in dogs. Of these, 4 target proteins were removed for reasons such as incompatibility, and data on the remaining 98 proteins were used in the experiment of the present invention. The purpose of the DUD-E dataset is as a benchmark dataset.

디코이 화합물은 이론상 비활성 화합물일 가능성이 높은 구조의 화합물로서, 현실적으로 비활성 화합물에 대한 데이터를 수집하기 곤란한 점을 반영한다. 즉, 활성 화합물에 대비되어 구별되게 하기 위한 목적으로 표준 데이터 설계자들이 구성한 화합물 데이터이다. 디코이 화합물은 실질적으로는 비활성 화합물에 준하는 용도로 사용된다.The decoy compound is a compound having a structure that is likely to be an inactive compound in theory, and reflects the fact that it is difficult to collect data on an inactive compound in reality. In other words, it is compound data constructed by standard data designers for the purpose of distinguishing against active compounds. The decoy compound is practically used for an inert compound.

다음으로, 3D데이터 형성부(312)는 단백질 또는 화합물에 대한 3차원 형상 데이터 또는 3차원 구조 데이터를 형성한다. 이때, 단백질 또는 화합물은 앞서 수집된 활성 데이터에 속하는 타겟 단백질 또는 화합물들이다.Next, the 3D data forming unit 312 forms 3D shape data or 3D structure data for a protein or compound. In this case, the protein or compound is a target protein or compound belonging to the previously collected activity data.

먼저, 3차원 형상 데이터를 형성하는 방법을 설명한다. 3차원 형상 데이터는 닫힌 표면 데이터로서, 바람직하게는, 닫힌 메쉬 표면(closed mesh surface)에 대한 3차원 데이터이다.First, a method of forming 3D shape data will be described. The three-dimensional shape data is closed surface data, preferably, three-dimensional data for a closed mesh surface.

바람직하게는, 3D데이터 형성부(32)는 코놀리 표면[비특허문헌 2]을 이용하여, 타겟 단백질 또는 화합물의 3차원 형상을 형성한다. 구체적으로, 단백질 또는 화합물에 대한 코놀리 표면을 구하고, 코놀리 표면으로부터 3차원 형상 데이터를 생성한다. 즉, 도 4에서 보는 바와 같이, 코놀리 표면 생성 방법을 통해 단백질과 화합물의 3차원 데이터로부터 각각이 가지는 3차원 형상을 추출한다. 바람직하게는, 코놀리 표면 생성방법을 통해 추출되는 것이 표면의 3차원 위치정보이고, 그 위치 정보를 그물망(mesh) 구조의 삼각형을 이루는 꼭지점들의 3차원 좌표로 나타낸다. 즉, 그물망 구조에 의한 꼭지점들의 집합이 3차원 형상을 나타낸다.Preferably, the 3D data forming unit 32 forms a three-dimensional shape of a target protein or compound using the Connolly surface [Non-Patent Document 2]. Specifically, a Connolly surface for a protein or compound is obtained, and three-dimensional shape data is generated from the Connolly surface. That is, as shown in FIG. 4, the three-dimensional shape of each protein is extracted from the three-dimensional data of the protein and the compound through the Connolly surface generation method. Preferably, three-dimensional position information of the surface is extracted through the Connolly surface generation method, and the position information is represented by three-dimensional coordinates of vertices forming a triangle of a mesh structure. That is, the set of vertices by the mesh structure represents a three-dimensional shape.

코놀리 표면은 분자를 구성하는 각 원자의 반데르 발스(van der Waals) 반지름을 바탕으로 용매가 접근 가능한 범위를 나타내는 표면을 말한다. 즉, 코놀리 표면은 단백질 또는 화합물이 차지하는 공간에 대한 형상을 나타낸다.Connolly's surface refers to a surface that represents the extent to which a solvent can be accessed based on the van der Waals radius of each atom constituting the molecule. In other words, the Connolly surface represents the shape of the space occupied by proteins or compounds.

바람직하게는, 3D데이터 형성부(32)는 단백질의 경우에는 단백질의 기질 결합부위의 표면만을 추출하고, 추출된 표면에 대한 3차원 형상을 추출한다. 즉, 단백질의 전체 형상이 필요한 것이 아니라 화합물에 상보적인 기질 결합부위의 형상만이 필요한 것이므로, 단백질의 기질 결합부위의 표면만을 따로 추출하여도 된다.Preferably, in the case of a protein, the 3D data forming unit 32 extracts only the surface of the substrate binding site of the protein, and extracts a three-dimensional shape of the extracted surface. That is, since the entire shape of the protein is not required, but only the shape of the substrate-binding site complementary to the compound is required, only the surface of the substrate-binding site of the protein may be separately extracted.

즉, 단백질과 화합물(ligand)가 결합된 상태에서, 화합물의 표면 원소 좌표를 기준으로, 일정 범위(사전에 정해진 거리) 이내에 들어오는 단백질의 바인딩 사이트(binding-site) 또는 결합 부위 정보를 추출한다. 또한, 추출된 바인딩 사이트에서 폐곡면을 유지하면서, 3차원 형상 데이터를 생성한다.That is, in a state in which a protein and a compound (ligand) are bound, information on a binding-site or binding site of a protein that comes within a certain range (a predetermined distance) is extracted based on the surface element coordinates of the compound. In addition, 3D shape data is generated while maintaining the closed curved surface at the extracted binding site.

바람직하게는, 3D데이터 형성부(32)는 추출된 표면을 삼각 그물망 구조(triangular mesh)로 변환하고, 면과 꼭지점의 정보를 통합하고 변환하여 3차원 형상을 생성한다.Preferably, the 3D data forming unit 32 converts the extracted surface into a triangular mesh, integrates and transforms information on the surface and vertex to generate a three-dimensional shape.

다음으로, 3차원 구조 데이터를 형성하는 방법을 설명한다. 3차원 구조 데이터는 단백질 또는 화합물의 화학적 구조에서의 원자 위치에 대한 3차원 데이터로서, 3차원 점 또는 포인트로 구성되는 데이터이다.Next, a method of forming 3D structure data will be described. The three-dimensional structure data is three-dimensional data on the position of atoms in the chemical structure of a protein or compound, and is data composed of three-dimensional points or points.

바람직하게는, 3D데이터 형성부(312)는 단백질 또는 화합물의 몰(mol) 파일로부터, 화학적 결합 구조의 3차원 구조 데이터를 형성한다. 3차원 구조 데이터의 각 포인트는 3차원 상의 원자 위치를 나타낸다. 또한, 바람작하게는, 단백질과 화합물이 결합된 상태에서의 원자 위치를 3차원 점 데이터로 추출한다.Preferably, the 3D data forming unit 312 forms 3D structure data of a chemical bond structure from a mole file of a protein or compound. Each point in the three-dimensional structure data represents an atomic position in three dimensions. Also, preferably, the atomic position in the state in which the protein and the compound are bound is extracted as 3D point data.

이때, 몰(mol) 파일은 사전에 제공되는(만들어진) 데이터를 이용하거나, 가상 결합을 통해 생성된 데이터를 이용한다. 즉, 화합물 또는 단백질이 실제 실험을 통한 결정구조(crystal structure)를 가지고 있는 경우, 화합물 또는 단백질의 각 원소의 위치정보가 몰(mol) 파일 형태로 제공된다. 이 경우, 제공된 몰(mol) 파일을 사용한다.In this case, the mole file uses data provided (created) in advance, or data generated through a virtual combination. That is, when a compound or protein has a crystal structure through an actual experiment, location information of each element of the compound or protein is provided in the form of a mole file. In this case, use the provided mole file.

또한, 화학 결합 시뮬레이션 도구(예를 들어, 오토도킹 autodocing 프로그램) 등을 이용하여, 화합물과 단백질을 가상으로 결합하고, 결합된 가상 3차원 화학적 구조로부터 3차원 점 데이터를 형성한다. 특히, 가상 결합을 한 후, 스코링 함수(scoring function)를 통한 최적의 포지셔닝(최적의 결합 상태)을 얻은 후, 이때의 위치정보를 몰(Mol) 파일로 저장한다.In addition, a chemical bond simulation tool (eg, an autodocking program) is used to virtually bind a compound and a protein, and form 3D point data from the combined virtual 3D chemical structure. Particularly, after performing a virtual combination, optimal positioning (optimal combination state) through a scoring function is obtained, and the location information at this time is stored as a Mol file.

특히, 3D데이터 형성부(312)는 단백질과 화합물의 3차원 결합 구조에서, 각각 단백질과 화합물의 3차원 구조 데이터(또는 3차원 점 데이터)를 추출한다. 바람직하게는, 단백질 또는 화합물의 각 원자 별로 별도의 3차원 구조 데이터를 추출한다. 또한, 더욱 바람직하게는, 주요 원자에 대해서만, 각 원자 별로 3차원 구조 데이터를 추출하고, 주요 원자 및 나머지 모든 원자에 대하여 3차원 점 데이터를 추출할 수도 있다.In particular, the 3D data forming unit 312 extracts 3D structure data (or 3D point data) of the protein and the compound, respectively, from the 3D bonded structure of the protein and the compound. Preferably, separate three-dimensional structure data is extracted for each atom of the protein or compound. In addition, more preferably, 3D structure data may be extracted for only the main atom, for each atom, and 3D point data for the main atom and all other atoms.

일례로서, 단백질은 C(탄소), N(질소), O(산소), S(황)의 4가지 원자로 구성된다. 따라서 단백질의 각 원자별로 4개의 3차원 구조 데이터를 추출한다. 화합물은 C(탄소), N(질소), O(산소)의 주요 원자와, 그외 다양한 원자를 포함할 수 있다. 따라서 화합물은 C(탄소), N(질소), O(산소)의 주요 원자에 대해서, 각 원자별로 3개의 3차원 점 데이터를 추출하고, 주요 원자 및 그외 모든 원자의 3차원 구조 데이터를 추출한다. 따라서 따라서 전체적으로, 모두 8개의 3차원 구조 데이터를 추출한다.As an example, proteins are composed of four atoms: C (carbon), N (nitrogen), O (oxygen), and S (sulfur). Therefore, 4 3D structural data are extracted for each atom of the protein. The compound may contain major atoms of C (carbon), N (nitrogen), and O (oxygen), as well as various other atoms. Therefore, the compound extracts three-dimensional point data for each atom for the major atoms of C (carbon), N (nitrogen), and O (oxygen), and extracts the three-dimensional structure data of the main atom and all other atoms. . Therefore, in total, eight three-dimensional structure data are extracted.

또한, 바람직하게는, 3D데이터 형성부(312)는 단백질의 3차원 구조 데이터를 추출할 때, 화합물과 결합하는 단백질의 부위 근처의 원자 위치만을 추출하여 3차원 구조 데이터를 추출한다. 즉, 화합물과의 결합 부위에서 소정의 거리 내에 위치한 원자들의 위치만을 추출하여 3차원 구조 데이터를 형성한다.In addition, preferably, when extracting the 3D structure data of the protein, the 3D data forming unit 312 extracts the 3D structure data by extracting only the atomic positions near the site of the protein binding to the compound. That is, 3D structure data is formed by extracting only the positions of atoms located within a predetermined distance from the bonding site with the compound.

다음으로, 본 발명의 일실시예에 따른 개별신경망 모듈(33)의 세부 구성을 도 3을 참조하여 보다 구체적으로 설명한다.Next, a detailed configuration of the individual neural network module 33 according to an embodiment of the present invention will be described in more detail with reference to FIG. 3.

도 3에서 보는 바와 같이, 개별신경망 모듈(33)은 3차원 데이터에서 디스크립터를 추출하는 디스크립터 추출부(331), 개별신경망 모델(41)을 학습시키는 개별모델 학습부(332), 및, 2차 학습용 결과데이터를 생성하는 결과데이터 생성부(333)로 구성된다.As shown in FIG. 3, the individual neural network module 33 includes a descriptor extracting unit 331 for extracting a descriptor from 3D data, an individual model learning unit 332 for learning the individual neural network model 41, and a secondary It consists of a result data generation unit 333 that generates result data for learning.

먼저, 디스크립터 추출부(331)은 단백질 또는 화합물의 3차원 데이터에서 1차원 데이터 또는 2차원 이미지 데이터를 생성한다. 각 개별신경망 모듈(33)에 따라 디스크립터를 산출하는 방식이 다르다.First, the descriptor extraction unit 331 generates 1D data or 2D image data from 3D data of a protein or compound. Each individual neural network module 33 has a different method of calculating the descriptor.

구체적으로, 푸리에 변환을 이용한 방식, 호몰로지를 이용한 방식, 기하학적 이미지 변환을 이용한 방식 등 3가지 방식으로 구분된다. 푸리에 변환 방식은 3차원 형상 데이터로부터 1차원 데이터의 디스크립터를 추출한다. 또한, 호몰로지 방식은 3차원 구조 데이터로부터 2차원 이미지의 디스크립터를 생성한다. 또한, 기하학적 이미지 변환 방식은 3차원 형상 데이터로부터 2차원 이미지의 디스크립터를 추출한다.Specifically, it is classified into three methods, such as a method using Fourier transform, a method using homology, and a method using geometric image transformation. The Fourier transform method extracts a descriptor of one-dimensional data from three-dimensional shape data. In addition, the homology method generates a descriptor of a two-dimensional image from three-dimensional structure data. In addition, the geometric image conversion method extracts a descriptor of a 2D image from 3D shape data.

먼저, 푸리에 변환 방식에 대하여 설명한다.First, the Fourier transform method will be described.

다음으로, 푸리에 변환(fourier transform, FT)은 함수 또는 신호를 그 함수를 구성하고 있는 주파수 성분들의 합으로 표현하는 방법이다. 변환된 함수는 주파수의 복소함수가 되고, 그의 절대값은 원래 함수를 구성하는 주파수 성분의 양을 나타낸다. Next, the Fourier transform (FT) is a method of expressing a function or signal as a sum of frequency components constituting the function. The transformed function becomes a complex function of frequency, and its absolute value represents the amount of frequency components constituting the original function.

먼저, 디스크립터 추출부(331)은 단백질 또는 화합물에 대해 추출된 3차원 형상 데이터를 구성하는 각 좌표를, 구면 조화 함수(spherical harmonics)를 통해 구면 좌표계로 변환한다.First, the descriptor extraction unit 331 converts each coordinate constituting the 3D shape data extracted for a protein or compound into a spherical coordinate system through spherical harmonics.

즉, 3차원 형상 데이터의 좌표 (x,y,z)를 구면 좌표계의 좌표 (θ,φ,r)로 변환된다.That is, the coordinates (x,y,z) of the 3D shape data are converted to the coordinates (θ,φ,r) of the spherical coordinate system.

다음으로, 디스크립터 추출부(331)은 구면 좌표계의 좌표 데이터 (θ,φ,r)에 대해, 푸리에 변환을 이용하면, 다음 식과 같이 기저함수와 그의 계수들의 합으로 표현할 수 있다. Next, the descriptor extracting unit 331 may express the basis function and its coefficients as a sum of the basis function and its coefficients as shown in the following equation, if Fourier transform is used for the coordinate data (θ, φ, r) of the spherical coordinate system.

[수학식 2][Equation 2]

여기서, Y_l,k 는 기저함수를 나타내고, c_l,k 는 기저함수 Y_l,k 에 대한 푸리에 계수를 나타낸다. l은 디그리(degree)를 나타내고, k는 차수(order)를 나타낸다. L은 디그리(degree)의 크기를 나타낸다. L이 클수록 오차를 적게 근사할 수 있다.Here, Y _l,k denotes the basis function, and c _l,k denotes the Fourier coefficient for the basis function Y _l,k. l represents the degree and k represents the order. L represents the size of the degree. The larger L, the smaller the error can be approximated.

즉, 모든 구면 좌표 (θ,φ,r)에 대해, r = f(θ,φ)로 나타나는데, 이때, 푸리에 변환을 적용하면, r = f(θ,φ)는 기저함수 Y_l,k(θ,φ) 의 가중치(푸리에 계수에 의한 가중치) 합으로 표현될 수 있다.That is, for all spherical coordinates (θ,φ,r), it is expressed as r = f(θ,φ). In this case, when Fourier transform is applied, r = f(θ,φ) is the basis function Y _l,k ( It can be expressed as a sum of weights (weights by Fourier coefficients) of θ,φ).

바람직하게는, 도 5에서 보는 바와 같이, 앞서 푸리에 변환을 통해 구한 푸리에 계수 c_l,k 들을 1차원 벡터 (c_0,0,c_1,-1,c_l,0,c_l,1,c_2,-2,c_2,-1,c_2,0,c_2,1,c_2,2,c_{3,-3, ...,}c_L,L)로 만들어, 만들어진 1차원 벡터를 디스크립터로 생성한다. 즉, 위와 같은 변환을 통해 단백질과 화합물의 3차원 구조는 푸리에 계수에 의한 1차원 벡터로 변환되었다.Preferably, as shown in FIG. 5, the Fourier coefficients c _l,k obtained through the Fourier transform are one-dimensional vectors (c _0,0, c _1,-1, c _l,0, c _l,1, c _2,-2, c _2,-1, c _2,0, c _2,1, c _2,2, c _{3,-3, ...,} c _L,L ) It is created with. In other words, through the above transformation, the three-dimensional structures of proteins and compounds were transformed into one-dimensional vectors by Fourier coefficients.

더욱 바람직하게는, 푸리에 계수를 먼저 디그리(l)가 작은 수부터, 다음으로 차수(k)가 작은 수부터 순서대로로 나열한다.More preferably, the Fourier coefficients are first ordered from a number having a small degree (l), and then a number having a small degree (k).

다음으로, 호몰로지를 이용한 디스크립터 추출 방식에 대하여 설명한다.Next, a description will be given of a descriptor extraction method using homology.

호몰로지를 이용한 방식은 단백질 또는 화합물의 3차원 구조 데이터에 대해 호몰로지(homology)를 이용하여 디스크립터(descriptor)를 산출한다.In the homology-based method, a descriptor is calculated using homology for the three-dimensional structure data of a protein or compound.

3차원 구조 데이터는 단백질 또는 화합물의 원자 또는 분자 구조에 대한 3차원 데이터이다. 즉, 3차원 구조 데이터는 유한개의 3차원 상의 점 데이터(또는 포인트 데이터)들의 집합이다. 각 포인트는 원자의 위치를 나타낸다.Three-dimensional structure data is three-dimensional data on the atomic or molecular structure of a protein or compound. That is, the three-dimensional structure data is a set of finite three-dimensional point data (or point data). Each point represents the position of an atom.

즉, 3차원 구조 데이터에 호몰로지(homology)를 적용하여, 2차원의 퍼시스턴스 다이어그램(persistence diagram)를 추출하고, 퍼시스턴스 다이어그램을 2차원 이미지로 변환하여 디스크립터로 사용한다.That is, homology is applied to the 3D structure data, a 2-dimensional persistence diagram is extracted, and the persistence diagram is converted into a 2-dimensional image and used as a descriptor.

구체적으로, 먼저, 3차원 구조 데이터에 호몰로지를 적용하여 퍼시스턴스 정보를 추출한다. 즉, 실수인 변수 r값을 준 다음, 이 값을 순차적으로 0부터 증가시켜가며 점 집합의 연결 상태를 변화시킨다.Specifically, first, persistence information is extracted by applying homology to 3D structure data. That is, after giving a real variable r value, this value is sequentially increased from 0 to change the connection state of the point set.

두 점의 연결상태는 다음과 같이 판단한다. 각 점을 원점으로 하는 반지름이 r인 3차원 구를 그린다. 각 r에 대해 다음을 판단한다.The connection state of the two points is judged as follows. Draw a three-dimensional sphere with a radius r with each point as the origin. For each r, the following is judged.

만약 두 구가 교차하는 부분이 있을 경우 두 점은 연결된 것으로 간주한다. 아닐 경우에는 연결되지 않은 것으로 간주한다. 두 점이 연결되었을 경우 두 점을 잇는 변을 생성한다. If two spheres intersect, the two points are considered to be connected. If not, it is regarded as not connected. When two points are connected, a side connecting the two points is created.

만약 세 변이 서로 연결되어 삼각형을 이룰 경우, 이를 변으로 하는 삼각형을 만든다. If three sides are connected to each other to form a triangle, a triangle is made with these sides.

만약 네 삼각형이 서로 연결되어 사면체를 이룰 경우, 이를 면으로 하는 속이 꽉 찬 사면체를 만든다. If the four triangles are connected to each other to form a tetrahedron, a solid tetrahedron is formed using this as a face.

위 과정을 통해 각 r에 대해, 주어진 점 데이터로부터, 0차원 복합체(simplex)(점), 1차원 복합체(변), 2차원 복합체(삼각형), 3차원 복합체(사면체)로 구성된 복합체(simplicial complex)를 얻는다.For each r through the above process, from the given point data, a simple complex consisting of a 0-dimensional complex (point), a 1-dimensional complex (side), a 2-dimensional complex (triangle), and a 3-dimensional complex (tetrahedron). ).

즉, 주어진 r에서의 복합체 M에 대한 호몰로지 H_r(M)를 계산한다. _{That is, calculate the homology H r} (M) for the complex M at a given r.

다음으로, 각 차원 복합체에 대한 호몰로지 그룹을 생성한다. 즉, 0차원 호몰로지 그룹, 1차원 호몰로지 그룹, 및 2차원 호몰로지 그룹을 생성한다.Next, create a homology group for each dimensional complex. That is, a 0-dimensional homology group, a 1-dimensional homology group, and a 2-dimensional homology group are created.

그리고 각 차원의 호몰로지 그룹에 대해 생성자의 개수를 구한다. 호몰로지 그룹은 유한 개의 생성자에 의해 만들어지는 그룹이다. 호몰로지 그룹의 생성자 또는 생성자들은 해당 차원의 복합체의 구성 형태에 따라 새로 생성되거나 소멸된다. And for each dimension homology group, the number of constructors is calculated. A homology group is a group created by a finite number of constructors. The constructors or constructors of a homology group are newly created or destroyed according to the constituent form of the complex of the corresponding dimension.

호몰로지 그룹은 해당 차원에서 싸이클(cycle)을 형성하여 경계가 없는 복합체들로 구성된다. 즉, 0차원에서는 각 점이 싸이클을 형성하여, 0차원 호몰로지 그룹은 각 점들로 구성된다. 1차원 호몰로지 그룹은 링(ring)을 형성하는 1차원 복합체(변)들로 구성된다. 또한, 2차원 호몰로지 그룹은 내부 공간이 있는 폐곡면 형태로 연결되는 2차원 복합체(삼각형)들로 구성된다. 생성자는 해당 차원에서의 각 싸이클에 해당되며, 생성자의 개수는 해당 차원에서의 싸이클 개수를 나타낸다.The homology group is composed of complexes without boundaries by forming a cycle at that dimension. That is, in the 0-dimensional, each point forms a cycle, and the 0-dimensional homology group is composed of each point. A one-dimensional homology group consists of one-dimensional complexes (sides) forming a ring. In addition, the two-dimensional homology group is composed of two-dimensional complexes (triangles) connected in the form of a closed curved surface with an inner space. The constructor corresponds to each cycle in the dimension, and the number of constructors represents the number of cycles in the dimension.

한편, 2차원(상위차원)의 복합체(삼각형)를 구성하는 1차원(하위 차원) 복합체(변)들은 링을 형성한다. 그러나, 호몰로지 그룹은 상위 차원의 복합체를 구성하는 하위 차원의 복합체 집합에 의해 쿼오션트 그룹(quotient)으로 형성되므로, 해당 하위 차원의 복합체들은 하나의 복합체로 합동(congruence)된다. 따라서 2차원 복합체(삼각형)를 형성하는 1차원 복합체(변)들이 링을 형성하더라도 1차원 호몰로지 그룹에서 제외된다.Meanwhile, one-dimensional (lower-dimensional) composites (sides) constituting a two-dimensional (upper-dimensional) composite (triangle) form a ring. However, since a homology group is formed as a quotient group by a set of lower-dimensional complexes constituting a higher-dimensional complex, the lower-dimensional complexes are congruenced into one complex. Therefore, even if one-dimensional complexes (sides) forming a two-dimensional complex (triangle) form a ring, they are excluded from the one-dimensional homology group.

한편, 특정 r에서 새로운 생성자에 대한 생성 시기(r_S)를 기록하고, 이 생성자가 사라질 때 그 생성자에 대한 소멸 시기(r_E)를 기록한다.On the other hand, the creation time (r _S ) for a new constructor in a specific r is recorded, and when the constructor disappears, the expiration time (r _E ) for the constructor is recorded.

이 기록들을 모두 모아 다이어그램으로 나타낸 것이 퍼시스턴스 다이어그램(persistence diagram)이다. 퍼시스턴스 다이어그램(persistence diagram)을 나타내는데 필요한 데이터(persistence information)는 다음과 같이 나타난다.The persistence diagram is a diagram of all these records. The data required to represent the persistence diagram (persistence information) is shown as follows.

퍼시스턴스 정보 = { (d, (r_S, r_E)) }Persistence information = {(d, (r _S , r _E ))}

여기서, d는 해당 생성자의 차원이다. r_S 과 r_E 는 각각 해당 생성자의 생성시기와 소멸 시기이다. 즉, r을 순차적으로 증가시키는데, r = r_S 일때 해당 생성자가 생성되고, r = r_E 일때 해당 생성자가 소멸된다.Where d is the dimension of the constructor. r _S and r _E are the creation and destruction times of the corresponding constructor, respectively. In other words, r is sequentially increased. When r = r _{S, the} corresponding constructor is created, and _{when r = r E, the} corresponding constructor is destroyed.

일례로서, 퍼시스턴스 정보는 다음과 같이 구해질 수 있다.As an example, persistence information may be obtained as follows.

퍼시스턴스 정보 = {(0,(0 ,0.8)),…,(1,(0.4,0.9))}Persistence information = ((0,(0 ,0.8)),… ,(1,(0.4,0.9))}

즉, 하나의 퍼시스턴스 정보 (0, (0, 0.8))는 해당 생성자가 0차원이고, r = 0일 때 생성되었다가, r = 0.8 일 때 소멸되었다는 것을 나타낸다.That is, one persistence information (0, (0, 0.8)) indicates that the corresponding constructor is 0-dimensional, was created when r = 0, and destroyed when r = 0.8.

다음으로, 퍼시스턴스 정보를 퍼시스턴스 다이어그램(persistence diagram)으로 나타낸다.Next, persistence information is represented by a persistence diagram.

퍼시스턴스 다이어그램의 x 축과 y 축은 각각 생성 시기와, 소멸 시기를 나타낸다. 그리고 각 퍼시스턴스 정보를 퍼시스턴스 다이어그램에 점(포인트)으로 표시한다. 이때, 점(포인트)의 위치는 (x,y)는 (r_S,r_E)에 해당하고, 포인트의 값은 차원으로 표시한다.The x-axis and y-axis of the persistence diagram represent the creation and extinction times, respectively. And each persistence information is displayed as a dot (point) on the persistence diagram. At this time, the location of the point (point) corresponds to (x,y) (r _S ,r _E ), and the value of the point is expressed as a dimension.

바람직하게는, 포인트의 값인 차원을 색상으로 표시한다. 일례로서, 빨간색은 0차원 생성자, 초록색은 1차원 생성자, 파란색은 2차원 생성자를 나타낸다.Preferably, the dimension, which is the value of the point, is displayed in color. As an example, red represents a 0-dimensional constructor, green represents a 1-dimensional constructor, and blue represents a 2-dimensional constructor.

퍼시스턴스 다이이어그램의 일례가 도 6에 도시되고 있다.An example of a persistence diagram is shown in FIG. 6.

다음으로, 퍼시스턴스 다이어그램(persistence diagram)으로부터 2차원의 호몰로지 이미지를 생성하여, 이를 디스크립터로 사용한다.Next, a two-dimensional homology image is created from a persistence diagram and used as a descriptor.

바람직하게는, 퍼시스턴스 다이어그램에 가우시안 매핑(Gaussian mapping)을 적용하여, 점의 밀도에 따라 값을 차등적으로 부여한다.Preferably, Gaussian mapping is applied to the persistence diagram, and values are differentially assigned according to the density of the points.

호몰로지 이미지의 일례가 도 7에 도시되고 있다.An example of a homology image is shown in FIG. 7.

다음으로, 기하학적 이미지 변환을 이용한 디스크립터 추출 방식에 대하여 설명한다.Next, a descriptor extraction method using geometric image transformation will be described.

기하학적 이미지 변환 방식은 단백질 또는 화합물의 3차원 형상 데이터에서 2차원 기하학적 이미지를 생성하고, 2차원 기하학적 이미지를 디스크립터(descriptor)로 추출한다.The geometric image conversion method generates a two-dimensional geometric image from three-dimensional shape data of a protein or compound, and extracts the two-dimensional geometric image as a descriptor.

도 8에서 보는 바와 같이, 디스크립터 추출부(331)은 3차원 형상 데이터를 구면에 매핑시키는 구형 매개변수화(Spherical Parameterization) 단계(S31), 오탈릭 매핑을 통해 구면 상의 형상 데이터를 보정하는 오탈릭 매개변수화(Authalic Parameterization) 단계(S32), 보정된 구면 상의 형상 데이터를 팔면체의 표면 상에 매핑하는 팔면체 매개변수화(Octahedron Parameterization) 단계(S33), 팔면체 상의 형상 데이터를 2차원 사각형에 매핑하는 사각형 매개화변수화(Square Parameterization) 단계(S34), 및, 2차원 기하학적 디스크립터를 생성하는 단계(S35)를 수행하여, 2차원 기하학적 이미지를 디스크립터로 산출한다.As shown in FIG. 8, the descriptor extracting unit 331 is a spherical parameterization step (S31) of mapping 3D shape data to a spherical surface, and an Otalic parameter correcting shape data on the spherical surface through oralic mapping. Authalic Parameterization Step (S32), Octahedron Parameterization Step (S33) Mapping the Corrected Spherical Shape Data on the Surface of the Octahedron (S33), Rectangle Parameterization Mapping the Shape Data on the Octahedron to a 2D Rectangle By performing a square parameterization step (S34) and a step of generating a two-dimensional geometric descriptor (S35), a two-dimensional geometric image is calculated as a descriptor.

도 9는 구형 메쉬(spherical mesh)에 매핑(mapping)된 꼭지점(vertex) 정보를 활용하여 최종적인 2D 기하학 이미지를 산출하는 형태를 나타낸 그림이다.9 is a diagram showing a form of calculating a final 2D geometric image by using information on vertices mapped to a spherical mesh.

먼저, 구형 매개변수화(Spherical Parameterization)를 이용하여, 3차원 형상 데이터를 구면(spherical surface) 상에 매핑시킨다(S31).First, 3D shape data is mapped onto a spherical surface using Spherical Parameterization (S31).

이때, 3차원 형상 데이터는 3D데이터 형성부(312)에서 형성한 단백질 또는 화합물의 형상 또는 표면에 대한 3차원 데이터이다. 특히, 3차원 형상 데이터는 삼각 그물망 구조(triangular mesh)로 표현된 표면 데이터로서, 각 삼각형의 꼭지점(vertex) 데이터들이다. 즉, 3차원 형상 데이터는 꼭지점의 3차원 좌표들로 구성된다.In this case, the 3D shape data is 3D data on the shape or surface of the protein or compound formed by the 3D data forming unit 312. In particular, the 3D shape data is surface data expressed in a triangular mesh, and is vertex data of each triangle. That is, the 3D shape data is composed of 3D coordinates of the vertices.

따라서 3차원 형상 데이터의 각 꼭지점을 구면 상에 매핑시킨다.Therefore, each vertex of the 3D shape data is mapped onto a spherical surface.

도 9의 예를 참조하면, (a) 3차원 형상 데이터에서, (b) 구면 상의 데이터로 매핑하는 것과 같다.Referring to the example of FIG. 9, it is the same as (a) mapping from 3D shape data to (b) spherical data.

다음으로, 오탈릭 매개변수화(Authalic Parameterization)를 통해 구면 상의 형상 데이터를 보정한다(S32). 즉, 3차원 형상 데이터에서의 각 메쉬(mesh)의 면적이, 이에 대응하는 구면(spherical surface) 상의 메쉬의 면적에 보전되도록, 구면 상의 데이터를 보정한다. 즉, 면적왜곡비율이 최소화되도록 보정한다.Next, the shape data on the spherical surface is corrected through authentic parameterization (S32). That is, the data on the spherical surface is corrected so that the area of each mesh in the 3D shape data is preserved in the area of the mesh on the corresponding spherical surface. That is, it is corrected so that the area distortion ratio is minimized.

즉, 구면 상에 매핑된 3차원 형상 데이터의 오리지날 메쉬(original mesh)의 꼭지점(vertex) 정보를 면적 왜곡 비율(areal distortion ratio)를 최소화하여 구형 메쉬(spherical mesh)에 보정하여 매핑(mapping)한다.In other words, the vertex information of the original mesh of the 3D shape data mapped on the spherical surface is corrected to a spherical mesh by minimizing the area distortion ratio to be mapped. .

다음으로, 구면 상의 보정된 형상 데이터를 정팔면체의 표면 상에 매핑한다(S33). 즉, 팔면체 매개변수화(Octahedron Parameterization)를 통해, 구면 상의 형상 데이터(또는 메쉬)를 8면체의 도메인으로 변환한다.Next, the corrected shape data on the spherical surface is mapped onto the surface of the octahedron (S33). That is, through Octahedron Parameterization, shape data (or mesh) on a spherical surface is converted into an octahedral domain.

바람직하게는, 8면체는 정팔면체이다.Preferably, the octahedron is an octahedron.

도 9의 예를 참조하면, (b) 구면 상의 데이터에서, (c) 8면체 상의 데이터로 매핑하는 것과 같다.Referring to the example of FIG. 9, it is the same as (b) mapping from spherical data to (c) octahedral data.

다음으로, 팔면체 상의 형상 데이터를 2차원의 사각형에 매핑한다(S34).Next, the shape data on the octahedron is mapped to a two-dimensional square (S34).

즉, 변환된 8면체를 편평(flatten)하여 2차원 사각형에 매핑한다. 바람직하게는, 2차원 사각형은 N×N의 정사각형으로 형성한다. 일례로서, N은 128로 설정한다.That is, the transformed octahedron is flattened and mapped to a two-dimensional square. Preferably, the two-dimensional square is formed into an N×N square. As an example, N is set to 128.

한편, 8면체의 각 표면의 8개의 삼각형을 조합하여 2차원 기하학적 이미지로 형성한다. 바람직하게는, 8면체의 각 표면의 8개의 삼각형을 직각 이등변 삼각형으로 변환하고, 이들 직각 이등변 삼각형을 서로 조합하여 사각형 또는 정사각형을 형성한다.Meanwhile, the eight triangles on each surface of the octahedron are combined to form a two-dimensional geometric image. Preferably, eight triangles on each surface of the octahedron are converted into right-angled isosceles triangles, and these right-angled isosceles triangles are combined with each other to form a square or square.

특히, 2D 이미지 또는 사각형 이미지는 에지(edge)와 코너(corner) 부분의 이미지 끊김 현상을 방지하기 위해, 8면체의 도메인에 해당하는 각각의 삼각형을 상하를 회전시킨 복사본을 이어 붙여, 각각의 정사각형 형태를 이루게 한다.In particular, for 2D images or square images, in order to prevent the image from being cut off at the edges and corners, copies of each triangle corresponding to the domain of the octahedron are rotated up and down, and each square is attached. Make it form.

도 9의 예를 참조하면, (c) 8면체 상의 데이터에서, (d) 정사각형 상의 데이터로 매핑하는 것을 도시하고 있다.Referring to the example of FIG. 9, (c) mapping from octahedral data to (d) square data is shown.

다음으로, 매핑된 2차원 사각형에 대해 2차원 기하학적 이미지의 디스크립터를 생성한다(S35).Next, a descriptor of a 2D geometric image is generated for the mapped 2D rectangle (S35).

앞서 단계들에서, 3차원 형상 데이터, 구면 상의 매핑 데이터, 보정된 구면 상의 매핑 데이터, 8면체 상의 매핑 데이터, 2차원 사각형 상의 매핑 데이터로 서로 일대일 매핑 관계를 갖는다. 따라서 3차원 형상 데이터와 2차원 사각형은 서로 매핑되고, 특히, 3차원 형상 데이터의 각 꼭지점은 2차원 사각형의 각 꼭지점에 매핑된다.In the previous steps, three-dimensional shape data, mapping data on a sphere, mapping data on a corrected sphere, mapping data on an octahedron, and mapping data on a two-dimensional rectangle have a one-to-one mapping relationship with each other. Accordingly, the 3D shape data and the 2D rectangle are mapped to each other, and in particular, each vertex of the 3D shape data is mapped to each vertex of the 2D rectangle.

2차원 이미지 또는 사각형의 각 픽셀의 픽셀 값을 픽셀 위치에 매핑되는 3차원 형상 데이터의 위치에서의 3차원 형상의 특징값(또는 기하학적 특성값)으로 설정한다. 즉, 2차원 이미지(또는 2차원 사각형)의 픽셀 위치 (x',y')는 3차원 형상 데이터의 한 점 (x,y,z)에 매핑된다. 꼭지점이 아닌 픽셀의 경우는, 보간법 등에 의해 대응되는 위치를 찾을 수 있다.A pixel value of each pixel of a 2D image or a square is set as a characteristic value (or geometric characteristic value) of a 3D shape at a location of 3D shape data mapped to a pixel location. That is, the pixel position (x',y') of the 2D image (or 2D square) is mapped to a point (x,y,z) of the 3D shape data. In the case of a pixel other than a vertex, a corresponding position can be found by interpolation or the like.

바람직하게는, 3차원 형상 데이터의 기하학적 특성값은 3차원 x축의 x좌표값, y축의 y좌표값, z축의 z좌표값, 주곡률값(최대 곡률값, 최소 곡률값) 등 어느 하나 이상을 사용한다. 바람직하게는, 5개의 기하학적 특성값을 모두 사용한다.Preferably, the geometric characteristic value of the 3D shape data is one or more of the x-coordinate value of the 3D x-axis, the y-coordinate value of the y-axis, the z-coordinate value of the z-axis, and the main curvature value (maximum curvature value, minimum curvature value). use. Preferably, all five geometric feature values are used.

한편, 각 기하학적 특성값은 픽셀의 수치 범위(0~255)로 정규화(normalization) 될 수 있다.Meanwhile, each geometric characteristic value may be normalized to a numerical range (0 to 255) of the pixel.

따라서 각 기하학적 특성값을 픽셀값으로 갖는 2차원 기하학적 이미지를 각 기하학적 특성값 별로 생성한다. 바람직하게는, 모두 K개(예를 들어, 5개)의 2차원 기하학적 이미지가 생성된다.Therefore, a two-dimensional geometric image having each geometric feature value as a pixel value is generated for each geometric feature value. Preferably, all K (eg, 5) two-dimensional geometric images are generated.

다음으로, 개별모델 학습부(332)는 추출한 단백질과 화합물의 디스크립터와, 단백질에 대한 화합물의 활성 데이터로 1차 학습데이터를 생성하고, 1차 학습데이터 중 일부로 자신의 개별 신경망 모델(41)을 학습시킨다.Next, the individual model learning unit 332 generates primary training data from the extracted protein and compound descriptors and activity data of the compound for the protein, and uses its own individual neural network model 41 as part of the primary training data. To learn.

특히, 개별모델 학습부(332)는 단백질의 디스크립터와, 화합물의 디스크립터가 별도로 구성되는 경우, 단백질과 화합물의 쌍에 대하여 해당 단백질의 디스크립터와 화합물의 디스크립터를 결합하여 전체 디스크립터를 형성한다.In particular, when the descriptor of the protein and the descriptor of the compound are separately configured, the individual model learning unit 332 combines the descriptor of the protein and the descriptor of the compound with respect to the pair of the protein and the compound to form an entire descriptor.

예를 들어, 푸리에 방식의 경우, 전체 디스크립터는 타겟 단백질과 그에 대응하는 화합물에 대한 각기 n개의 원소를 가진 두 벡터가 하나로 합쳐져 최종적으로 2n개의 원소를 가진 1차원 벡터이다.For example, in the case of the Fourier method, the entire descriptor is a one-dimensional vector having 2n elements by combining two vectors each having n elements for a target protein and a corresponding compound.

또한, 호몰리지 방식의 경우, 전체 디스크립터는 타겟 단백질의 호몰로지 이미지와, 그에 대응하는 화합물에 대한 호몰로지 이미지를 결합한 이미지이다.In addition, in the case of the homology method, the entire descriptor is an image that combines a homology image of a target protein and a homology image of a corresponding compound.

또한, 기하학적 변환 방식의 경우, 전체 디스크립터는 픽셀 값을 3차원 형상 데이터의 기하학적 특성값(정규화된 특성값)으로 갖는 K개의 2차원 기하학적 이미지이며, 단백질과 화합물의 각각을 결합한 이미지이다. K개는 기하학적 특성값의 개수(종류의 개수)이다.In addition, in the case of the geometric transformation method, the entire descriptor is K two-dimensional geometric images having pixel values as geometric characteristic values (normalized characteristic values) of 3D shape data, and is an image that combines each of proteins and compounds. K is the number of geometric feature values (number of types).

이하에서, 설명의 편의를 위하여, 전체 디스크립터를 디스크립터와 혼용한다.In the following, for convenience of description, the entire descriptor is mixed with the descriptor.

또한, 개별모델 학습부(332)는 단백질과 화합물의 쌍에 대한 디스크립터(또는 전체 디스크립터)와, 해당 단백질에 대한 해당 화합물의 활성 데이터를 해당 디스크립터에 라벨링함으로써 1차 학습데이터를 생성한다. 즉, 활성 데이터가 라벨(label)로 사용된다.In addition, the individual model learning unit 332 generates primary learning data by labeling the descriptor (or the entire descriptor) for the pair of the protein and the compound and the activity data of the corresponding compound for the corresponding protein to the corresponding descriptor. That is, the activity data is used as a label.

또한, 개별모델 학습부(332)는 1차 학습데이터를 2개의 그룹으로 분할한다. 하나의 그룹(또는 첫 번째 그룹)은 자신의 개별신경망 모델(41)을 학습시키기 위해 사용되고, 다른 하나의 그룹(또는 두 번째 그룹)은 2차 학습용 결과 데이터를 생성하기 위해 사용된다.In addition, the individual model learning unit 332 divides the primary training data into two groups. One group (or the first group) is used to train its individual neural network model 41, and the other group (or the second group) is used to generate result data for secondary learning.

또한, 개별모델 학습부(332)는 1차 학습데이터 중 일부(또는 첫번째 그룹)로 자신의 개별 신경망 모델(41)을 학습시킨다. 이때, 첫 번째 그룹은 그 내에서 학습용 데이터와 검증용 데이터(또는 테스트용 데이터) 등으로 분할될 수 있다.In addition, the individual model learning unit 332 trains its own individual neural network model 41 with some (or first group) of the primary training data. In this case, the first group may be divided into training data and verification data (or test data) therein.

다음으로, 결과데이터 생성부(333)는 학습된 개별 신경망 모델(41)에 두 번째 그룹을 적용하여, 2차 학습용 결과 데이터를 생성한다. 즉, 두 번째 그룹의 1차 학습데이터 중 디스크립터를 개별 신경망 모델(41)에 적용하여 예측결과 데이터를 획득하고, 예측결과 데이터에 해당하는 1차 학습데이터의 라벨(해당 디스크립터의 라벨)을 부여하여 2차 학습용 결과 데이터를 생성한다.Next, the result data generation unit 333 applies the second group to the trained individual neural network model 41 to generate result data for secondary learning. In other words, the descriptor among the first training data of the second group is applied to the individual neural network model 41 to obtain the prediction result data, and the label of the primary training data corresponding to the prediction result data (the label of the corresponding descriptor) is assigned. Generate result data for secondary learning.

즉, 2차 학습용 결과 데이터는 단백질과 화합물에 대한 정보(아이디 등 식별정보), 예측결과 데이터, 라벨로 구성된다. 즉, 각 예측결과 데이터를 식별하기 위하여, 각 데이터에 대해 데이터 식별번호 역할을 하는 단백질 ID와 화합물 ID가 있어야 한다That is, the result data for secondary learning is composed of information on proteins and compounds (identification information such as ID), prediction result data, and labels. In other words, in order to identify each prediction result data, there must be a protein ID and a compound ID serving as a data identification number for each data.

특히, 개별신경망 모델(41)은 각 데이터(또는 분류를 나타내는 라벨 값)에 대한 예측 가능성을 표현할 수 있는 소프트맥스(softmax) 값으로 계산하여 출력한다. 따라서 예측결과 데이터는 활성 또는 비활성의 확률 값(또는 소프트맥스 값)으로 구성된다.In particular, the individual neural network model 41 calculates and outputs a softmax value capable of expressing predictability for each data (or label value indicating classification). Therefore, the prediction result data is composed of a probability value (or softmax value) of active or inactive.

다음으로, 본 발명의 일실시예에 따른 통합 신경망 모듈(34)의 세부 구성에 대하여 도 3을 참조하여 보다 구체적으로 설명한다.Next, a detailed configuration of the integrated neural network module 34 according to an embodiment of the present invention will be described in more detail with reference to FIG. 3.

도 3에서 보는 바와 같이, 통합 신경망 모듈(34)은 각 개별 신경망 모듈(33)에서 생성한 2차 학습용 결과 데이터를 통합하여 2차 학습데이터를 생성하는 데이터 통합부(341), 및, 2차 학습데이터로 통합 신경망 모델(42)을 학습시키는 통합모델 학습부(342)로 구성된다.As shown in Figure 3, the integrated neural network module 34 is a data integration unit 341 for generating secondary training data by integrating the result data for secondary learning generated by each individual neural network module 33, and, It consists of an integrated model learning unit 342 that trains the integrated neural network model 42 with training data.

먼저, 데이터 통합부(341)는 생성한 2차 학습용 결과 데이터를, 단백질과 화합물의 쌍을 기준으로 결합하여 통합시킨다. 즉, 단백질과 화합물의 쌍이 동일하면, 서로 다른 개별 신경망 모듈의 2차 학습용 결과 데이터를 하나의 데이터로 결합하여 통합한다. 특히, 단백질과 화합물의 하나의 쌍에 대하여, 모든 개별 신경망 모듈의 2차 학습용 결과 데이터를 결합하여 2차 학습데이터를 생성한다. 단백질과 화합물의 쌍(또는 그 식별정보)에 의하여 결합되므로 각 개별신경망 모듈(33)이 출력하는 데이터의 순서는 반드시 일치할 필요는 없다.First, the data integration unit 341 combines and integrates the generated secondary learning result data based on a pair of a protein and a compound. That is, if the pair of protein and compound is the same, the result data for secondary learning of different individual neural network modules are combined into one data and integrated. In particular, for one pair of protein and compound, secondary training data is generated by combining the secondary training result data of all individual neural network modules. Since the protein and compound are combined by a pair (or identification information thereof), the order of data output from each individual neural network module 33 does not necessarily have to be identical.

바람직하게는, 데이터 통합부(341)는 각 개별신경망 모델(41)이 출력하는 소프트맥스(softmax) 함수의 결과값을 수집하고, 그 소프트맥스 결과값을 결합하여 2차 학습데이터를 형성한다. 즉, 각 개별신경망 모델(41)에서 출력한 소프트맥스(softmax) 값이 수집되면 데이터 식별번호(단백질ID와 화합물ID 등)을 기준으로 통합된다.Preferably, the data integration unit 341 collects result values of the softmax function output from each individual neural network model 41, and combines the softmax result values to form secondary learning data. That is, when the softmax value output from each individual neural network model 41 is collected, it is integrated based on the data identification number (protein ID, compound ID, etc.).

또한, 2차 학습데이터는 단백질과 화합물의 각 식별정보도 함께 포함하여 구성된다. In addition, the secondary learning data is comprised of the identification information of each protein and compound.

따라서 2차 학습데이터는 단백질과 화합물의 각 식별정보와, 해당 단백질과 화합물에 대한 모든 개별신경망 모델(41)의 예측 결과값, 해당 단백질과 화합물의 활성 데이터(또는 라벨 값)로 구성된다. 특히, 예측 결과값은 소프트맥스(softmax) 함수의 결과값이다.Therefore, the secondary learning data is composed of identification information of each protein and compound, prediction result values of all individual neural network models 41 for the protein and compound, and activity data (or label value) of the protein and compound. In particular, the predicted result value is a result value of a softmax function.

2차 학습데이터의 예시가 도 10의 표에 도시되고 있다. 도 10은 라벨의 범주가 2가지이고, 각 개별신경망 모델(41)이 3개의 경우를 예시하였다. 라벨의 범주의 개수와 개별신경망 모델(41)의 개수는 가변적으로 적용할 수 있다.An example of secondary learning data is shown in the table of FIG. 10. In FIG. 10, there are two categories of labels, and each individual neural network model 41 exemplifies three cases. The number of label categories and the number of individual neural network models 41 can be applied variably.

다음으로, 통합모델 학습부(342)는 통합하여 생성된 2차 학습데이터를 이용하여 통합 신경망 모델(42)을 학습시킨다. 이때, 2차 학습데이터는 그 내에서 학습용 데이터와 검증용 데이터(또는 테스트용 데이터) 등으로 분할될 수 있다.Next, the integrated model learning unit 342 trains the integrated neural network model 42 by using the second learning data generated by integration. In this case, the secondary learning data may be divided into learning data and verification data (or test data) therein.

다음으로, 본 발명의 효과를 도 11 내지 도 13을 참조하여 설명한다.Next, the effects of the present invention will be described with reference to FIGS. 11 to 13.

앞서 설명한 바와 같이, 통합 신경망 모듈(34)에서 개별 신경망 모듈(33)에서 출력한 소프트맥스(softmax) 값을 독립적으로 수집하고, 수집한 데이터를 통합하면 된다. 따라서 각 개별신경망 모듈(33)의 개발 일정이 상이하여도, 개별신경망 모듈(33)이 데이터를 생산하는 대로 통합신경망 모델(42)을 학습하면 된다. 또한, 개별 데이터를 확보한 이후에는 개별신경망 모델(41) 또는 개별신경망 모듈(33)과 분리하여 통합신경망 모델(42)을 훈련할 수 있다.As described above, the integrated neural network module 34 may independently collect the softmax value output from the individual neural network module 33 and integrate the collected data. Therefore, even if the development schedule of each individual neural network module 33 is different, it is sufficient to learn the integrated neural network model 42 as soon as the individual neural network module 33 produces data. In addition, after securing individual data, the integrated neural network model 42 may be trained by separating it from the individual neural network model 41 or the individual neural network module 33.

본 발명에 따른 활성 예측 시스템은 다음과 같은 효과가 얻어진다.The activity prediction system according to the present invention has the following effects.

즉, 전체 개발기간 및 개별 인공신경망 학습시간을 단축시킬 수 있다.That is, the entire development period and individual artificial neural network learning time can be shortened.

또한, 인공지능 서버 시스템의 사용 효율성을 제고할 수 있다. 즉, 개별 인공신경망을 동시에 학습시키지 않고, 시스템이 가용한 시간에 학습시킬 수 있다. 또한, 개별 인공신경망을 동시에 메모리에 적재할 필요가 없으므로 시스템 가용성이 향상된다.In addition, it is possible to improve the efficiency of using the artificial intelligence server system. In other words, individual artificial neural networks can be trained in an available time without simultaneously learning them. In addition, system availability is improved because there is no need to load individual artificial neural networks into memory at the same time.

또한, 도 11 내지 도 13에서 보는 바와 같이, 비활성(Inactive)에 대한 변별력을 향상시킬 수 있음을 확인하였다(여기서는 개별 인공지능모델을 2개 적용한 실례를 보여줌).In addition, as shown in FIGS. 11 to 13, it was confirmed that the discrimination power for inactive can be improved (here, an example of applying two individual artificial intelligence models is shown).

본 발명은 복수개의 개별 인공신경망을 학습한 결과를 출력 받은 후 해당 출력파일을 통합하여 별도의 통합 인공신경망(앙상블, Ensemble)을 구성하여 데이터 예측 가능성을 향상시키는 방법을 구현하였다. 본 발명은 단백질과 화합물의 3차원 구조를 표현하는 디스크립터(descriptor) 기법에 국한하지 않고 분류(Classification) 유형의 인공신경망을 통합하는 분야에 모두 적용이 가능하다. The present invention implements a method of improving data predictability by configuring a separate integrated artificial neural network (ensemble) by integrating the output files after receiving the results of learning a plurality of individual artificial neural networks. The present invention is not limited to a descriptor technique expressing a three-dimensional structure of a protein and a compound, but can be applied to all fields integrating a classification-type artificial neural network.

이상, 본 발명자에 의해서 이루어진 발명을 실시 예에 따라 구체적으로 설명하였지만, 본 발명은 실시 예에 한정되는 것은 아니고, 그 요지를 이탈하지 않는 범위에서 여러 가지로 변경 가능한 것은 물론이다.In the above, the invention made by the present inventors has been described in detail according to embodiments, but the invention is not limited to the embodiments, and it goes without saying that various changes can be made without departing from the gist of the invention.

10 : 분석 단말 20 : 네트워크
30 : 활성 예측 시스템 31 : 데이터관리 모듈
311 : 활성데이터 수집부 312 : 3D데이터 생성부
32 : 개별신경망 모델 그룹 33 : 개별신경망 모델
331 : 디스크립터 추출부 332 : 개별모델 학습부
333 : 결과데이터 생성부
34 : 통합신경망 모델 341 : 데이터 통합부
342 : 통합모델 학습부 35 : 활성예측 모듈
40 : 데이터베이스 41 : 개별신경망 모델
42 : 통합신경망모델 43 : 2차 학습용 데이터 스토리지10: analysis terminal 20: network
30: active prediction system 31: data management module
311: active data collection unit 312: 3D data generation unit
32: individual neural network model group 33: individual neural network model
331: descriptor extraction unit 332: individual model learning unit
333: result data generation unit
34: integrated neural network model 341: data integration unit
342: integrated model learning unit 35: active prediction module
40: database 41: individual neural network model
42: integrated neural network model 43: data storage for secondary learning

Claims

In a system for predicting the activity of a protein-binding compound based on a plurality of artificial neural network models,
A data management module that stores activity data of a compound for a protein, and data of a three-dimensional shape or three-dimensional structure of the protein and the compound;
Equipped with its own individual neural network module, extract a descriptor from 3D shape data or 3D structure data of a protein or compound, train the individual neural network model with the extracted descriptor and active data, and apply it to the trained individual neural network model An individual neural network module that stores the output data for secondary learning;
An integrated neural network that has its own integrated neural network module, generates secondary training data by integrating the result data for secondary training generated by each individual neural network module, and trains the integrated neural network model with the generated secondary training data. module; And,
Including an activity prediction module for predicting the activity of the query protein and the query compound using the individual neural network model and the integrated neural network model,
The individual neural network model is composed of at least two or more to form one individual neural network module group. A system for predicting the activity of a protein-binding compound based on a plurality of artificial neural network models.

The method of claim 1,
The individual neural network module generates primary training data from the extracted descriptor and active data, trains the individual neural network model as part of the primary training data, and applies another part of the primary training data to the trained individual neural network model. A system for predicting activity of protein-binding compounds based on a plurality of artificial neural networks, characterized in that generating result data for secondary learning.

The method of claim 1,
The primary learning data includes identification information of proteins and compounds, descriptors for proteins and compounds, and a label value determined from activity data of the corresponding compound for the protein,
The individual neural network module obtains prediction result data by applying a descriptor of the corresponding data to the individual neural network model for each data of the primary training data, and assigns a label of the corresponding data of the corresponding primary training data to the prediction result data. A system for predicting activity of protein-binding compounds based on a plurality of artificial neural networks, characterized in that generating result data for secondary learning and including identification information on proteins and compounds in the secondary learning result data.

The method of claim 1,
The output data of the individual neural network model is output as a probability value of a label value,
The integrated neural network module combines and integrates the result data for secondary learning based on the pair of protein and compound, but if the pair of protein and compound match, the result data for secondary learning of all individual neural network modules are combined into one data. A system for predicting the activity of protein-binding compounds based on a plurality of artificial neural network models, characterized in that for generating primary learning data.

The method of claim 1,
The individual neural network module group comprises at least one individual neural network module for generating a descriptor from 3D shape data, and at least one individual neural network module for generating a descriptor from 3D structure data. A system for predicting the activity of protein-binding compounds based.

The method of claim 1,
The individual neural network module extracts one-dimensional data from three-dimensional shape data using Fourier transform and generates it as a descriptor, or extracts a two-dimensional image from three-dimensional structure data using homology and generates a descriptor, or three-dimensional shape. A system for predicting the activity of a protein-binding compound based on a plurality of artificial neural network models, characterized in that a descriptor is generated by projecting data into a two-dimensional image.

The method of claim 1,
The 3D shape data is 3D data on a closed mesh surface as closed surface data, and the 3D structure data is 3D data on the atomic position in the chemical structure of a protein or compound. A system for predicting the activity of protein-binding compounds based on a plurality of artificial neural network models, characterized in that the data consists of.