KR102576033B1

KR102576033B1 - Protein-ligand binding affinity prediction using ensemble of 3d convolutional neural network and system therefor

Info

Publication number: KR102576033B1
Application number: KR1020200089089A
Authority: KR
Inventors: 이주용; 신웅희; 권용범; 고준수
Original assignee: 주식회사 아론티어
Priority date: 2020-07-17
Filing date: 2020-07-17
Publication date: 2023-09-11
Also published as: KR20220010327A

Abstract

단백질-리간드 복합체의 결합 친화도를 정확하게 예측하는 것은 합리적 약물 디자인의 효율과 성공 가능성을 위해 필수적이다. 실시예들은 단백질-리간드 복합체의 결합 친화도를 예측하는 새로운 뉴럴 네트워크 모델을 제공한다. 실시예가 제공하는 새로운 모델은 3D 컨벌루션 뉴럴 네트워크 레이어들의 다중 채널로 구성된 다중의 독립 학습된 네트워크들을 앙상블하여 복합체의 결합 친화도를 예측할 수 있다. Accurately predicting the binding affinity of protein-ligand complexes is essential for the efficiency and success of rational drug design. Examples provide a new neural network model for predicting the binding affinity of protein-ligand complexes. The new model provided by the embodiment can predict the binding affinity of the complex by ensembleing multiple independently learned networks composed of multiple channels of 3D convolutional neural network layers.

Description

Protein-ligand binding affinity prediction method using an ensemble of 3D-convolutional neural networks and a system for the same {PROTEIN-LIGAND BINDING AFFINITY PREDICTION USING ENSEMBLE OF 3D CONVOLUTIONAL NEURAL NETWORK AND SYSTEM THEREFOR}

본 발명은 단백질-리간드 복합체의 결합 친화도를 정확하게 예측하는 방법을 제공한다. The present invention provides a method for accurately predicting the binding affinity of a protein-ligand complex.

단백질-리간드 복합체의 결합 친화도를 예측하는 것은 약물 디자인과 약물 발견에 있어서 매우 중요한 역할을 한다. 일반적으로, 약물 발견을 위한 선도 분자가 되기 위해서는 표적 단백질과의 강한 결합이 요구된다. 그러나, 단백질-리간드 결합 친화도의 실험적 측정은 어렵고 많은 시간이 소요되는 작업으로서, 약물 발견 프로세스의 주요 걸림돌 중의 하나이다. 만일, 표적 단백질과 특정 리간드 사이의 친화도를 빠르고 정확하게 예측할 수 있다면, 가상 스크리닝(virtual screening)의 효율성을 상당히 개선할 수 있다. 따라서, 약물 발견 프로세스를 최적화하기 위하여 결합 친화도 예측을 위한 다양한 연산 방법들이 연구되고 있다. Predicting the binding affinity of protein-ligand complexes plays a very important role in drug design and drug discovery. In general, strong binding to the target protein is required to become a lead molecule for drug discovery. However, experimental measurement of protein-ligand binding affinity is a difficult and time-consuming task, and is one of the major obstacles in the drug discovery process. If the affinity between a target protein and a specific ligand can be quickly and accurately predicted, the efficiency of virtual screening can be significantly improved. Therefore, various computational methods for predicting binding affinity are being studied to optimize the drug discovery process.

일반적으로, 결합 친화도 예측을 위한 연산 방법은 세 가지 카테고리로 나눌 수 있다. 1) 물리 기반(physics based), 2) 실험 데이터 기반(empirical), 및 3) 구조 데이터 기반 (knowledge based) 방법들이다. 첫 번째 방법은 주로 역장(forcefield) 모델에 기초한 이론에 충실한 결합 자유도 연산이다. 물리 기반 방법의 가장 큰 장점은 최첨단의 역장 모델을 사용하여 임의의 소분자 리간드들의 단백질에 대한 결합 자유 에너지를 약 1kcal/mol의 평균 오차로 예측할 수 있다는 것이다. 그러나, 엄격한 자유 에너지 연산은 상당한 양의 연산 자원을 필요로 한다. 그래픽 프로세싱 유닛을 통한 최첨단 분자 동역학(molecular dynamics: MD) 코드 수행에서는, 하나 또는 두 개의 리간드의 결합 자유 에너지가 리간드와 단백질의 크기에 따라 연산될 수 있다. 이러한 연산의 부담은 분자 동역학(MD)의 사용에 걸림돌이 되며, 또한, 높은 처리량의 스크리닝이 요구되는 약물과 같은 분자들의 자유 에너지 연산에 걸림돌이 된다. Generally, computational methods for predicting binding affinity can be divided into three categories. These methods are 1) physics based, 2) experimental data based, and 3) structural data based (knowledge based) methods. The first method is a coupled degree-of-freedom calculation that is faithful to theory, mainly based on the forcefield model. The main advantage of physics-based methods is that, using state-of-the-art force field models, the binding free energy of arbitrary small molecule ligands to proteins can be predicted with an average error of approximately 1 kcal/mol. However, rigorous free energy calculations require a significant amount of computational resources. In state-of-the-art molecular dynamics (MD) code implementations via graphics processing units, the binding free energies of one or two ligands can be calculated depending on the size of the ligands and the protein. This computational burden is an obstacle to the use of molecular dynamics (MD) and also to free energy calculations of molecules such as drugs, which require high-throughput screening.

실험 데이터 기반 평가 함수(Empirical scoring functions)는 많은 단백질-리간드 도킹 프로그램과 가상 스크리닝 프로세스에서 광범위하게 사용되어 왔다. 이는, 반데르발스 상호작용(van der Waals interaction), 용매화 자유 에너지, 정전기 상호작용 등 다양한 물리 기반의 조건들로 구성되는 식을 이용하여 단백질-리간드 상호작용을 근사화한다. 물리 기반 조건들의 파라미터들은 일반적으로 측정된 결합 친화도 값을 재생산할 수 있도록 실험적 데이터에 맞춰진다. 연산의 간단함과 물리 기반 상호작용들에 대한 근접한 관계 때문에 실험 데이터 기반 평가 함수는 여전히 활발하게 개발되고 있다. Empirical scoring functions have been widely used in many protein-ligand docking programs and virtual screening processes. This approximates the protein-ligand interaction using an equation consisting of various physics-based conditions such as van der Waals interaction, solvation free energy, and electrostatic interaction. The parameters of the physics-based conditions are usually tailored to experimental data to reproduce the measured binding affinity values. Because of their computational simplicity and close relationship to physics-based interactions, evaluation functions based on experimental data are still being actively developed.

최근에는 딥러닝(deep learning) 방법들에 의해 다양한 과학 분야에서 보다 정확한 데이터 중심의 예측이 가능해졌다. 단백질-리간드의 결합 친화도 예측에 대해서도 많은 딥러닝 기반의 방법들이 제안되고 있다. 예를 들어, 문헌 [Ragoza, M.; Hochuli, J.; Idrobo, E.; Sunseri, J.; Koes, D. R. Protein-Ligand Scoring with Convolutional Neural Networks. J. Chem. Inf. Model. 2017, 57 (4), pp 942-957] 에서는 풀링 레이어(pooling layers)를 가진 3D 컨벌루션 뉴럴 네트워크(3D convolutional neural network: 3D-CNN)의 세 개의 시퀀셜 레이어들로 구성된 소규모 네트워크가 제안되었다. 유사하게 문헌 [Stepniewska-Dziubinska, M. M.; Zielenkiewicz, P.; Siedlecki, P. Development and Evaluation of a Deep Learning Model for Protein-ligand Binding Affinity Prediction.Bioinformatics. 2018, pp 3666-3674] 에서는 세 개의 연속적 3D-CNN 레이어들과 이를 따르는 세 개의 밀집 레이어(dense layers)로 구성된 결합 친화도 예측 모델을 개발하였다. 문헌 [Jimenez, J.; skalic, M.; Mart

nez-Rosell, G.; De Fabritiis, G. KDEEP: Protein-Ligand Absolute Binding Affinity Prediction via 3D-Convolutional Neural Networks. Journal of Chemical Information and Modeling. 2018, pp 287-296] 에서는 본래 이미지 분류를 위해 고안된 SqueezeNet 아키텍처에 기초하여 결합 친화도 예측모델, K_deep을 개발하였다. 상기 K_deep 모델은 약 1.3 백만 개의 파라미터들이 다중 3D-CNN으로 구성된다. 문헌 [Zhang, H.; Liao, L.; Saravanan, K. M.; Yin, P.; Wei, Y. DeepBindRG: A Deep Learning Based Method for Estimating Effective Protein-Ligand Affinity. PeerJ 2019, 7, e7362] 에서는 단백질-리간드 인터페이스의 2D 표현과 ResNet 아키텍처를 이용한 DeepBindRG 모델을 개발하였다. 유사하게, 문헌 [Zheng, L.; Fan, J.; Mu, Y. OnionNet: A Multiple-Layer Intermolecular-Contact-Based Convolutional Neural Network for Protein-Ligand Binding Affinity Prediction. ACS Omega 2019, 4 (14), 15956-15965] 에서는 단백질-리간드 결합 구조를 하나의 채널을 가진 2D 텐서(tensor)로 변환하고 이를 세 개의 2D-CNN 레이어들과 네 개의 밀집 레이어들로 프로세싱하였다. Recently, deep learning methods have made more accurate data-driven predictions possible in various scientific fields. Many deep learning-based methods have been proposed to predict protein-ligand binding affinity. See, for example, Ragoza, M.; Hochuli, J.; Idrobo, E.; Sunseri, J.; Koes, D. R. Protein-Ligand Scoring with Convolutional Neural Networks. J. Chem. Inf. Model. 2017, 57 (4), pp 942-957], a small network consisting of three sequential layers of a 3D convolutional neural network (3D-CNN) with pooling layers was proposed. Similarly, Stepniewska-Dziubinska, MM; Zielenkiewicz, P.; Siedlecki, P. Development and Evaluation of a Deep Learning Model for Protein-ligand Binding Affinity Prediction.Bioinformatics. 2018, pp 3666-3674] developed a binding affinity prediction model consisting of three consecutive 3D-CNN layers and three dense layers following them. Jimenez, J.; skalic, M.; Mart

nez-Rosell, G.; De Fabritiis, G. KDEEP: Protein-Ligand Absolute Binding Affinity Prediction via 3D-Convolutional Neural Networks. Journal of Chemical Information and Modeling. 2018, pp 287-296] developed a binding affinity prediction model, K _deep, based on the SqueezeNet architecture originally designed for image classification. The K _deep model consists of multiple 3D-CNNs with approximately 1.3 million parameters. References [Zhang, H.; Liao, L.; Saravanan, K.M.; Yin, P.; Wei, Y. DeepBindRG: A Deep Learning Based Method for Estimating Effective Protein-Ligand Affinity. PeerJ 2019, 7, e7362] developed the DeepBindRG model using 2D representation of the protein-ligand interface and ResNet architecture. Similarly, Zheng, L.; Fan, J.; Mu, Y. OnionNet: A Multiple-Layer Intermolecular-Contact-Based Convolutional Neural Network for Protein-Ligand Binding Affinity Prediction. [ACS Omega 2019, 4 (14), 15956-15965] converted the protein-ligand binding structure into a 2D tensor with one channel and processed it with three 2D-CNN layers and four dense layers. .

단백질-리간드 결합 친화도를 예측함에 있어서, 높은 정확도와 낮은 연산 처리량을 갖는 모델이 요구된다. 이를 위해 다양한 분야의 딥러닝 기반 기술들의 적용이 제안될 수 있으며, 특히 단백질-리간드 복합체의 결합 친화도 예측을 위하여 최적화된 네트워크 아키텍처가 요구된다. 또한, 새로운 네트워크 아키텍처의 평가를 위한 새로운 평가 함수의 도입이 요구된다. 또한, 결합 친화도의 결정에 있어서 가장 중요한 물리적 특성이 무엇인지 확인할 수 있는 방법이 요구된다. In predicting protein-ligand binding affinity, a model with high accuracy and low computational throughput is required. To this end, the application of deep learning-based technologies in various fields can be proposed, and in particular, an optimized network architecture is required to predict the binding affinity of protein-ligand complexes. Additionally, the introduction of a new evaluation function for evaluating new network architectures is required. Additionally, a method is required to identify the most important physical properties in determining binding affinity.

본 발명의 일 실시예는 제어부 및 메모리부를 포함하는 예측 시스템을 통해 단백질-리간드 결합 친화도를 예측하는 방법으로서, 단백질-리간드 복합체의 구조를, 상기 단백질-리간드 복합체 중 리간드의 질량 중심을 원점으로 하고 주변 원자 환경을 3D 그리드에 임베드한 3차원 정보로 메모리부에 저장하는 단계; 상기 3차원 정보에 대응하는 단백질-리간드 결합 친화도 값을 상기 메모리부에 저장하는 단계; 제어부를 통하여 예측 모델을 학습시키는 단계로서, 상기 예측 모델은 상기 저장된 3차원 정보를 입력값으로 하고, 이에 대응하는 상기 단백질-리간드 결합 친화도 값을 산출값으로 하며, 상기 예측 모델은 독립적으로 처리되는 하나 이상의 3D 컨벌루션 뉴럴 네트워크를 포함하는, 상기 예측 모델을 학습시키는 단계; 및 상기 예측 모델을 이용하여 단백질-리간드 결합 친화도 예측값을 생성하는 단계로서, 상기 각각의 3D 컨벌루션 뉴럴 네트워크를 통해, 새로운 단백질-리간드 복합체에 대한 단백질-리간드 결합 친화도 예측값을 각각 생성하고, 상기 각각의 단백질-리간드 결합 친화도 예측값에 기초하여 최종 예측값을 생성하는, 상기 단백질-리간드 결합 친화도 예측값을 생성하는 단계를 포함하는, 단백질-리간드 결합 친화도 예측 방법을 제공한다. One embodiment of the present invention is a method of predicting protein-ligand binding affinity through a prediction system including a control unit and a memory unit, where the structure of the protein-ligand complex is determined with the center of mass of the ligand in the protein-ligand complex as the origin. and storing the surrounding atomic environment as 3D information embedded in a 3D grid in the memory unit; storing a protein-ligand binding affinity value corresponding to the three-dimensional information in the memory unit; A step of learning a prediction model through a control unit, wherein the prediction model uses the stored three-dimensional information as an input value and the corresponding protein-ligand binding affinity value as a calculation value, and the prediction model is processed independently. training the prediction model, comprising one or more 3D convolutional neural networks; And generating a protein-ligand binding affinity predicted value using the prediction model, wherein each 3D convolutional neural network generates a protein-ligand binding affinity prediction value for a new protein-ligand complex, A method for predicting protein-ligand binding affinity is provided, comprising generating the predicted protein-ligand binding affinity, generating a final predicted value based on each protein-ligand binding affinity predicted value.

또한, 상기 예측 모델이 생성하는 상기 최종 예측값은 상기 각각의 3D 컨벌루션 뉴럴 네트워크를 통해 생성된 각각의 단백질-리간드 결합 친화도 예측값의 평균일 수 있다. Additionally, the final prediction value generated by the prediction model may be the average of each protein-ligand binding affinity prediction value generated through each 3D convolutional neural network.

또한, 상기 3차원 정보는 아래의 밀도 함수를 통해 연산된 단백질-리간드 상호작용의 패턴을 포함하는, 단백질-리간드 결합 친화도 예측 방법을 제공한다. In addition, the three-dimensional information provides a method for predicting protein-ligand binding affinity, including the pattern of protein-ligand interaction calculated through the density function below.

여기서, n(r)은 원자 개수 밀도이고, r_VDW는 원자의 반데르발스 반지름이고, r은 원자로부터 그리드의 중심까지의 거리이다.Here, n(r) is the atomic number density, r _VDW is the van der Waals radius of the atom, and r is the distance from the atom to the center of the grid.

또한, 본 발명의 일 실시예는 상기 3차원 정보를 복수 개의 클래스로 분류하여 상기 입력값의 상이한 채널들로 표현하는 단계를 더 포함할 수도 있다.Additionally, an embodiment of the present invention may further include classifying the 3D information into a plurality of classes and expressing them through different channels of the input value.

또한, 본 발명의 일 실시예는 상기 3차원 정보를 24개의 회전 조작으로 회전시키며 입력값을 추가하는 단계를 더 포함할 수도 있다. Additionally, an embodiment of the present invention may further include the step of rotating the 3D information using 24 rotation operations and adding an input value.

또한, 본 발명의 일 실시예에서 상기 3D 컨벌루션 뉴럴 네트워크는 하나 이상의 적층된 앙상블 기반의 잔여 블록 레이어를 포함하고, 상기 각각의 잔여 블록 레이어는 배치 정규화와 ReLU 활성화 레이어가 결합된 하나 이상의 적층 컨벌루션 레이어들을 포함할 수도 있다. In addition, in one embodiment of the present invention, the 3D convolutional neural network includes one or more stacked ensemble-based residual block layers, and each residual block layer is one or more stacked convolutional layers combining batch normalization and ReLU activation layers. may also include

또한, 상기 하나 이상의 적층 컨벌루션 레이어들의 중간에는 하나 이상의 병렬처리되는 3D 컨벌루션 뉴럴 네트워크 레이어들을 포함할 수도 있다. Additionally, one or more parallel processed 3D convolutional neural network layers may be included in the middle of the one or more stacked convolutional layers.

또한, 상기 하나 이상의 병렬처리되는 3D 컨벌루션 뉴럴 네트워크 레이어들의 결과값을 연접(concatenate)하여 상기 잔여 블록 레이어에 입력하는, 단백질-리간드 결합 친화도 예측 방법을 제공한다.In addition, a protein-ligand binding affinity prediction method is provided in which the results of the one or more parallel processed 3D convolutional neural network layers are concatenated and input to the remaining block layer.

또한, 본 발명의 일 실시예에서 상기 하나 이상의 3D 컨벌루션 뉴럴 네트워크는 5개 이상 25개 이하일 수 있다. Additionally, in one embodiment of the present invention, the number of the one or more 3D convolutional neural networks may be 5 to 25.

본 발명의 다른 실시예는 외부 서버와 통신가능한 통신부; 입력부 및 디스플레이부를 제어하는 입출력 인터페이스부; 데이터를 저장하기 위한 메모리부; 및 단백질-리간드 결합 친화도 예측 모델을 수행하기 위한 제어부를 포함하는 단백질-리간드 결합 친화도 예측 시스템으로서, 상기 메모리부는, 단백질-리간드 복합체의 구조를 상기 단백질-리간드 복합체 중 리간드의 질량 중심을 원점으로 하고 주변 원자 환경을 3D 그리드에 임베드한 3차원 정보 및 상기 3차원 정보에 대응하는 단백질-리간드 결합 친화도 값을 포함하고, 상기 제어부는, 상기 3차원 정보를 입력값으로 하고 이에 대응하는 상기 단백질-리간드 결합 친화도 값을 산출값으로 하여, 하나 이상의 3D 컨벌루션 뉴럴 네트워크를 독립적으로 처리하여, 상기 예측 모델을 학습시키고, 상기 제어부는, 상기 각각의 3D 컨벌루션 뉴럴 네트워크를 통해, 새로운 단백질-리간드 복합체에 대한 단백질-리간드 결합 친화도 예측값을 각각 생성하고, 상기 각각의 단백질-리간드 결합 친화도 예측값에 기초하여 최종 예측값을 생성하는, 단백질-리간드 결합 친화도 예측 시스템을 제공한다. Another embodiment of the present invention includes a communication unit capable of communicating with an external server; An input/output interface unit that controls the input unit and display unit; a memory unit for storing data; and a control unit for performing a protein-ligand binding affinity prediction model, wherein the memory unit determines the structure of the protein-ligand complex with the center of mass of the ligand in the protein-ligand complex as the origin. and includes 3D information embedding the surrounding atomic environment in a 3D grid and a protein-ligand binding affinity value corresponding to the 3D information, and the control unit takes the 3D information as an input value and Using the protein-ligand binding affinity value as a calculated value, one or more 3D convolutional neural networks are independently processed to learn the prediction model, and the control unit generates a new protein-ligand through each 3D convolutional neural network. A protein-ligand binding affinity prediction system is provided, which generates predicted protein-ligand binding affinity values for each complex and generates a final predicted value based on each protein-ligand binding affinity predicted value.

또한, 상기 3차원 정보는 아래의 밀도 함수를 통해 연산된 단백질-리간드 상호작용의 패턴을 포함하는, 단백질-리간드 결합 친화도 예측 시스템을 제공한다.Additionally, the three-dimensional information provides a protein-ligand binding affinity prediction system that includes the pattern of protein-ligand interaction calculated through the density function below.

또한, 상기 제어부는 상기 3차원 정보를 복수 개의 클래스로 분류하여 상기 입력값의 상이한 채널들로 표현할 수도 있다. Additionally, the control unit may classify the 3D information into a plurality of classes and express them through different channels of the input value.

또한, 상기 제어부는 상기 3차원 정보를 24개의 회전 조작으로 회전시키며 입력값을 추가할 수도 있다. Additionally, the control unit can rotate the 3D information using 24 rotation operations and add input values.

또한, 상기 3D 컨벌루션 뉴럴 네트워크는 하나 이상의 적층된 앙상블 기반의 잔여 블록 레이어를 포함하고, 상기 각각의 잔여 블록 레이어는 배치 정규화와 ReLU 활성화 레이어가 결합된 하나 이상의 적층 컨벌루션 레이어들을 포함할 수도 있다. Additionally, the 3D convolutional neural network includes one or more stacked ensemble-based residual block layers, and each residual block layer may include one or more stacked convolutional layers combining batch normalization and ReLU activation layers.

또한, 상기 하나 이상의 병렬처리되는 3D 컨벌루션 뉴럴 네트워크 레이어들의 결과값을 연접(concatenate)하여 상기 잔여 블록 레이어에 입력하는, 단백질-리간드 결합 친화도 예측 시스템을 제공한다. In addition, a protein-ligand binding affinity prediction system is provided that concatenates the results of the one or more parallel processed 3D convolutional neural network layers and inputs them to the remaining block layer.

또한, 본 발명의 다른 실시예에서 상기 복수 개의 3D 컨벌루션 뉴럴 네트워크는 5개 이상 25개 이하일 수 있다. Additionally, in another embodiment of the present invention, the plurality of 3D convolutional neural networks may be 5 or more and 25 or less.

본 명세서의 다양한 실시예들 중 일 실시예는, 종래 정확한 이미지 분류를 위해 개발된 ResNext 아키텍처로부터 새로운 단백질-리간드 결합 친화도 예측 모델을 개시한다. 또한, 기존 모델들과 비교하여 새로운 네트워크 아키텍처를 사용하는 것에서 더 나아가, 하나의 예측변수 대신에 다중 예측변수들을 이용한 앙상블 기반(ensemble-based) 접근을 통해 예측 품질을 크게 향상시킬 수 있다. 예를 들어, 다중 예측변수들의 평균값을 이용하는 앙상블 기반 접근으로 예측 품질을 향상시킬 수 있다. 앙상블 접근은 네트워크 아키텍처의 추가적인 수정을 요구하지 않으며, 현존하는 대부분의 모델들에 쉽게 적용 가능하다는 장점을 가진다. CASF-2016 데이터 세트를 이용한 벤치마크 결과는 본 발명의 성능이 현존하는 가장 우수한 평가함수에 비견함을 보여준다. 또한, 결합 친화도의 결정에 있어서 가장 본질적인 물리적 특성을 확인할 수 있도록 상대적인 인자 중요도를 분석할 수 있다. One of the various embodiments of the present specification discloses a new protein-ligand binding affinity prediction model from the ResNext architecture, which was previously developed for accurate image classification. In addition, compared to existing models, prediction quality can be significantly improved through the use of a new network architecture and an ensemble-based approach using multiple predictors instead of a single predictor. For example, prediction quality can be improved with an ensemble-based approach that uses the average value of multiple predictors. The ensemble approach has the advantage of not requiring additional modifications to the network architecture and being easily applicable to most existing models. Benchmark results using the CASF-2016 data set show that the performance of the present invention is comparable to the best existing evaluation function. Additionally, the relative importance of factors can be analyzed to identify the most essential physical properties in determining binding affinity.

도 1은 본 발명의 일 실시예의 예측 시스템을 개략적으로 나타낸 도면이다.
도 2는 본 발명의 일 실시예의 네트워크 구조적 전체적인 개략도이다.
도 3은 각각의 잔여 블록의 구조를 나타내는 개략도이다.
도 4는 네트워크의 개수에 대한 예측 품질을 나타낸다.
도 5는 현존하는 단백질-리간드 결합 친화도 평가 함수들과 AK-스코어의 벤치마크 결과를 나타낸다.
도 6은 실험적 결합 친화도와 AK-score-ensemble, X-score, 및 Autodock vina를 통해 얻은 예측값의 비교를 나타낸 도면이다.
도 7은 인자 중요도의 산출 결과이다.1 is a diagram schematically showing a prediction system according to an embodiment of the present invention.
Figure 2 is a schematic diagram of the overall network structure of an embodiment of the present invention.
Figure 3 is a schematic diagram showing the structure of each remaining block.
Figure 4 shows prediction quality versus number of networks.
Figure 5 shows the benchmark results of existing protein-ligand binding affinity evaluation functions and AK-score.
Figure 6 is a diagram showing a comparison between experimental binding affinity and predicted values obtained through AK-score-ensemble, X-score, and Autodock vina.
Figure 7 shows the calculation results of factor importance.

본 발명은 본 명세서에 첨부된 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 한편, 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 본 명세서에서 사용되는 "포함한다(comprises)" 또는 "포함하는(comprising)"은 언급된 구성요소, 단계 외에 하나 이상의 다른 구성요소, 단계의 존재 또는 추가를 배제하지 않는다. 제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 구성요소들은 용어들에 의해 한정되어서는 안 된다. 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.The present invention will become clear by referring to the embodiments described in detail below along with the drawings attached to this specification. However, the present invention is not limited to the embodiments disclosed below and will be implemented in various different forms. The present embodiments only serve to ensure that the disclosure of the present invention is complete and that common knowledge in the technical field to which the present invention pertains is not limited. It is provided to fully inform those who have the scope of the invention, and the present invention is only defined by the scope of the claims. Meanwhile, the terms used in this specification are for describing embodiments and are not intended to limit the present invention. As used herein, singular forms also include plural forms, unless specifically stated otherwise in the context. As used herein, “comprises” or “comprising” does not exclude the presence or addition of one or more other components or steps in addition to the mentioned components or steps. Terms such as first, second, etc. may be used to describe various components, but the components should not be limited by the terms. Terms are used only to distinguish one component from another.

이하에서는 도면을 참조하여 본 발명의 다양한 실시예를 예시적으로 설명한다. Hereinafter, various embodiments of the present invention will be described by way of example with reference to the drawings.

도 1은 실시예에 따른 예측 시스템의 구성의 일례를 나타내는 블록도로서, 본 실시예에 관련된 부분을 개념적으로 나타내고 있다. 각각의 구성은 하나의 장치에 모두 구비되어 단독으로 처리를 행할 수도 있으나 이에 한정되는 것은 아니며, 네트워크를 통해 접속되어 각각의 구성이 분리된 장치에서 수행되는 것 또한 포함할 수 있다. 1 is a block diagram showing an example of the configuration of a prediction system according to an embodiment, and conceptually shows parts related to the present embodiment. Each component may be provided in one device and processed independently, but the present invention is not limited to this, and may also include that each component is performed on a separate device connected through a network.

외부 서버(20)는 네트워크를 통해 예측 시스템(10)과 서로 접속될 수 있고, 단백질 구조 정보, 리간드 구조 정보, 단백질-리간드 복합체 구조 정보, 단백질-리간드 복합체 결합 부위에 따른 결합 친화도 정보, 유전자 정보, 분자간 상호 작용 정보, 및/또는 단백질 구조 유사성 정보 등의 정보를 제공할 수도 있다. 예를 들어, 외부 서버(20)는 예측 시스템(10)의 예측 처리를 위한 데이터 베이스이거나 또는 이를 제공하는 서버일 수 있다. The external server 20 can be connected to the prediction system 10 through a network, and provides protein structure information, ligand structure information, protein-ligand complex structure information, binding affinity information according to the protein-ligand complex binding site, and gene Information such as information, intermolecular interaction information, and/or protein structure similarity information may be provided. For example, the external server 20 may be a database for prediction processing of the prediction system 10 or a server that provides the same.

예측 시스템(10)은 제어부(11), 통신부(12), 입출력 인터페이스부(13), 메모리부(14)를 포함할 수 있다. The prediction system 10 may include a control unit 11, a communication unit 12, an input/output interface unit 13, and a memory unit 14.

제어부(11)는 예측 시스템(10)의 전체를 제어하는 구성으로서, 예를 들어, CPU, GPU 등의 프로세싱 유닛을 포함할 수 있다. 제어부(11)는 메모리부(14)에 저장된 정보들을 이용하여 후술할 모델들을 학습시킬 수 있고, 또한 학습된 모델을 통해 새로운 입력에 대한 예측값 산출을 수행할 수도 있다. 구체적으로, 제어부(11)는 단백질-리간드 결합 친화도의 예측을 산출하는 모델을 제어할 수 있다. 이를 위하여 제어부(11)는 OS(operating system) 등의 제어 프로그램이나, 각종의 처리 순서 등을 규정한 프로그램, 데이터를 저장하기 위한 내부 메모리를 포함할 수도 있다. 그리고, 제어부(11)는 이들 프로그램 등에 의해 다양한 처리를 실행하기 위한 정보 처리를 수행할 수 있다. The control unit 11 is a component that controls the entire prediction system 10 and may include, for example, a processing unit such as a CPU or GPU. The control unit 11 can train models to be described later using information stored in the memory unit 14, and can also calculate a predicted value for a new input through the learned model. Specifically, the control unit 11 may control a model that calculates a prediction of protein-ligand binding affinity. To this end, the control unit 11 may include an internal memory for storing control programs such as an operating system (OS), programs defining various processing sequences, and data. And, the control unit 11 can perform information processing to execute various processes using these programs, etc.

또한, 통신부(12)는 통신 회선 등에 접속되는 라우터(router) 등의 통신 장치에 접속될 수 있는 인터페이스를 포함할 수 있고, 예측 시스템(10)과 외부 서버(20)와의 통신을 제어할 수 있다. Additionally, the communication unit 12 may include an interface that can be connected to a communication device such as a router connected to a communication line, and can control communication between the prediction system 10 and the external server 20. .

입출력 인터페이스부(13)는 입력부(15) 및/또는 디스플레이부(16)에 접속되는 인터페이스일 수 있다. 입출력 인터페이스부(13)를 통해 예측 시스템(10)과 사용자가 소통할 수 있다. 예를 들어, 디스플레이부(16)는 애플리케이션 등의 표시 화면을 표시하는 표시 수단(예를 들면, 액정 또는 유기 EL 등으로 구성되는 디스플레이, 모니터, 터치 패널 등)일 수도 있다. 또한, 입력부(15)는, 예를 들면 키입력부, 터치 패널, 컨트롤 패드(예를 들면 터치 패드, 게임 패드 등), 마우스, 키보드, 마이크 등일 수도 있다. The input/output interface unit 13 may be an interface connected to the input unit 15 and/or the display unit 16. The prediction system 10 and the user can communicate through the input/output interface unit 13. For example, the display unit 16 may be a display means (for example, a display made of liquid crystal or organic EL, a monitor, a touch panel, etc.) that displays a display screen such as an application. Additionally, the input unit 15 may be, for example, a key input unit, a touch panel, a control pad (eg, a touch pad, a game pad, etc.), a mouse, a keyboard, a microphone, etc.

또한, 메모리부(14)는 각종의 데이터 베이스나 테이블 등을 저장하는 장치일 수 있다. 예를 들어, 메모리부는 단백질 구조 정보, 리간드 구조 정보, 단백질-리간드 복합체 구조 정보, 단백질-리간드 복합체 결합 부위에 따른 결합 친화도 정보, 유전자 정보, 분자간 상호 작용 정보, 및/또는 단백질 구조 유사성 정보 등의 정보를 포함할 수 있다.Additionally, the memory unit 14 may be a device that stores various databases, tables, etc. For example, the memory unit contains protein structure information, ligand structure information, protein-ligand complex structure information, binding affinity information according to the protein-ligand complex binding site, genetic information, intermolecular interaction information, and/or protein structure similarity information, etc. may include information.

데이터 세트data set

네트워크의 학습과 테스트를 위한 단백질-리간드의 결합 친화도 데이터는 PDBBind 데이터 베이스를 활용했다. 2018년 8월 기준으로 데이터 베이스의 정제된 세트는 3767개의 단백질-리간드 복합체에 대한 실험적 결합 친화도를 가지고 있다. 그 중에서 290개의 고해상도 데이터를 포함하는, 수작업으로 선별된 코어 세트가 테스트로 사용되었다. 정제된 결합 친화도로 그룹화된 나머지 4055개의 복합체는 학습 세트로 사용되었다. The PDBBind database was used for protein-ligand binding affinity data for network training and testing. As of August 2018, the refined set of databases has experimental binding affinities for 3767 protein-ligand complexes. Among them, a manually selected core set containing 290 high-resolution data was used for testing. The remaining 4055 complexes grouped by purified binding affinity were used as the training set.

컨벌루션 뉴럴 네트워크(Convolutional Neural Network)Convolutional Neural Network

컨벌루션 뉴럴 네트워크의 장점을 활용하기 위해, 단백질-리간드 복합체의 구조가 삼차원(3-dimentional: 3D) 그리드로 표현되었다. 각각의 단백질-리간드 복합체에 대하여, 복합체 중 리간드의 질량 중심이 원점으로 세팅되고, 주변 원자 환경은 가장자리 길이가 30 Å인 3D-그리드로 임베드되었다. X, Y 및 Z축을 따라서, 30 그리드 박스가 1.0 Å 간격으로 생성되었다. 각각의 그리드 박스에 대한 단백질-리간드 상호작용의 패턴을 수집하기 위하여, 원자 개수 밀도가 아래의 밀도 함수(수식1)를 통해 연산되었다. To take advantage of the convolutional neural network, the structure of the protein-ligand complex was represented as a 3-dimensional (3D) grid. For each protein-ligand complex, the center of mass of the ligand in the complex was set as the origin, and the surrounding atomic environment was embedded into a 3D-grid with an edge length of 30 Å. Along the X, Y and Z axes, 30 grid boxes were created at 1.0 Å spacing. To collect the pattern of protein-ligand interactions for each grid box, the atomic number density was calculated using the density function below (Equation 1).

수식1 Formula 1

여기서, 는 원자의 반데르발스 반지름이고, 은 원자로부터 그리드의 중심까지의 거리이다. here, is the van der Waals radius of the atom, is the distance from the atom to the center of the grid.

원자들은 8 개의 클래스로 분류되고, 이들은 입력 데이터의 상이한 채널들을 나타낸다. 실시예는 단백질과 리간드의 원자들을 별도로 처리하였고, 이는 각각의 단백질-리간드 복합체의 집계된 개수 밀도를 나타내는 16 개의 실수값 채널들로 나타났다. 아래 표 1은 실시예에서 사용된 원자 타입의 설명을 나타낸다.Atoms are classified into eight classes, which represent different channels of input data. The example treats the atoms of the protein and ligand separately, resulting in 16 real-valued channels representing the aggregated number density of each protein-ligand complex. Table 1 below shows a description of the atom types used in the examples.

단백질-리간드 결합 부위를 형성하는 원자들을 분류하기 위해 사용된 원자 타입Atom type used to classify the atoms that form the protein-ligand binding site Atom typeAtom type DefinitionDefinition HydrophobicHydrophobic Aliphatic or aromatic CAliphatic or aromatic C AromaticAromatic Aromatic CAromatic C Hydrogen bond donorHydrogen bond donor Donor 1 H-bond or Donor S
Spherical H with NA, NS, OA, OS, SADonor 1 H-bond or Donor S
Spherical H with NA, NS, OA, OS, SA Hydrogen bond acceptorHydrogen bond acceptor Acceptor 1 H-bond or S Spherical N;
Acceptor 2 H-bonds or S Spherical O;
Acceptor 2 H-bonds SAcceptor 1 H-bond or S Spherical N;
Acceptor 2 H-bonds or S Spherical O;
Acceptor 2 H-bonds S PositivePositive ionizable Gasteiger positive chargeionizable Gasteiger positive charge NegativeNegative ionizable Gasteiger negative chargeionizable Gasteiger negative charge MetallicMetallic MG, Zn, Mn, Ca, or FeMG, Zn, Mn, Ca, or Fe Excluded VolumeExcluded Volume All atom typesAll atom types

실시예는 복합체 구조의 방향 의존성을 줄이기 위하여, 그리드를 모든 24개의 가능한 회전 조작으로 회전시키면서 데이터 세트의 수를 추가했다. The examples added to the number of data sets by rotating the grid through all 24 possible rotation operations to reduce the orientation dependence of the composite structure.

네트워크 아키텍처(Network architecture)Network architecture

실시예에서 제공하는 네트워크의 주요 구성은 이미지 인식을 위한 ResNext 모델에서 사용되는 앙상블 기반의 잔여 네트워크(residual network)이다. 이하, 도 1 및 도 2에서 이를 상세하게 설명한다. The main configuration of the network provided in the embodiment is an ensemble-based residual network used in the ResNext model for image recognition. Hereinafter, this will be described in detail in FIGS. 1 and 2.

도 2는 실시예의 네트워크 구조적 전체적인 개략도이다. Figure 2 is a schematic diagram of the overall network structure of the embodiment.

전체 네트워크는 주로 앙상블 기반의 잔여 블럭(residual block)이 14개 적층된 레이어로 구성된다.The entire network mainly consists of 14 stacked layers of ensemble-based residual blocks.

도 3은 각각의 잔여 블록의 구조를 나타낸다. 잔여 블록은 배치 정규화(batch normalization)와 ReLU(Rectified Linear Units) 활성화 레이어들이 결합된 세 개의 적층 컨벌루션 레이어들로 구성된다. 블록의 중간에는, 각각의 채널이 16개의 3D 컨벌루션 레이어들로 분산되어 병렬로 처리된다. Figure 3 shows the structure of each remaining block. The remaining block consists of three stacked convolutional layers combining batch normalization and ReLU (Rectified Linear Units) activation layers. In the middle of the block, each channel is distributed into 16 3D convolutional layers and processed in parallel.

도 2 및 도 3에 있어서 Conv3D는 3D 컨벌루션 뉴럴 네트워크 레이어를 의미하고, BN(batch normalization)은 배치 정규화, RL(residual layer)은 잔여 레이어를 의미한다. 병렬로 배치되는 잔여 네트워크의 개수는 또한 카디널리티(cardinality)로 불린다. In Figures 2 and 3, Conv3D refers to a 3D convolutional neural network layer, BN (batch normalization) refers to batch normalization, and RL (residual layer) refers to a residual layer. The number of remaining networks deployed in parallel is also called cardinality.

실시예에서는 각각의 잔여 블록에 대하여 16개의 Conv3D 레이어들을 사용했다. 본 명세서에서는 실시예의 네트워크 아키텍처를 AK-스코어(Arontier-Kangwon docking scoring function: AK-score)라고 지칭한다. In the example, 16 Conv3D layers were used for each residual block. In this specification, the network architecture of the embodiment is referred to as AK-score (Arontier-Kangwon docking scoring function: AK-score).

실시예에서는 ReLU 활성화 함수가 사용된다. 모든 가중치 파라미터들은 He_normal 초기화 방식으로 초기화된다. 모델 손실은 결합 친화도에 대한 실험치와 예측치 사이의 평균 절대 오차(mean absolute error: MAE)를 kcal/mol 단위로 계산한다. In the embodiment, the ReLU activation function is used. All weight parameters are initialized using the He_normal initialization method. Model loss is calculated as the mean absolute error (MAE) between experimental and predicted binding affinity values in kcal/mol.

파라미터 최적화를 위하여, beta-1 = 0.99, beta-2 = 0.999의 파라미터로 Adam 옵티마이저가 사용된다. 모델은 최종 예측 품질에 대한 러닝 레이트(learning rate)의 영향을 평가하기 위하여 다중의 러닝 레이트(learning rate)들로 학습되었다. 러닝 레이트는 0.0001, 0.0005, 0.0007 그리고 0.0010의 러닝 레이트가 평가되었다. 네트워크를 학습하는 동안 전체 데이터 세트는 무작위 순열로 이루어졌다.For parameter optimization, the Adam optimizer is used with parameters beta-1 = 0.99 and beta-2 = 0.999. The model was trained at multiple learning rates to evaluate the impact of learning rate on the final prediction quality. Running rates of 0.0001, 0.0005, 0.0007 and 0.0010 were evaluated. While training the network, the entire data set consisted of random permutations.

앙상블 예측Ensemble prediction

예측의 정확도를 강화하기 위하여, 실시예는 앙상블 예측 방식을 도입했고, 최종 예측값을 다중의 독립적으로 학습된 모델들의 평균으로부터 도출했다. 많은 머신러닝 과제에 있어서, 각각의 예측 모델에 대한 파라미터들은 초기 랜덤 값으로부터 최적화된다. 파라미터들의 수가 커지게 되면, 최종 파라미터 세트가 일반적으로 수렴하지 않는다. 이러한 성향을 줄이기 위하여, 실시예는 다중 네트워크들을 독립적으로 학습시키고, 다중 예측치의 평균이 더 나은 예측을 산출하는지 여부를 확인했다. 실시예에서는 최종 예측값을 다중의 독립적으로 학습된 모델들의 평균으로부터 도출했으나 이는 예시의 목적으로서, 본 발명이 이에 한정되는 것은 아니며, 중간값 등 다른 값을 활용할 수도 있다. To enhance prediction accuracy, the embodiment adopted an ensemble prediction method, and the final prediction value was derived from the average of multiple independently learned models. For many machine learning tasks, the parameters for each prediction model are optimized from initial random values. As the number of parameters becomes large, the final parameter set generally does not converge. To reduce this tendency, the embodiment trained multiple networks independently and checked whether averaging multiple predictions yielded a better prediction. In the embodiment, the final predicted value was derived from the average of multiple independently learned models, but this is for illustrative purposes, and the present invention is not limited to this, and other values such as the median may be used.

성능 평가performance evaluation

실시예의 성능을 평가하기 위하여, 실시예의 모델을 기존에 제안된 3D-CNN 기반의 결합 친화도 예측 모델과 비교했다. 이미지 분류에 사용되는 SqueezeNet 아키텍처를 기반으로 하는 K_deep 모델을 비교 모델로 사용했다. 모델들은 동일한 파라미터를 이용하여 학습되었다. 본 명세서에 개시된 모든 모델들은 Tensorflow-1.13.1을 백엔드로 한 Keras-2.2.4가 사용되었다. To evaluate the performance of the example, the model of the example was compared with a previously proposed 3D-CNN-based binding affinity prediction model. The K_deep model based on the SqueezeNet architecture used for image classification was used as a comparison model. The models were trained using the same parameters. All models disclosed in this specification used Keras-2.2.4 with Tensorflow-1.13.1 as the backend.

실시예는 다양한 관점에서 모델들의 성능을 비교했다. CASF-2016 벤치마크 세트는 다양한 도킹 모델들의 비교를 위한 기준점으로 사용되어 왔다. 단백질-리간드 도킹 예측의 정확도는 일반적으로 세 가지 관점에서 측정된다. 1) 스코어링, 예측된 결합 친화도 값이 실험값과 얼마나 잘 상호 연관되는가? 2) 랭킹, 관련성 있는 결합 특성이 얼마나 잘 예측되는가? 3) 도킹, 올바른 도킹 포즈가 디코이(decoy)보다 낮은 에너지를 갖도록 정확하게 식별되었는가? The examples compared the performance of models from various perspectives. The CASF-2016 benchmark set has been used as a reference point for comparison of various docking models. The accuracy of protein-ligand docking prediction is generally measured from three perspectives. 1) Scoring, how well do the predicted binding affinity values correlate with experimental values? 2) Ranking, how well are relevant combined features predicted? 3) Docking, has the correct docking pose been accurately identified to have lower energy than the decoy?

모델의 스코어링 파워를 평가하기 위해서 피어슨 상관 계수(Pearson correlation coefficient)가 계산되었다. 또한, 랭킹 파워를 평가하기 위하여 Spearman, Kendall tau, 그리고 PI(Predictive index) 값이 사용되었다. 도킹 파워를 평가하기 위하여 실제 결합 리간드를 가지는 리간드가 가능한 디코이의 top1, top2, 그리고 top3 퍼센트에 포함되는지 여부를 확인했다. To evaluate the scoring power of the model, the Pearson correlation coefficient was calculated. Additionally, Spearman, Kendall tau, and PI (Predictive index) values were used to evaluate ranking power. To evaluate docking power, we checked whether the ligand with the actual bound ligand was included in the top1, top2, and top3 percent of possible decoys.

결과 및 분석Results and Analysis

결합 친화도 예측의 정확성Accuracy of binding affinity predictions

실시예는 AK-스코어 아키텍처를 기반으로 세 가지 상이한 타입의 실험을 수행했다. AK-score-single, AK-score-small, 그리고 AK-score-ensemble이다. AK-score-single은 도 1 및 도 2에 도시된 하나의 예측 네트워크를 사용했다. AK-score-small은 다른 것들에 비해 비교적 드문, 예를 들어, 표 1의 positive, negative, metallic을 배제하고 오직 10개의 인자(feature)를 사용했다. AK-score-ensemble은 최종 예측 값으로 20개의 독립적으로 학습된 네트워크들의 평균을 사용했다. 아래의 표 2에는 상이한 러닝 레이트로 학습된 다양한 모델들의 예측값과 실험값 사이의 평균 절대 오차(mean absolute error: MAE)와 평균 제곱근 오차(root mean squared error: RMSE)가 제공된다. 비교를 위하여, 동일한 그룹이 적용된 다른 3D-CNN 기반의 딥러닝 모델, K_deep 모델의 결과를 같이 개시하였다. The examples performed three different types of experiments based on the AK-Score architecture. AK-score-single, AK-score-small, and AK-score-ensemble. AK-score-single used one prediction network shown in Figures 1 and 2. AK-score-small uses only 10 features, excluding the relatively rare ones (e.g., positive, negative, and metallic in Table 1) compared to others. AK-score-ensemble used the average of 20 independently trained networks as the final prediction value. Table 2 below provides the mean absolute error (MAE) and root mean squared error (RMSE) between the predicted and experimental values of various models trained at different running rates. For comparison, the results of another 3D-CNN-based deep learning model, K _deep model, to which the same group was applied were also disclosed.

벤치마크 결과는 AK-score-ensemble 모델이 가장 정확한 예측 결과를 산출함을 보여준다. 평가된 모델들 중에서, AK-score-ensemble 은 1.01 kcal/mol의 MAE, 1.29 kcal/mol의 RMSE로서 가장 낮은 정확도 척도값을 가진다. 싱글 모델과 비교할 때, 앙상블 모델의 평균 오차는 약 0.1 kcal/mol 정도 낮다. 또한, K_deep 모델과 비교할 경우, AK-score-ensemble은 0.2 kcal/mol 낮은 평균 오차를 가진다. 실시예의 결과는 또한 최적의 러닝 레이트를 선택함으로써 MAE를 약 0.05 kcal/mol, RMSE를 0.10 kcal/mol 개선시킬 수 있다는 것을 보여준다. The benchmark results show that the AK-score-ensemble model produces the most accurate prediction results. Among the evaluated models, AK-score-ensemble has the lowest accuracy measure, with a MAE of 1.01 kcal/mol and an RMSE of 1.29 kcal/mol. Compared to the single model, the average error of the ensemble model is lower by approximately 0.1 kcal/mol. Additionally, when compared to the K _deep model, AK-score-ensemble has a lower average error of 0.2 kcal/mol. The results of the example also show that by selecting the optimal running rate, the MAE can be improved by about 0.05 kcal/mol and the RMSE by 0.10 kcal/mol.

PDBbind-2016 데이터 세트와 평균 절대 오차, 평균 제곱근 오차 척도들을 이용하여 ResNext-ensemble, ResNext, K_deep의 예측 정확도를 평가한 결과Results of evaluating the prediction accuracy of ResNext-ensemble, ResNext, and K _deep using the PDBbind-2016 data set and mean absolute error and root mean square error measures. ModelModel Learning RateLearning Rate MAEM.A.E. RMSERMSE KdeepKdeep 0.0001 0.0001 1.131 1.131 1.4621.462 0.0005 0.0005 1.200 1.200 1.5191.519 0.0006 0.0006 1.1641.164 1.5341.534 0.0010 0.0010 1.219 1.219 1.5361.536 AK-score singleAK-score single 0.0001 0.0001 1.1591.159 1.5111.511 0.0005 0.0005 1.1011.101 1.4151.415 0.0007 0.0007 1.1301.130 1.4251.425 0.0010 0.0010 1.1101.110 1.4061.406 AK-score smallAK-score small 0.00060.0006 1.1281.128 1.4621.462 AK-score ensembleAK-score ensemble 0.00070.0007 1.0141.014 1.2931.293

실시예는 또한 CASF-2016 데이터 세트에 적용된 세 개의 기준인 스코어링, 랭킹, 그리고 도킹 파워에 기초하여 모델의 성능을 평가했다. 아래의 표 3은 이를 구체적으로 보여주고 있다. 세 개의 기준 모두에서 AK-score-ensemble 모델은 가장 우수한 성능을 보여주고 있다. 스코어링 파워는 예측값과 실험값 사이의 피어슨 상관 계수에 의해 표시된다. 전체적으로, AK-스코어 모델들은 K_deep 모델보다 높은 상관 값을 도출했다. 모든 평가된 모델들 중에서, 오직 AK-score-ensemble 모델만이 0.8 보다 높은 상관 계수 값을 산출했다. 랭킹 파워에 있어서는, AK-스코어 모델들이 평균적으로 K_deep 모델을 상회했다. 이 중, AK-score-ensemble 모델은 가장 높은 랭크 상관 계수를 도출했다. 도킹 파워에 있어서는, K_deep 모델과 AK-score-single 모델의 예측 성능의 차이점이 다른 기준들에 비해 두드러지지는 않았다. 그러나, AK-score-ensemble 모델의 예측 결과는 K_deep 모델보다 확연히 우수했다. The example also evaluated the performance of the model based on three criteria applied to the CASF-2016 data set: scoring, ranking, and docking power. Table 3 below shows this in detail. In all three criteria, the AK-score-ensemble model shows the best performance. Scoring power is indicated by the Pearson correlation coefficient between predicted and experimental values. Overall, AK-score models yielded higher correlation values than K _deep models. Among all evaluated models, only the AK-score-ensemble model yielded correlation coefficient values higher than 0.8. In terms of ranking power, AK-score models outperformed K _deep models on average. Among these, the AK-score-ensemble model derived the highest rank correlation coefficient. In terms of docking power, the difference between the prediction performance of the K _deep model and the AK-score-single model was not noticeable compared to other criteria. However, the prediction results of the AK-score-ensemble model were significantly better than the K _deep model.

CASF-2016 데이터 세트에 대한 예측 정확도의 비교Comparison of prediction accuracy for CASF-2016 dataset ModelModel ScoringScoring RankingRanking DockingDocking Learning rateLearning rate Pearson
(R)Pearson
(R) Spearman
(SP)Spearman
(SP) Kendall
(tau)Kendall
(tau) Predictive
(PI)Predictive
(PI) Top1
(%)Top1
(%) Top2
(%)Top2
(%) Top3
(%)Top3
(%) K_deep K _deep 0.00010.0001 0.7380.738 0.5390.539 0.4350.435 0.5590.559 24.824.8 38.538.5 52.252.2 0.00050.0005 0.7090.709 0.4860.486 0.3890.389 0.5350.535 29.129.1 39.939.9 49.649.6 0.00060.0006 0.7010.701 0.5280.528 0.4390.439 0.5580.558 29.129.1 39.939.9 49.649.6 0.00100.0010 0.7150.715 0.4790.479 0.4000.400 0.4920.492 24.824.8 36.336.3 44.644.6 AK-score-singleAK-score-single 0.00010.0001 0.7190.719 0.5720.572 0.4560.456 0.6000.600 34.934.9 48.648.6 56.156.1 0.00050.0005 0.7550.755 0.5960.596 0.5120.512 0.6160.616 29.929.9 43.243.2 54.054.0 0.00070.0007 0.7590.759 0.6160.616 0.5260.526 0.6400.640 31.331.3 47.147.1 57.957.9 0.00100.0010 0.7600.760 0.5980.598 0.5050.505 0.6270.627 26.426.4 43.943.9 54.054.0 AK-score-smallAK-score-small 0.00060.0006 0.7410.741 0.5670.567 0.4950.495 0.5860.586 28.428.4 46.846.8 56.556.5 AK-score-ensembleAK-score-ensemble 0.00070.0007 0.8120.812 0.6700.670 0.5890.589 0.6980.698 36.036.0 51.451.4 59.759.7

네트워크의 앙상블은 예측 품질을 향상시킴Ensembles of networks improve prediction quality

도 4는 네트워크의 개수에 따라 예측 품질이 변하는 것을 보여준다.Figure 4 shows that prediction quality changes depending on the number of networks.

도 4의 (a)는 결합 친화도에 대한 실험값과 예측값 사이의 피어슨 상관 계수에 의해 측정되는 스코어링 파워를 나타낸다. 도 4의 (b)는 Spearman(SP), Kendall tau(tau), 그리고 Predictive Index(PI)의 세 개의 랭크 상관 계수에 의해 측정되는 랭킹 파워를 나타낸다. Figure 4(a) shows the scoring power measured by the Pearson correlation coefficient between experimental and predicted values for binding affinity. Figure 4 (b) shows the ranking power measured by three rank correlation coefficients: Spearman (SP), Kendall tau (tau), and Predictive Index (PI).

실시예에 따른 결과는 보다 많은 네트워크가 사용될수록 예측 정확도가 전체적으로 향상되는 것을 보여주고 있다. 세 개의 품질 척도 모두에서, 즉, 스코어링, 랭킹, 그리고 도킹 파워에 있어서, 하나의 네트워크에서 다섯 개의 네트워크로 이동함에 따라 정확도는 급격하게 증가했다. 스코어링 파워에 있어서는, 실험치와 예측치 사이의 피어슨 상관 계수(R)가 네트워크가 25개에 도달하기까지 증가하였다. 도 4의 (a)는 이를 나타내는 도면이다. 하나의 네트워크가 사용될 때는 상관 계수는 0.74이다. 그러나, 다섯 개의 네트워크의 평균이 사용될 때는 0.80 보다 큰 값을 얻게 된다. 10개의 네트워크 이후부터는, 개선도는 완만하지만 25개의 네트워크가 사용될 때까지 지속적으로 증가한다. 유사하게, 랭킹 파워는 25개의 네트워크가 사용될 때까지 증가한다. 세 개의 랭킹 측정 인자, SP, tau, PI 모두 네트워크의 앙상블 평균일 때에 일관되게 증가한다. 도 4의 (b)는 이러한 결과를 나타내고 있다. 이러한 결과들은 예측 네트워크의 앙상블을 사용하는 것이 예측 품질을 상당하게 개선한다는 것을 명확하게 보여주고 있다. 이는 다양한 네트워크 아키텍처를 새로 고안하지 않고도 예측 정확도를 간단하면서도 명확하게 향상시키는 것이다. Results according to the embodiment show that the overall prediction accuracy improves as more networks are used. In all three quality measures: scoring, ranking, and docking power, accuracy increased dramatically as we moved from one network to five networks. In terms of scoring power, the Pearson correlation coefficient (R) between experimental and predicted values increased until the network reached 25. Figure 4(a) is a diagram showing this. When one network is used, the correlation coefficient is 0.74. However, when the average of five networks is used, values greater than 0.80 are obtained. After 10 networks, the improvement is gradual but continues to increase until 25 networks are used. Similarly, ranking power increases until 25 networks are used. All three ranking measurement factors, SP, tau, and PI, consistently increase when the ensemble average of the network is averaged. Figure 4(b) shows these results. These results clearly show that using an ensemble of prediction networks significantly improves prediction quality. This simply and clearly improves prediction accuracy without having to reinvent various network architectures.

현존하는 스코어링 함수와의 비교Comparison with existing scoring functions

도 5는 현존하는 단백질-리간드 결합 친화도 평가 함수들과 AK-스코어의 벤치마크 결과를 나타낸다. 도 5의 (a)는 AK-스코어가 현존하는 평가 함수들에 비해 가장 높은 스코어링 파워를 나타냄을 보여준다. 도 5의 (b)는 AK-스코어가 현재 알려진 가장 높은 성능의 vina-RF₂₀ 바로 다음에 랭크되는 것을 보여준다. Figure 5 shows the benchmark results of existing protein-ligand binding affinity evaluation functions and AK-score. Figure 5(a) shows that AK-Score shows the highest scoring power compared to existing evaluation functions. Figure 5(b) shows that AK-Score ranks right after vina-RF ₂₀ , which has the highest performance currently known.

AK-스코어의 벤치마크 결과는, AK-스코어의 예측 정확도가 CASF-2016 데이터 세트에 대한 현존 최고의 스코어링 함수에 비견될 수 있음을 보여준다. CASF-2016 데이터 세트는 알려진 평가 함수의 사전 예측 결과를 제공한다. 이를 통해, 정확히 동일한 테스트 세트를 사용하여 실시예의 모델과 현존하는 평가 함수를 공정하게 비교할 수 있도록 한다. 스코어링 파워에 있어서, AK-스코어가 산출한 가장 높은 상관 계수는 0.828이다. 이는 vina-RF₂₀로 산출한 최고 값인 0.816보다 높은 값이다. 또한, AK-스코어의 랭킹 파워는 0.736으로서, 이는 vina-RF₂₀의 최고 값인 0.761보다 살짝 낮은 것인데, 테스트한 평가 함수들 중에서 두 번째에 해당하는 것이다. AK-Score's benchmark results show that AK-Score's prediction accuracy is comparable to the best existing scoring functions for the CASF-2016 dataset. The CASF-2016 dataset provides prior prediction results of known evaluation functions. This allows a fair comparison between the model of the embodiment and the existing evaluation function using exactly the same test set. In terms of scoring power, the highest correlation coefficient produced by AK-Score is 0.828. This is a higher value than the highest value calculated with vina-RF ₂₀ , which is 0.816. Additionally, the ranking power of AK-Score is 0.736, which is slightly lower than the highest value of 0.761 for vina-RF ₂₀ , and ranks second among the evaluation functions tested.

도 6은 실험적 결합 친화도와 AK-score-ensemble, X-score, 및 Autodock vina를 통해 얻은 예측값의 비교를 나타낸 도면이다. 구체적으로 도 6의 (a)는 AK-score-ensemble을 통해 얻은 예측값과 실험적 결합 친화도의 산포도를 나타낸다. 도 6의 (b)와 (c)는 각각 Autodock vina를 통해 얻은 예측값, X-score를 통해 얻은 예측값의 산포도를 나타낸다. Figure 6 is a diagram showing a comparison between experimental binding affinity and predicted values obtained through AK-score-ensemble, X-score, and Autodock vina. Specifically, Figure 6 (a) shows a scatter plot of the predicted value obtained through AK-score-ensemble and the experimental binding affinity. Figures 6 (b) and (c) show scatter plots of the predicted values obtained through Autodock vina and the predicted values obtained through X-score, respectively.

전체적으로, AK-score-ensemble 모델의 결과가 테스트를 진행한 모든 복합체에 대하여 0.827의 상관계수를 보여 실험적 값과 높은 상관관계를 보인다. 반면에, X-score와 Autodock의 결과는 확실히 편향된 모습을 보인다. 평균적으로, X-score는 절대 결합 친화도를 상당히 낮춰 평가하고 있는데, 이는 회귀직선의 작은 기울기 계수로 나타난다. Autodock-vina는 X-score보다는 더 나은 상관관계를 보여준다. 그러나, 이 또한 실험적 값에 비하여 어느 정도 낮춰 평가하고 있다. 요약하면, AK-score-ensemble 이 절대적 결합 친화도에 있어서, 보편적으로 사용되는 실험 데이터 기반(empirical)의 평가 함수들을 능가하는 것을 명확히 보여준다. Overall, the results of the AK-score-ensemble model show a high correlation with the experimental values, showing a correlation coefficient of 0.827 for all complexes tested. On the other hand, the results of X-score and Autodock are clearly biased. On average, the X-score estimates the absolute binding affinity significantly lower, which is indicated by the small slope coefficient of the regression line. Autodock-vina shows better correlation than X-score. However, this is also evaluated somewhat lower than the experimental value. In summary, we clearly show that AK-score-ensemble outperforms commonly used empirical data-based evaluation functions in terms of absolute binding affinity.

인자 중요도의 평가Evaluation of factor importance

학습된 네트워크를 통해 화학적 및 생물학적 통찰력을 얻기 위해서는, 단백질-리간드 복합체의 결합 친화도를 결정하는데 있어서 어떤 원자적 인자가 중요한 역할을 하는지 확인하는 것이 필요하다. 이를 위하여, 실시예는 1) 특정 채널 제로(channel zero)의 값을 만들기, 2) 채널의 값을 무작위로 셔플링하여 예측을 수행하는 추가적인 실험을 진행하였다. 두 번째 실험의 논리는 채널 제로의 모든 값을 만드는 것은 정보의 손실이 너무 크고, 채널의 값의 평균과 분산을 보존하는 것이 합리적인 예측을 하는데 있어 중요할 수 있다는 추측을 기반으로 한다. In order to gain chemical and biological insights through learned networks, it is necessary to identify which atomic factors play an important role in determining the binding affinity of protein-ligand complexes. To this end, the embodiment conducted additional experiments to 1) create a specific channel zero value, and 2) perform prediction by randomly shuffling the channel values. The logic of the second experiment is based on the conjecture that making all values in the channel zero would result in too much information loss, and that preserving the mean and variance of the channel values may be important in making reasonable predictions.

전체적으로, 두 개의 실험으로부터, 리간드의 배제 체적(excluded volume)과 결합 부위(binding site)가 단백질-리간드 복합체의 결합 친화도를 결정함에 있어 가장 중요한 인자임이 확인되었다. Overall, from the two experiments, it was confirmed that the excluded volume and binding site of the ligand are the most important factors in determining the binding affinity of the protein-ligand complex.

도 7은 이와 같은 결과를 나타내는 도면이다. 도 7은 예측 정확치의 손실에 의해 측정된 인자 중요도 산출 결과이다. 예측의 평균 절대 오차(MAE)는 Y축에 kcal/mol의 단위로 표시된다. 도 7의 (a)는 X축의 인자에 대응하는 채널이 제로로 채워져 있고, 도 7의 (b)는 인자에 대응하는 채널의 값이 무작위로 셔플링되어 있다. Figure 7 is a diagram showing such results. Figure 7 shows the calculation results of factor importance measured by loss of prediction accuracy. The mean absolute error (MAE) of the prediction is shown in units of kcal/mol on the y-axis. In Figure 7(a), the channel corresponding to the factor on the X-axis is filled with zero, and in Figure 7(b), the values of the channel corresponding to the factor are randomly shuffled.

다시 말하면, 결합 부위와 리간드 사이의 형상 상보성(shape complementarity)이 복합체의 결합 친화도를 결정하는데 가장 중요하다. 도 7의 (a)에 나타난 바와 같이, 리간드의 배제 체적 정보가 없는 경우, 결합 친화도 예측 정확도의 평균은 1.4kcal/mol 낮아진다. 배제 체적 정보를 따를 경우, 리간드의 소수성 원자 정보와 결합 부위는 두 번째 중요한 요소로 확인된다. 결합 부위에 대해서는, 결합 부위의 수소 수용성 원자들이 세 번째 중요한 역할을 한다. 흥미로운 것은, 리간드에 대해서는, 방향족 원자들이 세 번째 중요한 역할을 한다. 도 7의 (b)에서 확인할 수 있듯이 셔플링 실험에서는, 전체적인 방향성은 제로 실험의 것과 유사하나, 예측 정확도의 평균 감소는 더 적다. 결합 부위에 대한 가장 현저한 차이점은 수소 결합 수용체 원자들의 상대적인 중요도가 소수성 원자들의 그것보다 커진다는 것이다. In other words, shape complementarity between the binding site and the ligand is most important in determining the binding affinity of the complex. As shown in (a) of Figure 7, when there is no exclusion volume information of the ligand, the average binding affinity prediction accuracy is lowered by 1.4 kcal/mol. When following the exclusion volume information, the hydrophobic atom information of the ligand and the binding site are identified as the second important factors. Regarding the binding site, the hydrogen-soluble atoms of the binding site play a third important role. Interestingly, for ligands, aromatic atoms play a third important role. As can be seen in (b) of Figure 7, in the shuffling experiment, the overall directionality is similar to that of the zero experiment, but the average decrease in prediction accuracy is smaller. The most striking difference for the binding site is that the relative importance of the hydrogen bond acceptor atoms becomes greater than that of the hydrophobic atoms.

결론conclusion

실시예는 새로운 결합 친화도 예측 모델로서, 멀티브랜치 딥러닝 네트워크 아키텍처와 앙상블 예측 접근을 결합한 AK-score를 제공한다. 실시예는 단백질-리간드 복합체의 결합 친화도를 상당히 높은 정확도로 예측하여, 스코어링과 랭킹 파워에 있어서 현존하는 최고의 스코어링 함수와 비견할 만한 결과를 제공한다. 실시예가 제공하는 앙상블 기반의 접근법과 독립적으로 학습된 복수의 모델의 평균을 이용하는 것은 매우 간단하면서도 강력한 접근이다. 실시예는 구체적인 네트워크 모델에 한정되는 것은 아니며, 현존하는 머신러닝 기반의 다양한 모델들이 적용될 수도 있다. 또한, 실시예는 화학적 특성에 기반하여 원자들의 상대적 중요성에 대한 통찰을 제공한다. 인자 중요도에 대한 실험들이 보여주듯이, 리간드에 있어서, 원자들의 배제 체적, 소수성 및 방향족 원자들의 공간분포가 결합 친화도의 결정에 있어서 매우 중요한 역할을 한다. 단백질에 있어서는, 원자들의 배제 체적, 그리고 소수성 원자들과 수소 결합 수용체의 분포가 중요한 요소로 확인되었다. The embodiment provides AK-score, a new binding affinity prediction model that combines a multi-branch deep learning network architecture and ensemble prediction approach. The examples predict the binding affinity of protein-ligand complexes with significantly high accuracy, providing results comparable to the best existing scoring functions in terms of scoring and ranking power. Using the ensemble-based approach provided by the embodiment and the average of multiple independently learned models is a very simple yet powerful approach. The embodiment is not limited to a specific network model, and various existing machine learning-based models may be applied. Additionally, the examples provide insight into the relative importance of atoms based on their chemical properties. As experiments on factor importance show, for a ligand, the exclusion volume of atoms, the spatial distribution of hydrophobic and aromatic atoms play a very important role in determining the binding affinity. For proteins, the exclusion volume of atoms and the distribution of hydrophobic atoms and hydrogen bond acceptors have been identified as important factors.

이상 설명된 본 발명에 따른 실시예들은 다양한 컴퓨터 구성요소를 통하여 수행될 수 있는 프로그램 명령어의 형태로 구현되어 컴퓨터 판독 가능한 기록 매체에 기록될 수 있다. 컴퓨터 판독 가능한 기록 매체는 프로그램 명령어, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 컴퓨터 판독 가능한 기록 매체에 기록되는 프로그램 명령어는 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능한 기록 매체의 예에는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM, DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 ROM, RAM, 플래시 메모리 등과 같은 프로그램 명령어를 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령어의 예에는, 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함된다. 상기 하드웨어 장치는 본 발명에 따른 처리를 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Embodiments according to the present invention described above may be implemented in the form of program instructions that can be executed through various computer components and recorded on a computer-readable recording medium. A computer-readable recording medium may include program instructions, data files, data structures, etc., singly or in combination. Program instructions recorded on a computer-readable recording medium may be specially designed and configured for the present invention, or may be known and usable by those skilled in the computer software field. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical recording media such as CD-ROMs and DVDs, and magneto-optical media such as floptical disks. media), and hardware devices specifically configured to store and perform program instructions, such as ROM, RAM, flash memory, etc. Examples of program instructions include not only machine language code such as that created by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like. The hardware device may be configured to operate as one or more software modules to perform processing according to the invention and vice versa.

또한, 이상 설명된 본 발명에 따른 실시예들은 다양한 컴퓨터 구성요소를 통하여 수행될 수 있는 프로그램 명령어의 집합 및 이를 실행하기 위한 사용자 애플리케이션 자체일 수도 있다. 구체적으로, 서버를 통해 또는 저장매체를 통해 다운로드하여 클라이언트 컴퓨터에 설치할 수 있는 프로그램 그 자체일 수도 있다. Additionally, the embodiments according to the present invention described above may be a set of program instructions that can be executed through various computer components and a user application for executing them. Specifically, it may be a program itself that can be downloaded from a server or via a storage medium and installed on a client computer.

이상에서 본 발명이 구체적인 구성요소 등과 같은 특정 사항들과 한정된 실시예 및 도면에 의해 설명되었으나, 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명이 상기 실시예들에 한정되는 것은 아니며, 본 발명이 속하는 기술분야에서 통상적인 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형을 꾀할 수 있다.In the above, the present invention has been described with specific details such as specific components and limited embodiments and drawings, but this is only provided to facilitate a more general understanding of the present invention, and the present invention is not limited to the above embodiments. , a person skilled in the art to which the present invention pertains can make various modifications and variations from this description.

따라서, 본 발명의 사상은 상기 설명된 실시예에 국한되어 정해져서는 아니되며, 후술하는 특허청구범위 뿐만 아니라 이 특허청구범위와 균등하게 또는 등가적으로 변형된 모든 것들은 본 발명의 사상의 범주에 속한다고 할 것이다.Accordingly, the spirit of the present invention should not be limited to the above-described embodiments, and the scope of the patent claims described below as well as all modifications equivalent to or equivalent to the scope of the patent claims fall within the scope of the spirit of the present invention. They will say they do it.

10: 예측 시스템 20: 외부 서버
11: 제어부 12: 통신부
13: 입출력 인터페이스부 14: 메모리부
15: 입력부 16: 디스플레이부10: Prediction system 20: External server
11: Control unit 12: Communication unit
13: input/output interface unit 14: memory unit
15: input unit 16: display unit

Claims

A method of predicting protein-ligand binding affinity through a prediction system including a control unit and a memory unit,
Storing the structure of the protein-ligand complex in a memory unit as three-dimensional information with the center of mass of the ligand in the protein-ligand complex as the origin and embedding the surrounding atomic environment in a 3D grid;
storing a protein-ligand binding affinity value corresponding to the three-dimensional information in the memory unit;
A step of learning a prediction model through a control unit, wherein the prediction model uses the stored three-dimensional information as an input value and the corresponding protein-ligand binding affinity value as a calculation value, and the prediction model is processed independently. training the prediction model, comprising one or more 3D convolutional neural networks; and
A step of generating a protein-ligand binding affinity predicted value using the prediction model, generating protein-ligand binding affinity prediction values for a new protein-ligand complex through the one or more 3D convolutional neural networks, respectively, Generating the predicted protein-ligand binding affinity, generating a final predicted value based on the predicted protein-ligand binding affinity of
The three-dimensional information includes the pattern of protein-ligand interaction calculated through the density function below,

Here, n(r) is the atomic number density, r _VDW is the van der Waals radius of the atom, and r is the distance from the atom to the center of the grid. Protein-ligand binding affinity prediction method.

According to claim 1,
The final prediction value generated by the prediction model is an average of each protein-ligand binding affinity prediction value generated through the one or more 3D convolutional neural networks.

A method of learning a prediction model that predicts protein-ligand binding affinity through a prediction system including a control unit and a memory unit,
Storing the structure of the protein-ligand complex in a memory unit as three-dimensional information with the center of mass of the ligand in the protein-ligand complex as the origin and embedding the surrounding atomic environment in a 3D grid;
storing a protein-ligand binding affinity value corresponding to the three-dimensional information in the memory unit; and
A step of learning a prediction model through a control unit, wherein the prediction model uses the stored three-dimensional information as an input value and the corresponding protein-ligand binding affinity value as a calculation value, and the prediction model is processed independently. A step of training the prediction model, including one or more 3D convolutional neural networks,
The three-dimensional information includes the pattern of protein-ligand interaction calculated through the density function below,

Here, n(r) is the atom number density, r _VDW is the van der Waals radius of the atom, and r is the distance from the atom to the center of the grid. Method of learning a prediction model to predict protein-ligand binding affinity.

A method of predicting protein-ligand binding affinity through a prediction system including a control unit and a memory unit,
Storing the structure of the protein-ligand complex in a memory unit as three-dimensional information with the center of mass of the ligand in the protein-ligand complex as the origin and embedding the surrounding atomic environment in a 3D grid;
storing a protein-ligand binding affinity value corresponding to the three-dimensional information in the memory unit;
A step of learning a prediction model through a control unit, wherein the prediction model uses the stored three-dimensional information as an input value and the corresponding protein-ligand binding affinity value as a calculation value, and the prediction model is processed independently. training the prediction model, comprising one or more 3D convolutional neural networks; and
A step of generating a protein-ligand binding affinity predicted value using the prediction model, generating protein-ligand binding affinity prediction values for a new protein-ligand complex through the one or more 3D convolutional neural networks, respectively, Generating the predicted protein-ligand binding affinity, generating a final predicted value based on the predicted protein-ligand binding affinity of
A method for predicting protein-ligand binding affinity, further comprising classifying the three-dimensional information into a plurality of classes and expressing them in different channels of the input value.

A method of predicting protein-ligand binding affinity through a prediction system including a control unit and a memory unit,
Storing the structure of the protein-ligand complex in a memory unit as three-dimensional information with the center of mass of the ligand in the protein-ligand complex as the origin and embedding the surrounding atomic environment in a 3D grid;
storing a protein-ligand binding affinity value corresponding to the three-dimensional information in the memory unit;
A step of learning a prediction model through a control unit, wherein the prediction model uses the stored three-dimensional information as an input value and the corresponding protein-ligand binding affinity value as a calculation value, and the prediction model is processed independently. training the prediction model, comprising one or more 3D convolutional neural networks; and
A step of generating a protein-ligand binding affinity predicted value using the prediction model, generating protein-ligand binding affinity prediction values for a new protein-ligand complex through the one or more 3D convolutional neural networks, respectively, Generating the predicted protein-ligand binding affinity, generating a final predicted value based on the predicted protein-ligand binding affinity of
A method for predicting protein-ligand binding affinity, further comprising the step of rotating the three-dimensional information using 24 rotation operations and adding input values.

According to claim 2,
The 3D convolutional neural network includes one or more stacked ensemble-based residual block layers,
A protein-ligand binding affinity prediction method, wherein each residual block layer includes one or more stacked convolutional layers combined with a batch normalization and ReLU activation layer.

According to claim 6,
A protein-ligand binding affinity prediction method comprising one or more parallel processed 3D convolutional neural network layers in the middle of the one or more stacked convolutional layers.

According to claim 7,
A protein-ligand binding affinity prediction method in which the results of the one or more parallel processed 3D convolutional neural network layers are concatenated and input to the remaining block layer.

According to claim 1,
A method for predicting protein-ligand binding affinity, wherein the one or more 3D convolutional neural networks are 5 or more and 25 or less.

A communication unit capable of communicating with an external server;
An input/output interface unit that controls the input unit and display unit;
a memory unit for storing data; and
A protein-ligand binding affinity prediction system comprising a control unit for performing a protein-ligand binding affinity prediction model,
The memory unit includes three-dimensional information that sets the structure of the protein-ligand complex as the origin of the center of mass of the ligand in the protein-ligand complex and embeds the surrounding atomic environment in a 3D grid, and protein-ligand binding affinity corresponding to the three-dimensional information. Contains degree values,
The control unit uses the 3D information as an input value and the corresponding protein-ligand binding affinity value as a calculation value to independently process one or more 3D convolutional neural networks to learn the prediction model,
The control unit generates predicted protein-ligand binding affinity values for each new protein-ligand complex through each of the one or more 3D convolutional neural networks, and generates a final predicted value based on each predicted protein-ligand binding affinity value. Create a ,
The three-dimensional information includes the pattern of protein-ligand interaction calculated through the density function below,

Here, n(r) is the atomic number density, r _VDW is the van der Waals radius of the atom, and r is the distance from the atom to the center of the grid. Protein-ligand binding affinity prediction system.

According to claim 10,
The final prediction value generated by the prediction model is an average of each protein-ligand binding affinity prediction value generated through each of the one or more 3D convolutional neural networks.

Ministry of Communications;
Input/output interface unit;
a memory unit for storing data; and
A protein-ligand binding affinity prediction system comprising a control unit for performing a protein-ligand binding affinity prediction model,
The memory unit includes three-dimensional information that sets the structure of the protein-ligand complex as the origin of the center of mass of the ligand in the protein-ligand complex and embeds the surrounding atomic environment in a 3D grid, and protein-ligand binding affinity corresponding to the three-dimensional information. Contains degree values,
The control unit uses the 3D information as an input value and the corresponding protein-ligand binding affinity value as a calculation value to independently process one or more 3D convolutional neural networks to learn the prediction model,
The control unit generates a predicted protein-ligand binding affinity value using the prediction model,
The three-dimensional information includes a pattern of protein-ligand interaction calculated through the density function below,

Here, n(r) is the atomic number density, r _VDW is the van der Waals radius of the atom, and r is the distance from the atom to the center of the grid. Protein-ligand binding affinity prediction system.

Ministry of Communications;
Input/output interface unit;
a memory unit for storing data; and
A protein-ligand binding affinity prediction system comprising a control unit for performing a protein-ligand binding affinity prediction model,
The memory unit includes three-dimensional information that sets the structure of the protein-ligand complex as the origin of the center of mass of the ligand in the protein-ligand complex and embeds the surrounding atomic environment in a 3D grid, and protein-ligand binding affinity corresponding to the three-dimensional information. Contains degree values,
The control unit uses the 3D information as an input value and the corresponding protein-ligand binding affinity value as a calculation value to independently process one or more 3D convolutional neural networks to learn the prediction model,
The control unit generates predicted protein-ligand binding affinity values for each new protein-ligand complex through each of the one or more 3D convolutional neural networks, and generates a final predicted value based on each predicted protein-ligand binding affinity value. Create a ,
The protein-ligand binding affinity prediction system wherein the control unit classifies the three-dimensional information into a plurality of classes and expresses it through different channels of the input value.

Ministry of Communications;
Input/output interface unit;
a memory unit for storing data; and
A protein-ligand binding affinity prediction system comprising a control unit for performing a protein-ligand binding affinity prediction model,
The memory unit includes three-dimensional information that sets the structure of the protein-ligand complex as the origin of the center of mass of the ligand in the protein-ligand complex and embeds the surrounding atomic environment in a 3D grid, and protein-ligand binding affinity corresponding to the three-dimensional information. Contains degree values,
The control unit uses the 3D information as an input value and the corresponding protein-ligand binding affinity value as a calculation value to independently process one or more 3D convolutional neural networks to learn the prediction model,
The control unit generates predicted protein-ligand binding affinity values for each new protein-ligand complex through each of the one or more 3D convolutional neural networks, and generates a final predicted value based on each predicted protein-ligand binding affinity value. Create a ,
A protein-ligand binding affinity prediction system in which the control unit rotates the three-dimensional information using 24 rotation operations and adds input values.

According to claim 11,
The 3D convolutional neural network includes one or more stacked ensemble-based residual block layers,
A protein-ligand binding affinity prediction system, where each residual block layer includes one or more stacked convolutional layers combined with a batch normalization and ReLU activation layer.

According to claim 15,
A protein-ligand binding affinity prediction system comprising one or more parallel 3D convolutional neural network layers in the middle of the one or more stacked convolutional layers.

According to claim 16,
A protein-ligand binding affinity prediction system that concatenates the results of the one or more parallel processed 3D convolutional neural network layers and inputs them to the remaining block layer.

According to claim 10,
A protein-ligand binding affinity prediction system wherein the one or more 3D convolutional neural networks are 5 or more and 25 or less.