KR20230073366A

KR20230073366A - Method for new drug development using ai

Info

Publication number: KR20230073366A
Application number: KR1020210155287A
Authority: KR
Inventors: 장지환
Original assignee: 주식회사 유케어트론
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2023-05-25

Abstract

According to one embodiment of the present invention, provided is a method for developing a new drug using AI, which can develop a new drug more efficiently. The method for developing a drug using AI comprises: a step of selecting a ligand bound to a target protein; a step of decomposing the ligand into molecular descriptors (MDs) representing the ligand as a unique material property of the ligand; a step of getting distribution for each ligand for bioassay of each decomposed MD; a step of removing an outlier for the decomposed MDs; and a step of executing regression analysis with a machine learning or AI technique by using the remaining MDs in which the outlier is removed.

Description

New drug development method using artificial intelligence {METHOD FOR NEW DRUG DEVELOPMENT USING AI}

본 발명은 인공지능을 이용한 신약 개발방법으로서, 보다 상세하게는 타겟 단백질에 대한 리간드를 분자설명인자로 분해하고, 이들에서 이상점을 제거한 후 회귀분석을 수행함으로써 보다 효율적으로 신약을 개발할 수 있는 방법에 관한 것이다.The present invention is a new drug development method using artificial intelligence, and more specifically, a method for developing new drugs more efficiently by decomposing a ligand for a target protein into molecular descriptors, removing outliers from them, and performing regression analysis. It is about.

신약 설계의 일반적인 방법은 타겟 단백질에 바인딩되는 리간드(ligand)들에 대한 바이오어세이(bioassay)데이터를 ChEMBL 데이터베이스 등에서 가져와 IC50 등의 목표값(Y)을 종속변수로 처리하고, 리간드의 구조나 SMILES code 를 독립변수(X)로 이용하여 머신러닝 등의 방법으로 회귀분석하는 것이다. 그러나, 이러한 방법은 ChEMBL 등의 데이터베이스에 존재하는 바이오어세이 데이터가 서로 다른 그룹에서 실험한 데이터인 까닭에 일관성이 부족하고 각자 서로 다른 실험 프로토콜에 의한 경우가 많아 독립변수 및 종속변수들의 정확도 혹은 정밀도가 떨어지는 문제점이 있다. A general method of designing a new drug is to import bioassay data for ligands bound to a target protein from the ChEMBL database, treat the target value (Y) such as IC50 as a dependent variable, and analyze the structure of the ligand or SMILES. Regression analysis is performed using a method such as machine learning using code as an independent variable (X). However, since the bioassay data existing in databases such as ChEMBL are experimental data from different groups, these methods lack consistency and are often based on different experimental protocols, resulting in the accuracy or precision of independent and dependent variables. There is a problem with falling.

선행문헌 : 한국공개특허 10-2022-0169886Prior literature: Korean Patent Publication No. 10-2022-0169886

선행문헌은 '신약개발을 위한 히트분자 발굴 장치 및 방법'에 관한 것으로서, 신약 개발을 위한 분자의 생성 및 히트 분자를 발굴하기 위한 프로그램이 저장되는 저장부, 및 상기 프로그램을 실행함으로써 신약 개발을 위한 분자의 생성 및 히트 분자를 발굴하는 제어부를 포함하고, 상기 제어부는, 임의의 초기 분자에 분자 생성 알고리즘을 이용하여 단편을 부착함으로써 단편 기반 분자를 생성하고, 상기 단편 기반 분자를 생성하는 과정에서 발생된 경험 정보들을 기초로 탐색적 강화학습 알고리즘을 적용시켜 히트 분자를 발굴하는 내용이 개시되어 있다. Prior literature relates to 'apparatus and method for discovering hit molecules for new drug development', a storage unit in which a program for generating molecules and discovering hit molecules for new drug development is stored, and executing the program for new drug development A control unit generating a molecule and discovering a hit molecule, wherein the control unit generates a fragment-based molecule by attaching a fragment to an arbitrary initial molecule using a molecule generation algorithm, and generates the fragment-based molecule in the process of generating the molecule. Disclosed is a disclosure of a hit molecule by applying an exploratory reinforcement learning algorithm based on the obtained experience information.

본 발명의 일실시 형태는 타겟 단백질에 대한 리간드를 분자설명인자로 분해하고, 이들에서 이상점을 제거한 후 회귀분석을 수행함으로써 보다 효율적으로 신약을 개발할 수 있는 방법을 제공할 수 있다. One embodiment of the present invention can provide a method for developing new drugs more efficiently by decomposing ligands for target proteins into molecular descriptors, removing outliers from them, and then performing regression analysis.

본 발명의 일실시 형태는, 타겟 단백질에 바인딩 되는 리간드(ligand)를 선택하는 단계와, 상기 리간드를 리간드 고유의 물질특성으로 나타내는 분자설명인자(Molecular Descriptor, MD)들로 분해하는 단계와, 상기 분해된 각 분자설명인자들의 바이오어세이(bioassay)에 있어서의 각 리간드별 분포를 구하는 단계와, 상기 분해된 분자설명인자(MD)에 대해 이상점(outlier)을 제거 단계, 및 상기 이상점이 제거된 나머지 분자설명인자(MD)를 사용하여 머신러닝 또는 인공지능 기법으로 회귀분석을 실시하는 단계를 포함하는 인공지능을 이용한 신약 개발 방법을 제공할 수 있다. One embodiment of the present invention includes the steps of selecting a ligand that binds to a target protein, decomposing the ligand into molecular descriptors (MDs) representing the material properties inherent in the ligand, and Finding the distribution of each ligand in the bioassay of each decomposed molecular descriptor, removing outliers from the decomposed molecular descriptors (MD), and removing the outliers It is possible to provide a new drug development method using artificial intelligence, including the step of performing regression analysis using machine learning or artificial intelligence techniques using the remaining molecular descriptors (MD).

본 실시형태에 따른 인공지능을 이용한 신약개발 방법은, 상기 이상점(outlier)을 제거하는 단계 이전에 상기 분자설명인자들의 데이터에 대해 차원축소(Dimension Reduction) 단계를 더 포함할 수 있다. The new drug development method using artificial intelligence according to the present embodiment may further include a dimension reduction step for data of the molecular descriptors before the step of removing the outliers.

상기 이상점(outlier)을 제거 단계는, 특징 이상점(feature outlier) 및 분자 이상점(molecular outlier)를 제거하는 것을 특징으로 할 수 있다.The removing of outliers may be characterized in that feature outliers and molecular outliers are removed.

본 발명의 일실시 형태에 따르면 타겟 단백질에 대한 리간드를 분자설명인자로 분해하고, 이들에서 이상점을 제거한 후 회귀분석을 수행함으로써 보다 효율적으로 신약을 개발할 수 있는 방법을 얻을 수 있다. According to one embodiment of the present invention, a method for developing new drugs more efficiently can be obtained by decomposing ligands for target proteins into molecular descriptors, removing outliers from them, and performing regression analysis.

도 1 내지 도 7은, 본 발명의 일실시 형태로서 BRD4 단백질에 대해 인공지능을 이용한 신약개발을 진행하는 방법을 나타내는 도면이다. 1 to 7 are diagrams illustrating a method for developing new drugs using artificial intelligence for BRD4 protein as an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명을 상세히 설명하겠다.Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

도 1의 (a)는 BRD4 단백질의 구조를 나타내는 도면이고, 도 1의 (b)는 상기 BRD4 단백질에 대한 억제제(Inhibitor)들이다. 본 실시형태에서는 타겟 단백질을 BRD4 단백질로 선정하였다. Figure 1 (a) is a diagram showing the structure of the BRD4 protein, Figure 1 (b) is the inhibitor (Inhibitor) for the BRD4 protein. In this embodiment, the target protein was selected as the BRD4 protein.

BET 단백질 패밀리는 4개의 구성원: BRD2, BRD3, BRD4 및 BRDT로 구성되며, 각각의 구성원은 2개의 N-말단 탠덤(BD1 및 BD2), 말단 외부 도메인(ET), 여러 보존 영역 (A, B, SEED 영역) 및 C-말단 모티프 (CTM)를 포함한다. 그들 중에서, BRD4는 가장 광범위하게 연구된 구성원이며, 림프종 (예를 들어, 급성골수림프종(acutemyelolymphoma) 등), 백혈병 (예를 들어, 급성 림프모구성 백혈병(acute lymphoblastic leukemia) 등), 골수종 (예를 들어, 다발성 골수종(multiple myeloma) 등) 및 고형 종양 예컨대 신경세포종(neurocytoma), 신경교종(glioma), 유방암 (예를 들어, 삼중 음성 유방암(triple negative breast cancer) 등), 위장 종양(gastrointestinal tumor) (예를 들어, 결장직장암(colorectal cancer) 등), 및 전립선암을 포함한 혈액 종양(hematologic tumor)의 발생이 모두 BRD4의 과발현과 관련이 있다.The BET protein family consists of four members: BRD2, BRD3, BRD4 and BRDT, each of which contains two N-terminal tandems (BD1 and BD2), a terminal ectodomain (ET), and several conserved regions (A, B, SEED region) and C-terminal motif (CTM). Among them, BRD4 is the most extensively studied member, and is the most widely studied member of lymphoma (eg, acute myelolymphoma, etc.), leukemia (eg, acute lymphoblastic leukemia, etc.), myeloma (eg, acute lymphoblastic leukemia, etc.) eg, multiple myeloma, etc.) and solid tumors such as neurocytoma, glioma, breast cancer (eg, triple negative breast cancer, etc.), gastrointestinal tumor ) (eg, colorectal cancer, etc.), and hematologic tumors, including prostate cancer, are all associated with overexpression of BRD4.

본 실시형태에서는 타겟 단백질에 바인딩 되는 리간드(ligand)를 선택하는 단계를 수행할 수 있다. 도 2는, ChEMBL 데이터베이스에 존재하는 BRD4 단백질에 대한 바이오어세이(Bioassay) 데이터를 나타내는 도면이다. ChEMBL 데이터베이스는 약물과 유사한 특성을 가진 생리활성 분자의 수동으로 큐레이팅된 화학 데이터베이스이다. ChEMBL 데이터베이스는 영국 힝스턴에 위치한 웰컴 게놈 캠퍼스에 있는 유럽 분자생물학 연구수(EMBL)의 유럽 생물정보학 연구소(EBI)에서 관리한다. 도 2를 참조하면, ChEMBL 데이터베이스에는 BRD4 단백질에 대해 목표값(Y)을 IC50으로 했을 때 약 1600개의 리간드가 존재하는 것으로 나타난다. IC50은 반수 최대 억제 농도(half maximal inhibitory concentration)로 특정 생물학적 또는 생화학적 기능을 억제하는 물질의 효능을 측정한 것이다. IC50은 특정 억제물질이 시험관 내에서 주어진 생물학적 과정 또는 생물학적 성분을 50% 억제하는데 필요한 양을 나타내는 정량적 측정값이다. 본 단계에서는 이러한 데이터베이스를 이용하여 타겟 단백질에 대한 리간드를 선택할 수 있다.In this embodiment, a step of selecting a ligand bound to a target protein may be performed. 2 is a diagram showing bioassay data for the BRD4 protein present in the ChEMBL database. The ChEMBL database is a manually curated chemical database of bioactive molecules with drug-like properties. The ChEMBL database is maintained by the European Bioinformatics Institute (EBI) of the European Institute for Molecular Biology Research (EMBL) at the Wellcome Genome Campus in Hinxton, UK. Referring to FIG. 2 , it is shown that about 1600 ligands exist in the ChEMBL database when the target value (Y) is set as IC50 for the BRD4 protein. IC50 is the half maximal inhibitory concentration, which measures the ability of a substance to inhibit a specific biological or biochemical function. The IC50 is a quantitative measure of the amount required for a particular inhibitor to inhibit 50% of a given biological process or biological component in vitro. In this step, a ligand for a target protein may be selected using such a database.

본 실시형태에서는, 상기 선택된 리간드들을 리간드 고유의 물질특성으로 나타내는 분자설명인자(Molecular Descriptor, MD)들로 분해할 수 있다. 도 3은, 상기 1600개의 리간드 각각을 분자설명인자(Molecular Descriptor, MD)들로 분해한 일예를 나타낸다. 이 때, 상기 분자설명인자(Molecular Descriptor, MD)들은 분자량, 용해도, 전하량 등 각 리간드의 고유한 특성들로 분해될 수 있다. 본 실시형태에서는 약 1600개의 리간드에 대해 4,129 개의 분자설명인자(Molecular Descriptor, MD)들로 분해할 수 있다. In the present embodiment, the selected ligands can be decomposed into molecular descriptors (MDs) representing ligand-specific material properties. 3 shows an example in which each of the 1600 ligands is decomposed into molecular descriptors (MDs). At this time, the molecular descriptor (MD) can be decomposed into unique properties of each ligand, such as molecular weight, solubility, and charge amount. In this embodiment, about 1600 ligands can be decomposed into 4,129 molecular descriptors (MDs).

본 실시형태에서는, 상기 분해된 각 분자설명인자들의 바이오어세이(bioassay)에 있어서의 각 리간드별 분포를 구할 수 있다. 본 실시형태에서 분해된 약 4,129 개의 분자설명인자(Molecular Descriptor, MD)들은 독립변수(X)로 역할을 할 수 있다.In the present embodiment, the distribution of each ligand in the bioassay of the decomposed molecular descriptors can be obtained. About 4,129 molecular descriptors (MDs) decomposed in this embodiment can serve as independent variables (X).

본 실시형태에서는 상기 분해된 분자설명인자(MD)에 대해 이상점(outlier)을 제거 단계를 수행할 수 있다. 도 5는, 본 실시형태에서 비정상적인 리간드 데이터인 이상점(outlier)를 제거하는 일예를 나타낸다. 이상점(outlier)는 변수의 분포에서 비정상적으로 분포를 벗어난 값을 의미한다. 각 변수의 분포에서 비정상적으로 극단값을 갖는 경우나 자료에 타당도가 없는 경우, 비현실적인 변수값들이 이에 해당한다. 이상점은 단변량 분포에서만 아니라 다변량 분포에서도 존재한다. 본 실시형태에서 이상점(outlier)을 제거 단계는, 특징 이상점(feature outlier) 및 분자 이상점(molecular outlier)를 제거하는 것일 수 있다. 본 실시형태에서는 비정상적인 리간드 데이터를 제거한 이후에 217개의 분자설명인자(Molecular Descriptor, MD)를 갖는 1566개의 분자가 남을 수 있다. In the present embodiment, an outlier removal step may be performed on the decomposed molecular descriptor (MD). 5 shows an example of removing outliers, which are abnormal ligand data, in the present embodiment. An outlier is an abnormally out-of-distribution value in the distribution of variables. Unrealistic variable values fall under this category when each variable has an abnormally extreme value in the distribution or when the data has no validity. Outliers exist not only in univariate distributions, but also in multivariate distributions. In the present embodiment, the step of removing outliers may include removing feature outliers and molecular outliers. In this embodiment, 1566 molecules with 217 Molecular Descriptors (MD) may remain after removing abnormal ligand data.

본 실시형태에서는, 상기 이상점(outlier)을 제거하는 단계 이전에 상기 분자설명인자들의 데이터에 대해 차원축소(Dimension Reduction) 단계를 더 포함할 수 있다. 차원축소(Dimension Reduction)는, 데이터의 양을 줄이는 여러가지 방법을 의미한다.도 4는, 본 실시형태에서, 데이터 큐레이션을 통해 차원축소(Dimension Reduction)를 진행하는 도면이다. 본 실시형태에서 차원축소(Dimension Reduction)는 RDR(reduced design region) approach 에 의해 진행될 수 있다. In the present embodiment, a dimension reduction step of the data of the molecular descriptors may be further included before the step of removing the outliers. Dimension reduction refers to various methods of reducing the amount of data. FIG. 4 is a diagram illustrating dimension reduction through data curation in the present embodiment. In this embodiment, dimension reduction may be performed by a reduced design region (RDR) approach.

본 실시형태에서는, 상기 이상점이 제거된 나머지 분자설명인자(MD)를 사용하여 머신러닝 또는 인공지능 기법으로 회귀분석을 실시할 수 있다. 도 6은, 상기 차원 축소(Dimension Reduction) 및 이상점(outlier) 제거후 남은 217개의 분자설명인자(Molecular Descriptor, MD)를 갖는 1566개의 분자(molecules)를 트레이닝 데이터셋으로 하여 회귀분석을 시행한 그래프이다. 이 그래프에서 각 분자들의 IC50 값을 예측할 수 있다. 도 7은, BRD4 단백질에 대해 알려진 억제제(inhibitor)들의 실제 IC50 값과, 상기 회귀분석모델에 의해 예측된 IC50값을 정리한 표이다. 실제 측정된 값과 예측된 값 사이의 오차가 크지 않은 것을 알 수 있다. In the present embodiment, regression analysis may be performed using machine learning or artificial intelligence techniques using the remaining molecular descriptors (MD) from which the outliers have been removed. 6 shows that regression analysis was performed using 1566 molecules having 217 molecular descriptors (MD) remaining after dimension reduction and outlier removal as a training dataset. it's a graph From this graph, the IC50 value of each molecule can be predicted. 7 is a table summarizing actual IC50 values of known inhibitors for the BRD4 protein and IC50 values predicted by the regression analysis model. It can be seen that the error between the actual measured value and the predicted value is not large.

Claims

selecting a ligand that binds to the target protein;
decomposing the ligand into molecular descriptors (MDs) representing the material properties inherent in the ligand;
obtaining a distribution for each ligand in a bioassay of the decomposed molecular descriptors;
removing outliers from the decomposed molecular descriptor (MD); and
Performing regression analysis using machine learning or artificial intelligence techniques using the remaining molecular descriptors (MD) from which the outliers have been removed.
New drug development method using artificial intelligence, including.

According to claim 1,
A dimension reduction step for the data of the molecular descriptors before the step of removing the outliers.
New drug development method using artificial intelligence, characterized in that it further comprises.

According to claim 1,
In the step of removing the outliers,
A new drug development method using artificial intelligence, characterized by removing feature outliers and molecular outliers.