KR20220094179A

KR20220094179A - Method and apparatus for predicting of mechanism of actions of novel compounds based on transcriptome phenotype

Info

Publication number: KR20220094179A
Application number: KR1020210190080A
Authority: KR
Inventors: 박성준; 강재우; 장광훈; 이상훈
Original assignee: 고려대학교 산학협력단
Priority date: 2020-12-28
Filing date: 2021-12-28
Publication date: 2022-07-05

Abstract

According to an embodiment of the present invention, a transcriptome phenotype-based drug action mechanism predicting device comprises: a communication module; a memory in which a drug action mechanism predicting program is stored; and a processor for executing the drug action mechanism predicting program. The drug action mechanism prediction program inputs an embedding vector for a target material to a learning model built to locate an embedding vector for a compound and an embedding vector for transcriptome phenotype feature information induced by each compound in the same vector space and outputs ranking of a drum action mechanism matching at least one piece of transcriptome phenotype feature information outputted by the learning model.

Description

Transcriptome phenotype-based drug mechanism prediction apparatus and method

본 발명은 전사체 표현형(transcriptome phenotype)에 기반하여 대상 화학 물질에 대한 약물 작용 기전을 예측하는 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for predicting a drug action mechanism for a target chemical substance based on a transcriptome phenotype.

신약은 신약 발견 단계와 신약 개발 단계로 구성된 프로세스에 의해 개발된다. 신약 발견 단계는 타겟 확인, 후보 물질 설계, 효능 측정, 약 후보 물질 선택을 포함한다. 신약 개발 단계에는 안전성 평가와 약물 후보자의 임상 시험이 포함된다. 신약 발견 단계와 신약 개발 단계를 통해 약품을 상용화하는 데는 상당한 시간과 비용이 소모되지만, 그 성공률은 높지 않은 것으로 알려져 있다. New drugs are developed by a process consisting of a new drug discovery stage and a new drug development stage. The new drug discovery phase includes target identification, candidate substance design, efficacy measurement, and drug candidate selection. The new drug development phase includes safety evaluation and clinical trials of drug candidates. It takes considerable time and money to commercialize a drug through the new drug discovery stage and new drug development stage, but it is known that the success rate is not high.

신약 개발 파이프 라인에서 질병에 적합한 타겟 단백질을 확인하고 타겟에 결합하는 분자를 찾는 것이 매우 중요하다. 일단 질병에 대한 타겟이 확인되면 타겟에 결합할 수 있는 화합물이 고효율 스크리닝을 통해 발견되고, 타겟에 결합하는 약물의 구조 유사체도 약물 후보 물질로 선택된다. 이렇게 약물 후보 물질로 약 5000~10000 개 이상이 선정되지만, 실험과 검증을 거쳐 판매되기까지의 성공률이 0.02% 미만이기 때문에 신약 개발 비용이과 개발 시간이 많이 소요된다. In the drug development pipeline, it is very important to identify a target protein suitable for a disease and find a molecule that binds to the target. Once a target for a disease is identified, a compound capable of binding to the target is found through high-efficiency screening, and structural analogues of the target-binding drug are also selected as drug candidates. Although more than 5,000 to 10,000 are selected as drug candidates, the success rate from testing and verification to sales is less than 0.02%, resulting in high cost and time-consuming development of new drugs.

신약 발견 단계에서 신약 후보 물질 발굴을 위해 사용되었던 전통적인 방식은 타겟 중심으로 신약 후보 물질을 발굴하는 방법이다. 이는 발병 관련 주요 인자를 밝혀내는 과정인 타겟 단백질 발굴, 타겟 단백질에 물리적으로 결합하여 기능을 억제할 수 있는 화합물을 찾는 과정인 유효 물질 발굴(Hit discovery), 앞서 찾아낸 유효 물질을 구조적으로 최적화하는 선도물질 최적화 과정 (Lead optimization)으로 진행된다. 타겟 단백질로부터 발굴한 약물 후보군 중에서 개발(Development) 단계를 통해 세포, 조직, 개체까지의 효과가 있는 약물을 최종적으로 고르게 된다. 이때, 타겟 단백질 외에도 타겟 단백질과 구조가 유사하거나 약물과 결합하는 부위 (binding pocket) 가 유사한 단백질들이 있을 경우, 의도치 않게 개발된 약물이 해당 단백질들과 결합하여 의도치 않은 효과 (off-target effect)을 보여 약물의 독성 및 부작용을 발생 시킬 수 있다. 약물이 영향을 끼치는 단백질 및 패스웨이(pathway), 즉 약물 작용기전을 예측 할 수 있다면 최적화 과정에서 의도치 않은 효과를 최소화 할 수 있는 약물의 구조를 설계 할 수 있게된다.The traditional method used to discover new drug candidates in the new drug discovery stage is a method of discovering new drug candidates based on a target. These include the discovery of target proteins, a process that reveals major factors related to disease, hit discovery, the process of finding compounds that can physically bind to the target protein and inhibit its function, and lead to structural optimization of the previously found active substances. It proceeds to the material optimization process (Lead optimization). Among the drug candidates discovered from the target protein, drugs that have effects on cells, tissues, and individuals are finally selected through the development stage. In this case, in addition to the target protein, if there are proteins having a structure similar to the target protein or having a similar binding pocket to the drug, the unintentionally developed drug binds to the corresponding proteins, resulting in an unintended effect (off-target effect). ), which can cause drug toxicity and side effects. If the protein and pathway that the drug affects, that is, the mechanism of action of the drug can be predicted, the structure of the drug can be designed to minimize unintended effects during the optimization process.

하지만 1) 타겟 단백질 가설이 필수적으로 필요하다는 점, 2) 타겟 단백질 가설을 찾았더라도 타겟 단백질에 화합물이 결합할 수 없는 구조(undruggable target)를 가진다면 약물 개발이 어려울 수 있다는 점, 3) 타겟 단백질에 화합물이 결합할 수 있는 구조라 하더라도 무수히 많은 화합물을 실험적으로 검증하기 어렵고, 후보물질 도출에 약 5.5년 이상의 상당한 시간이 소요된다는 점이 타겟 중심 신약개발 과정의 한계이다. However, 1) a target protein hypothesis is essential, 2) even if the target protein hypothesis is found, drug development may be difficult if the compound has an undruggable target that cannot be bound to the target protein, 3) the target protein A limitation of the target-oriented drug development process is that it is difficult to experimentally verify a myriad of compounds even if the structure can be combined with a compound, and it takes about 5.5 years or more to derive candidate substances.

이와 달리, 전사체 표현형 기반의 신약개발은 전사체 표현형 수준에서 질병을 억제할 수 있는 약을 먼저 발굴한다. 따라서 희귀 난치성 질병 등 가설이 없는 질병에 대해서도 신약개발을 시작할 수 있다는 점에서 장점을 가지고 있다. 전사체 표현형 기반 신약개발 과정에서는 히트 물질 발굴 후 해당 물질이 어떤 단백질 및 패스웨이를 타겟하여 질병을 억제하였는지 기전 규명이 필요하다. On the other hand, the development of new drugs based on the transcriptome phenotype first discovers drugs that can suppress diseases at the level of the transcriptome phenotype. Therefore, it has an advantage in that it can start the development of new drugs even for diseases for which there is no hypothesis, such as rare intractable diseases. In the process of developing new drugs based on the transcriptome phenotype, it is necessary to find out the mechanism by which the substance suppressed the disease by targeting which protein and pathway after discovering the hit substance.

이에, 본 발명에서는 신규한 화합물들의 단백질 및 패스웨이 타겟을 밝히는데 활용할 수 있는 약물 작용 기전 예측 방법을 제안하고자 한다. Accordingly, in the present invention, it is intended to propose a method for predicting a drug action mechanism that can be used to reveal proteins and pathway targets of novel compounds.

대한민국 공개특허공보 제 10-2018-0058648 호(발명의 명칭 : 비구조-구조 전이 부위를 표적으로 하는 신약 후보 물질 발굴 방법 및 신약 후보 물질 발굴 장치)Korean Patent Laid-Open Patent Publication No. 10-2018-0058648 (Title of the invention: a method for discovering a new drug candidate targeting a non-structural-structural transition site and an apparatus for discovering a new drug candidate)

본 발명은 전술한 문제점을 해결하기 위하여, 본 발명의 일 실시예에 따라 화합물의 구조에 대한 정보와 전사체 표현형 특징 정보를 기초로 구축된 학습 모델을 통해 대상 화합물에 대한 약물 작용 기전을 출력할 수 있는 약물 작용 기전 예측 장치 및 방법을 제공하는 것에 그 목적이 있다.In order to solve the above problems, according to an embodiment of the present invention, a drug action mechanism for a target compound can be output through a learning model built on the basis of information on the structure of the compound and information on the transcriptome phenotype characteristic. An object of the present invention is to provide an apparatus and method for predicting a drug mechanism of action.

다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical task to be achieved by the present embodiment is not limited to the technical task as described above, and other technical tasks may exist.

상기한 기술적 과제를 달성하기 위한 기술적 수단으로서 본 발명의 일 실시예에 따른 약물 작용 기전 예측 장치는 통신 모듈; 약물 작용 기전 예측 프로그램이 저장된 메모리; 상기 약물 작용 기전 예측 프로그램을 실행하는 프로세스를 포함하되, 상기 약물 작용 기전 예측 프로그램은 화합물에 대한 임베딩 벡터와 각 화합물에 의해 유도된 전사체 표현형 특징 정보에 대한 임베딩 벡터가 동일한 벡터 공간에 위치하도록 구축된 학습 모델에 대상 물질에 대한 임베딩 벡터를 입력하고, 상기 학습 모델에 의해 출력되는 적어도 하나 이상의 전사체 표현형 특징 정보와 매칭되는 약물 작용 기전의 랭킹을 출력하는 것이다.As a technical means for achieving the above technical problem, an apparatus for predicting a drug action mechanism according to an embodiment of the present invention includes a communication module; a memory in which a drug action mechanism prediction program is stored; a process of executing the drug mechanism of action prediction program, wherein the drug mechanism prediction program is constructed such that an embedding vector for a compound and an embedding vector for transcript phenotypic characteristic information induced by each compound are located in the same vector space An embedding vector for a target substance is input to the trained learning model, and a ranking of a drug action mechanism matched with at least one transcript phenotype characteristic information output by the learning model is output.

또한, 본 발명의 다른 실시예에 따른, 전사체 표현형 기반 약물 작용 기전 예측 방법은 화합물에 대한 임베딩 벡터와 각 화합물에 의해 유도된 전사체 표현형 특징 정보에 대한 임베딩 벡터가 동일한 벡터 공간에 위치하도록 구축된 학습 모델이 제공되는 단계; 상기 학습 모델에 대상 물질에 대한 임베딩 벡터가 입력됨에 따라, 상기 학습 모델이 입력된 대상 물질의 임베딩 벡터와 유사도가 높은 적어도 하나 이상의 전사체 표현형 특징 정보를 출력하는 단계; 및 상기 전사체 표현형 특징 정보와 매칭되는 약물 작용 기전의 랭킹을 출력하는 단계를 포함한다.In addition, the transcript phenotype-based drug mechanism prediction method according to another embodiment of the present invention is constructed so that the embedding vector for the compound and the embedding vector for the transcript phenotype characteristic information induced by each compound are located in the same vector space providing a learned learning model; outputting at least one transcript phenotype characteristic information having a high degree of similarity to the embedding vector of the target material to which the learning model is input, as the embedding vector for the target material is input to the learning model; and outputting a ranking of a drug action mechanism that matches the transcriptome phenotype characteristic information.

전술한 본 발명의 과제 해결 수단에 의하면, 신약 후보물질을 발굴하는 디스커버리 단계에 소모되는 시간과 비용을 크게 단축시킬 수 있다.According to the above-described means for solving the problems of the present invention, the time and cost consumed in the discovery step of discovering new drug candidates can be greatly reduced.

또한 타겟 단백질 가설이 존재하지 않거나, 타겟 단백질은 알지만 실제 그 단백질에 결합하는 물질을 만들 수 없는 경우에도 신약개발이 가능하다. In addition, new drug development is possible even when a target protein hypothesis does not exist or when a target protein is known but a substance that actually binds to the protein cannot be made.

또한, 발굴한 신약 후보 물질들의 단백질 및 패스웨이 타겟을 규명하고, 규명된 단백질 및 패스웨이 타겟을 이용해 전사체 표현형 기반 신약개발 과정에서 선도물질 최적화 과정을 진행 할 수 있도록 한다.In addition, it identifies proteins and pathway targets of discovered new drug candidates, and uses the identified proteins and pathway targets to optimize the lead material in the process of developing new drugs based on transcriptome phenotype.

도 1은 본 발명의 일 실시예에 따른 약물 작용 기전 예측 장치의 구성을 도시한 도면이다.
도 2는 본 발명의 일 실시예에 따른 약물 작용 기전 예측 방법의 동작을 설명하기 위한 순서도이다.
도 3은 본 발명의 일 실시예에 따른 약물 작용 기전 예측을 위한 학습 모델의 구성을 도시한 것이다.
도 4는 본 발명의 일 실시예에 따른 학습 모델의 구축을 위한 인코더의 구성을 도시한 것이다.
도 5는 본 발명의 일실시예에 따른 학습 모델 구축시 사용되는 GP 특징의 구성을 도시한 것이다.
도 6은 본 발명의 일 실시예에 따른 약물 작용 기전 랭킹 출력 과정을 도시한 것이다.1 is a diagram illustrating the configuration of an apparatus for predicting a drug action mechanism according to an embodiment of the present invention.
2 is a flowchart for explaining the operation of a method for predicting a drug action mechanism according to an embodiment of the present invention.
3 shows the configuration of a learning model for predicting a drug action mechanism according to an embodiment of the present invention.
4 shows the configuration of an encoder for building a learning model according to an embodiment of the present invention.
5 is a diagram illustrating a configuration of a GP feature used when building a learning model according to an embodiment of the present invention.
6 is a diagram illustrating a drug action mechanism ranking output process according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본원이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본원의 실시예를 상세히 설명한다. 그러나 본원은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본원을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present application will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art to which the present application pertains can easily carry out. However, the present application may be implemented in several different forms and is not limited to the embodiments described herein. And in order to clearly explain the present application in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

본원 명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. Throughout this specification, when a part is "connected" with another part, this includes not only the case where it is "directly connected" but also the case where it is "electrically connected" with another element interposed therebetween. do.

본원 명세서 전체에서, 어떤 부재가 다른 부재 “상에” 위치하고 있다고 할 때, 이는 어떤 부재가 다른 부재에 접해 있는 경우뿐 아니라 두 부재 사이에 또 다른 부재가 존재하는 경우도 포함한다.Throughout this specification, when a member is said to be located “on” another member, this includes not only a case in which a member is in contact with another member but also a case in which another member is present between the two members.

이하 첨부된 도면을 참고하여 본 발명의 일 실시예를 상세히 설명하기로 한다.Hereinafter, an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 약물 작용 기전 예측 장치의 구성을 도시한 도면이다.1 is a diagram illustrating the configuration of an apparatus for predicting a drug action mechanism according to an embodiment of the present invention.

도시된 바와 같이, 약물 작용 기전 예측 장치(100)는 통신 모듈(110), 메모리(120), 프로세서(130)및 데이터베이스(140)을 포함할 수 있다. 약물 작용 기전 예측 장치(100)는 컴퓨팅 장치를 기본으로 구성되며, 도시되어 있지 않은, 전원부, 각종 입력 장치 및 출력 장치 등을 더 포함한다.As shown, the apparatus 100 for predicting a drug action mechanism may include a communication module 110 , a memory 120 , a processor 130 , and a database 140 . The drug action mechanism prediction apparatus 100 is configured based on a computing device, and further includes, not shown, a power supply unit, various input devices and output devices, and the like.

통신모듈(110)은 외부의 컴퓨팅 장치와 화합물의 화학 구조에 대한 데이터나, 이와 매칭되는 것으로서 화합물에 의해 유도된 전사체량 특징 정보에 대한 데이터를 송수신할 수 있다. 통신모듈(110)은 다른 네트워크 장치와 유무선 연결을 통해 제어 신호 또는 데이터 신호와 같은 신호를 송수신하기 위해 필요한 하드웨어 및 소프트웨어를 포함하는 장치일 수 있다.The communication module 110 may transmit/receive data about the chemical structure of the compound to and from an external computing device, or data regarding the transcript amount characteristic information induced by the compound as matching it. The communication module 110 may be a device including hardware and software necessary for transmitting and receiving signals such as control signals or data signals through wired/wireless connection with other network devices.

메모리(120)에는 약물 작용 기전 예측 프로그램이 저장된다. 약물 작용 기전 예측 프로그램은 화합물의 화학 구조에 대한 임베딩 벡터와 각 화합물에 의해 유도된 전사체 표현형 특징 정보에 대한 임베딩 벡터가 동일한 벡터 공간에 위치하도록 구축된 학습 모델을 제공하고, 학습 모델에 대상 물질의 화학 구조에 대한 임베딩 벡터를 입력하고, 학습 모델에 의해 출력되는 적어도 하나 이상의 전사체 표현형 특징 정보와 매칭되는 약물 작용 기전의 랭킹을 출력한다. 이때, 약물 작용 기전 예측 프로그램에 입력되는 쿼리로는 신규 물질의 화학 구조에 데이터일 수 있다. 또한, 약물 작용 기전 예측 프로그램은 약물 학습 모델을 구축하는 로직이나, 구축된 약물 학습 모델에 대하여 새로운 학습 과정을 수행하는 학습 모델 갱신 과정등을 추가로 수행할 수 있다.The memory 120 stores a drug action mechanism prediction program. The drug mechanism prediction program provides a learning model constructed so that the embedding vector for the chemical structure of the compound and the embedding vector for the transcript phenotypic characteristic information induced by each compound are located in the same vector space, and provides the learning model with the target substance Inputs an embedding vector for the chemical structure of , and outputs a ranking of a drug action mechanism that matches at least one transcript phenotypic characteristic information output by the learning model. In this case, the query input to the drug action mechanism prediction program may be data on the chemical structure of the new substance. In addition, the drug action mechanism prediction program may additionally perform a logic for building a drug learning model or a learning model update process for performing a new learning process on the built drug learning model.

이러한 메모리(120)에는 약물 작용 기전 예측 장치(100)의 구동을 위한 운영 체제나 누락 데이터 예측 프로그램의 실행 과정에서 발생되는 여러 종류의 데이터가 저장된다. In the memory 120 , various types of data generated during the execution of an operating system for driving the drug action mechanism prediction apparatus 100 or a missing data prediction program are stored.

이때, 메모리(120)는 전원이 공급되지 않아도 저장된 정보를 계속 유지하는 비휘발성 저장장치 및 저장된 정보를 유지하기 위하여 전력이 필요한 휘발성 저장장치를 통칭하는 것이다. In this case, the memory 120 collectively refers to a non-volatile storage device that continuously maintains stored information even when power is not supplied, and a volatile storage device that requires power to maintain the stored information.

또한, 메모리(120)는 프로세서(130)가 처리하는 데이터를 일시적 또는 영구적으로 저장하는 기능을 수행할 수 있다. 여기서, 메모리(120)는 저장된 정보를 유지하기 위하여 전력이 필요한 휘발성 저장장치 외에 자기 저장 매체(magnetic storage media) 또는 플래시 저장 매체(flash storage media)를 포함할 수 있으나, 본 발명의 범위가 이에 한정되는 것은 아니다.In addition, the memory 120 may perform a function of temporarily or permanently storing data processed by the processor 130 . Here, the memory 120 may include magnetic storage media or flash storage media in addition to the volatile storage device that requires power to maintain stored information, but the scope of the present invention is limited thereto. it's not going to be

프로세서(130)는 메모리(120)에 저장된 프로그램을 실행하되, 약물 작용 기전 예측 프로그램의 실행에 따르는 전체 과정을 제어한다. 프로세서(130)가 수행하는 각각의 동작에 대해서는 추후 보다 상세히 살펴보기로 한다.The processor 130 executes the program stored in the memory 120, but controls the entire process according to the execution of the drug action mechanism prediction program. Each operation performed by the processor 130 will be described in more detail later.

이러한 프로세서(130)는 데이터를 처리할 수 있는 모든 종류의 장치를 포함할 수 있다. 예를 들어 프로그램 내에 포함된 코드 또는 명령으로 표현된 기능을 수행하기 위해 물리적으로 구조화된 회로를 갖는, 하드웨어에 내장된 데이터 처리 장치를 의미할 수 있다. 이와 같이 하드웨어에 내장된 데이터 처리 장치의 일 예로써, 마이크로프로세서(microprocessor), 중앙처리장치(central processing unit: CPU), 프로세서 코어(processor core), 멀티프로세서(multiprocessor), ASIC(application-specific integrated circuit), FPGA(field programmable gate array) 등의 처리 장치를 망라할 수 있으나, 본 발명의 범위가 이에 한정되는 것은 아니다.The processor 130 may include all kinds of devices capable of processing data. For example, it may refer to a data processing device embedded in hardware having a physically structured circuit to perform a function expressed as code or instructions included in a program. As an example of the data processing apparatus embedded in the hardware as described above, a microprocessor, a central processing unit (CPU), a processor core, a multiprocessor, an application-specific integrated (ASIC) circuit) and a processing device such as a field programmable gate array (FPGA), but the scope of the present invention is not limited thereto.

데이터베이스(140)는 프로세서(130)의 제어에 따라, 약물 작용 기전 예측 장치에 필요한 데이터를 저장 또는 제공한다. 이러한 데이터베이스(140)는 메모리(120)와는 별도의 구성 요소로서 포함되거나, 또는 메모리(120)의 일부 영역에 구축될 수도 있다.The database 140 stores or provides data necessary for the apparatus for predicting a drug action mechanism under the control of the processor 130 . The database 140 may be included as a component separate from the memory 120 , or may be built in a partial area of the memory 120 .

도 2는 본 발명의 일 실시예에 따른 약물 작용 기전 예측 방법의 동작을 설명하기 위한 순서도이다.2 is a flowchart for explaining the operation of a method for predicting a drug action mechanism according to an embodiment of the present invention.

먼저, 약물 작용 기전 예측 장치(100)는 화합물의 화학 구조에 대한 임베딩 벡터와 각 화합물에 의해 유도된 전사체 표현형 특징 정보에 대한 임베딩 벡터가 동일한 벡터 공간에 위치하도록 구축된 학습 모델을 제공한다(S210).First, the apparatus 100 for predicting a drug action mechanism provides a learning model constructed so that the embedding vector for the chemical structure of the compound and the embedding vector for the transcript phenotype characteristic information induced by each compound are located in the same vector space ( S210).

이와 같은 학습 모델은, 화합물의 화학 구조 정보를 학습하는 복수 계층의 인공 신경망과 전사체량 특징 정보를 학습하는 복수 계층의 인공 신경망으로 구성할 수 있다. Such a learning model may be composed of a multi-layered artificial neural network for learning chemical structure information of a compound and a multi-layered artificial neural network for learning transcript mass characteristic information.

도 3은 본 발명의 일 실시예에 따른 약물 작용 기전 예측을 위한 학습 모델의 구성을 도시한 것이고, 도 4는 본 발명의 일 실시예에 따른 학습 모델의 구축을 위한 인코더의 구성을 도시한 것이고, 도 5는 본 발명의 일실시예에 따른 학습 모델 구축시 사용되는 GP 특징의 구성을 도시한 것이다.Figure 3 shows the configuration of a learning model for predicting a drug action mechanism according to an embodiment of the present invention, Figure 4 shows the configuration of an encoder for building a learning model according to an embodiment of the present invention , FIG. 5 shows the configuration of a GP feature used when building a learning model according to an embodiment of the present invention.

도시된 바와 같이, 학습 모델은 화합물의 화학 구조에 대한 임베딩 벡터와 해당 화합물이 약물로서 주입되었을 때 발생하는 전사체량의 특징 정보에 대한 임베딩 벡터를 동일한 벡터 공간에 위치하도록 한것에 기술적 특징이 있다.As shown, the learning model has a technical feature in that the embedding vector for the chemical structure of the compound and the embedding vector for the characteristic information of the transcript amount generated when the compound is injected as a drug are located in the same vector space.

먼저, 화합물의 화학 구조를 나타내기 위해 2048 비트의 ECFP(extended-connectivity fingerprints) 데이터를 사용할 수 있다. 이는 개별 약물 또는 화합물의 화학 구조를 나타내는 바이너리 데이터로서, 각각의 비트는 화학적 구조의 존재 여부를 나타낸다. 이와 같은, ECFP는 공개된 데이터에 해당한다.First, 2048 bits of extended-connectivity fingerprints (ECFP) data may be used to represent the chemical structure of a compound. It is binary data representing the chemical structure of an individual drug or compound, with each bit representing the presence or absence of the chemical structure. As such, ECFP corresponds to published data.

또한, 전사체 표현형 특징 정보는 화합물에 의해 발현된 RNA의 총합을나타내는 특징 정보로서, 기존에 알려진 여러 데이터셋을 활용하여 데이터를 수집할 수 있다. 예를 들면, LINCS(Integrated Network-based Cellular Signatures) L1000 데이터 세트 라이브러리를 이용할 수 있으며, 그중에서도 978개의 랜드마크 유전자의 특징 정보를 활용할 수 있다.In addition, the transcript phenotype characteristic information is characteristic information indicating the total amount of RNA expressed by the compound, and data can be collected by using several known datasets. For example, an Integrated Network-based Cellular Signatures (LINCS) L1000 data set library may be used, and among them, characteristic information of 978 landmark genes may be utilized.

이와 같이, 화합물의 화학 구조를 나타내는 데이터와 전사체 표현형 특징 정보를 나타내는 데이터 각각에 대하여, 도 4와 같이, 인코더 인공 신경망을 각각 적용하여, 각각에 대해 임베딩 벡터를 생성한다.In this way, the encoder artificial neural network is applied to each of the data representing the chemical structure of the compound and the data representing the transcript phenotype characteristic information, respectively, as shown in FIG. 4 , and an embedding vector is generated for each.

각각의 인코더 인공 신경망은 복수의 다층 퍼셉트론 계층(multilayer perceptron, MLP)을 포함하는 형태로 구축될 수 있다. 예를 들어, 화합물의 화학 구조를 임베딩하는 제 1 인코더 인공 신경망(f_str)은 2048개의 입력 데이터(X_str)를 512-256-256으로 임베딩하는 4개의 MLP 계층을 포함할 수 있으며, 이를 통해 화합물 화학 구조를 나타내는 임베딩 벡터(Z_str)를 생성한다. Each encoder artificial neural network may be constructed in a form including a plurality of multilayer perceptron (MLP) layers. For example, a first encoder artificial neural network (f _str ) embedding the chemical structure of a compound may include 4 MLP layers embedding 2048 input data (X _str ) as 512-256-256, through which Create an embedding vector (Z _str ) representing the chemical structure of the compound.

또한, 전사체 표현형 특징 정보를 임베딩하는 제 2 인코더 인공 신경망(f_sig)은 978개의 입력 데이터(X_sig)를 기초로, 512-512-256-256으로 임베딩하는 4개의 MLP 계층을 포함할 수 있으며, 이를 통해 전사체 표현형 특징 정보를 나타내는 임베딩 벡터(Z_sig)를 생성한다.In addition, the second encoder artificial neural network (f _sig ) for embedding transcriptome phenotypic feature information may include 4 MLP layers embedding as 512-512-256-256, based on 978 input data (X _sig ). and, through this, an embedding vector (Z _sig ) indicating transcript phenotype characteristic information is generated.

한편, 전사체 표현형 특징 정보로서 도 5에 도시된 바와 같이, 유전자 퍼터베이션(pertubation) 정보가 더 포함되어 임베딩될 수 있다. 이때, 유전자 퍼터베이션 정보는 특정 유전자의 제거(knock out) 또는 유전자의 발현 억제(knock down)를 수행한 정보를 포함한다. 이를 통해, 화합물에 의해 유도되는 유전자 퍼터페이션 정보가 임베딩 벡터로서 생성될 수 있다.On the other hand, as shown in FIG. 5 as transcript phenotype characteristic information, gene pertubation information may be further included and embedded. At this time, the gene perturbation information includes information on which a specific gene is removed (knocked out) or the expression of a gene is suppressed (knocked down). Through this, gene perturbation information induced by the compound can be generated as an embedding vector.

예를 들어, EGFR 억제제에 대한 임베딩 벡터가 입력된다면, 이에 의해 발현되는 유전자 퍼터베이션 정보가 임베딩 벡터로서 학습되며, 특히 동일한 벡터 공간에서 서로 가까운 위치에 배치되도록 한다. 이때, 유전자 퍼터베이션 정보도 978 비트의 데이터로 마련되어, 임베딩 벡터 생성에 활용될 수 있다.For example, if an embedding vector for an EGFR inhibitor is input, the gene perturbation information expressed thereby is learned as an embedding vector, and in particular, it is placed in a position close to each other in the same vector space. At this time, the gene perturbation information is also provided as 978-bit data and can be utilized to generate an embedding vector.

한편, 학습 모델의 구축을 위해 화합물의 화학 구조에 대한 임베딩 벡터와 해당 화합물이 약물로서 주입되었을 때 발생하는 전사체량의 특징 정보에 대한 임베딩 벡터를 동일한 벡터 공간에 구축하고, 이에 대한 손실 함수로서, 삼중항 손실(triplet loss) 함수를 사용한다. 본 손실함수에서는 전사체 표현형 특징 정보를 나타내는 임베딩 벡터를 앵커 벡터(Z_sig)로 설정하고, 앵커 벡터에 대응하는 전사체 표현형 특징을 유도하는 화합물의 임베딩 벡터를 양의 벡터(Z_str ^p), 앵커 벡터에 대응하는 전사체 표현형 특징을 유도하지 않는 화합물의 임베딩 벡터를 음의 벡터(Z_str ⁿ)로 설정하고, 앵커 벡터와 양의 벡터간의 최소화하고, 앵커 벡터와 음의 벡터와의 거리는 최대화하도록 한다. 구체적인 함수의 구성은 수학식 1과 같다.On the other hand, for the construction of a learning model, an embedding vector for the chemical structure of a compound and an embedding vector for characteristic information on the amount of transcript generated when the compound is injected as a drug are constructed in the same vector space, and as a loss function for this, We use the triplet loss function. In this loss function, the embedding vector representing the transcript phenotype characteristic information is set as the anchor vector (Z _sig ), and the embedding vector of the compound that induces the transcript phenotypic characteristic corresponding to the anchor vector is a positive vector (Z _str ^p ), Set the embedding vector of a compound that does not induce a transcriptome phenotypic characteristic corresponding to the anchor vector as a negative vector (Z _str ⁿ ), minimize the distance between the anchor vector and the positive vector, and maximize the distance between the anchor vector and the negative vector. to do it The structure of the specific function is as shown in Equation 1.

[수학식 1][Equation 1]

이때, 각 벡터의 거리는 코사인 유사도(

)를 이용하여 계산한다.In this case, the distance of each vector is the cosine similarity (

) is used to calculate

이러한 구성에 따라, 화합물에 대한 임베딩 벡터와, 해당 화합물에 의해 유도된 전사체 표현형 특징 정보에 대한 임베딩 벡터가 동일한 벡터 공간에 위치하는 학습 모델이 구축될 수 있다. 이에 따라, 특정 화합물 구조에 대한 정보가 입력되면, 그에 의해 발현되는 전사체 표현형 특징 정보가 예측 또는 추론될 수 있다. According to this configuration, a learning model in which an embedding vector for a compound and an embedding vector for transcript phenotypic characteristic information induced by the compound are located in the same vector space can be constructed. Accordingly, when information on the structure of a specific compound is input, information on the phenotype characteristic of the transcriptome expressed by the information can be predicted or inferred.

다시 도 2를 참조하면, 위와 같은 과정을 통해 구축된 학습 모델에 대상 물질에 대한 임베딩 벡터를 입력하고, 이에 따라 학습 모델이 대상 물질의 임베딩 벡터와 유사도가 높은 적어도 하나 이상의 전사체 표현형 특징 정보를 출력한다(S220).Referring back to FIG. 2 , an embedding vector for a target material is input to the learning model built through the above process, and accordingly, the learning model receives at least one transcript phenotype characteristic information with high similarity to the embedding vector of the target material. output (S220).

앞서 설명한 바와 같이, 본 발명의 학습 모델은 화합물에 대한 임베딩 벡터와 전사체 표현형 특징 정보에 대한 임베딩 벡터가 하나의 공간에 구축된 상태이므로, 입력 데이터로서 약물 또는 각종 화합물에 대한 임베딩 벡터를 입력하면, 그와 가장 관련성이 높은 전사체 표현형 특징 정보가 적어도 하나 이상 출력될 수 있다. 이때, 출력되는 전사체 표현형 특징 정보는 앞서 설명한, 유전자 퍼터베이션 정보를 포함할 수 있다.As described above, in the learning model of the present invention, since the embedding vector for the compound and the embedding vector for the transcript phenotype characteristic information are built in one space, if the embedding vector for a drug or various compounds is input as input data, , at least one transcript phenotype characteristic information most relevant thereto may be output. In this case, the output transcript phenotype characteristic information may include the gene perturbation information described above.

이때, 유사도의 산출을 위해 연결성(connectivity) 점수를 사용할 수 있으며, 아래의 수학식을 통해 연결성 점수를 산출할 수 있다.In this case, the connectivity score may be used to calculate the degree of similarity, and the connectivity score may be calculated through the following equation.

[수학식 2][Equation 2]

이때, 임베딩 벡터(Z_c)는 화합물의 임베딩 벡터를 나타내는 것이고, 임베딩 벡터(Z_g)는 전사체 표현형 특징 정보를 나타내는 것이다. 그들의 코사인 유사도가 가장 큰 경우 연결성 점수가 큰 것으로 본다. 예를 들어, EGFR 억제제에 대한 임베딩 벡터가 입력된다면, 이와 유사도가 높은 EGFR KD 유전자 퍼터페이션이 전사체 표현형 특징 정보로서 출력된다.In this case, the embedding vector (Z _c ) represents the embedding vector of the compound, and the embedding vector (Z _g ) represents the transcript phenotype characteristic information. If their cosine similarity is the greatest, the connectivity score is considered high. For example, if an embedding vector for an EGFR inhibitor is input, EGFR KD gene perturbation with a high degree of similarity is output as transcript phenotype characteristic information.

다음으로, 학습 모델에 의해 출력되는 전사체 표현형 특징 정보와 매칭되는 약물 작용 기전의 랭킹을 출력한다(S230).Next, the ranking of the drug action mechanism matching the transcript phenotype characteristic information output by the learning model is output (S230).

앞선 단계(S220)에서 출력된 전사체 표현형 특징 정보를 유사도 크기에 따라 내림차순으로 정렬하고, 이에 대해 각각 패스웨이 분석을 수행하도록 한다.The transcript phenotype characteristic information output in the previous step (S220) is arranged in descending order according to the degree of similarity, and pathway analysis is performed on each of them.

도 6은 본 발명의 일 실시예에 따른 약물 작용 기전 랭킹 출력 과정을 도시한 것이다.6 is a diagram illustrating a drug action mechanism ranking output process according to an embodiment of the present invention.

대상 물질과 관련성이 높은 전사체 표현형 특징 정보가 결정되면, 이를 기초로 약물의 작용 기전의 랭킹 정보를 추천하는 패스웨이 분석을 수행할 수 있다.When the transcript phenotype characteristic information highly related to the target substance is determined, a pathway analysis that recommends ranking information of the mechanism of action of the drug may be performed based on this.

패스웨이 DB는 약물의 작용 기전에 대한 각종 데이터를 포함하는 것으로, 이의 분석을 위한 방법으로서 GSEA(Gene Set Enrichment Analysis) 방법이 알려져 있다. GSEA 방법은 하나의 생물학적인 특성에 관련된 유전자들이 두 가지의 조건 (즉, 정상세포 vs 암세포 등) 에서 통계적으로 의미 있는 차이를 보이며 발현되는지, 발현되는 특성이 서로 비슷한지를 동시에 평가해주는 기법이다.The pathway DB includes various data on the mechanism of action of a drug, and a GSEA (Gene Set Enrichment Analysis) method is known as a method for its analysis. The GSEA method is a technique that simultaneously evaluates whether genes related to one biological characteristic are expressed with a statistically significant difference under two conditions (ie, normal cells vs. cancer cells, etc.) and whether the expressed characteristics are similar to each other.

이와 같은 과정을 통해 본 발명에서는 입력되는 새로운 화합물에 대해 관련성이 높은 전사체 표현형 특징 또는 유전자 퍼터페이션에 대한 정보를 추론할 수 있고, 해당 유전자들이 증강(enriched)되어 있는 약물의 작용 기전을 순서에 따라 출력한다.Through such a process, in the present invention, information on transcript phenotypic characteristics or gene perturbation that is highly relevant to a new input compound can be inferred, and the mechanism of action of the drug in which the genes are enriched can be sequenced. output according to

본 발명의 일 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체를 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다. An embodiment of the present invention may also be implemented in the form of a recording medium including instructions executable by a computer, such as a program module to be executed by a computer. Computer-readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. Also, computer-readable media may include computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

본 발명의 방법 및 시스템은 특정 실시예와 관련하여 설명되었지만, 그것들의 구성 요소 또는 동작의 일부 또는 전부는 범용 하드웨어 아키텍쳐를 갖는 컴퓨터 시스템을 사용하여 구현될 수 있다.Although the methods and systems of the present invention have been described with reference to specific embodiments, some or all of their components or operations may be implemented using a computer system having a general purpose hardware architecture.

전술한 본원의 설명은 예시를 위한 것이며, 본원이 속하는 기술분야의 통상의 지식을 가진 자는 본원의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above description of the present application is for illustration, and those of ordinary skill in the art to which the present application pertains will understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present application. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a dispersed form, and likewise components described as distributed may be implemented in a combined form.

본원의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본원의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present application is indicated by the following claims rather than the above detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalents should be construed as being included in the scope of the present application.

100: 약물 작용 기전 예측 장치
110: 통신 모듈
120: 메모리
130: 프로세서
140: 데이터베이스100: drug mechanism prediction device
110: communication module
120: memory
130: processor
140: database

Claims

In the apparatus for predicting the mechanism of action of a drug based on a transcriptome phenotype,
communication module;
a memory in which a drug action mechanism prediction program is stored;
Including the process of executing the drug mechanism of action prediction program,
The drug action mechanism prediction program inputs the embedding vector for the target substance into the learning model constructed so that the embedding vector for the compound and the embedding vector for the transcript phenotypic characteristic information induced by each compound are located in the same vector space, The apparatus for predicting a drug action mechanism based on transcriptome phenotype to output a ranking of a drug action mechanism that matches at least one or more transcript phenotype characteristic information output by the learning model.

The method of claim 1,
The learning model minimizes the distance between the first embedding vector representing the transcriptome phenotype feature information and the embedding vector of the compound inducing the transcript phenotype feature through a triplet loss function, and minimizes the distance between the first embedding vector and the transcript phenotype A device for predicting a mechanism of action of a drug, configured to maximize the distance from the embedding vector of a compound that does not induce a characteristic.

The method of claim 1,
The transcript phenotype characteristic information will further include gene perturbation information, drug action mechanism prediction device.

In a method for predicting a mechanism of action of a drug based on a transcriptome phenotype,
providing a learning model constructed so that an embedding vector for a compound and an embedding vector for transcript phenotypic characteristic information induced by each compound are located in the same vector space;
outputting at least one transcript phenotype characteristic information having a high degree of similarity to the embedding vector of the target material to which the learning model is input, as the embedding vector for the target material is input to the learning model; and
A method of predicting a drug mechanism of action based on a transcriptome phenotype, comprising outputting a ranking of a drug action mechanism matching the transcript phenotype characteristic information.

5. The method of claim 4,
The learning model minimizes the distance between the first embedding vector representing the transcriptome phenotype feature information and the embedding vector of the compound inducing the transcript phenotype feature through a triplet loss function, and minimizes the distance between the first embedding vector and the transcript phenotype A method for predicting a mechanism of action of a drug based on a transcriptome phenotype, wherein the compound that does not induce a characteristic is configured to maximize the distance from the embedding vector.

5. The method of claim 4,
The transcript phenotype characteristic information will further include gene perturbation information, a method for predicting a mechanism of action based on a transcriptome phenotype.

A computer-readable recording medium in which a program for performing the method according to any one of claims 4 to 6 is recorded.