KR102540558B1

KR102540558B1 - Method and apparatus for new drug candidate discovery

Info

Publication number: KR102540558B1
Application number: KR1020200177206A
Authority: KR
Inventors: 강재우; 전민지; 장부루; 박정수; 박성준; 김선규
Original assignee: 고려대학교 산학협력단
Priority date: 2019-12-31
Filing date: 2020-12-17
Publication date: 2023-06-12
Also published as: KR20210086495A

Abstract

본 발명의 일 실시예에 따른 신약 후보 물질 출력 장치는 통신 모듈; 신약 후보 물질 출력 프로그램이 저장된 메모리; 상기 신약 후보 물질 출력 프로그램을 실행하는 프로세스를 포함하되, 상기 신약 후보 물질 출력 프로그램은 화학 합성물의 화학 구조에 대한 임베딩 벡터와 각 화학 합성물에 의해 유도된 전사체량 변화 정보에 대한 임베딩 벡터가 동일한 벡터 공간에 위치한 약물 학습 모델을 제공하고, 상기 약물 학습 모델에 입력된 신규 물질의 화학 구조에 대한 임베딩 벡터에 매칭되는 전사체량 변화 정보 결과를 출력하거나, 상기 약물 학습 모델에 입력된 타겟이 되는 전사체량 변화 정보에 매칭되는 하나 이상의 약물에 대한 정보를 출력하는 것이다.A new drug candidate output device according to an embodiment of the present invention includes a communication module; a memory in which a program for outputting new drug candidates is stored; A process of executing the new drug candidate output program, wherein the new drug candidate output program includes a vector space in which an embedding vector for a chemical structure of a chemical compound and an embedding vector for information on a change in the amount of transcript induced by each chemical compound are the same Provides a drug learning model located in , and outputs a transcript amount change information result matching an embedding vector for the chemical structure of a new substance input to the drug learning model, or changes in transcript amount that is a target input to the drug learning model To output information about one or more drugs that match the information.

Description

New drug candidate output method and apparatus {METHOD AND APPARATUS FOR NEW DRUG CANDIDATE DISCOVERY}

본 발명은 전사체 표현형(transcriptome phenotype)을 이용하여 신약 후보 물질을 도출하기 위한 신약 후보 물질의 출력 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for outputting a new drug candidate for deriving a new drug candidate using a transcriptome phenotype.

신약은 신약 발견 단계와 신약 개발 단계로 구성된 프로세스에 의해 개발된다. 신약 발견 단계는 타겟 확인, 후보 물질 설계, 효능 측정, 약 후보 물질 선택을 포함한다. 신약 개발 단계에는 안전성 평가와 약물 후보자의 임상 시험이 포함된다. 신약 발견 단계와 신약 개발 단계를 통해 약품을 상용화하는 데는 상당한 시간과 비용이 소모되지만, 그 성공률은 높지 않은 것으로 알려져 있다. New drugs are developed through a process consisting of a new drug discovery phase and a new drug development phase. The drug discovery phase includes target identification, candidate design, efficacy measurement, and drug candidate selection. The drug development phase includes safety evaluation and clinical trials of drug candidates. Although it takes a lot of time and money to commercialize a drug through a new drug discovery stage and a new drug development stage, it is known that the success rate is not high.

신약 개발 파이프 라인에서 질병에 적합한 타겟 단백질을 확인하고 타겟에 결합하는 분자를 찾는 것이 매우 중요하다. 일단 질병에 대한 타겟이 확인되면 타겟에 결합할 수 있는 화합물이 고효율 스크리닝을 통해 발견되고, 타겟에 결합하는 약물의 구조 유사체도 약물 후보 물질로 선택된다. In the drug development pipeline, it is very important to identify a target protein suitable for a disease and to find a molecule that binds to the target. Once a target for a disease is identified, compounds capable of binding to the target are discovered through high-throughput screening, and structural analogues of drugs that bind to the target are also selected as drug candidates.

이렇게 약물 후보 물질로 약 5000~10000 개 이상이 선정되지만, 실험과 검증을 거쳐 판매되기까지의 성공률이 0.02% 미만이기 때문에 신약 개발 비용이과 개발 시간이 많이 소요된다. Although about 5,000 to 10,000 or more are selected as drug candidates, the success rate from testing and verification to sales is less than 0.02%, so new drug development costs and development time are high.

이와 같이, 신약 개발 과정은 시간과 비용이 많이 필요할 뿐만 아니라 어려운 과정으로서 개발되는 신약이 실제로 성공할지에 대해서도 장담할 수 없다. 게다가, 제약 업계의 연구 개발 비용은 증가하고 있으며, 새로 승인된 의약품의 수에 대한 연구 개발 비용의 비율로 계산되는 생산성은 1950 년대 이후 매년 꾸준히 감소하고 있다. 신약 개발의 성공은 신약 후보 선택에 달려 있기 때문에 신약 개발 생산성을 높이기 위해 성공확률이 높은 신약 후보를 선택하는 것이 중요하다.In this way, the new drug development process is not only time consuming and costly, but also difficult, and it is impossible to guarantee whether the new drug developed will actually succeed. Moreover, R&D costs in the pharmaceutical industry have been increasing, and productivity, calculated as the ratio of R&D costs to the number of newly approved drugs, has been steadily declining each year since the 1950s. Since the success of new drug development depends on the selection of new drug candidates, it is important to select new drug candidates with a high probability of success in order to increase the productivity of new drug development.

신약 발견 단계에서 신약 후보 물질 발굴을 위해 사용되었던 전통적인 방식은 타겟 중심으로 신약 후보 물질을 발굴하는 방법이다. 이는 발병 관련 주요 인자를 밝혀내는 과정인 타겟 단백질 발굴, 타겟 단백질에 물리적으로 결합하여 기능을 억제할 수 있는 화합물을 찾는 과정인 유효 물질 발굴(Hit discovery), 앞서 찾아낸 유효 물질을 구조적으로 최적화하는 선도물질 최적화 과정 (Lead optimization)으로 진행된다. 타겟 단백질로부터 발굴한 약물 후보군 중에서 개발(Development) 단계를 통해 세포, 조직, 개체까지의 효과가 있는 약물을 최종적으로 고르게 된다.The traditional method used to discover new drug candidates in the new drug discovery stage is a method of discovering new drug candidates based on a target. This includes discovery of target proteins, which is a process of identifying key factors related to disease, hit discovery, which is a process of finding compounds that can physically bind to and inhibit the function of target proteins, and lead structural optimization of previously found effective materials. It proceeds with material optimization process (Lead optimization). Among the drug candidates discovered from the target protein, drugs that have effects on cells, tissues, and individuals are finally selected through the development stage.

하지만 1) 타겟 단백질 가설이 필수적으로 필요하다는 점, 2) 타겟 단백질 가설을 찾았더라도 타겟 단백질에 화합물이 결합할 수 없는 구조(undruggable target)를 가진다면 약물 개발이 어려울 수 있다는 점, 3) 타겟 단백질에 화합물이 결합할 수 있는 구조라 하더라도 무수히 많은 화합물을 실험적으로 검증하기 어렵고, 후보물질 도출에 약 5.5년 이상의 상당한 시간이 소요된다는 점이 타겟 중심 신약개발 과정의 한계이다. 구체적인 예는 아래와 같다. However, 1) that a target protein hypothesis is essential, 2) even if a target protein hypothesis is found, drug development can be difficult if the compound has a structure that cannot bind to the target protein (undruggable target), and 3) target protein The limitation of the target-oriented new drug development process is that it is difficult to experimentally verify a myriad of compounds even if the compound can bind to , and it takes a considerable amount of time (about 5.5 years or more) to derive a candidate substance. A specific example is as follows.

예를 들어 기존 타겟 중심 신약개발 방식은 타겟 단백질을 먼저 설정하고 해당 단백질에 결합하는 화합물을 탐색한다. 하지만 해당 질병에 대한 이해도가 높지 않아 타겟 단백질을 알아낼 수 없다면, 해당 질병을 치료할 수 있는 신약개발 프로세스를 시작할 수조차 없다. 가령 알츠하이머 질병 같은 경우, 해당 질병의 요인으로 작용하는 명확한 타겟 단백질을 밝혀내지 못하여 신약후보물질을 탐색하는데 어려움이 있다. For example, in the existing target-oriented new drug development method, a target protein is first set and then a compound that binds to the protein is searched for. However, if the target protein cannot be identified because the understanding of the disease is not high, the process of developing a new drug that can treat the disease cannot even be started. For example, in the case of Alzheimer's disease, it is difficult to search for new drug candidates because a clear target protein acting as a factor of the disease cannot be identified.

또한 해당 질병 치료에 중요한 역할을 하는 타겟 단백질은 알지만 실제로 그 단백질에 결합하는 물질을 만들 수 없는 경우도 약물 개발이 어려울 수 있다. 폐암이나 대장암의 타겟 단백질로 널리 알려진 KRAS는 약물이 결합할 수 있는 바인딩 사이트(binding site)가 없어 KRAS 억제제(inhibitor)가 현재 존재하지 않는다. 이는 해당 질병에 대한 이해도가 아무리 높아져도 기존 타겟 중심 방식으로는 이 질병에 대한 신약개발이 불가능함을 의미한다.In addition, drug development may be difficult when a target protein that plays an important role in treating a disease is known, but a substance that binds to the protein cannot be made. KRAS, which is widely known as a target protein for lung or colorectal cancer, does not have a binding site to which drugs can bind, so there are currently no KRAS inhibitors. This means that no matter how much the understanding of the disease is improved, it is impossible to develop a new drug for this disease with the existing target-oriented method.

대한민국 공개특허공보 제 10-2018-0058648 호(발명의 명칭 : 비구조-구조 전이 부위를 표적으로 하는 신약 후보 물질 발굴 방법 및 신약 후보 물질 발굴 장치)Republic of Korea Patent Publication No. 10-2018-0058648 (Title of Invention: New drug candidate discovery method and new drug candidate discovery device targeting non-structural-structural transition site)

본 발명은 전술한 문제점을 해결하기 위하여, 본 발명의 일 실시예에 따라 전사체량 변화 데이터 및 각 화학 합성물 데이터를 입력으로 학습된 약물 학습 모델을 통해 신약 후보 물질을 출력할 수 있는 신약 후보 물질 출력 장치 및 방법을 제공하는 것에 그 목적이 있다.In order to solve the above-mentioned problems, the present invention outputs new drug candidates that can output new drug candidates through a learned drug learning model using transcriptome change data and each chemical compound data as inputs according to an embodiment of the present invention. Its purpose is to provide an apparatus and method.

다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical problem to be achieved by the present embodiment is not limited to the technical problem as described above, and other technical problems may exist.

상기한 기술적 과제를 달성하기 위한 기술적 수단으로서 본 발명의 일 실시예에 따른 신약 후보 물질 출력 장치는 통신 모듈; 신약 후보 물질 출력 프로그램이 저장된 메모리; 상기 신약 후보 물질 출력 프로그램을 실행하는 프로세스를 포함하되, 상기 신약 후보 물질 출력 프로그램은 화학 합성물의 화학 구조에 대한 임베딩 벡터와 각 화학 합성물에 의해 유도된 전사체량의 변화 정보에 대한 임베딩 벡터가 동일한 벡터 공간에 위치한 약물 학습 모델을 제공하고, 상기 약물 학습 모델에 입력된 신규 물질의 화학 구조에 대한 임베딩 벡터에 매칭되는 전사체량 변화 정보 결과를 출력하거나, 상기 약물 학습 모델에 입력된 타겟이 되는 전사체량 변화 정보에 매칭되는 하나 이상의 약물에 대한 정보를 출력하는 것이다.As a technical means for achieving the above technical problem, a new drug candidate output device according to an embodiment of the present invention includes a communication module; a memory in which a program for outputting new drug candidates is stored; and a process of executing the new drug candidate output program, wherein the new drug candidate output program includes a vector in which an embedding vector for a chemical structure of a chemical compound and an embedding vector for change information of a transcript amount induced by each chemical compound are the same. A drug learning model located in space is provided, and a transcript amount change information result matching the embedding vector for the chemical structure of a new substance input to the drug learning model is output, or the target transcript amount input to the drug learning model Information on one or more drugs matched with the change information is output.

또한, 본 발명의 다른 실시예에 따른 신약 후보 물질 발굴을 위한 학습 모델 구축 방법은, 화학 합성물의 화학 구조에 대한 임베딩 벡터와 각 화학 합성물에 의해 유도된 전사체량의 변화 정보에 대한 임베딩 벡터가 동일한 벡터 공간에 배치시킨 약물 학습 모델을 구축하는 단계; 및 상기 화학 합성물의 투약전 전사체량에 대한 데이터가 상기 약물 학습 모델에 입력되었을 때, 상기 약물 학습 모델이 추론하는 전사체량의 데이터와 실제 화학 합성물 투약 후 전사체량의 데이터간의 차이가 최소화되도록 반복 학습을 수행하는 단계를 포함한다.In addition, in the learning model construction method for discovering new drug candidates according to another embodiment of the present invention, the embedding vector for the chemical structure of the chemical compound and the embedding vector for the change information of the amount of transcript induced by each chemical compound are the same. constructing a drug learning model arranged in a vector space; and when the data on the amount of transcript before administration of the chemical compound is input to the drug learning model, repeated learning to minimize the difference between the data on the amount of transcript inferred by the drug learning model and the data on the amount of transcript after actual administration of the chemical compound. It includes the steps of performing

또한, 본 발명의 다른 실시예에 따른 신약 후보 물질 출력 방법은 화학 합성물의 화학 구조에 대한 임베딩 벡터와 각 화학 합성물에 의해 유도된 전사체량의 변화 정보에 대한 임베딩 벡터가 동일한 벡터 공간에 위치한 약물 학습 모델이 제공되는 단계 및 상기 약물 학습 모델에 신규 물질의 화학 구조에 대한 임베딩 벡터 또는 타겟이 되는 전사체량 변화 정보가 입력되는 단계; 상기 약물 학습 모델이 상기 신규 물질의 화학 구조에 대한 임베딩 벡터에 매칭되는 전사체량 변화 정보 결과를 출력하거나, 상기 타겟이 되는 전사체량 변화 정보에 매칭되는 하나 이상의 약물에 대한 정보를 출력하는 단계를 포함한다.In addition, in the new drug candidate output method according to another embodiment of the present invention, the embedding vector for the chemical structure of the chemical compound and the embedding vector for the change information of the amount of transcript induced by each chemical compound are located in the same vector space for drug learning. providing a model and inputting an embedding vector for a chemical structure of a new substance or target transcript amount change information into the drug learning model; The drug learning model outputs a result of transcript mass change information matched to an embedding vector for the chemical structure of the new substance, or outputting information on one or more drugs matched to the target transcript mass change information. do.

전술한 본 발명의 과제 해결 수단에 의하면, 신약 후보물질을 발굴하는 디스커버리 단계에 소모되는 시간과 비용을 크게 단축시킬 수 있다.According to the problem solving means of the present invention described above, it is possible to greatly reduce the time and cost consumed in the discovery step of discovering new drug candidates.

또한 타겟 단백질 가설이 존재하지 않거나, 타겟 단백질은 알지만 실제 그 단백질에 결합하는 물질을 만들 수 없는 경우에도 신약개발이 가능하다.In addition, new drug development is possible even when a target protein hypothesis does not exist or when a target protein is known but a substance binding to the protein cannot be made.

도 1은 본 발명의 일 실시예에 따른 신약 후보 물질 출력 장치의 구성을 도시한 도면이다.
도 2는 본 발명의 일 실시예에 따른 신약 후보 물질 출력 장치의 동작을 설명하기 위한 예시도이다.
도 3은 본 발명의 일 실시예에 따른 신약 후보 물질 출력 방법의 동작을 설명하기 위한 순서도이다.1 is a diagram showing the configuration of a new drug candidate output device according to an embodiment of the present invention.
2 is an exemplary diagram for explaining the operation of a new drug candidate substance output device according to an embodiment of the present invention.
3 is a flowchart for explaining the operation of a new drug candidate output method according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본원이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본원의 실시예를 상세히 설명한다. 그러나 본원은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본원을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present application will be described in detail so that those skilled in the art can easily practice with reference to the accompanying drawings. However, the present disclosure may be implemented in many different forms and is not limited to the embodiments described herein. And in order to clearly describe the present application in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

본원 명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. Throughout this specification, when a part is said to be "connected" to another part, this includes not only the case of being "directly connected" but also the case of being "electrically connected" with another element in between. do.

본원 명세서 전체에서, 어떤 부재가 다른 부재 “상에” 위치하고 있다고 할 때, 이는 어떤 부재가 다른 부재에 접해 있는 경우뿐 아니라 두 부재 사이에 또 다른 부재가 존재하는 경우도 포함한다.Throughout the present specification, when a member is said to be located “on” another member, this includes not only a case where a member is in contact with another member, but also a case where another member exists between the two members.

이하 첨부된 도면을 참고하여 본 발명의 일 실시예를 상세히 설명하기로 한다.Hereinafter, an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 신약 후보 물질 출력 장치의 구성을 도시한 도면이다.1 is a diagram showing the configuration of a new drug candidate output device according to an embodiment of the present invention.

도시된 바와 같이, 신약 후보 물질 출력 장치(100)는 통신 모듈(110), 메모리(120), 프로세서(130)및 데이터베이스(140)을 포함할 수 있다. 신약 후보 물질 출력 장치(100)는 컴퓨팅 장치를 기본으로 구성되며, 도시되어 있지 않은, 전원부, 각종 입력 장치 및 출력 장치 등을 더 포함한다.As shown, the new drug candidate output device 100 may include a communication module 110, a memory 120, a processor 130, and a database 140. The new drug candidate output device 100 is configured based on a computing device, and further includes a power supply unit, various input devices, and output devices, which are not shown.

통신모듈(110)은 외부의 컴퓨팅 장치와 화학 합성물의 화학 구조에 대한 데이터나, 이와 매칭되는 것으로서 화학 합성물에 의해 유도된 전사체량의 변화 정보에 대한 데이터를 송수신할 수 있다. 통신모듈(110)은 다른 네트워크 장치와 유무선 연결을 통해 제어 신호 또는 데이터 신호와 같은 신호를 송수신하기 위해 필요한 하드웨어 및 소프트웨어를 포함하는 장치일 수 있다.The communication module 110 may transmit and receive data on the chemical structure of the chemical compound or data on the change information of the amount of transcripts induced by the chemical compound as matched with the external computing device. The communication module 110 may be a device including hardware and software necessary for transmitting and receiving signals such as control signals or data signals with other network devices through wired or wireless connections.

메모리(120)에는 신약 후보 물질 출력 프로그램이 저장된다. 신약 후보 물질 출력 프로그램은 화학 합성물의 화학 구조에 대한 임베딩 벡터와 각 화학 합성물에 의해 유도된 전사체량의 변화 정보에 대한 임베딩 벡터가 동일한 벡터 공간에 위치한 약물 학습 모델을 제공하고, 입력 장치 또는 통신모듈(110)등을 통해 입력되는 쿼리에 기반하여, 신약 후보 물질을 출력한다. 이때, 입력되는 쿼리로는 신규 물질의 화학 구조에 데이터이거나, 타겟이 되는 전사체량 변화 정보일 수 있다.A new drug candidate substance output program is stored in the memory 120 . The new drug candidate output program provides a drug learning model in which the embedding vector for the chemical structure of the chemical compound and the embedding vector for the change information of the amount of transcript induced by each chemical compound are located in the same vector space, and the input device or communication module Based on the query input through (110), etc., new drug candidates are output. At this time, the input query may be data on the chemical structure of a new material or information on the change in the amount of a target transcript.

또한, 신약 후보 물질 출력 프로그램은 약물 학습 모델을 구축하는 로직이나, 구축된 약물 학습 모델에 대하여 새로운 학습 과정을 수행하는 학습 모델 갱신 과정등을 추가로 수행할 수 있다.In addition, the new drug candidate output program may additionally perform a logic for constructing a drug learning model or a learning model update process for performing a new learning process on the constructed drug learning model.

이러한 메모리(120)에는 신약 후보 물질 출력 장치(100)의 구동을 위한 운영 체제나 누락 데이터 예측 프로그램의 실행 과정에서 발생되는 여러 종류의 데이터가 저장된다. The memory 120 stores various types of data generated during the execution of an operating system for driving the new drug candidate substance output device 100 or a missing data prediction program.

이때, 메모리(120)는 전원이 공급되지 않아도 저장된 정보를 계속 유지하는 비휘발성 저장장치 및 저장된 정보를 유지하기 위하여 전력이 필요한 휘발성 저장장치를 통칭하는 것이다. At this time, the memory 120 collectively refers to a non-volatile storage device that continuously retains stored information even when power is not supplied and a volatile storage device that requires power to maintain stored information.

또한, 메모리(120)는 프로세서(130)가 처리하는 데이터를 일시적 또는 영구적으로 저장하는 기능을 수행할 수 있다. 여기서, 메모리(120)는 저장된 정보를 유지하기 위하여 전력이 필요한 휘발성 저장장치 외에 자기 저장 매체(magnetic storage media) 또는 플래시 저장 매체(flash storage media)를 포함할 수 있으나, 본 발명의 범위가 이에 한정되는 것은 아니다.Also, the memory 120 may temporarily or permanently store data processed by the processor 130 . Here, the memory 120 may include magnetic storage media or flash storage media in addition to a volatile storage device that requires power to maintain stored information, but the scope of the present invention is limited thereto it is not going to be

프로세서(130)는 메모리(120)에 저장된 프로그램을 실행하되, 신약 후보 물질 출력 프로그램의 실행에 따르는 전체 과정을 제어한다. 프로세서(130)가 수행하는 각각의 동작에 대해서는 추후 보다 상세히 살펴보기로 한다.The processor 130 executes the program stored in the memory 120 and controls the entire process of executing the new drug candidate substance output program. Each operation performed by the processor 130 will be described in more detail later.

이러한 프로세서(130)는 데이터를 처리할 수 있는 모든 종류의 장치를 포함할 수 있다. 예를 들어 프로그램 내에 포함된 코드 또는 명령으로 표현된 기능을 수행하기 위해 물리적으로 구조화된 회로를 갖는, 하드웨어에 내장된 데이터 처리 장치를 의미할 수 있다. 이와 같이 하드웨어에 내장된 데이터 처리 장치의 일 예로써, 마이크로프로세서(microprocessor), 중앙처리장치(central processing unit: CPU), 프로세서 코어(processor core), 멀티프로세서(multiprocessor), ASIC(application-specific integrated circuit), FPGA(field programmable gate array) 등의 처리 장치를 망라할 수 있으나, 본 발명의 범위가 이에 한정되는 것은 아니다.The processor 130 may include any type of device capable of processing data. For example, it may refer to a data processing device embedded in hardware having a physically structured circuit to perform a function expressed as a code or command included in a program. As an example of such a data processing device built into hardware, a microprocessor, a central processing unit (CPU), a processor core, a multiprocessor, an application-specific integrated (ASIC) circuit), field programmable gate array (FPGA), etc., but the scope of the present invention is not limited thereto.

데이터베이스(140)는 프로세서(130)의 제어에 따라, 신약 후보 물질 출력 장치에 필요한 데이터를 저장 또는 제공한다. 이러한 데이터베이스(140)는 메모리(120)와는 별도의 구성 요소로서 포함되거나, 또는 메모리(120)의 일부 영역에 구축될 수도 있다.The database 140 stores or provides data necessary for the new drug candidate substance output device under the control of the processor 130 . The database 140 may be included as a component separate from the memory 120 or may be built in a partial area of the memory 120 .

도 2는 본 발명의 일 실시예에 따른 신약 후보 물질 출력 장치의 동작을 설명하기 위한 예시도 이고, 도 3은 본 발명의 일 실시예에 따른 신약 후보 물질 출력 방법의 동작을 설명하기 위한 순서도이다.2 is an exemplary diagram for explaining the operation of a new drug candidate output device according to an embodiment of the present invention, and FIG. 3 is a flow chart for explaining the operation of a new drug candidate output method according to an embodiment of the present invention. .

먼저, 신약 후보 물질 출력 장치(100)의 약물 학습 모델을 구축하는 방법을 살펴보기로 한다(S310).First, a method of constructing a drug learning model of the new drug candidate output device 100 will be described (S310).

신약 후보 물질 출력 장치(100)의 약물 학습 모델은, 전사체량 변화 정보를 학습하는 복수 계층의 인공 신경망과 화학 합성물의 화학 구조 정보를 학습하는 복수 계층의 인공 신경망으로 구성할 수 있다. The drug learning model of the new drug candidate output device 100 may be composed of a multi-layer artificial neural network learning transcript amount change information and a multi-layer artificial neural network learning chemical structure information of chemical compounds.

각각의 계층은 기본적인 퍼셉트론 계층(완전 연결 계층(Fully-connected layer)) 로 구성할 수 있으며, 화학 합성물의 화학 구조를 중심으로 임베딩 벡터를 생성하기 위해, GNN(Graph Neural Netwokr) 등을 사용할 수 있다. 예시적으로, 전사체량 변화 정보를 학습하는 복수 계층의 인공 신경망과 화학 합성물의 화학 구조 정보를 학습하는 복수 계층의 인공 신경망 모두 4개 계층을 가지며 같은 차원의 임베딩 벡터를 출력으로 생성하는 완전 연결된 인공 신경망의 형태로 구현될 수 있다. Each layer can be composed of a basic perceptron layer (fully-connected layer), and GNN (Graph Neural Network) can be used to generate an embedding vector centered on the chemical structure of a chemical compound. . Exemplarily, a multi-layer artificial neural network learning transcriptome change information and a multi-layer artificial neural network learning chemical structure information of chemical compounds both have four layers and fully connected artificial neural networks that generate embedding vectors of the same dimension as outputs. It can be implemented in the form of a neural network.

신약 후보 물질 출력 장치(100)의 약물 학습 모델은 화학 합성물의 화학 구조에 대한 임베딩 벡터와 각 화학 합성물에 의해 유도된 전사체량의 변화 정보에 대한 임베딩 벡터를 동일한 벡터 공간에 배치시킨다. The drug learning model of the new drug candidate output device 100 places an embedding vector for the chemical structure of a chemical compound and an embedding vector for change information of the amount of transcript induced by each chemical compound in the same vector space.

전사체량 변화 정보는 약물 투약 전후의 세포안의 복수의 유전자의 발현도(전사체량) 변화를 의미한다. 약물 투약 전후의 전사체량 변화에는 약물이 세포에 미치는 영향, 즉 약물과 세포안의 모든 단백질 간 및 단백질과 단백질 간의 복잡한 상호작용에 대한 정보가 포함된다. 이때, 전사체량의 변화 정도는 약물을 투여하기 전에 각 유전자의 전사체량(gene-expression) 으로 구성한 가우시안 분포의 평균값으로부터 약물 투여 후 각 유전자의 전사체량까지의 거리로 정의할 수 있다. 즉, 거리값에 비례하여 변화의 정도가 증가하게 된다. 이때, 두 전사체량 간의 거리는 아래와 같은 가우시안을 통해서 구할 수 있다.Transcriptome change information refers to changes in the expression level (transcript amount) of a plurality of genes in cells before and after drug administration. Transcriptome changes before and after drug administration include information about the effect of the drug on cells, that is, complex interactions between the drug and all proteins in the cell and between proteins. At this time, the degree of change in transcript amount can be defined as the distance from the average value of the Gaussian distribution composed of the transcript amount (gene-expression) of each gene before drug administration to the transcript amount of each gene after drug administration. That is, the degree of change increases in proportion to the distance value. At this time, the distance between the two transcript amounts can be obtained through the following Gaussian.

은 약물을 투여하기 전 유전자 g 의 전사체량 값들의 평균을 의미하고,

는 약물을 투여하기 전 유전자 g 의 전사체량 값들의 표본평균을 의미하며,

는 약물을 투여한 후 유전자 g 의 전사체량 값을 의미한다.

Means the average of the transcript values of gene g before drug administration,

Means the sample mean of the transcript values of gene g before drug administration,

Means the value of the transcript amount of gene g after drug administration.

예를 들어, 암세포주에 대하여 약물을 투약했을때의 전사체량의 변화 정보를 담고 있는 학습 데이터를 고려할 수 있으며, 이는 978 개의 대표 유전자를 포함하고 있다고 가정한다. 이때, 특정 약물의 투약에 따른 전사체량의 변화 정보는 총 978개 유전자별로, 각 유전자의 전사체에 대한 가우시안 분포의 평균값으로부터 약물 투여 후 각 유전자의 전사체량까지의 거리값을 포함하게 된다. For example, learning data containing information on changes in the amount of transcripts when a drug is administered to a cancer cell line may be considered, and it is assumed that it includes 978 representative genes. At this time, the change information of the transcript amount according to the administration of a specific drug includes a distance value from the average value of the Gaussian distribution for the transcript of each gene to the transcript amount of each gene after drug administration for each 978 genes.

이를 이용하면 기존의 타겟 중심 신약개발 방식으로는 신약개발이 불가능했던 문제를 해소할 수 있다. 예를 들어, 질병 관련 타겟 단백질 가설이 존재하지 않은 경우, 환자군의 유전자 발현 패턴을 정상군의 유전자 발현 패턴으로 유도할 수 있는 약물 후보 물질을 발굴할 수 있다. 또한, 질병 관련 타겟 단백질 가설이 알려져 있더라도 타겟 단백질이 결합 불가능하여 약물 개발이 어려웠던 경우, 타겟 유전자 녹다운(Knockdown)으로 인한 유전자 발현도 변화를 유도할 수 있는 후보 물질을 발굴할 수 있게 된다.By using this, it is possible to solve the problem that new drug development was impossible with the existing target-oriented new drug development method. For example, when a disease-related target protein hypothesis does not exist, a drug candidate substance capable of inducing a gene expression pattern of a patient group to a gene expression pattern of a normal group may be discovered. In addition, even if a disease-related target protein hypothesis is known, if drug development has been difficult because the target protein cannot bind, a candidate substance capable of inducing changes in gene expression due to knockdown of the target gene can be discovered.

도시된 바와 같이, 화학 합성물의 화학 구조에 대한 임베딩 벡터와 해당 화학 합성물이 약물로서 주입되었을 때 발생하는 전사체량의 변화 정보에 대한 임베딩 벡터를 동일한 벡터 공간(예를 들면, 단위 초구(unit hypersphere)의 표면)에 배치하여, 약물 학습 모델(210)이 구축되도록 한다. As shown, the embedding vector for the chemical structure of the chemical compound and the embedding vector for the change information of the amount of transcript that occurs when the chemical compound is injected as a drug are combined into the same vector space (eg, unit hypersphere). surface), so that the drug learning model 210 is built.

이때, 약물 투약 전후의 전사체량 변화 정보에는 오프 타겟(off-target) 영향을 포함한 약물의 반응에 대한 모든 정보가 포함되어 있으므로, 전사체량 변화 정보와 그 변화를 유도하는 약물을 동일한 벡터공간의 같은 곳에 위치하도록 임베딩함으로써 약물 효과에 대한 추상적 특성을 포착할 수 있게 된다. 또한, 약물 효과에 기반한 임베딩 모듈이기 때문에, 새로운 화학구조를 갖는 약물일지라도 약물 특성에 기반하여 신약 후보 물질로 발굴될 수 있다. 또한, 본 발명의 임베딩 모듈은 약물의 화학적 구조를 입력으로 사용하기 때문에, 기존에 구조가 알려진 화학 합성물뿐만 아니라 미래에 알려질 모든 합성 가능한 화합물들에 대해서도 별도의 과정과 비용 발생 없이 약물 효과를 기반으로 하는 임베딩으로 표현할 수 있다.At this time, since the transcriptome change information before and after drug administration includes all information on the drug's response, including off-target effects, the transcriptome change information and the drug inducing the change are stored in the same vector space. By embedding it so that it is located in the right place, it is possible to capture the abstract characteristics of the drug effect. In addition, since it is an embedding module based on drug effects, even a drug having a new chemical structure can be discovered as a new drug candidate based on drug properties. In addition, since the embedding module of the present invention uses the chemical structure of a drug as an input, it is possible to analyze the drug effect based on the drug effect without a separate process and cost for all synthesizable compounds that will be known in the future as well as chemical compounds with known structures. It can be expressed as an embedding that

또한, 본 발명의 약물 학습 모델은 화학 합성물의 화학 구조에 대한 임베딩 벡터와 해당 화학 합성물이 약물로서 주입되었을 때 발생하는 전사체량의 변화 정보에 대한 임베딩 벡터를 동일한 벡터 공간에 구축하고, 이에 대한 손실 함수로서, 삼중항 손실(triplet loss) 함수를 사용한다. 본 손실함수에서는 특정 화학 합성물의 투약에 따라 발생하는 전사체량의 변화 정보를 나타내는 임베딩 벡터(f^a), 해당 전사체량 변화를 유도하는 화학 합성물을 나타내는 임베딩 벡터(f^p), 해당 전사체량 변화를 유도하지 않는 화학 합성물을 나타내는 임베딩 벡터(fⁿ)를 이용한다.In addition, the drug learning model of the present invention builds an embedding vector for the chemical structure of a chemical compound and an embedding vector for information on changes in the amount of transcripts generated when the chemical compound is injected as a drug in the same vector space, and the loss thereof As a function, we use the triplet loss function. In this loss function, the embedding vector (f ^a ) representing the change in the amount of transcript that occurs according to the administration of a specific chemical compound, the embedding vector (f ^p ) representing the chemical compound that induces the change in the amount of the corresponding transcript, and the change in the corresponding transcript amount An embedding vector (f ⁿ ) representing a non-derivative chemical compound is used.

Triplet_loss(Anchor, Positive, Negative)Triplet_loss(Anchor, Positive, Negative)

즉, 삼중항 손실함수는 전사체량의 변화 정보를 나타내는 임베딩 벡터(f^a)와 해당 전사체량 변화를 유도하는 화학 합성물을 나타내는 임베딩 벡터(f^p)간의 거리에서, 전사체량의 변화 정보를 나타내는 임베딩 벡터(f^a)와 해당 전사체량 변화를 유도하지 않는 화학 합성물을 나타내는 임베딩 벡터(fⁿ)간의 거리를 차감한값에 기초하여 산출된다.That is, the triplet loss function is the distance between the embedding vector (f ^a ) representing the change information of the transcript amount and the embedding vector (f ^p ) representing the chemical compound that induces the change in the corresponding transcript amount. It is calculated based on a value obtained by subtracting the distance between the vector (f ^a ) and the embedding vector (f ⁿ ) representing a chemical compound that does not induce a corresponding transcript amount change.

이때, α는 모델 설계자에 의해 설정되는 마진 값을 나타내며, i는 학습 횟수 또는 샘플의 식별 번호를 나타내는 것이고, N은 각 전사체량 변화를 유도하는 각 화학 합성물과 전사체량 변화 정보의 쌍의 개수를 나타낸다. 예를 들어, 학습 데이터가 총 21,220개의 약물 및 82개의 셀라인에 대하여 실험된 310,114 개의 전사체량 변화 정보를 포함하고 있다면, N은 전체 전사체량 변화 정보의 개수에 해당하는 310, 114가 될 수 있다. At this time, α represents the margin value set by the model designer, i represents the number of learning or the identification number of the sample, and N represents the number of pairs of each chemical compound and transcript amount change information inducing each transcript amount change. indicate For example, if the learning data includes 310,114 transcriptome change information tested for a total of 21,220 drugs and 82 cell lines, N can be 310, 114 corresponding to the number of total transcriptome change information .

임의의 두 임베딩 벡터 A,B 에 대한 거리

는 음의 코사인 유사도(Negative cosine similarity) 를 사용하며, 다음과 같이 정의된다.Distance for any two embedding vectors A,B

uses the negative cosine similarity, and is defined as:

삼중항 손실 함수는, 전사체량의 변화 정보를 나타내는 임베딩 벡터와 해당 전사체량 변화를 유도하는 화학 합성물의 임베딩 벡터간의 거리는 최소화하고, 해당 전사체량의 변화를 유도하지 않는 화학 합성물의 임베딩 벡터와의 거리는 최대화하도록 구성되어 있다.The triplet loss function minimizes the distance between the embedding vector representing the change in transcript amount and the embedding vector of the chemical compound that induces the change in the corresponding transcript amount, and the distance between the embedding vector of the chemical compound that does not induce the change in the corresponding transcript amount It is structured to maximize

이러한 본 발명의 손실함수의 성능을 더욱 극대화하기 위하여, 아래와 같이, 더블 삼중항 손실(Double triplet loss) 함수를 사용할 수 있다. In order to further maximize the performance of the loss function of the present invention, a double triplet loss function can be used as follows.

앞서 설명한 삼중항 손실 함수와 대비하여, 후방에 전사체량 변화를 유도하는 화학 합성물을 나타내는 임베딩 벡터(f^p)와 전사체량의 변화 정보를 나타내는 임베딩 벡터(f^a)간의 거리에서, 전사체량 변화를 유도하는 화학 합성물을 나타내는 임베딩 벡터(f^p)와 전사체량 변화를 유도하지 않는 화학 합성물을 나타내는 임베딩 벡터(fⁿ)간의 거리를 차감한 항이 더 추가된다.In contrast to the triplet loss function described above, at the distance between the embedding vector (f ^p ) representing the chemical compound that induces the change in transcript mass at the rear and the embedding vector (f ^a ) representing change information of the transcript mass, the change in transcript amount A term obtained by subtracting the distance between the embedding vector (f ^p ) representing the inducing chemical compound and the embedding vector (f ⁿ ) representing the chemical compound not inducing a change in transcript amount is further added.

앞서 설명한 삼중항 손실 함수를 사용하면, 전사체량의 변화 정보를 나타내는 임베딩 벡터(f^a)와 해당 전사체량 변화를 유도하지 않는 화학 합성물을 나타내는 임베딩 벡터(fⁿ)간의 거리가 매우 멀 경우, 손실함수 값이 0으로 계산된다. 이러한 경우, 전사체량의 변화 정보를 나타내는 임베딩 벡터(f^a)와 해당 전사체량 변화를 유도하는 화학 합성물을 나타내는 임베딩 벡터(f^p) 간의 거리가 좁혀지지 않는 경우가 발생할 수 있다.Using the triplet loss function described above, if the distance between the embedding vector (f ^a ) representing the change in transcript amount and the embedding vector (f ⁿ ) representing the chemical compound that does not induce the corresponding transcript amount change is very large, the loss The function value is calculated as 0. In this case, there may be cases in which the distance between the embedding vector (f ^a ) indicating the change in transcript amount and the embedding vector (f ^p ) indicating the chemical compound inducing the corresponding transcript amount change is not narrowed.

더블 삼중항 손실 함수를 사용하면, 전방에 위치한 항이 0이 되더라도, 후방에 추가된 항을 통해 전사체량 변화를 유도하는 화학 합성물을 나타내는 임베딩 벡터(f^p)와 전사체량의 변화 정보를 나타내는 임베딩 벡터(f^a)간의 거리를 좁히는 것이 가능하다.If the doublet-triplet loss function is used, even if the term located in the front becomes 0, the embedding vector (f ^p ) representing the chemical compound that induces the change in the amount of transcript through the term added to the rear and the embedding vector representing the change information of the amount of transcript It is possible to narrow the distance between (f ^a ).

한편, 이와 같이 구축된 학습 모델을 더욱 고도화 하기 위해, 반복 학습 단계가 추가적으로 수행될 수 있다(220).Meanwhile, in order to further advance the learning model built in this way, an iterative learning step may be additionally performed (220).

즉, 화학 합성물의 투약전 전사체량에 대한 데이터가 약물 학습 모델에 입력되었을 때, 약물 학습 모델이 추론하는 전사체량의 데이터와 실제 화학 합성물 투약 후 전사체량의 데이터간의 차이가 최소화되도록 반복 학습을 수행한다. 이를 위해, 소정의 손실 함수를 사용하여, 학습 모델에서 출력한 전사체량 데이터와 약물 투약후 전사체량 데이터 간의 차이가 최소가 되도록, 학습 모델의 가중치를 갱신하는 동작을 수행할 수 있다.That is, when the data on the amount of transcript before administration of the chemical compound is input to the drug learning model, repeated learning is performed to minimize the difference between the data of the amount of transcript inferred by the drug learning model and the data of the amount of transcript after actual administration of the chemical compound. do. To this end, an operation of updating the weights of the learning model may be performed using a predetermined loss function so that the difference between the transcript amount data output from the learning model and the transcript amount data after drug administration is minimized.

다음으로, 이와 같이 구축된 학습 모델에 대하여 신규 물질의 화학 구조에 대한 임베딩 벡터 또는 타겟이 되는 전사체량 변화 정보를 쿼리로서 입력한다(S320).Next, the embedding vector for the chemical structure of the new material or target transcript amount change information is input as a query to the learning model built as described above (S320).

본 발명에서는, 기존에 알려진 모든 화합물들에 대해 임베딩 벡터를 구하고 이들을 빠르게 검색할 수 있도록 색인을 미리 마련하여 DB 등에 저장한다. 기존에 알려진 화합물은 대략 Zinc15 DB에 등록되어 있는 것들 기준으로는 약 13억개 수준이다. 이 화합물들을 포함해 미래에 추가될 화합물들에 대해서도 추가로 색인 가능하도록 구현한다. 이와 같이 구축한 유사 벡터 검색 시스템을 사용하면 밀집 벡터로 표현된 여러 화합물 중에서 쿼리 벡터와 가장 유사한 화합물을 찾아낼 수 있다. 이때 쿼리 벡터는 신규 물질의 화학 구조에 대한 임베딩 벡터가 되거나, 약물 투약 후 전사체량 변화 정보에 대한 임베딩 벡터가 될 수 있다. In the present invention, an embedding vector is obtained for all previously known compounds, and an index is prepared in advance and stored in a DB or the like so that they can be quickly searched. The existing known compounds are about 1.3 billion based on those registered in the Zinc15 DB. Including these compounds, additional indexing is implemented for compounds to be added in the future. Using the similar vector search system constructed in this way, it is possible to find a compound most similar to a query vector among several compounds expressed as dense vectors. In this case, the query vector may be an embedding vector for the chemical structure of the new substance or an embedding vector for information on the change in transcript amount after drug administration.

예를 들어, 신약 재창출(drug repositioning)과 같이 개발 중이거나 판매가 진행 중인 약물의 새로운 의학적 용도를 개발하기 위하여 해당 약물의 임베딩 벡터를 쿼리 벡터로써 사용할 수 있고, 약물 스크리닝과 같이 요구되는 기능을 수행하는 신약 후보 물질을 뽑기 위해서 전사체량 변화 정보를 쿼리로 사용할 수 있다. 이때 대략 13억 개의 화합물에 대한 유사성 검색 소요시간은 수 초 단위로 빠른 시간 내에 약물 후보군을 선정할 수 있다.For example, in order to develop a new medical use of a drug under development or sale, such as drug repositioning, the embedding vector of the corresponding drug can be used as a query vector and perform required functions such as drug screening In order to select new drug candidates, the transcriptome change information can be used as a query. At this time, the similarity search time for approximately 1.3 billion compounds is a few seconds, and a drug candidate group can be selected in a short time.

다음으로, 약물 학습 모델은 쿼리 벡터에 대한 출력으로서, 신규 물질의 화학 구조에 대한 임베딩 벡터에 매칭되는 전사체량 변화 정보 결과를 출력하거나, 타겟이 되는 전사체량 변화 정보에 매칭되는 하나 이상의 약물에 대한 정보를 출력한다(S330).Next, the drug learning model outputs a transcript amount change information result matching the embedding vector for the chemical structure of the new substance as an output for the query vector, or for one or more drugs matching the target transcript amount change information. Information is output (S330).

도 2에 도시된 바와 같이, 약물 학습 모델은 전사체량 변화 정보가 쿼리로서 입력되면, 이에 매칭되는 약물 후보 물질을 출력한다. 또한, 약물 학습 모델은 신규 물질의 화학 구조에 대한 임베딩 벡터가 입력되면, 이에 매칭되는 전사체량 변화 정보 결과를 출력한다.As shown in FIG. 2, the drug learning model outputs a drug candidate that matches the transcript amount change information as a query. In addition, when an embedding vector for a chemical structure of a new substance is input, the drug learning model outputs a transcript amount change information result matched thereto.

한편, 본 발명에서는 앞서 살펴본 바와 같이, 전사체량의 변화 정보를 기초로, 이에 매칭되는 약물 후보 물질을 출력할 수 있다. 특히, 타겟 단백질이 결정된 경우에는, 타겟 단백질의 유전자를 KO(knock out)/KD(knock down) 시키기 이전 전사체량(기준 전사체량)과 타겟 단백질의 유전자를 KO/KD 시킨 이후의 전사체량(유도 전사체량)의 차이를 기초로 전사체량 변화 정보를 특정할 수 있다. 그리고, 이렇게 얻어진 전사체량 변화 정보를 약물 학습 모델에 입력하여, 약물 후보 물질을 출력할 수 있다.On the other hand, in the present invention, as described above, based on the change information of the amount of transcripts, it is possible to output a drug candidate matched thereto. In particular, when the target protein is determined, the amount of transcript before KO (knock out) / KD (knock down) of the gene of the target protein (reference transcript amount) and the amount of transcript after KO / KD of the gene of the target protein (induction) Transcript amount change information can be specified based on the difference in transcript amount). In addition, by inputting the obtained transcriptome change information to a drug learning model, a drug candidate may be output.

다른 방법으로, 정상군과 타겟 질병군의 전사체량 정보가 주어진 경우에는 질병군의 전사체량(기준 전사체량)과 정상군의 전사체량(유도 전사체량)의 차이를 기초로 전사체량 변화 정보를 특정할 수 있다. 그리고, 이렇게 얻어진 전사체량 변화 정보를 약물 학습 모델에 입력하여, 약물 후보 물질을 출력할 수 있다.Alternatively, when the transcript quantity information of the normal group and the target disease group is given, the transcript quantity change information can be specified based on the difference between the transcript quantity of the diseased group (reference transcript quantity) and the normal group (derived transcript quantity). there is. In addition, by inputting the obtained transcriptome change information to a drug learning model, a drug candidate may be output.

한편, 전사체량 변화량의 절대적인 값의 편차(variance)는, 값의 관측 방법에 따라 모델 학습시 사용한 전사체 변화량 값의 편차와 크게 차이가 날 수 있다. 이러한 값을 약물 학습 모델의 입력으로 그대로 사용할 경우, 모델 학습 환경과 크게 다른 데이터이기 때문에 기대했던 결과를 얻기 어려울 수 있다. 이를 해결하기 위해, 쿼리로서 입력되는 전사체량 변화 정보의 편차를 약물 학습 모델의 학습 과정에 사용된 전사체량 변화 정보 학습 데이터의 편차와 대체적으로 일치시킬 필요가 있다.On the other hand, the variance of the absolute value of the amount of change in the amount of transcript may be significantly different from the variance of the amount of change in the transcript used during model learning, depending on the method of observing the value. If these values are used as inputs to the drug learning model, it may be difficult to obtain the expected results because the data is significantly different from the model learning environment. In order to solve this problem, it is necessary to substantially match the deviation of the transcriptome change information input as a query with the deviation of the transcriptome change information learning data used in the learning process of the drug learning model.

이를 위해, 새롭게 입력되는 전사체량 변화 정보를 구성하는 기준 전사체량과 유도 전사체량에 대해서 T-테스트를 통해 T-통계량(T-statistic) 의 크기 순으로 각각의 유전자들을 나열한다. 이는 새롭게 입력되는 전사체 정보에 대해서 전사체 변화량의 크기 순으로 유전자들을 나열한 것과 같다. 다음으로, 전사체 변화량의 크기 순으로 나열된 유전자들에 대해서, 학습 데이터의 전사체 변화량의 값들을 큰 순으로 매핑한다. 예를 들어, 기준 전사체량 대비 유도 전사체량에서 전사체량이 가장 많이 감소된 유전자에 대해서는, 학습 데이터의 전사체 변화량 중에서도 음의 절대값이 가장 큰 값을 부여하는 식이다. 이러한 방법을 통해 새롭게 들어온 전사체 정보에 대해서, 전사체 변화량 기준 유전자들의 순서를 유지하면서 동시에 학습 데이터의 값의 편차와 동일한 수준의 편차를 갖는 입력 데이터를 생성할 수 있다.To this end, each gene is listed in order of T-statistic size through a T-test for the reference transcript amount and the induced transcript amount constituting the newly input transcript amount change information. This is the same as arranging genes in the order of magnitude of transcript change for newly input transcript information. Next, with respect to the genes listed in the order of magnitude of transcript change, the values of the transcript change of the training data are mapped in ascending order. For example, for a gene with the largest decrease in transcript amount in the amount of induced transcript compared to the reference transcript amount, the absolute value of the largest negative value is assigned among the changes in the transcript of the learning data. With respect to the newly received transcript information through this method, it is possible to generate input data having the same level of deviation as the deviation of the value of the learning data while maintaining the order of transcriptome change reference genes.

약물 학습 모델은, 조정된 전사체량 변화정보의 임베딩 벡터가 입력되면, 앞서 구축한 화합물 임베딩 데이터베이스로부터 가장 가까운 벡터값을 갖는 화합물들을 검색한다. 이때, 벡터 간의 거리는 유클리드 거리(Euclidean distance) 혹은 코사인 유사도(Cosine similarity)와 같은 일반적인 거리 함수를 사용할 수 있다. 이를 기반으로, 가장 가까운 벡터값을 갖는 화합물들을 기대하는 전사체량 변화 효과를 유도하는 약물 후보로서 도출한다.When the embedding vector of the adjusted transcriptome change information is input, the drug learning model searches for compounds having the closest vector values from the previously constructed compound embedding database. In this case, a distance between vectors may use a general distance function such as Euclidean distance or cosine similarity. Based on this, compounds having the closest vector values are derived as drug candidates inducing the expected effect of changing the amount of transcripts.

본 발명의 일 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체를 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다. An embodiment of the present invention may be implemented in the form of a recording medium including instructions executable by a computer, such as program modules executed by a computer. Computer readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. Also, computer readable media may include computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

본 발명의 방법 및 시스템은 특정 실시예와 관련하여 설명되었지만, 그것들의 구성 요소 또는 동작의 일부 또는 전부는 범용 하드웨어 아키텍쳐를 갖는 컴퓨터 시스템을 사용하여 구현될 수 있다.Although the methods and systems of the present invention have been described with reference to specific embodiments, some or all of their components or operations may be implemented using a computer system having a general-purpose hardware architecture.

전술한 본원의 설명은 예시를 위한 것이며, 본원이 속하는 기술분야의 통상의 지식을 가진 자는 본원의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above description of the present application is for illustrative purposes, and those skilled in the art will understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present application. Therefore, the embodiments described above should be understood as illustrative in all respects and not limiting. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

본원의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본원의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present application is indicated by the following claims rather than the detailed description above, and all changes or modifications derived from the meaning and scope of the claims and equivalent concepts thereof should be construed as being included in the scope of the present application.

100: 신약 후보 물질 출력 장치
110: 통신 모듈
120: 메모리
130: 프로세서
140: 데이터베이스100: new drug candidate output device
110: communication module
120: memory
130: processor
140: database

Claims

In the new drug candidate output device,
communication module;
a memory in which a program for outputting new drug candidates is stored; and
A processor executing the new drug candidate output program;
The new drug candidate output program provides a drug learning model in which the embedding vector for the chemical structure of the chemical compound and the embedding vector for the change information of the transcript induced by each chemical compound are located in the same vector space, and the drug learning model It is preferable to output a transcript quantity change information result that matches the embedding vector for the chemical structure of the input new substance, or to output information about one or more drugs that match the target transcript quantity change information input to the drug learning model. become,
When the data on the amount of transcript before administration of the chemical compound is input, the drug learning model repeatedly learns to minimize the difference between the data of the amount of changed transcript output by the drug learning model and the data of the amount of transcript after actual administration of the chemical compound. this has progressed,
The new drug candidate output device, wherein the transcript amount change information indicates a difference between a transcript amount before administration of the chemical compound and a transcript amount after administration of the chemical compound.

According to claim 1,
The drug learning model minimizes the distance between the first embedding vector representing the transcript amount change information and the embedding vector of the chemical compound inducing the corresponding transcript amount change through the triplet loss function, and the first embedding vector and the corresponding transcript amount A new drug candidate output device configured to maximize a distance from an embedding vector of a chemical compound that does not induce a change.

delete

A drug learning model construction method of a new drug candidate output device for discovering new drug candidates,
constructing a drug learning model in which an embedding vector for a chemical structure of a chemical compound and an embedding vector for information on a change in transcript amount induced by each chemical compound are placed in the same vector space; and
When data on the amount of transcript before administration of the chemical compound is input to the drug learning model, repeated learning is performed to minimize the difference between the data on the amount of transcript inferred by the drug learning model and the data on the amount of transcript after actual administration of the chemical compound. Including the steps to perform,
When the embedding vector for the chemical structure of a new substance is input, the drug learning model outputs a matched transcript quantity change information result, or when target transcript quantity change information is input, outputs information on one or more drugs matched therewith. is to do,
Wherein the transcript amount change information indicates a difference between the transcript amount before administration of the chemical compound and the transcript amount after administration of the chemical compound.

According to claim 4,
Building the drug learning model
Through the triplet loss function, the distance between the first embedding vector representing the change in transcript amount information and the embedding vector of the chemical compound inducing the change in the corresponding transcript amount is minimized, and the first embedding vector and the change in the corresponding transcript amount are minimized. A method for constructing a drug learning model, wherein a distance from an embedding vector of a chemical compound is maximized.

In the new drug candidate output method using a new drug candidate output device,
providing a drug learning model in which an embedding vector for a chemical structure of a chemical compound and an embedding vector for information on a change in transcript amount induced by each chemical compound are located in the same vector space;
inputting an embedding vector for a chemical structure of a new substance or target transcript amount change information into the drug learning model; and
The drug learning model outputs a result of transcript mass change information matched to an embedding vector for the chemical structure of the new substance, or outputting information on one or more drugs matched to the target transcript mass change information. but
When the data on the amount of transcript before administration of the chemical compound is input, the drug learning model repeatedly learns to minimize the difference between the data of the amount of changed transcript output by the drug learning model and the data of the amount of transcript after actual administration of the chemical compound. this has progressed,
Wherein the transcript amount change information indicates a difference between a transcript amount before administration of the chemical compound and a transcript amount after administration of the chemical compound.

According to claim 6,
The drug learning model minimizes the distance between the first embedding vector representing the transcript amount change information and the embedding vector of the chemical compound inducing the corresponding transcript amount change through the triplet loss function, and the first embedding vector and the corresponding transcript amount A method for outputting new drug candidates, which is configured to maximize a distance from an embedding vector of a chemical compound that does not induce a change.

delete

A computer-readable recording medium recording a program for performing a method according to any one of claims 4 to 7.