KR102158066B1

KR102158066B1 - Method, Device, and Computer-Readable Medium for Optimizing Document Image

Info

Publication number: KR102158066B1
Application number: KR1020180127291A
Authority: KR
Inventors: 윤영미; 장기업
Original assignee: 가천대학교 산학협력단
Priority date: 2018-10-24
Filing date: 2018-10-24
Publication date: 2020-09-21
Also published as: KR20200046314A

Abstract

본 발명은 의생명 문헌 데이터로부터 약-유전자 사이의 관련성데이터를 도출하고, 도출된 약-유전자 관련성데이터를 기초로 판별대상 효능 혹은 부작용이 있다고 예측되는 신규 약을 도출하는 약의 신규 특성을 도출하는 방법, 장치 및 컴퓨터-판독가능 매체에 관한 것이다.The present invention derives drug-gene relationship data from biomedical literature data, and based on the derived drug-gene relationship data, a new drug for deriving a new drug predicted to have an effect or side effect to be identified is derived. It relates to a method, apparatus, and computer-readable medium.

Description

[Method, Device, and Computer-Readable Medium for Optimizing Document Image}]

본 발명은 약의 신규 특성을 도출하는 방법, 장치 및 컴퓨터-판독가능 매체에 관한 것으로서, 더욱 상세하게는 의생명 문헌 데이터로부터 약-유전자 사이의 관련성데이터를 도출하고, 도출된 약-유전자 관련성데이터를 기초로 판별대상 효능 혹은 부작용이 있다고 예측되는 신규 약을 도출하는 약의 신규 특성을 도출하는 방법, 장치 및 컴퓨터-판독가능 매체에 관한 것이다.The present invention relates to a method, an apparatus, and a computer-readable medium for deriving novel properties of drugs, and more particularly, to derive drug-gene relationship data from biomedical literature data, and derived drug-gene relationship data. It relates to a method, an apparatus, and a computer-readable medium for deriving novel properties of a drug for deriving a new drug that is predicted to have an effect or side effect to be determined based on the.

신약을 개발하기 위해서는 천문학적 연구비용과 10~20년에 이르는 긴 연구기간을 필요로 한다. 현재 수많은 연구원과 제약회사는 신약을 개발함과 동시에 이러한 비용과 기간을 줄이기 위한 방법으로 기존에 존재하던 약품의 새로운 효능을 찾기 위한 노력을 기울이고 있다. 그 대표적인 예로 비아그라(Viagra)를 들 수 있다. 비아그라는 개발 초기에는 협심증 치료를 위하여 연구되었지만 현재는 발기부전 치료제로 사용되고 있다. In order to develop a new drug, astronomical research costs and a long research period of 10 to 20 years are required. Currently, numerous researchers and pharmaceutical companies are working to find new efficacy of existing drugs as a way to reduce these costs and duration while developing new drugs. A typical example is Viagra. In the early stages of development, Viagra was studied to treat angina, but is now used as a treatment for erectile dysfunction.

이와 같이 이미 개발을 마친 약의 새로운 효능을 찾기 위한 방법 중의 하나로 임상실험 결과를 이용한 연구를 진행하기도 한다. 하지만 임상실험의 피실험자로부터 동의를 얻어 연구를 진행하고, 다수의 피실험자에게 약을 사용하여 연구에 필요한 충분한 양의 데이터를 얻기까지는 수많은 어려움이 존재하는 것이 사실이다. 따라서 이러한 어려움을 해결하기 위하여 새로운 방법이 필요한 실정이다.As one of the methods to find new efficacy of drugs that have already been developed, research using clinical trial results is also conducted. However, it is true that there are numerous difficulties in obtaining a sufficient amount of data necessary for the study by obtaining consent from clinical trial subjects and proceeding with the study, and using drugs for a large number of subjects. Therefore, a new method is needed to solve these difficulties.

약을 인체에 사용하게 되면 인체 내의 유전자에 억제(down-regulation) 작용을 하거나, 활성(up-regulation) 작용을 하게 된다. 또한 인체 내에 질병이 발병하게 되면 마찬가지로 인체 내의 유전자에 억제 작용을 하거나 활성 작용을 하게 된다. 따라서 이렇게 유전자를 매개체로 약과 질병의 상관관계를 탐구하는 방법을 강구하게 되었다.When the drug is used in the human body, it acts as a down-regulation or up-regulation effect on genes in the human body. In addition, when a disease occurs in the human body, it acts as an inhibitory or active action on genes in the human body. Therefore, a method to explore the relationship between medicine and disease was devised through genes as a medium.

결국 임상실험을 통한 방법이 위와 같은 문제점을 지닌 이상, 임상실험만이 아닌 기존에 진행된 수많은 생물학 연구와 관련된 문헌을 이용하여 이와 같은 상관관계를 탐구하는 방법의 필요성이 커지게 되었다.Eventually, as the method through clinical trials has the above problems, the need for a method to explore such correlations using literature related to numerous existing biological studies, not just clinical trials, has grown.

또한, 약의 경우 일정 부작용을 수반하고 있는데, 각각의 부작용에 대하여 임상실험을 통하여 도출하고 있지만 이에 대한 막대한 비용이 들고, 또한 약의 새로운 용도 혹은 효능을 도출하기 위한 임상실험의 경우에도 마찬가지로 막대한 비용이 들고 있다.In addition, drugs have certain side effects, and each side effect is derived through clinical trials, but there is enormous cost for this, and in the case of clinical trials to derive new uses or efficacy of the drug, it is also enormous cost. Is holding this.

본 발명의 목적은 의생명 문헌 데이터로부터 약-유전자 사이의 관련성데이터를 도출하고, 도출된 약-유전자 관련성데이터를 기초로 판별대상 효능 혹은 부작용이 있다고 예측되는 신규 약을 도출하는 약의 신규 특성을 도출하는 방법, 장치 및 컴퓨터-판독가능 매체를 제공하는 것이다.An object of the present invention is to derive drug-gene relationship data from biomedical literature data, and based on the derived drug-gene relationship data, a novel characteristic of a drug to derive a new drug predicted to have an effect or side effect to be identified. It is to provide a method, apparatus, and computer-readable medium for derivation.

본 발명은 상기와 같은 과제를 해결하기 위하여, 컴퓨팅 장치로 구현되는, 약의 신규 특성을 도출하는 방법으로서, 1 이상의 약 각각에 대한 1 이상의 관련 유전자 정보를 포함하는 DG관련성데이터를 상기 컴퓨팅 장치에서 로드하거나 혹은 의생명 문헌데이터로부터 도출하는 DG관련성데이터추출단계; 및 상기 DG관련성데이터로부터 추출되고 1 이상의 유전자 정보에 의하여 구성되는 1 이상의 DG토픽에 대한, 각각의 약들의 DG관련성데이터로부터 추출된 유전자 정보의 매칭율에 기초하여, 판별대상 효능 혹은 부작용이 알려지지 않은 약에 대하여 판별대상 효능 혹은 부작용이 있는 지 여부를 판별하는 약특성도출단계;를 포함하는, 약의 신규 특성을 도출하는 방법을 제공한다.In order to solve the above problems, the present invention is a method for deriving a new characteristic of a drug, implemented by a computing device, in which DG related data including one or more related gene information for each of one or more drugs is stored in the computing device. Loading or extracting DG relevance data derived from biomedical literature data; And based on the matching rate of the gene information extracted from the DG related data of each drug, to one or more DG topics extracted from the DG-related data and constituted by one or more gene information, the efficacy or side effect to be identified is unknown. It provides a method for deriving a new characteristic of a drug, including; a drug characteristic derivation step of determining whether there is an effect or side effect to be determined for the drug.

본 발명의 일 실시예에서는, 상기 DGP관련성도출단계는, 상기 의생명 문헌데이터로부터 텍스트를 로드하는 텍스트로드단계; 로드된 상기 텍스트로부터 기설정된 규칙에 따라 분석대상텍스트를 추출하는 문장추출단계; 상기 분석대상텍스트가 약, 유전자를 포함하고 있는 지 여부를 판별하는 문장판별단계; 상기 분석대상텍스트에서 약, 유전자의 관계와 관련된 유효단어를 추출하는 유효단어추출단계; 및 상기 유효단어로부터 약, 유전자의 관계를 도출하여 DG관련성데이터를 추출하는 제1GRS계산단계;를 포함할 수 있다.In an embodiment of the present invention, the DGP relevance derivation step includes: a text loading step of loading text from the biomedical document data; A sentence extraction step of extracting a text to be analyzed according to a preset rule from the loaded text; A sentence discrimination step of determining whether the analysis target text contains about or a gene; A valid word extraction step of extracting valid words related to the relationship between the drug and the gene from the analysis target text; And a first GRS calculation step of extracting DG related data by deriving the relationship between the drug and the gene from the effective word.

본 발명의 일 실시예에서는, 상기 문장추출단계는 로드된 상기 텍스트로부터 구(phrase) 혹은 절(clause) 단위로 2 이상의 단어를 추출하여 이를 분석대상텍스트로 하고, 상기 문장판별단계는 상기 분석대상텍스트에 약 및 유전자의 기재가 있는지 여부를 판별할 수 있다.In one embodiment of the present invention, in the sentence extraction step, two or more words are extracted from the loaded text in units of phrases or clauses, and the text is analyzed as the text to be analyzed. It can be determined whether there is a description of drugs and genes in the text.

본 발명의 일 실시예에서는, 유효단어추출단계는 상기 분석대상텍스트로부터 동사, 및 명사를 포함하는 유효단어를 포함하는 유효단어그룹을 추출하고, 상기 제1GRS계산단계는, 상기 유효단어그룹의 요소가 기설정된 억제의미의 생물학용어에 해당하는 경우에는, 상기 유효단어그룹의 유효단어에 대해 제1 파라미터값을 부여하는 단계; 상기 유효단어그룹의 요소가 기설정된 억제의미의 생물학용어에 해당하지 않는 경우에는, 상기 유효단어그룹의 유효단어에 대해 제2 파라미터값을 부여하는 단계; 및 상기 유효단어그룹의 유효단어에 부여된 1 이상의 파라미터값에 기초하여, 약 및 유전자에 대한 유전자조절점수(GRS)를 계산하는 제1GRS도출단계를 포함할 수 있다.In one embodiment of the present invention, the effective word extraction step extracts a valid word group including valid words including verbs and nouns from the analysis target text, and the first GRS calculation step comprises: an element of the valid word group In the case where is corresponding to a biological term having a predetermined suppressive meaning, assigning a first parameter value to the effective word of the effective word group; Assigning a second parameter value to the effective word of the effective word group when the element of the effective word group does not correspond to a biological term of a predetermined suppressive meaning; And a first GRS derivation step of calculating a gene control score (GRS) for a drug and a gene based on one or more parameter values assigned to the effective word of the effective word group.

본 발명의 일 실시예에서는, 상기 제1 파라미터값 및 상기 제2 파라미터값은 부호가 상이한 숫자값에 해당하고, 상기 제1GRS도출단계는 상기 유효단어그룹의 유효단어에 부여된 2 이상의 파라미터값을 곱하여 상기 유전자조절점수를 계산할 수 있다.In an embodiment of the present invention, the first parameter value and the second parameter value correspond to numeric values having different signs, and in the step of deriving the first GRS, two or more parameter values assigned to valid words of the valid word group By multiplying, the gene control score can be calculated.

본 발명의 일 실시예에서는, 상기 약특성도출단계는, 상기 1 이상의 약에 대하여 유전자 정보에 기초하여 상기 DG관련성데이터로부터 추출된 각각의 DG토픽에 대한 매칭율을 계산하는 제1단계; 상기 판별대상 효능 혹은 부작용이 있다고 알려진 1 이상의 약의 상기 DG토픽에 대한 매칭율 혹은 매칭율로부터 도출된 점수에 기초하여 1 이상의 대표DG토픽을 도출하는 제2단계; 상기 대표DG토픽에 대한 상기 판별대상 효능 혹은 부작용이 있다고 알려지지 않은 1 이상의 약의 매칭율 혹은 매칭율로부터 도출된 점수에 기초하여 판별대상 효능 혹은 부작용이 예측되는 신규약을 도출하는 제3단계;를 포함할 수 있다.In an embodiment of the present invention, the step of deriving drug characteristics includes: a first step of calculating a matching rate for each DG topic extracted from the DG related data based on genetic information for the one or more drugs; A second step of deriving one or more representative DG topics based on a matching rate for the DG topic or a score derived from the matching rate of one or more drugs known to have an efficacy or side effect to be identified; A third step of deriving a new drug for which the target efficacy or side effect is predicted based on the score derived from the matching rate or the matching rate of one or more drugs that are not known to have the target efficacy or side effects for the representative DG topic; Can include.

본 발명의 일 실시예에서는, 상기 DG토픽은 상기 DG관련성데이터의 각각의 약을 문헌으로 간주하고, 상기 유전자 정보를 단어로 간주하여 토픽모델링 방법을 통하여 도출된 복수의 토픽이고, 상기 제1단계는, 상기 1 이상의 약 각각의 상기 DG관련성데이터에서의 유전자 정보와 상기 DG토픽의 각각의 토픽에 포함된 유전자 정보의 매칭율을 계산할 수 있다.In an embodiment of the present invention, the DG topic is a plurality of topics derived through a topic modeling method by considering each drug of the DG related data as a document, and considering the genetic information as a word, the first step May calculate a matching rate between gene information in each of the at least one DG related data and gene information included in each topic of the DG topic.

본 발명의 일 실시예에서는, 상기 제2단계는, 상기 판별대상 효능 혹은 부작용이 있다고 알려진 1 이상의 약 각각에 대해 상기 DG토픽에 대한 매칭율 혹은 매칭율로부터 도출된 점수가 기설정된 기준 이상인 1 이상의 DG토픽을 상기 대표DG토픽으로 도출할 수 있다.In an embodiment of the present invention, in the second step, for each of the one or more drugs known to have an efficacy or side effect to be identified, a matching rate for the DG topic or a score derived from the matching rate is equal to or greater than a preset criterion. DG topics can be derived as the representative DG topics.

본 발명의 일 실시예에서는, 상기 대표DG토픽은 복수의 토픽을 포함하고, 상기 제3단계는, 상기 판별대상 효능 혹은 부작용이 있다고 알려지지 않은 1 이상의 약 각각의 상기 대표DG토픽에 속한 복수의 토픽에 대한 각각의 매칭율 혹은 매칭율로부터 도출된 점수로부터 도출된 대표점수에 기초하여, 상기 상기 판별대상 효능 혹은 부작용이 있다고 알려지지 않은 1 이상의 약 중 판별대상 효능 혹은 부작용이 예측되는 신규약을 도출할 수 있다.In one embodiment of the present invention, the representative DG topic includes a plurality of topics, and the third step is a plurality of topics belonging to each of the representative DG topics that are not known to have the efficacy or side effects to be identified. Based on each matching rate for or a representative score derived from the score derived from the matching rate, a new drug for which the target efficacy or side effect is predicted among the one or more drugs that are not known to have the target efficacy or side effects can be derived. I can.

본 발명의 일 실시예에서는, 상기 대표점수는 상기 판별대상 효능 혹은 부작용이 있다고 알려지지 않은 1 이상의 약 각각의 상기 대표DG토픽에 속한 복수의 토픽에 대한 각각의 매칭율 혹은 매칭율의 합일 수 있다.In an embodiment of the present invention, the representative score may be a matching rate or a sum of matching rates for a plurality of topics belonging to each of the representative DG topics, each of which is not known to have an efficacy or side effect to be determined.

본 발명의 일 실시예에서는, 상기 제3단계는, 상기 판별대상 효능 혹은 부작용이 있다고 알려진 1 이상의 약 각각의 상기 대표DG토픽에 속한 복수의 토픽에 대한 각각의 매칭율 혹은 매칭율로부터 도출된 점수를 기준으로 하여, 상기 판별대상 효능 혹은 부작용이 있다고 알려지지 않은 1 이상의 약 각각의 상기 대표DG토픽에 속한 복수의 토픽에 대한 각각의 매칭율 혹은 매칭율로부터 도출된 점수를 판별하여, 상기 상기 판별대상 효능 혹은 부작용이 있다고 알려지지 않은 1 이상의 약 중 판별대상 효능 혹은 부작용이 예측되는 신규약을 도출할 수 있다.In one embodiment of the present invention, the third step is a score derived from each matching rate or matching rate for a plurality of topics belonging to the representative DG topic of each of the at least one drug known to have an efficacy or side effect to be identified. On the basis of, by determining a score derived from each matching rate or matching rate for a plurality of topics belonging to the representative DG topic of each of the at least one drug that is not known to have an efficacy or side effect of the discrimination target, the discrimination target Among one or more drugs that are not known to have efficacy or side effects, new drugs with predicted efficacy or side effects can be derived.

상기와 같은 과제를 해결하기 위하여, 1 이상의 메모리 및 1 이상의 프로세서를 포함하는 약의 신규 특성을 도출하는 장치로서, 1 이상의 약 각각에 대한 1 이상의 관련 유전자 정보를 포함하는 DG관련성데이터를 메모리에서 로드하거나 혹은 의생명 문헌데이터로부터 도출하는 DG관련성데이터추출부; 및 상기 DG관련성데이터로부터 추출되고 1 이상의 유전자 정보에 의하여 구성되는 1 이상의 DG토픽에 대한, 각각의 약들의 DG관련성데이터로부터 추출된 유전자 정보의 매칭율에 기초하여, 판별대상 효능 혹은 부작용이 알려지지 않은 약에 대하여 판별대상 효능 혹은 부작용이 있는 지 여부를 판별하는 약특성도출부를 포함하는, 약의 신규 특성을 도출하는 장치를 제공한다.In order to solve the above problems, as a device for deriving a new characteristic of a drug including one or more memories and one or more processors, DG-related data including one or more related gene information for each of one or more drugs is loaded from the memory. Or a DG-related data extraction unit derived from biomedical literature data; And based on the matching rate of the gene information extracted from the DG related data of each drug, to one or more DG topics extracted from the DG-related data and constituted by one or more gene information, the efficacy or side effect to be identified is unknown. It provides an apparatus for deriving a new characteristic of a drug, including a drug characteristic derivation unit for determining whether there is an effect or side effect to be determined for the drug.

상기와 같은 과제를 해결하기 위하여, 컴퓨터-판독가능 기록매체로서, 상기 컴퓨터-판독가능 기록매체는, 컴퓨팅 장치로 하여금 이하의 단계들을 수행하도록 하는 명령들을 저장하며, 상기 단계들은: 1 이상의 약 각각에 대한 1 이상의 관련 유전자 정보를 포함하는 DG관련성데이터를 상기 컴퓨팅 장치에서 로드하거나 혹은 의생명 문헌데이터로부터 도출하는 DG관련성데이터추출단계; 및 상기 DG관련성데이터로부터 추출되고 1 이상의 유전자 정보에 의하여 구성되는 1 이상의 DG토픽에 대한, 각각의 약들의 DG관련성데이터로부터 추출된 유전자 정보의 매칭율에 기초하여, 판별대상 효능 혹은 부작용이 알려지지 않은 약에 대하여 판별대상 효능 혹은 부작용이 있는 지 여부를 판별하는 약특성도출단계;를 포함하는, 컴퓨터-판독가능 기록매체를 제공한다.In order to solve the above problems, as a computer-readable recording medium, the computer-readable recording medium stores instructions for causing a computing device to perform the following steps, the steps: DG relevance data extracting step of loading DG relevance data including one or more related gene information of the computer in the computing device or derived from biomedical literature data; And based on the matching rate of the gene information extracted from the DG related data of each drug, to one or more DG topics extracted from the DG-related data and constituted by one or more gene information, the efficacy or side effect to be identified is unknown. It provides a computer-readable recording medium comprising; a drug characteristic derivation step of determining whether there is an effect or side effect to be determined for the drug.

본 발명의 일 실시예에 따르면, 의생명 문헌데이터에 기반하여 문장정보를 추출하여 약의 유전자에 대한 억제작용(down-regulation) 및 활성작용(up-regulation)를 고려하여, 판별대상 효능 혹은 부작용이 알려지지 않은 약에 대하여 판별대상 효능 혹은 부작용이 있는 지 여부를 판별하는 효과를 발휘할 수 있다.According to an embodiment of the present invention, by extracting sentence information based on biomedical literature data and taking into account the down-regulation and up-regulation of the gene of the drug, the target efficacy or side effect For this unknown drug, it can exert the effect of discriminating whether there are any effects or side effects to be identified.

본 발명의 일 실시예에 따르면, 방대한 의생명 문헌데이터에 기재된 약-유전자에 대한 정보를 간접적으로 이용하여 기존에 알려진 약들의 새로운 판별대상 효능 혹은 부작용을 도출할 수 있는 효과를 발휘할 수 있다.According to an embodiment of the present invention, it is possible to exert an effect of inducing a new discriminant effect or side effect of previously known drugs by indirectly using information on a drug-gene described in a vast amount of biomedical literature data.

본 발명의 일 실시예에 따르면, 새로운 의생명 문헌데이터의 업데이트에 따라 별도의 새로운 작업 혹은 실험 없이 새로운 판별대상 효능 혹은 부작용을 갖는 약을 도출할 수 있는 효과를 발휘할 수 있다.According to an embodiment of the present invention, according to the update of new biomedical literature data, it is possible to exert an effect of deriving a drug having a new efficacy or side effect to be identified without a separate new work or experiment.

본 발명의 일 실시예에 따르면, 임상실험 없이 낮은 비용으로 새로운 판별대상 효능 혹은 부작용을 갖는 약을 도출할 수 있다.According to an embodiment of the present invention, it is possible to derive a drug having a new discriminant efficacy or side effect at low cost without a clinical trial.

본 발명의 일 실시예에 따르면, 의생명 문헌데이터를 효과적으로 활용하여, 새로운 판별대상 효능 혹은 부작용을 갖는 약을 도출할 수 있다.According to an embodiment of the present invention, it is possible to derive a drug having a new discriminant effect or side effect by effectively utilizing the biomedical data.

도 1은 본 발명의 일 실시예에 따른 의생명 문헌 데이터에 기반하여 약의 신규 특성을 도출하는 과정을 개략적으로 도출한다.
도 2는 본 발명의 일 실시예에 따른 의생명 문헌데이터에 기반하여 약의 신규 특성을 도출하는 컴퓨팅장치를 포함하는 시스템을 개략적으로 도시한다.
도 3은 본 발명의 일 실시예에 따른 의생명 문헌데이터에 기반하여 약의 신규 특성을 도출하는 컴퓨팅장치의 내부 구성을 개략적으로 도시한다.
도 4는 본 발명의 일 실시예에 따른 DG관련성데이터추출부의 내부 구성을 개략적으로 도시한다.
도 5는 본 발명의 일 실시예에 따른 유효단어추출부 동작을 수행한 결과를 예시적으로 도시한다.
도 6은 본 발명의 일 실시예에 따른 기설정된 억제의미의 생물학용어를 예시적으로 도시한다.
도 7은 본 발명의 일 실시예에 따른 유효단어에 부여된 제1 파라미터값 혹은 제2 파라미터값을 이용한 유전자조절점수(GRS)계산을 예시적으로 도시한다.
도 8은 본 발명의 일 실시예에 따른 약특성도출단계의 과정을 개략적으로 도시한다.
도 9는 본 발명의 일 실시예에 따른 DG관련성데이터로부터 토픽을 추출하고, 각각의 약에 대한 토픽의 매칭율을 도출하는 과정을 개략적으로 도시한다.
도 10은 본 발명의 일 실시예에 따른 각각의 약들에 대한 토픽에 대한 매칭율에 기초한 점수를 도출한 과정을 개략적으로 도시한다.
도 11은 본 발명의 일 실시예에 따른 각각의 약들 중 판별대상 효능 혹은 부작용이 있는 약들에 대한 클래스 표기를 수행한 과정을 개략적으로 도시한다.
도 12는 본 발명의 일 실시예에 따른 대표DG토픽에 대한 각각의 약들의 매칭율 혹은 매칭율로부터 도출된 점수를 개략적으로 도시한 도면이다.
도 13은 본 발명의 일 실시예에 따른 판별대상 효능 혹은 부작용이 예측되는 신규약을 도출하는 과정을 개략적으로 도시한 도면이다.
도 14는 본 발명의 일 실시예에 따른 컴퓨팅장치의 내부 구성을 예시적으로 도시한다.1 schematically derives a process of deriving a new characteristic of a drug based on biomedical literature data according to an embodiment of the present invention.
FIG. 2 schematically shows a system including a computing device for deriving a new characteristic of a drug based on biomedical document data according to an embodiment of the present invention.
3 schematically shows the internal configuration of a computing device for deriving a new characteristic of a drug based on biomedical document data according to an embodiment of the present invention.
4 schematically shows the internal configuration of a DG-related data extracting unit according to an embodiment of the present invention.
FIG. 5 exemplarily shows a result of performing an operation of a valid word extractor according to an embodiment of the present invention.
6 exemplarily shows a biological term with a predetermined inhibitory meaning according to an embodiment of the present invention.
FIG. 7 exemplarily illustrates calculation of a genetic control score (GRS) using a first parameter value or a second parameter value assigned to a valid word according to an embodiment of the present invention.
8 schematically shows a process of a weak characteristic derivation step according to an embodiment of the present invention.
9 schematically illustrates a process of extracting topics from DG relevance data according to an embodiment of the present invention, and deriving a matching rate of topics for each drug.
10 schematically shows a process of deriving a score based on a matching rate for a topic for each drug according to an embodiment of the present invention.
FIG. 11 schematically illustrates a process of performing class marking for drugs having an efficacy or side effect to be identified among each drug according to an embodiment of the present invention.
12 is a diagram schematically showing a matching rate or a score derived from the matching rate of each drug for a representative DG topic according to an embodiment of the present invention.
13 is a diagram schematically showing a process of deriving a new drug for which an efficacy or side effect to be determined is predicted according to an embodiment of the present invention.
14 exemplarily illustrates an internal configuration of a computing device according to an embodiment of the present invention.

이하에서는, 다양한 실시예들 및/또는 양상들이 이제 도면들을 참조하여 개시된다. 하기 설명에서는 설명을 목적으로, 하나이상의 양상들의 전반적 이해를 돕기 위해 다수의 구체적인 세부사항들이 개시된다. 그러나, 이러한 양상(들)은 이러한 구체적인 세부사항들 없이도 실행될 수 있다는 점 또한 본 발명의 기술 분야에서 통상의 지식을 가진 자에게 인식될 수 있을 것이다. 이후의 기재 및 첨부된 도면들은 하나 이상의 양상들의 특정한 예시적인 양상들을 상세하게 기술한다. 하지만, 이러한 양상들은 예시적인 것이고 다양한 양상들의 원리들에서의 다양한 방법들 중 일부가 이용될 수 있으며, 기술되는 설명들은 그러한 양상들 및 그들의 균등물들을 모두 포함하고자 하는 의도이다.In the following, various embodiments and/or aspects are now disclosed with reference to the drawings. In the following description, for purposes of explanation, a number of specific details are disclosed to aid in an overall understanding of one or more aspects. However, it will also be appreciated by those of ordinary skill in the art that this aspect(s) may be practiced without these specific details. The following description and the annexed drawings set forth in detail certain illustrative aspects of one or more aspects. However, these aspects are exemplary and some of the various methods in the principles of the various aspects may be used, and the descriptions described are intended to include all such aspects and their equivalents.

또한, 다양한 양상들 및 특징들이 다수의 디바이스들, 컴포넌트들 및/또는 모듈들 등을 포함할 수 있는 시스템에 의하여 제시될 것이다. 다양한 시스템들이, 추가적인 장치들, 컴포넌트들 및/또는 모듈들 등을 포함할 수 있다는 점 그리고/또는 도면들과 관련하여 논의된 장치들, 컴포넌트들, 모듈들 등 전부를 포함하지 않을 수도 있다는 점 또한 이해되고 인식되어야 한다.Further, various aspects and features will be presented by a system that may include multiple devices, components and/or modules, and the like. It is also noted that various systems may include additional devices, components and/or modules, and/or may not include all of the devices, components, modules, etc. discussed in connection with the figures. It must be understood and recognized.

본 명세서에서 사용되는 "실시예", "예", "양상", "예시" 등은 기술되는 임의의 양상 또는 설계가 다른 양상 또는 설계들보다 양호하다거나, 이점이 있는 것으로 해석되지 않을 수도 있다. 아래에서 사용되는 용어들 '~부', '컴포넌트', '모듈', '시스템', '인터페이스' 등은 일반적으로 컴퓨터 관련 엔티티(computer-related entity)를 의미하며, 예를 들어, 하드웨어, 하드웨어와 소프트웨어의 조합, 소프트웨어를 의미할 수 있다.As used herein, “an embodiment,” “example,” “aspect,” “example,” and the like may not be construed as having any aspect or design described as being better or advantageous than other aspects or designs. . The terms'~part','component','module','system', and'interface' used below generally mean a computer-related entity, for example, hardware, hardware It can mean a combination of software and software, or software.

또한, "포함한다" 및/또는 "포함하는"이라는 용어는, 해당 특징 및/또는 구성요소가 존재함을 의미하지만, 하나이상의 다른 특징, 구성요소 및/또는 이들의 그룹의 존재 또는 추가를 배제하지 않는 것으로 이해되어야 한다.In addition, the terms "comprising" and/or "comprising" mean that the corresponding feature and/or element is present, but excludes the presence or addition of one or more other features, elements, and/or groups thereof. It should be understood as not.

또한, 제1, 제2 등과 같이 서수를 포함하는 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되지는 않는다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.In addition, terms including ordinal numbers such as first and second may be used to describe various elements, but the elements are not limited by the terms. These terms are used only for the purpose of distinguishing one component from another component. For example, without departing from the scope of the present invention, a first element may be referred to as a second element, and similarly, a second element may be referred to as a first element. The term and/or includes a combination of a plurality of related listed items or any of a plurality of related listed items.

또한, 본 발명의 실시예들에서, 별도로 다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 발명의 실시예에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.In addition, in the embodiments of the present invention, unless otherwise defined, all terms used herein including technical or scientific terms are commonly understood by those of ordinary skill in the art to which the present invention belongs. It has the same meaning as. Terms as defined in a commonly used dictionary should be interpreted as having a meaning consistent with the meaning in the context of the related technology, and unless explicitly defined in the embodiments of the present invention, an ideal or excessively formal meaning Is not interpreted as.

도 1은 본 발명의 일 실시예에 따른 의생명 문헌 데이터에 기반하여 약의 신규 특성을 도출하는 방법을 개략적으로 도시한다. 1 schematically shows a method of deriving a new property of a drug based on biomedical literature data according to an embodiment of the present invention.

상기 실시예에 따르면 의생명 문헌데이터로부터 DG관련성데이터를 도출하는 DG관련성도출단계를 수행한다. 이와 같은 DG관련성데이터를 수행함으로써, 상기 의생명 문헌데이터로부터 DG관련성데이터가 도출된다. According to the above embodiment, the DG relevance derivation step of deriving DG relevance data from biomedical document data is performed. By performing such DG relevance data, DG relevance data is derived from the biomedical document data.

상기 DG관련성데이터는 특정 약의 사용으로 특정 유전자가 억제(down-regulation) 혹은 활성화(up-regulation) 되는지 여부를 수치로 나타내는 데이터이다. 이러한 DG관련성데이터는 본 발명에서 기존에 존재하던 약의 새로운 효능을 도출하기 위한 기초 데이터가 된다.The DG-related data is data indicating whether a specific gene is down-regulated or activated (up-regulated) by the use of a specific drug. These DG-related data become basic data for deriving new efficacy of existing drugs in the present invention.

여기서 상기 의생명 문헌데이터의 “문헌”은 일반적으로 이해되는 것과 동일한 의미를 포함하지만, 이에 한정되지 않고, 논문, 학술지, 서적 등 글이나 기호로 일정한 의사, 관념 또는 정보 등을 표현한 모든 것을 포함하는 개념이다. 바람직하게는, 상기 의생명 문헌데이터는 상기 논문, 학술지, 서적의 요약 혹은 초록(abstract) 정보임이 바람직하다.Here, the “literature” of the biomedical literature data includes the same meaning as commonly understood, but is not limited thereto, and includes all expressions of a certain intention, idea, or information in writings or symbols such as papers, journals, and books. It is a concept. Preferably, it is preferable that the biomedical literature data is a summary or abstract information of the paper, journal, or book.

또한, 본 발명에 따른 의생명 문헌데이터에 기반하여 약의 신규 특성을 도출하는 방법은 상기 DG관련성데이터로부터 판별대상 효능 혹은 부작용이 있는 신규 약을 도출하는 단계를 더 포함한다.In addition, the method of deriving a new characteristic of a drug based on the biomedical literature data according to the present invention further includes the step of deriving a new drug having an efficacy or side effect to be determined from the DG-related data.

이하에서는, 본 발명의 의생명 문헌데이터에 기반하여 약의 신규 특성을 도출하는 장치에 대하여 설명하도록 한다.Hereinafter, an apparatus for deriving a new characteristic of a drug based on the biomedical literature data of the present invention will be described.

도 2는 본 발명의 일 실시예에 따른 의생명 문헌데이터에 기반하여 약의 신규 특성을 도출하는 컴퓨팅장치를 포함하는 시스템을 개략적으로 도시한다. FIG. 2 schematically shows a system including a computing device for deriving a new characteristic of a drug based on biomedical document data according to an embodiment of the present invention.

상기 시스템은 컴퓨팅장치(1000)와 외부로부터 연결된 의생명 문헌데이터 DB(A), 약 동의어 DB(B), 약-유전자의 관계에 대해 이미 도출된 DG관련성데이터가 저장된 DB(C), 유전자 기호 DB(D)로 구성된다.The system includes a biomedical document data DB(A) connected from the outside with the computing device 1000, a synonym DB(B), a DB(C) storing DG-related data already derived about the relationship between a drug-gene, and a genetic symbol. It is composed of DB(D).

상기 컴퓨팅 장치(1000)는 의생명 문헌데이터에 기반하여 약의 신규 특성을 도출하는 연산기능을 수행하는 장치로서, 프로세서, 버스, 네트워크 인터페이스 및 메모리를 포함할 수 있다. 다른 실시예에서 의생명 문헌데이터에 기반하여 약의 신규 특성을 도출하는 컴퓨팅장치는 이와 같은 구성요소들보다 더 많은 구성요소들을 포함할 수도 있다. The computing device 1000 is a device that performs an operation function for deriving a new characteristic of a drug based on biomedical document data, and may include a processor, a bus, a network interface, and a memory. In another embodiment, a computing device for deriving a new characteristic of a drug based on biomedical literature data may include more components than these components.

상기 의생명 문헌데이터 DB(A)는 외부로부터 연결된 네트워크를 통하여 상기 컴퓨팅장치(1000)의 연산기능을 수행하는데 있어 필요한 정보에 접근이 가능한 복수의 의생명 문헌데이터가 저장되어 있는 데이터베이스로 의생명 문헌의 초록(abstract)을 제공하는 PubMed 데이터 베이스 등을 포함한다. 혹은 상기 의생명 문헌데이터는 상기 컴퓨팅 장치의 메모리에 저장된 형태로 억세스가 될 수도 있다.The biomedical literature data DB(A) is a database storing a plurality of biomedical literature data that enables access to information necessary to perform computational functions of the computing device 1000 through an externally connected network. Includes the PubMed database, which provides an abstract of Alternatively, the biomedical document data may be accessed in a form stored in a memory of the computing device.

상기 약 동의어 DB(B)는 외부로부터 연결된 네트워크를 통하여 상기 컴퓨팅장치(1000)의 연산기능을 수행하는데 있어 필요한 정보에 접근이 가능한 약 및 그 동의어 데이터가 저장되어 있는 데이터베이스로 DrugBank 혹은 KEGG DRUG 등의 데이터베이스를 포함할 수 있다. 혹은 상기 약 동의어 DB의 데이터는 상기 컴퓨팅 장치의 메모리에 저장된 형태로 억세스가 될 수도 있다.The drug synonym DB (B) is a database that stores drugs and synonym data that enable access to information necessary to perform computational functions of the computing device 1000 through an externally connected network, such as DrugBank or KEGG DRUG. May contain databases. Alternatively, the data of the weak synonym DB may be accessed in a form stored in the memory of the computing device.

상기 DG관련성데이터 DB(C)는 외부로부터 연결된 네트워크를 통하여 상기 컴퓨팅장치(1000)의 연산기능을 수행하는데 있어 필요한 약-유전자 관련성데이터가 저장된 DB이다. 본 발명의 일 실시예에서는 컴퓨팅장치가 의생명문헌데이터로부터 DG관련성데이터를 추출할 수 있으나 혹은 본 발명의 일 실시예에 따른 방식 혹은 다른 방식으로 다른 DB에 이미 축적된 약과 유전자에 대한 관련성 데이터인 DG관련성데이터를 외부로부터 수집할 수 있다. 마찬가지로, 상기 DG관련성데이터는 컴퓨팅 장치의 메모리에 저장되어 이를 로드하는 형태로 본 발명의 약의 신규 특성을 도출하는 방법이 수행될 수도 있다.The DG relevance data DB (C) is a DB in which weak-gene relevance data required to perform the computational function of the computing device 1000 through a network connected from outside is stored. In an embodiment of the present invention, the computing device may extract DG-related data from biomedical literature data, or, in a method according to an embodiment of the present invention, or in a different method, the relationship data for drugs and genes already accumulated in another DB. DG-related data can be collected from outside. Likewise, the DG-related data may be stored in a memory of a computing device and loaded, and a method of deriving a new characteristic of the drug of the present invention may be performed.

상기 유전자 기호 DB(D)는 외부로부터 연결된 네트워크를 통하여 상기 컴퓨팅장치(1000)의 연산기능을 수행하는데 있어 필요한 정보에 접근이 가능한 유전자 기호 데이터가 저장되어 있는 데이터베이스로 PharmGKB 등의 데이터베이스를 포함 할 수 있다.The genetic symbol DB(D) is a database storing genetic symbol data that enables access to information necessary for performing the computational function of the computing device 1000 through a network connected from the outside, and may include a database such as PharmGKB. have.

상기 도 2에 도시된 실시예에서는, 상기 의생명 문헌데이터 DB(A), 약 동의어 DB(B), DG관련성데이터 DB(C), 및 유전자 기호 DB(D)는 상기 컴퓨팅장치(1000)의 외부에 독립적으로 운영되는 것으로 도시되어 있으나, 본 발명은 이에 한정되지 않고, 상기 의생명 문헌데이터 DB(A), 약 동의어 DB(B), DG관련성데이터 DB(C), 및 유전자 기호 DB(D)가 상기 컴퓨팅장치(1000) 내부에 해당하거나, 혹은 상기 의생명 문헌데이터 DB(A), 약 동의어 DB(B), DG관련성데이터 DB(C), 및 유전자 기호 DB(D)로부터 수신한 정보가 상기 컴퓨팅장치(1000)에 저장되어, 상기 컴퓨팅장치의 연산에서 이용되는 방식으로 구현될 수 있다.In the embodiment shown in FIG. 2, the biomedical literature data DB(A), about synonym DB(B), DG relation data DB(C), and genetic symbol DB(D) are of the computing device 1000. Although shown to operate independently from the outside, the present invention is not limited thereto, and the biomedical literature data DB(A), about synonym DB(B), DG relation data DB(C), and genetic symbol DB(D ) Corresponds to the inside of the computing device 1000, or information received from the biomedical literature data DB(A), about synonym DB(B), DG relevance data DB(C), and genetic symbol DB(D) May be stored in the computing device 1000 and implemented in a manner used in calculation of the computing device.

도 3은 본 발명의 일 실시예에 따른 의생명 문헌데이터에 기반하여 약의 신규 특성을 도출하는 컴퓨팅장치의 내부 구성을 개략적으로 도시한다.3 schematically shows the internal configuration of a computing device for deriving a new characteristic of a drug based on biomedical document data according to an embodiment of the present invention.

상기 실시예에 따른 의생명 문헌데이터에 기반하여 약의 신규 특성을 도출하는 컴퓨팅장치는 프로세서(100), 버스(프로세서(100), 메모리(300), 네트워크 인터페이스(200) 사이의 양방향 화살표에 해당), 네트워크 인터페이스(200) 및 메모리(300)를 포함할 수 있다. 메모리(300)는 운영체제, DG관련성데이터추출부(110)실행코드, 약특성도출부(120)실행코드와 같은 실행코드, DG관련성데이터, 의생명문헌데이터, 약동의어데이터, 및 유전자기호데이터를 포함할 수 있다. 프로세서(100)는 DG관련성데이터추출부(110), 및 약특성도출부(120)를 포함할 수 있다. 다른 실시예들에서 의생명 문헌데이터에 기반하여 약의 신규 특성을 도출하는 컴퓨팅 장치는 도 3의 구성요소들보다 더 많은 구성요소들을 포함할 수도 있다.The computing device for deriving new characteristics of medicines based on biomedical literature data according to the above embodiment corresponds to a double arrow between the processor 100 and the bus (processor 100, memory 300, and network interface 200). ), a network interface 200 and a memory 300. The memory 300 includes an operating system, an execution code such as a DG-related data extraction unit 110, an execution code, a weak-characteristic extraction unit 120, an execution code, DG-related data, biomedical literature data, pharmacological synonym data, and gene symbol data. Can include. The processor 100 may include a DG-related data extraction unit 110 and a weak characteristic extraction unit 120. In other embodiments, a computing device for deriving a new characteristic of a drug based on biomedical literature data may include more components than those of FIG. 3.

메모리는 컴퓨터에서 판독 가능한 기록 매체로서, RAM(random access memory), ROM(read only memory) 및 디스크 드라이브와 같은 비소멸성 대용량 기록장치(permanent mass storage device)를 포함할 수 있다. 이러한 소프트웨어 구성요소들은 드라이브 메커니즘(drive mechanism, 미도시)을 이용하여 메모리와는 별도의 컴퓨터에서 판독 가능한 기록 매체로부터 로딩될 수 있다. 이러한 별도의 컴퓨터에서 판독 가능한 기록 매체는 플로피 드라이브, 디스크, 테이프, DVD/CD-ROM 드라이브, 메모리 카드 등의 컴퓨터에서 판독 가능한 기록 매체(미도시)를 포함할 수 있다. 다른 실시예에서 소프트웨어 구성요소들은 컴퓨터에서 판독 가능한 기록 매체가 아닌 네트워크 인터페이스(200)를 통해 메모리에 로딩될 수도 있다.The memory is a computer-readable recording medium and may include a permanent mass storage device such as a random access memory (RAM), a read only memory (ROM), and a disk drive. These software components may be loaded from a computer-readable recording medium separate from the memory using a drive mechanism (not shown). Such a separate computer-readable recording medium may include a computer-readable recording medium (not shown) such as a floppy drive, a disk, a tape, a DVD/CD-ROM drive, or a memory card. In another embodiment, software components may be loaded into a memory through the network interface 200 rather than a computer-readable recording medium.

버스는 의생명 문헌 데이터에 기반하여 약의 신규 특성을 도출하는 컴퓨팅 장치의 구성요소들간의 통신 및 데이터 전송을 가능하게 할 수 있다. 버스는 고속 시리얼 버스(high-speed serial bus), 병렬 버스(parallel bus), SAN(Storage Area Network) 및/또는 다른 적절한 통신 기술을 이용하여 구성될 수 있다.The bus may enable communication and data transmission between components of a computing device that derive new properties of drugs based on biomedical literature data. The bus may be configured using a high-speed serial bus, a parallel bus, a storage area network (SAN) and/or other suitable communication technology.

네트워크 인터페이스(200)는 의생명 문헌데이터에 기반하여 약의 신규 특성을 도출하는 컴퓨팅장치를 컴퓨터 네트워크에 연결하기 위한 컴퓨터 하드웨어 구성 요소일 수 있다. 네트워크 인터페이스(200)는 의생명 문헌 데이터에 기반하여 약의 신규 특성을 도출하는 컴퓨팅장치를 무선 또는 유선 커넥션을 통해 컴퓨터 네트워크에 연결시킬 수 있다. 이와 같은 네트워크 인터페이스(200)를 통하여 의생명 문헌데이터에 기반하여 약의 신규 특성을 도출하는 컴퓨팅장치가 촉각 인터페이스장치에 무선적 혹은 유선적으로 접속될 수 있다.The network interface 200 may be a computer hardware component for connecting a computing device for deriving new properties of a drug based on biomedical document data to a computer network. The network interface 200 may connect a computing device that derives new characteristics of a drug based on biomedical literature data to a computer network through a wireless or wired connection. Through such a network interface 200, a computing device for deriving a new characteristic of a drug based on biomedical literature data may be wirelessly or wiredly connected to the tactile interface device.

프로세서는 기본적인 산술, 로직 및 의생명 문헌데이터에 기반하여 약의 신규 특성을 도출하는 컴퓨팅장치의 입출력 연산을 수행함으로써, 컴퓨터 프로그램의 명령을 처리하도록 구성될 수 있다. 명령은 메모리(300) 또는 네트워크 인터페이스(200)에 의해, 그리고 버스를 통해 프로세서로 제공될 수 있다. 프로세서는 DG관련성데이터추출부(110), 및 약특성도출부(120)를 위한 프로그램 코드를 실행하도록 구성될 수 있다. 이러한 프로그램 코드는 메모리와 같은 기록 장치에 저장될 수 있다.The processor may be configured to process instructions of a computer program by performing input/output operations of a computing device that derives new properties of drugs based on basic arithmetic, logic, and biomedical document data. Instructions may be provided to the processor by means of memory 300 or network interface 200 and via a bus. The processor may be configured to execute program codes for the DG-related data extraction unit 110 and the weak characteristic extraction unit 120. Such program codes may be stored in a recording device such as a memory.

상기 DG관련성데이터추출부(110), 및 약특성도출부(120)는 이하에서 설명하게 될 의생명 문헌 데이터를 기반하여 약의 신규 특성을 도출하는 방법을 수행하기 위해 구성될 수 있다. 상기한 프로세서는 의생명 문헌 데이터를 기반하여 약의 신규 특성을 도출하는 방법에 따라 일부 컴포넌트가 생략되거나, 도시되지 않은 추가의 컴포넌트가 더 포함되거나, 2개 이상의 컴포넌트가 결합될 수 있다.The DG relevance data extraction unit 110 and the drug characteristics extraction unit 120 may be configured to perform a method of deriving new characteristics of a drug based on biomedical literature data, which will be described below. In the processor described above, some components may be omitted, additional components not shown may be further included, or two or more components may be combined according to a method of deriving new properties of a drug based on biomedical literature data.

한편, 이와 같은 상기 컴퓨팅 장치는 바람직하게는 개인용 컴퓨터 혹은 서버에 해당하고, 경우에 따라서는 스마트 폰(smart phone)과, 태블릿(tablet)과, 이동 전화기와, 화상 전화기와, 전자책 리더(e-book reader)와, 데스크 탑(desktop) PC와, 랩탑(laptop) PC와, 넷북(netbook) PC와, 개인용 복합 단말기(personal digital assistant: PDA, 이하 'PDA'라 칭하기로 한다)와, 휴대용 멀티미디어 플레이어(portable multimedia player: PMP, 이하 'PMP'라 칭하기로 한다)와, 엠피3 플레이어(mp3 player)와, 이동 의료 디바이스와, 카메라와, 웨어러블 디바이스(wearable device)(일 예로, 헤드-마운티드 디바이스(head-mounted device: HMD, 일 예로 'HMD'라 칭하기로 한다)와, 전자 의류와, 전자 팔찌와, 전자 목걸이와, 전자 앱세서리(appcessory)와, 전자 문신, 혹은 스마트 워치(smart watch) 등에 해당할 수 있다.Meanwhile, such a computing device preferably corresponds to a personal computer or a server, and in some cases, a smart phone, a tablet, a mobile phone, a video phone, and an e-book reader (e -book reader), desktop PC, laptop PC, netbook PC, personal digital assistant (PDA, hereinafter referred to as'PDA'), portable Multimedia player (portable multimedia player: PMP, hereinafter referred to as'PMP'), mp3 player, mobile medical device, camera, wearable device (for example, head-mounted Device (head-mounted device: HMD, for example, to be referred to as'HMD'), electronic clothing, electronic bracelet, electronic necklace, electronic appcessory, electronic tattoo, or smart watch ), etc.

상기 컴퓨팅장치는 외부로부터 의생명 문헌데이터 DB(A)를 네트워크 인터페이스(200)를 통하여 전송 받거나 혹은 메모리(300)에 기저장된 의생명 문헌데이터에 대하여 상기 DG관련성데이터추출부(110)의 프로세스를 수행하여 DG관련성데이터를 이용할 수 있다.The computing device receives the biomedical document data DB(A) from the outside through the network interface 200 or performs the process of the DG relevance data extraction unit 110 with respect to the biomedical document data previously stored in the memory 300. DG-related data can be used.

상기 DG관련성데이터추출부(110)는 1 이상의 약 각각에 대한 1 이상의 관련 유전자 정보를 포함하는 DG관련성데이터를 상기 컴퓨팅 장치에서 로드하거나 혹은 의생명 문헌데이터로부터 도출한다.The DG-related data extracting unit 110 loads DG-related data including one or more related gene information for each of one or more drugs from the computing device or derives it from biomedical document data.

즉, 본 발명의 바람직한 일 실시예에서는 의생명 문헌데이터로부터 DG관련성데이터를 지속적으로 업데이트를 하지만, 다른 실시예에서는 이미 구축된 DG관련성데이터를 추출 혹은 로드 하여 이용하는 방식으로 동작할 수도 있다.That is, in a preferred embodiment of the present invention, DG relevance data is continuously updated from biomedical document data, but in another embodiment, DG relevance data already constructed may be extracted or loaded and used.

상기 약특성도출부(120)는 상기 DG관련성데이터로부터 추출되고 1 이상의 유전자 정보에 의하여 구성되는 1 이상의 DG토픽에 대한, 각각의 약들의 DG관련성데이터로부터 추출된 유전자 정보의 매칭율에 기초하여, 판별대상 효능 혹은 부작용이 알려지지 않은 약에 대하여 판별대상 효능 혹은 부작용이 있는 지 여부를 판별한다.The drug characteristic extraction unit 120 is based on the matching rate of the gene information extracted from the DG related data of each drug with respect to one or more DG topics extracted from the DG related data and constituted by one or more gene information, It is determined whether there is an efficacy or side effect to be identified for a drug whose efficacy or side effect to be identified is unknown.

도 4는 본 발명의 일 실시예에 따른 DG관련성데이터추출부(110)의 내부 구성을 개략적으로 도시한다. 4 schematically shows the internal configuration of the DG-related data extracting unit 110 according to an embodiment of the present invention.

상기 DG관련성데이터추출부(110)는 상기 의생명 문헌데이터로부터 텍스트를 로드하는 텍스트로드부(111); 로드된 상기 텍스트로부터 기설정된 규칙에 따라 분석대상텍스트를 추출하는 문장추출부(112); 상기 분석대상텍스트가 약, 및 유전자를 포함하고 있는 지 여부를 판별하는 문장판별부(113); 상기 분석대상텍스트에서 약, 및 유전자의 관계와 관련된 유효단어를 추출하는 유효단어추출부(114); 상기 유효단어로부터 약, 및 유전자의 대상의 관계를 도출하는 제1 GRS계산부(115); 상기 제1GRS 계산부(115)에 의하여 도출된 결과를 이용하여 복수의 문헌 혹은 복수의 분석대상텍스트에서의 정보를 종합하여 약, 및 유전자의 관계를 도출하는 제2 GRS계산부(116); 및 상기 DG관련성데이터추출부(110)에서 약-유전자에 대한 정보를 도출하는 것이 아니라 이미 도출되거나 구축된 DG관련성데이터를 로드하는 DG관련성데이터로드부(117)를 포함한다. 상기 DG관련성데이터추출부(110)의 동작을 수행하여 상기 의생명 문헌데이터 혹은 이미 저장된 데이터로부터 상기 DG관련성데이터가 추출된다.The DG relevance data extraction unit 110 includes a text loading unit 111 for loading text from the biomedical document data; A sentence extraction unit 112 for extracting an analysis target text according to a preset rule from the loaded text; A sentence discrimination unit 113 that determines whether the analysis target text contains a drug and a gene; A valid word extracting unit 114 for extracting valid words related to a relationship between a drug and a gene from the analysis target text; A first GRS calculation unit 115 for deriving a relationship between a drug and an object of a gene from the effective word; A second GRS calculation unit 116 for deriving a relationship between a drug and a gene by synthesizing information from a plurality of documents or a plurality of texts to be analyzed using the result derived by the first GRS calculation unit 115; And a DG-related data loading unit 117 for loading DG-related data that has already been derived or constructed, rather than derives information on the weak-gene from the DG-related data extraction unit 110. The DG-related data is extracted from the biomedical document data or data already stored by performing the operation of the DG-related data extracting unit 110.

상기 텍스트로드부(111)는 상기 의생명 문헌데이터로부터 텍스트를 로드한다. 외부로부터 의생명 문헌데이터 DB(A)를 네트워크 인터페이스(200)를 통하여 전송 받거나 혹은 메모리(300)에 기저장된 의생명 문헌데이터를 DG관련성데이터추출부(110)로 로드한다. 이와 같은 의생명 문헌데이터 DB(A)는 의생명 문헌의 초록(abstract)을 제공하는 PubMed 데이터 베이스 등을 포함한다. 예를 들어, 네트워크 인터페이스(200)을 통하여 PubMed 데이터베이스를 이용하면 7,000여개의 저널로부터 1,450,000만여개의 초록(abstract)을 로드할 수 있고, 이는 상기 DG관련성데이터추출부(110)의 로드된 상기 텍스트가 된다.The text loading unit 111 loads text from the biomedical document data. The biomedical document data DB(A) is transmitted from the outside through the network interface 200 or the biomedical document data previously stored in the memory 300 is loaded into the DG relevance data extraction unit 110. Such biomedical literature data DB(A) includes the PubMed database, which provides an abstract of biomedical literature. For example, if the PubMed database is used through the network interface 200, about 1,450,000,000 abstracts can be loaded from about 7,000 journals, which means that the loaded text of the DG relevance data extraction unit 110 is do.

상기 문장추출부(112)는 로드된 상기 텍스트로부터 구(phrase) 혹은 절(clause) 단위로 2 이상의 단어를 추출하는 방식으로 분석대상텍스트를 추출한다. 로드된 상기 텍스트로부터 DG관련성데이터를 도출하는데 있어서 상기 구(pharase) 혹은 절(clause)이 필요한 정보를 갖는 최소단위의 의미를 갖기 때문이다.The sentence extracting unit 112 extracts the text to be analyzed by extracting two or more words in units of phrases or clauses from the loaded text. This is because in deriving DG relevance data from the loaded text, the phrase or clause has the meaning of the smallest unit having necessary information.

상기 문장판별부(113)는 상기 분석대상텍스트에 약 및 유전자의 기재가 있는지 여부를 판별하여 동시사용문장을 추출한다. 이러한 동시사용문장이 새로운 분석대상텍스트가 된다. 결국 문장판별부(113)의 동작을 수행함으로써 상기 약 및 상기 유전자의 기재가 상기 약 및 상기 유전자의 관련성을 보여주는 상기 DG 관련성데이터로 도출된다.The sentence determination unit 113 determines whether there are descriptions of drugs and genes in the text to be analyzed, and extracts a simultaneous use sentence. These simultaneous use sentences become a new text to be analyzed. Eventually, by performing the operation of the sentence determination unit 113, the description of the drug and the gene is derived as the DG relevance data showing the relationship between the drug and the gene.

이러한 과정에서 상기 문장판별부(113)는 이러한 약 및 유전자의 기재가 있는지 여부를 판별하기 위하여 약 동의어, 및 유전자 기호를 포함하는 시드(seed)로 상기 분석대상텍스트를 검색하여 동시사용문장을 추출한다. 상기 약 동의어, 및 상기 유전자 기호에 대한 데이터는 외부로부터 상기 약 동의어 DB(B), 및 상기 유전자 기호 DB(D)를 상기 네트워크인터페이스(200)을 통하여 전송 받거나 혹은 상기 메모리(300)에 기저장된 상기 약 동의어 데이터, 및 상기 유전자 기호 데이터를 사용한다. In this process, the sentence determination unit 113 searches for the text to be analyzed with a seed including a drug synonym and a gene symbol to determine whether there are descriptions of such drugs and genes, and extracts a simultaneous use sentence. do. The about synonyms and data on the genetic symbols are transmitted from the outside through the weak synonym DB (B) and the genetic symbol DB (D) through the network interface 200 or previously stored in the memory 300 The about synonym data and the genetic symbol data are used.

상기 약동의어 DB(B) 및 상기 약 동의어 데이터(360)는 DrugBank 혹은 KEGG DRUG 등의 데이터베이스를 포함할 수 있고, 상기 유전자 기호 DB(D) 및 상기 유전자 기호 데이터(392)는 PharmGKB 등의 데이터베이스를 포함 할 수 있다.The pharmacokinetics DB (B) and the drug synonym data 360 may include a database such as DrugBank or KEGG DRUG, and the gene symbol DB (D) and the gene symbol data 392 include a database such as PharmGKB. Can contain

예를 들어, 상기 유전자 기호 및 상기 질병 동의어가 시드(seed)가 되어, 분석대상텍스트를 검색하고, 상기 분석대상텍스트로부터 상기 유전자 기호 중 하나인 “APOE” 및 상기 질병 동의어 중 하나인 “atorvastatin”이 기재된 “APOE mRNA expression was reduced after atorvastatin treatment”와 같은 동시사용문장을 추출할 수 있다.For example, the genetic symbol and the disease synonym become seeds, search for an analysis target text, and from the analysis target text, “APOE” as one of the genetic symbols and “atorvastatin” as one of the disease synonyms Concurrent use sentences such as “APOE mRNA expression was reduced after atorvastatin treatment” can be extracted.

바람직하게는 상기 문장판별부(113)는 상기 분석대상텍스트에 약 및 유전자의 기재가 있는지 여부를 판별하여 동시사용문장을 추출하고, 이후 상기 분석대상대상텍스트에 부정문(not)과 관련된 표현이 있는지 여부를 판별할 수 있다. 이렇게 판별된 부정문(not)은 상기 분석대상텍스트에서 제외된다. 이는, 후술할 제1 GRS계산부(115)의 동작을 수행할 때 상기 분석대상텍스트에 사용된 단어 중에 생물학용어가 있다면, 이러한 생물학용어가 억제(down-regulation)의미의 생물학용어인지 혹은 활성(up-regulation)의미의 생물학용어인지 여부를 구분하기 때문인데, 만일 상기 억제의미의 생물학 용어가 부정문에서 쓰여 활성의미를 갖는다면 상기 부정문(not)도 분석대상텍스트에 포함될 수 있지만 실제로는 그렇지 아니하다. 예를 들어, 부정문에 쓰인 활성의미의 생물학용어 “not activated (비활성화)”가 반드시 억제의미의 생물학용어인 “inhibited (억제)”의 의미를 갖지 않는다. Preferably, the sentence discrimination unit 113 determines whether there are descriptions of drugs and genes in the analysis target text, extracts a simultaneous use sentence, and then, whether there is an expression related to a negative sentence in the analysis target text. Whether or not can be determined. The negative sentence determined in this way (not) is excluded from the analysis target text. This is, if there is a biological term among the words used in the text to be analyzed when performing the operation of the first GRS calculation unit 115 to be described later, whether the biological term is a biological term with a meaning of down-regulation or is active ( This is because it distinguishes whether or not it is a biological term of up-regulation meaning.If the biological term of the inhibitory meaning is written in a negative sentence and has an active meaning, the negative sentence (not) may also be included in the text to be analyzed, but it is not. . For example, the biological term “not activated” used in negative sentences does not necessarily have the meaning of “inhibited”, the biological term for inhibitory meaning.

상기 유효단어추출부(114)는 동사, 및 명사를 포함하는 2 이상의 유효단어를 포함하는 유효단어그룹을 추출한다. 이러한 유효단어를 포함하는 상기 유효단어그룹은 후술할 제1 GRS계산부(115)의 동작에서 상기 유효단어에 파라미터값을 부여할 때 사용된다. 여기서 유효단어란, 먼저 상기 동시사용문장에서 사용된 단어 중에서 상기 약 동의어, 및 유전자 기호를 포함하는 시드(seed)에 해당하는 단어는 제외하고, 상기 동시사용문장에서 사용된 단어 중에서 상기 DG관련성데이터를 도출할 때 필요한 약, 혹은 유전자를 포함하는 2 이상의 단어 사이의 관련성을 나타내는 정보를 갖는 단어를 말한다. 이러한 관련성을 나타내는 정보를 갖는 단어는 상기 동시사용문장에서 사용된 단어 중에서 전치사 혹은 부사 등의 품사가 아닌 동사 혹은 명사가 바람직하다. 문장 구성에서 단어가 어떠한 기능을 하는가를 기준으로 구분한 품사분류 중 문장의 주요한 정보를 가진 품사는 동사 및 명사가 되기 때문이다.The effective word extracting unit 114 extracts a valid word group including two or more valid words including verbs and nouns. The valid word group including such valid words is used when assigning a parameter value to the valid word in the operation of the first GRS calculator 115 to be described later. Here, the effective word means, first of all words used in the simultaneous use sentence, excluding the word corresponding to the seed containing the weak synonym and the genetic symbol, and the DG relevance data among the words used in the simultaneous use sentence. It refers to a word that has information indicating the relationship between two or more words that contain drugs or genes that are required to derive. Words having information indicating such relevance are preferably verbs or nouns other than parts of speech such as prepositions or adverbs among words used in the simultaneous use sentence. This is because the part of speech that has the main information of the sentence among the parts of speech classification based on the function of the word in the sentence composition becomes a verb and a noun.

도 5는 본 발명의 일 실시예에 따른 유효단어추출부(114)의 동작을 수행한 결과를 예시적으로 도시한다.FIG. 5 exemplarily shows a result of performing an operation of the effective word extracting unit 114 according to an embodiment of the present invention.

상기 일 실시예에서, 상기 유효단어는 먼저 상기 동시사용문장으로 추출된 “APOE mRNA expression was reduced after atorvastatin treatment”에서 약 동의어, 및 유전자 기호를 포함하는 시드(seed)에 해당하는 “APOE” 및 “atorvastatin”를 제외하고, 상기 동시사용문장에서 사용된 단어 중에서 동사 혹은 명사에 해당하는 expression, reduced, 및 treatment임을 알 수 있다. 즉 상기 동시사용문장에서 상기 유효단어추출부(114)의 동작을 수행하여 세가지 유효단어인 expression, reduced 및 treatment를 포함하는 상기 유효단어그룹을 생성한다.In the above embodiment, the effective words are about synonyms in “APOE mRNA expression was reduced after atorvastatin treatment” extracted with the simultaneous use sentence, and “APOE” and “APOE” corresponding to a seed containing a genetic symbol. Except for “atorvastatin”, it can be seen that it is an expression, reduced, and treatment corresponding to a verb or a noun among the words used in the simultaneous use sentence. That is, by performing the operation of the effective word extracting unit 114 in the simultaneous use sentence, the effective word group including the three valid words expression, reduced and treatment is generated.

상기 제1 GRS계산부(115)는, 상기 유효단어그룹의 유효단어가 기설정된 억제의미의 생물학용어에 해당하는지 여부를 판별하고; 상기 유효단어그룹의 요소가 기설정된 억제의미의 생물학용어에 해당하는 경우에는, 상기 유효단어그룹의 요소에 대해 제1 파라미터값을 부여하고, 상기 유효단어그룹의 요소가 기설정된 억제의미의 생물학용어에 해당하지 않는 경우에는, 상기 유효단어그룹의 요소에 대해 제2 파라미터값을 부여하고; 상기 유효단어그룹의 유효단어에 부여된 파라미터값에 기초하여, 약 및 유전자에 대한 유전자조절점수(GRS)를 계산한다. 결국, 상기 DG관련성데이터는 각각의 약에 대해 관련된 유전자의 상기 유전자조절점수(GRS)로 나타낼 수 있다.The first GRS calculator 115 determines whether or not the effective word of the effective word group corresponds to a biological term having a predetermined suppressive meaning; If the element of the effective word group corresponds to a biological term of a preset inhibitory meaning, a first parameter value is assigned to the element of the effective word group, and the element of the effective word group is a biological term of a preset inhibitory meaning. If it does not correspond to, a second parameter value is assigned to the element of the valid word group; Based on the parameter values assigned to the effective words of the effective word group, the gene control score (GRS) for drugs and genes is calculated. Consequently, the DG-related data can be expressed as the gene control score (GRS) of the gene related to each drug.

상기 제1 GRS계산부(115)는 먼저 상기 유효단어그룹의 유효단어에 대해 기설정된 억제의미의 생물학용어에 해당하는지 여부를 판별한다. 일 실시예에 따르면, 상기 기설정된 억제의미의 생물학용어(DRW, Down-regulation Relationship Words)는 사용자가 직접 설정한 억제의미의 생물학용어 및/또는 약-유전자 관계 온톨로지(PHARE ontology)와 부정어휘 그룹 사이에 중복되는 단어로 이루어진 억제의미의 생물학용어를 포함할 수 있다. 이러한 억제의미의 생물학용어는 메모리(300)에 저장되고, 상기 제1 GRS계산부(115)는 상기 유효단어 그룹의 유효단어에 대해 기설정된 억제의미의 생물학용어에 해당하는지 여부를 판별함에 있어 필요한 데이터를 DRW 데이터로부터 로드할 수 있다.The first GRS calculation unit 115 first determines whether the effective word of the effective word group corresponds to a biological term having a predetermined suppression meaning. According to an embodiment, the preset inhibitory meaning of biological terms (DRW, Down-regulation Relationship Words) is a biological term of inhibitory meaning and/or a drug-gene relationship ontology (PHARE ontology) and negative vocabulary group set by the user. It may include a biological term with a suppressive meaning consisting of words that are overlapping between them. The biological terms of the suppressive meaning are stored in the memory 300, and the first GRS calculation unit 115 is necessary in determining whether the valid words of the effective word group correspond to the biological terms of the preset suppression meaning. Data can be loaded from DRW data.

도 6은 본 발명의 일 실시예에 따른 기설정된 억제의미의 생물학용어를 예시적으로 도시한다. 6 exemplarily shows a biological term with a predetermined inhibitory meaning according to an embodiment of the present invention.

도 6에서 Abolish, Antagonize, 및 Decrease 등을 포함하는 단어들은 상기 기설정된 억제의미의 생물학용어에 해당한다.In FIG. 6, words including Abolish, Antagonize, and Decrease correspond to the biological terms of the predetermined inhibitory meaning.

이 때에, 상기 유효단어그룹의 요소가 상기 기설정된 억제의미의 생물학용어에 해당하는 경우에는, 상기 유효단어그룹의 요소에 대해 제1 파라미터값을 부여한다. 다만, 상기 유효단어그룹의 요소가 상기 기설정된 억제의미의 생물학용어에 해당하지 않는 경우에는, 상기 유효단어그룹의 요소에 대해 제2 파라미터값을 부여한다. 이러한 방식으로 상기 유효단어 그룹의 요소 중에서 상기 기설정된 억제의미의 생물학용어에 해당하는지 여부에 따라 구분되는 파라미터값을 부여한다. In this case, when the element of the effective word group corresponds to the preset biological term with a suppressive meaning, a first parameter value is assigned to the element of the effective word group. However, when the element of the effective word group does not correspond to the preset biological term of the suppressive meaning, a second parameter value is assigned to the element of the effective word group. In this way, a parameter value that is classified according to whether it corresponds to the biological term of the predetermined inhibitory meaning among the elements of the effective word group is assigned.

마지막으로, 상기 유효단어 그룹의 유효단어에 부여된 파라미터값에 기초하여, 약 및 유전자 혹은 질병 및 유전자에 대한 유전자조절점수(GRS)를 계산한다. 이러한 유전자조절점수(GRS)는 도출된 값에 따라 상기 약의 사용으로 상기 유전자가 억제(down-regulation) 혹은 활성화(up-regulation) 되는지 여부를 나타내거나 상기 질병의 발병으로 상기 유전자가 억제(down-regulation) 혹은 활성화(up-regulation) 되는지 여부를 나타낼 수 있다. 결국 이러한 유전자조절점수가 메모리(300)내의 상기 DG관련성데이터(380)로 저장된다.Finally, a gene control score (GRS) for drugs and genes or diseases and genes is calculated based on the parameter values assigned to the effective words of the effective word group. This gene control score (GRS) indicates whether the gene is down-regulated or activated (up-regulation) by the use of the drug according to the derived value, or the gene is suppressed by the onset of the disease (down-regulation). -regulation) or activation (up-regulation). Eventually, this gene control score is stored as the DG-related data 380 in the memory 300.

바람직하게는 상기 제1 파라미터값 및 상기 제2 파라미터값은 부호가 상이한 숫자값에 해당하고, 상기 GRS도출단계는 상기 유효단어그룹의 유효단어에 부여된 2 이상의 파라미터값을 곱하여 상기 유전자조절점수를 계산한다.Preferably, the first parameter value and the second parameter value correspond to numeric values having different signs, and in the GRS derivation step, the genetic control score is multiplied by two or more parameter values assigned to the effective words of the effective word group. Calculate.

이렇게 구한 상기 유전자조절점수(GRS)가 제1 파라미터값과 같은 부호면 상기 약의 사용 혹은 상기 질병의 발병으로 상기 유전자는 억제(down-regulation)됨을 알 수 있고, 상기 유전자조절점수가 제2 파라미터값과 같은 부호면 상기 약의 사용 혹은 상기 질병의 발병으로 상기 유전자는 활성화(up-regulation) 됨을 알 수 있다. 즉 유전자조절점수가 제1 파라미터값과 같은 부호라는 것은 상기 동시사용문장에서 사용된 상기 유효단어그룹에서 상기 기설정된 억제의미의 생물학용어에 해당되는 유효단어가 홀 수번 쓰여, 상기 동시사용문장이 억제의미를 갖는 것을 나타내고, 상기 유전자조절점수가 상기 제2 파라미터값과 같은 부호인 경우는 반대로 활성의미를 갖는 것을 나타낸다.If the thus obtained gene control score (GRS) is the same code as the first parameter value, it can be seen that the gene is down-regulated due to the use of the drug or the onset of the disease, and the gene control score is the second parameter. If the code is the same as the value, it can be seen that the gene is up-regulated due to the use of the drug or the onset of the disease. That is, if the genetic control score is the same sign as the first parameter value, the effective word corresponding to the biological term of the preset suppression meaning is written odd times in the effective word group used in the simultaneous use sentence, and the simultaneous use sentence is suppressed. It indicates that it has a meaning, and when the gene control score is the same sign as the second parameter value, it indicates that it has an active meaning.

이하 도 5와 도 6의 일 실시예로 상기 제1 GRS계산부(115)의 동작이 수행되는 과정을 살펴본다. Hereinafter, a process in which the operation of the first GRS calculator 115 is performed will be described according to the exemplary embodiment of FIGS. 5 and 6.

도 5의 expression, reduced, 및 treatment를 포함하는 상기 유효단어 그룹의 요소가 도 6에 예시된 기설정된 억제의미의 생물학 용어에 해당하는지 판별하였을 때 reduced는 상기 기설정된 억제의미의 생물학용어에 해당한다. 이와 반대로, 그 이외의 단어인 expression 및 treatment는 상기 기설정된 억제의미의 생물학용어에 해당하지 않는다. 따라서 reduced에 제1 파라미터값이 부여되고, expression 및 treatment에는 제1 파라미터값과 구분되는 제2 파라미터값이 부여된다.When it is determined whether the elements of the effective word group including expression, reduced, and treatment in FIG. 5 correspond to the biological terms of the preset inhibitory meaning illustrated in FIG. 6, reduced corresponds to the biological term of the preset inhibitory meaning. . On the contrary, other words, expression and treatment, do not correspond to the pre-set biological terms of inhibitory meaning. Therefore, a first parameter value is given to reduced, and a second parameter value distinguished from the first parameter value is given to expression and treatment.

도 7은 본 발명의 일 실시예에 따른 유효단어에 부여된 제1 파라미터값 혹은 제2 파라미터값을 이용한 유전자조절점수(GRS)계산을 예시적으로 도시한다.FIG. 7 exemplarily illustrates calculation of a genetic control score (GRS) using a first parameter value or a second parameter value assigned to a valid word according to an embodiment of the present invention.

도 7에 따르면, 상기 제1 파라미터값으로 음의 실수에 해당하는 '-1'을 부여할 수 있고, 상기 제2 파라미터값으로 양의 실수에 해당하는 '+1'값을 부여할 수 있다. 즉, 도 5의 상기 유효단어그룹의 요소 중에서 상기 기설정된 억제의미의 생물학용어에 해당하는 reduced에 제1 파라미터값인 -1이 부여되고, 그 이외의 단어에 해당하는 expression 및 treatment에 제2 파라미터값인 +1이 부여되고, 상기 유효단어에 부여된 모든 파라미터값을 곱하여 준 -1이 상기 유전자조절점수(GRS)가 된다. 이는 상기 유전자조절점수가 상기 제1 파라미터값과 같은 부호를 갖는 것을 알 수 있다. 즉, 전술한 바와 같이 도 7에 나타난 특정 약인 atorvastatin을 사용하여 특정 유전자인 APOE가 억제(down-regulation)되는 것을 알 수 있다. Referring to FIG. 7,'-1' corresponding to a negative real number may be assigned as the first parameter value, and a'+1' value corresponding to a positive real number may be assigned as the second parameter value. That is, among the elements of the effective word group of FIG. 5, a first parameter value of -1 is assigned to a reduced corresponding to the biological term of the preset suppression meaning, and a second parameter is applied to expressions and treatments corresponding to other words. A value of +1 is given, and a given value of -1 by multiplying all parameter values assigned to the effective word becomes the gene control score (GRS). It can be seen that the gene control score has the same sign as the first parameter value. That is, as described above, it can be seen that APOE, a specific gene, is down-regulated using atorvastatin, a specific drug shown in FIG. 7.

바람직하게는 상기 제1 GRS계산부에서 동일한 약 및 유전자에 대해 2 이상의 유전자조절점수가 도출되는 경우에, 상기 2 이상의 유전자조절점수로부터 유전자조절점수합(GRS_sum)을 유전자조절점수로 업데이트할 수 있다. 수 많은 시행이 반복되고 각 시행에 따라 오차에 의해 서로 다른 결과값이 도출되는 경우, 이러한 시행의 표본이 클수록 더 정확한 결과값을 도출할 수 있다. 즉, 상기 2이상의 유전자 조절점수를 합한 상기 유전자조절점수합(GRS_sum)은 유전자조절점수보다 더 정확한 결과를 나타내는 이점이 있다. Preferably, when two or more gene control scores are derived for the same drug and gene in the first GRS calculation unit, the sum of gene control scores (GRS _sum ) from the two or more gene control scores can be updated with the gene control score. have. If a number of trials are repeated and different results are derived due to errors in each trial, the larger the sample of these trials, the more accurate results can be derived. That is, the gene control score sum (GRS _sum ) of the two or more gene control scores has the advantage of showing more accurate results than the gene control score.

예를들어, 의생명 문헌데이터인 문헌 1 내지 3으로부터 특정 약물 A의 특정 유전자 APOE에 대한 유전자조절점수는 +1, +1 및 -1인 경우에는 이러한 유전자조절점수를 모두 더한 +1(+1+1-1)이 유전자조절점수합이 된다. 이렇게 도출한 유전자조절점수합인 +1을 새로운 유전자점수로 업데이트할 수 있다.For example, if the gene control score for a specific gene APOE of a specific drug A is +1, +1, and -1 from documents 1 to 3, which are biomedical literature data, all of these gene control scores are +1 (+1 +1-1) is the sum of the gene control points. The derived gene control score +1 can be updated with a new gene score.

바람직하게는, 상기 제2 GRS계산부에서는 복수의 문헌에서의 유전자조절점수의 합의 부호만을 데이터로 취한다. 예를들어, 특정 약물 A에 대해 특정 유전자 B에 대한 유전자조절점수가 5개의 문헌에서, +1, +1, +1, +1, +1로 나오는 경우에, 상기 제2 GRS계산부에 의하여 업데이트되는 유전자조절점수 혹은 유전자조절점수합은 +5가 아닌 +1로 업데이트됨이 바람직하다.Preferably, in the second GRS calculation unit, only the sign of the sum of gene control scores in a plurality of documents is taken as data. For example, when the gene control score for a specific gene B for a specific drug A is +1, +1, +1, +1, +1 in five documents, the second GRS calculation unit It is preferable that the updated gene control score or the gene control score sum is updated to +1 instead of +5.

도 8은 본 발명의 일 실시예에 따른 약특성도출단계의 과정을 개략적으로 도시한다.8 schematically shows a process of a weak characteristic derivation step according to an embodiment of the present invention.

상기 약특성도출단계는 상기 1 이상의 약에 대하여 유전자 정보에 기초하여 상기 DG관련성데이터로부터 추출된 각각의 DG토픽에 대한 매칭율을 계산하는 제1단계(S10, S20); 상기 판별대상 효능 혹은 부작용이 있다고 알려진 1 이상의 약의 상기 DG토픽에 대한 매칭율 혹은 매칭율로부터 도출된 점수에 기초하여 1 이상의 대표DG토픽을 도출하는 제2단계(S30, S40); 상기 대표DG토픽에 대한 상기 판별대상 효능 혹은 부작용이 있다고 알려지지 않은 1 이상의 약의 매칭율 혹은 매칭율로부터 도출된 점수에 기초하여 판별대상 효능 혹은 부작용이 예측되는 신규약을 도출하는 제3단계(S50);를 포함한다.The drug characteristic derivation step comprises: a first step (S10, S20) of calculating a matching rate for each DG topic extracted from the DG relevance data based on genetic information for the at least one drug; A second step (S30, S40) of deriving one or more representative DG topics based on a score derived from a matching rate or a matching rate for the DG topic of one or more drugs known to have an efficacy or side effect to be identified; A third step of deriving a new drug for which the target efficacy or side effect is predicted based on the score derived from the matching rate or the matching rate of one or more drugs that are not known to have the target efficacy or side effects for the representative DG topic (S50 ); includes.

단계 S10에서는 DG토픽을 추출한다. 상기 DG토픽은 상기 DG관련성데이터의 각각의 약을 문헌으로 간주하고, 상기 유전자 정보를 단어로 간주하여 토픽모델링 방법을 통하여 도출된 복수의 토픽이다. In step S10, a DG topic is extracted. The DG topic is a plurality of topics derived through a topic modeling method by considering each drug of the DG-related data as a document and considering the gene information as a word.

최근 문헌들의 수가 급격하게 증가함에 따라서, 연구자들이 필요한 문서들만을 다루기 위한 기법이 연구되었다. 이와 같은 기법으로서, 토픽 모델링이라는 기법이 현재 연구 및 개발되었다. 토픽 모델링에서는 문서의 메인 키워드들을 문서의 컨텐츠에 기초하여 추출하고, 상응하는 키워드를 갖는 문서들을 그룹핑하는 알고리즘에 기반한다.In recent years, as the number of documents has increased rapidly, techniques have been studied for researchers to deal with only necessary documents. As such a technique, a technique called topic modeling is currently researched and developed. In topic modeling, the main keywords of the document are extracted based on the content of the document, and it is based on an algorithm for grouping documents having corresponding keywords.

토픽 모델링은 복수의 문서로부터 메인 토픽 혹은 메인 테마를 추출하는 통계적 기반 모델링 알고리즘이고, 문서는 복수의 토픽의 복합체로 가정한다. 토픽 모델링에서는 복수의 몇몇 문헌에서 동시에 자주 언급되는 워드 집합을 하나의 토픽으로 분류한다. 토픽은 새로운 카테고리를 형성하고, 토픽을 구성하는 단어들의 의미는 해당 카테고리에 의미를 부여한다. Topic modeling is a statistical-based modeling algorithm that extracts a main topic or main theme from a plurality of documents, and a document is assumed to be a complex of a plurality of topics. In topic modeling, a set of words frequently mentioned at the same time in several documents are classified into one topic. A topic forms a new category, and the meaning of words constituting the topic gives meaning to the category.

일반적인 데이터 클러스터링 방식에서는 하나의 문헌은 하나의 클러스터에 해당되지만, 토픽 모델링에서는 하나의 문서가 복수의 토픽에 해당될 수 있고, 또한 각각의 토픽에 대한 매칭은 O/X 가 아닌 확률값으로 지정될 수도 있다.In a general data clustering method, one document corresponds to one cluster, but in topic modeling, one document may correspond to a plurality of topics, and the matching for each topic may be designated as a probability value other than O/X. have.

토픽 모델링의 예시적인 알고리즘인 LDA(Latent Dirichlet Allocation)의 전체적인 프로세스는 하기의 식에 의하여 이루어질 수 있다. The overall process of LDA (Latent Dirichlet Allocation), which is an exemplary algorithm of topic modeling, may be accomplished by the following equation.

여기서 토픽은

이고,

는 단어 집단(vocabulary)에서의 분포이다. d번째 문서에 대한 토픽 비율(proportion)은

이고,

는 d번째 문서의 k 토픽의 토픽 비율을 나타낸다. d번째 문서에 대한 토픽 할당(topic assignment)는

이고,

는 문서 d의 n 번째 단어의 토픽 할당에 해당한다. 문서 d의 관측된 단어는

이고,

는 문서 d의 n번째 단어를 의미하고, 이는 고정된 단어 집단의 요소에 해당한다.The topic here is

ego,

Is the distribution in the vocabulary. The topic proportion for the d-th document is

ego,

Represents the topic ratio of the k topic of the d-th document. The topic assignment for the d-th document is

ego,

Corresponds to the topic assignment of the nth word of document d. The observed word in document d is

ego,

Denotes the nth word of document d, which corresponds to an element of a fixed group of words.

상기 단계 S10에서는 상기와 같은 LDA 방식의 토픽 모델링이 이루어지거나 혹은 다른 방식으로 토픽 모델링이 이루어질 수 있다.In step S10, topic modeling in the LDA method as described above may be performed or topic modeling may be performed in another method.

한편, 본 발명의 다른 실시예에서는 S10 단계에서와 같이 DG관련성데이터로부터 토픽 모델링을 수행하여 DG토픽을 추출하는 것이 아니라 이미 추출된 DG토픽이 로드되는 형태로 구현될 수도 있다.On the other hand, in another embodiment of the present invention, the DG topic may not be extracted by performing topic modeling from the DG related data as in step S10, but the already extracted DG topic may be loaded.

상기와 같은 DG토픽은 복수로 구성되고, 각각의 토픽은 복수의 유전자 정보를 포함한다.The DG topics as described above are composed of a plurality, and each topic includes a plurality of gene information.

예를들어, A토픽은 하기와 같은 형태로 구성될 수 있다.For example, topic A may be configured as follows.

유전자 A: +Gene A: +

유전자 B: +Gene B: +

유전자 C: -Gene C:-

유전자 D: +Gene D: +

유전자 E: -Gene E:-

유전자 F: +Gene F: +

유전자 H: +Gene H: +

… 유전자 Z: +… Gene Z: +

본 발명의 일 실시예에서는, 현재 FDA 승인을 받은 약은 약 2200 개가 존재하고, 본 발명에서는 유전자 문헌정보가 있는 684개에 대하여 토픽 모델링을 통하여 토픽을 추출하였다. 적정 토픽수는 LOG-LIKELIHOOD 기법을 통하여 적정 토픽수를 도출하였고, 적정 토픽수는 다른 방식으로 혹은 임의로 설정될 수도 있다.In one embodiment of the present invention, there are about 2200 drugs currently approved by the FDA, and in the present invention, topics were extracted through topic modeling for 684 drugs having genetic literature information. The appropriate number of topics was derived through the LOG-LIKELIHOOD technique, and the appropriate number of topics may be set in different ways or arbitrarily.

S20에서는 상기 1 이상의 약 각각의 상기 DG관련성데이터에서의 유전자 정보와 상기 DG토픽의 각각의 토픽에 포함된 유전자 정보의 매칭율을 계산한다. 상기 매칭율을 확률값으로 계산됨이 바람직하고, 더욱 바람직하게는 각 약의 각각의 DG토픽에 대한 합은 특정 수(예를들어 1)로 정규화됨이 바람직하다.In S20, a matching rate of gene information in each of the at least one DG related data and gene information included in each topic of the DG topic is calculated. It is preferable that the matching rate is calculated as a probability value, and more preferably, the sum of each drug for each DG topic is normalized to a specific number (for example, 1).

S30에서는 판별대상 효능 혹은 부작용이 알려진 약 그룹을 도출한다. 예를들어 판별대상 효능 혹은 부작용이 유방암(Breast Cancer) 인 경우에는 DG관련성데이터에 포함된 약들 중 유방암에 대한 효능 혹은 부작용이 있는 약 그룹을 도출한다.In S30, a drug group with known efficacy or side effects to be identified is derived. For example, if the target efficacy or side effect is breast cancer, among the drugs included in the DG-related data, a group of drugs with efficacy or side effects for breast cancer is derived.

예를들어, 상기 DG관련성데이터에 포함된 약-유전자(통상적으로 하나의 약에 대하여 복수의 유전자 정보를 가짐) 그룹, 혹은 약의 개수가 600개라고 하면, 이 중 이미 유방암에 대한 효능이 있다고 알려진 약이 6개 정도 존재할 수 있다. 이 경우, 상기 6개의 약이 판별대상 효능 혹은 부작용이 알려진 약 그룹에 해당한다.For example, if the number of drug-genes (usually having multiple genetic information for one drug) group or drug included in the DG-related data is 600, it is said that there are already efficacy against breast cancer. There may be as many as six known drugs. In this case, the six drugs correspond to a group of drugs with known efficacy or side effects.

S40에서는 상기 판별대상 효능 혹은 부작용이 있다고 알려진 1 이상의 약 각각에 대해 상기 DG토픽에 대한 매칭율 혹은 매칭율로부터 도출된 점수가 기설정된 기준 이상인 1 이상의 DG토픽을 상기 대표DG토픽으로 도출한다.In S40, one or more DG topics in which the score derived from the matching rate or the matching rate for the DG topic for each of the one or more drugs known to have an efficacy or side effect to be determined is greater than or equal to a preset criterion are derived as the representative DG topic.

구체적으로, 유방암에 대한 새로운 약을 도출하고자 한다면, 상기 6개의 약에 대한 각각의 DG토픽의 값을 판별하고, 이 중 높은 매칭율 혹은 매칭율로부터 도출된 점수(매칭율로부터 정규화 혹은 특정 함수를 통하여 도출된 점수)가 높은 DG토픽을 모은다. 이렇게 모아진 DG토픽은 결과적으로 유방암에 대하여 연관이 있는 유전자 토픽 정보에 해당될 수 있고 이를 대표DG토픽으로 한다.Specifically, if you want to derive a new drug for breast cancer, determine the value of each DG topic for the six drugs, and among them, the high matching rate or the score derived from the matching rate (normalized from the matching rate or a specific function). DG topics with high scores) are collected. As a result, the collected DG topics may correspond to information on genetic topics related to breast cancer, and this is referred to as a representative DG topic.

예를들어, 6개의 약에서 각각 기설정된 수치 이상의 매칭율 혹은 매칭율로부터 도출된 점수를 갖는 DG토픽을 대표DG토픽으로 추출할 수 있다.For example, a DG topic having a matching rate of more than a preset value or a score derived from the matching rate from six drugs may be extracted as a representative DG topic.

예를들어, 약이 약1 ~ 약600까지 있고, 토픽이 토픽 A ~ 토픽 Z까지 있고, 유방암이 판별대상 효능에 해당하고, 약 1 ~ 약 6이 유방암에 대해 판별대상 효능이 있다고 알려져있다고 가정한다. For example, it is assumed that there are drugs from about 1 to about 600, topics are from topic A to topic Z, and breast cancer corresponds to the efficacy to be identified, and that about 1 to about 6 are known to have the target efficacy for breast cancer. do.

예를들어, 약1 에서는 토픽A가 가장 높은 매칭율 혹은 매칭율로부터 도출된 점수를 가지고, 약2 에서는 토픽B가 가장 높은 매칭율 혹은 매칭율로부터 도출된 점수를 가지고, 약3 에서는 토픽C가 가장 높은 매칭율 혹은 매칭율로부터 도출된 점수를 가지고, 약4 에서는 토픽D가 가장 높은 매칭율 혹은 매칭율로부터 도출된 점수를 가지고, 약5 에서는 토픽E가 가장 높은 매칭율 혹은 매칭율로부터 도출된 점수를 가지고, 약6 에서는 토픽F가 가장 높은 매칭율 혹은 매칭율로부터 도출된 점수를 가진다고 한다면, 대표DG토픽은 토픽 A, B, C, D, E, F 가 될 수 있다. 이는 토픽 A, B, C, D, E, F가 유방암과 관련된 토픽임을 시사한다.For example, in about 1, topic A has the highest matching rate or score derived from the matching rate, in about 2, topic B has the highest matching rate or score derived from the matching rate, and in about 3, topic C In about 4, Topic D has the highest matching rate or score derived from the matching rate, and In about 5, Topic E has the highest matching rate or matching rate. With a score, if topic F has the highest matching rate or a score derived from the matching rate in about 6, the representative DG topics can be topics A, B, C, D, E, and F. This suggests that topics A, B, C, D, E, and F are topics related to breast cancer.

한편, 대표DG토픽은 하나의 약에서 하나만 도출될 수 있는 것이 아니라 기준치를 넘는 2개 이상의 DG토픽이 추출되거나, 동일하거나 혹은 유사 범위의 매칭율 혹은 매칭율로부터 도출된 점수를 갖는 2 이상의 DG토픽이 추출될 수도 있다.On the other hand, representative DG topics are not only one that can be derived from one drug, but two or more DG topics that exceed the standard value are extracted, or two or more DG topics that have a score derived from the matching rate or matching rate of the same or similar range. May be extracted.

단계 S50에서는 상기 대표DG토픽에 대한 상기 판별대상 효능 혹은 부작용이 있다고 알려지지 않은 1 이상의 약의 매칭율 혹은 매칭율로부터 도출된 점수에 기초하여 판별대상 효능 혹은 부작용이 예측되는 신규약을 도출한다.In step S50, a new drug for which the target efficacy or side effect is predicted is derived based on the score derived from the matching rate or the matching rate of one or more drugs that are not known to have the target efficacy or side effects for the representative DG topic.

구체적으로, 단계 S50에서는 판별대상 효능 혹은 부작용이 알려지지 않은 약 각각의 상기 대표DG토픽그룹에 속하는 DG토픽에 대한 매칭율 혹은 매칭율로부터 도출된 점수가 고려된다.Specifically, in step S50, the matching rate or the score derived from the matching rate for the DG topic belonging to the representative DG topic group of each drug whose discrimination target efficacy or side effect is unknown are considered.

예를들어, 상기 예시와 같이 토픽 A, B, C, D, E, F 가 대표DG토픽인 경우에, 상기 약 7~600 각각에 대한 토픽 A, B, C, D, E, F 에 대한 매칭율 혹은 매칭율에 기초한 점수에 기초하여 신규 약을 도출한다.For example, if topics A, B, C, D, E, and F are representative DG topics as in the above example, the topics A, B, C, D, E, and F for each of about 7 to 600 New drugs are derived based on the matching rate or score based on the matching rate.

바람직하게는, 상기 판별대상 효능 혹은 부작용이 있다고 알려지지 않은 1 이상의 약 각각의 상기 대표DG토픽에 속한 복수의 토픽에 대한 각각의 매칭율 혹은 매칭율로부터 도출된 점수로부터 도출된 대표점수에 기초하여, 상기 상기 판별대상 효능 혹은 부작용이 있다고 알려지지 않은 1 이상의 약 중 판별대상 효능 혹은 부작용이 예측되는 신규약을 도출한다.Preferably, on the basis of a representative score derived from a score derived from each matching rate or a score derived from a matching rate for a plurality of topics belonging to the representative DG topic of each of the at least one drug that is not known to have an effect or side effect to be identified, Among the at least one drug that is not known to have the effect or side effect to be identified, a new drug for which the effect or side effect to be identified is predicted is derived.

바람직하게는, 상기 대표점수는 상기 판별대상 효능 혹은 부작용이 있다고 알려지지 않은 1 이상의 약 각각의 상기 대표DG토픽에 속한 복수의 토픽에 대한 각각의 매칭율 혹은 매칭율로부터 도출된 점수의 합에 해당할 수 있다.Preferably, the representative score corresponds to the sum of the scores derived from each matching rate or matching rate for a plurality of topics belonging to each of the representative DG topics that are not known to have the efficacy or side effects to be identified. I can.

예를들어, 상기 약 1 ~ 6 의 토픽 A, B, C, D, E, F에 대한 매칭율로부터 도출된 점수의 합이 각각 0.6, 0.7, 0.6, 0.6, 0.7, 0.9에 해당하는 경우에는, 0.6, 0.7, 0.6, 0.6, 0.7, 0.9에 대한 통계적 분석을 수행한 대표값(예를들어 평균값, 중간값 등)이 유방암에 대해 유사한 유전자 토픽을 가지는 약의 기준이 될 수 있다.For example, if the sum of the scores derived from the matching rates for topics A, B, C, D, E, and F of about 1 to 6 correspond to 0.6, 0.7, 0.6, 0.6, 0.7, and 0.9, respectively, , 0.6, 0.7, 0.6, 0.6, 0.7, 0.9, a representative value (for example, mean value, median value, etc.) obtained by performing statistical analysis may be a standard for drugs having similar genetic topics for breast cancer.

즉, 신규 약을 도출함에 있어서, 상기 단계 S50에서는 상기 판별대상 효능 혹은 부작용이 있다고 알려진 1 이상의 약 각각의 상기 대표DG토픽에 속한 복수의 토픽에 대한 각각의 매칭율 혹은 매칭율로부터 도출된 점수를 기준으로 하여 기준치가 설정될 수 있다. 본 발명의 실시예에서는 상기 기준치는 다양한 통계적 기법을 통하여 설정될 수도 있다. 예를들어, Z-score 가 사용될 수도 있다.That is, in deriving a new drug, in the step S50, a score derived from each matching rate or matching rate for a plurality of topics belonging to the representative DG topic of each of the one or more drugs known to have an efficacy or side effect to be identified is A reference value can be set as a reference. In an embodiment of the present invention, the reference value may be set through various statistical techniques. For example, Z-score may be used.

이후, 단계 S50에서는, 상기 기준치를 기준으로 하여 상기 판별대상 효능 혹은 부작용이 있다고 알려지지 않은 1 이상의 약 각각의 상기 대표DG토픽에 속한 복수의 토픽에 대한 각각의 매칭율 혹은 매칭율로부터 도출된 점수를 판별한다. Thereafter, in step S50, a score derived from each matching rate or matching rate for a plurality of topics belonging to the representative DG topic of each of the at least one drug that is not known to have an efficacy or side effect to be determined based on the reference value is calculated. Discriminate.

예를들어, 약 7 ~ 600에 대하여 상기 토픽 A, B, C, D, E, F에 대한 매칭율 혹은 매칭율로부터 도출된 점수의 합 혹은 합에 대한 통계적 기법을 사용하여 변환된 수치(Z-score)를 적용하여 신규약을 도출할 수도 있다.For example, for about 7 to 600, the matching rate for the topics A, B, C, D, E, F, or the sum or sum of scores derived from the matching rate was converted using a statistical technique (Z -score) can be applied to derive new drugs.

예를들어, 약 10, 11, 12, 13이 대하여 상기 토픽 A, B, C, D, E, F에 대한 매칭율 혹은 매칭율로부터 도출된 점수의 합이 상기 약 1 ~ 약 6의 상기 토픽 A, B, C, D, E, F에 대한 매칭율 혹은 매칭율로부터 도출된 점수의 합에 근접하다면, 약 10, 11, 12, 13이 약 1 ~ 6과 유사한 유전자 정보를 가지고 이로부터 약 10, 11, 12, 13도 유방암에 대한 효능이 있다고 예측할 수 있다.For example, for about 10, 11, 12, 13, the matching rate for the topics A, B, C, D, E, F, or the sum of scores derived from the matching rate is the topic of about 1 to about 6 If it is close to the matching rate for A, B, C, D, E, or F or the sum of the scores derived from the matching rate, about 10, 11, 12, and 13 have similar genetic information to about 1 to 6 and from this 10, 11, 12, 13 can also be predicted to have efficacy against breast cancer.

본 발명의 바람직한 실시예에서는, 각각의 약의 상기 대표DG토픽에 대한 매칭율 혹은 매칭율로부터 도출된 값의 제1 변환값(예를들어 합)에 대하여 통계적 변환(예를들어 정규분포)를 수행하고, 상기 판별대상 효능 혹은 부작용이 알려진 약의 상기 제1 변환값에 상응 혹은 기설정된 범위에 있는 제1 변환값을 갖는 상기 판별대상 효능 혹은 부작용이 알려지지 않은 약을 신규약을 도출할 수 있다.In a preferred embodiment of the present invention, statistical transformation (eg, normal distribution) is performed on the first conversion value (eg, sum) of the matching rate for the representative DG topic of each drug or the value derived from the matching rate. And a drug for which the discriminant efficacy or side effect is unknown, having a first transformed value corresponding to the first conversion value of the drug for which the discriminant efficacy or side effect is known, or in a preset range, can be derived. .

도 9는 본 발명의 일 실시예에 따른 DG관련성데이터로부터 토픽을 추출하고, 각각의 약에 대한 토픽의 매칭율을 도출하는 과정을 개략적으로 도시한다.9 schematically illustrates a process of extracting topics from DG relevance data according to an embodiment of the present invention, and deriving a matching rate of topics for each drug.

도 9의 좌측의 Drug 1, 2, 3 등은 DG관련성데이터로서, 각각의 약에 대한 관련 유전자 정보를 표시한다. 본 발명에서는 이와 같은 DG관련성데이터에서 각각의 약을 문헌으로 보고, 각각의 약의 유전자 정보를 단어로 판단하여 이에 대한 토픽 모델링을 수행한다. Drugs 1, 2, 3, etc. on the left side of FIG. 9 are DG-related data, and indicate related gene information for each drug. In the present invention, each drug is viewed as a document in such DG-related data, and the genetic information of each drug is determined as a word, and topic modeling is performed.

도 9의 Topic Assignment 의 빨간색, 노란색, 초록색, 파란색으로 표시된 네모가 약들의 DG관련성데이터로부터 도출된 Topic 이고, 이와 같은 토픽은 전술한 바와 같이, 유전자 A:+, 유전자 B:+, 유전자 C:- 등의 형태를 가질 수 있다.The red, yellow, green, and blue squares of the Topic Assignment of FIG. 9 are Topics derived from DG-related data of drugs, and such topics are as described above, Gene A:+, Gene B:+, Gene C: -It can have the shape of a back.

각각의 약의 유전자 정보가 각각의 토픽에 100% 합치하지 여부를 판단하는 것이 아니라, 본 발명에서는 각각의 약의 유전자정보가 이들로부터 공통된다고 판단되어 추출된 각각의 토픽에 대한 매칭율을 추출한다. 예를들어, 도 9에서는 4개의 토픽이 도출되었다고 가정시, 막대그래프는 특정 약(예를들어, Drug 2)에 대한 4개의 토픽에 대한 매칭율 혹은 확률값을 도시한다. Rather than determining whether the genetic information of each drug 100% matches each topic, in the present invention, it is determined that the genetic information of each drug is common from them, and the matching rate for each extracted topic is extracted. . For example, in FIG. 9, assuming that 4 topics have been derived, the bar graph shows a matching rate or probability value for 4 topics for a specific drug (eg, Drug 2).

도 9의 우측의 테이블은 각각의 약의 각각의 도출된 DG토픽에 대한 확률값 혹은 매칭율을 정리한 테이블에 해당한다. 이와 같이, 본 발명에서는 DG관련성데이터에 N개의 약이 있다고 하는 경우에, N개의 약 각각에 대하여 도출된 유효범위의 수의 DG토픽에 대한 매칭율을 도출한다. The table on the right side of FIG. 9 corresponds to a table that summarizes probability values or matching rates for each derived DG topic of each drug. As described above, in the present invention, when there are N drugs in the DG related data, the matching rate for the DG topics of the number of effective ranges derived for each of the N drugs is derived.

도 9에서 P_I,X는 I번째 약의 X번째 토픽에 대한 매칭율 혹은 매칭율에 대한 확률값을 나타낸다.In FIG. 9, P _I,X denotes a matching rate for the X-th topic of the I-th drug or a probability value for the matching rate.

도 10은 본 발명의 일 실시예에 따른 각각의 약들에 대한 토픽에 대한 매칭율에 기초한 점수를 도출한 과정을 개략적으로 도시한다.10 schematically shows a process of deriving a score based on a matching rate for a topic for each drug according to an embodiment of the present invention.

본 발명의 일 실시예에서는 매칭율을 그대로 사용할 수도 있고, 도 10에서와 같이 매칭율을 특정 점수로 변환할 수 있다. 바람직하게는, 각각의 약에 대하여 각각의 토픽에 대한 매칭율에 대한 순위를 도출하고, 각각의 순위에 대한 점수로 변환할 수 있다. In an embodiment of the present invention, the matching rate may be used as it is, or the matching rate may be converted into a specific score as shown in FIG. 10. Preferably, for each drug, it is possible to derive a ranking for the matching rate for each topic and convert it into a score for each ranking.

예를들어, 토픽이 6개가 있다고 가정시, Drug 2의 토픽 A, B, C, D, E, F의 매칭율 혹은 확률값이 0.1, 0.2, 0.3, 0.4, 0.5, 0.6 인 경우에는 각각의 토픽에 대한 순위가 6, 5, 4, 3, 2, 1이 배정되고, 1순위에 대해서는 1점, 2순위에 대해서는 0.8점, 3순위에 대해서는 0.6점, 4순위에 대해서는 0.4점, 5순위에 대해서는 0.2점, 6순위에 대하여 0점이 부여될 수도 있다.For example, assuming that there are 6 topics, each topic if the matching rate or probability value of the topics A, B, C, D, E, F of Drug 2 is 0.1, 0.2, 0.3, 0.4, 0.5, 0.6. For the rankings 6, 5, 4, 3, 2, and 1 are assigned, 1 point for the 1st, 0.8 points for the 2nd, 0.6 points for the 3rd, 0.4 points for the 4th, and 5th 0.2 points for the 6th rank and 0 points for the 6th rank may be given.

여기서, R_I,X는 I번째 약의 X번째 토픽에 대한 매칭율 혹은 매칭율에 대한 확률값으로부터 도출된 점수에 해당한다.Here, R _I,X corresponds to a score derived from the matching rate for the X-th topic of the I-th drug or the probability value for the matching rate.

도 11은 본 발명의 일 실시예에 따른 각각의 약들 중 판별대상 효능 혹은 부작용이 있는 약들에 대한 클래스 표기를 수행한 과정을 개략적으로 도시한다. 예를들어, 판별대상 약의 효능이 유방암이고 Drug 1, 3, 4가 유방암에 대해 효능이 있다고 알려져 있는 경우에는 상기 도 11에서와 같이, Drug 1, 3, 4에 대해 Class 항목이 T로 표시되어 있다. 이는 Drug 1, 3, 4와 유사한 유전자 토픽에 대한 매칭율을 갖는 약의 경우 유방암의 효능을 기대할 수 있기 때문이다.FIG. 11 schematically illustrates a process of performing class marking for drugs having an efficacy or side effect to be identified among each drug according to an embodiment of the present invention. For example, when the efficacy of the drug to be discriminated is breast cancer and Drugs 1, 3, and 4 are known to be effective against breast cancer, as in FIG. 11, the Class item for Drugs 1, 3, 4 is indicated as T. Has been. This is because drugs having a matching rate for gene topics similar to Drugs 1, 3, and 4 can be expected to have the efficacy of breast cancer.

도 12는 본 발명의 일 실시예에 따른 대표DG토픽에 대한 각각의 약들의 매칭율 혹은 매칭율로부터 도출된 점수를 개략적으로 도시한 도면이고, 도 13은 본 발명의 일 실시예에 따른 판별대상 효능 혹은 부작용이 예측되는 신규약을 도출하는 과정을 개략적으로 도시한 도면이다.12 is a diagram schematically showing a matching rate or a score derived from the matching rate of each drug for a representative DG topic according to an embodiment of the present invention, and FIG. 13 is a discrimination object according to an embodiment of the present invention It is a diagram schematically showing the process of deriving a new drug with predicted efficacy or side effects.

도 12는 예를들어 Drug 1, 3, 4가 대상 효능을 갖는 약으로 알려져 있다고 하는 경우, Drug 1, 3, 4에 대해 기설정된 매칭율 혹은 매칭율로부터 도출된 점수 기준을 충족하는 토픽(대표DG토픽)이 Topic a, b, c, d, e인 경우를 도시한다.12 is a topic that satisfies the score criteria derived from a preset matching rate or matching rate for Drugs 1, 3, 4, for example, when it is said that Drugs 1, 3, and 4 are known as drugs with target efficacy (representative DG topic) is Topic a, b, c, d, e.

도 12에서의 Sum 항목에서와 같이, 이 단계에서는 각각의 약에 대하여 상기 대표DG토픽에 해당하는 Topic a, b, c, d, e에 대한 매칭율 혹은 매칭율로부터 도출된 점수(도 12는 이에 해당함)의 합산값을 도출한다. 여기서 Sum은 합산값이 될 수 있지만, 혹은 Topic a, b, c, d, e에 대한 매칭율 혹은 매칭율로부터 도출된 점수들로부터 도출된 수치면 가능하다.As in the Sum item in Fig. 12, in this step, for each drug, a score derived from the matching rate or matching rate for Topic a, b, c, d, e corresponding to the representative DG topic (Fig. 12 is Corresponds to this). Here, Sum can be a sum value, but it can be a matching rate for topics a, b, c, d, e, or a numerical value derived from scores derived from the matching rate.

상기 도 12 및 도 13에 도시된 실시예에서는 각각의 대표DG토픽의 매칭율로부터 도출된 점수를 Sum으로 도출한다. 이 경우, Drug 1, 3, 4 는 Sum 값이 다른 Drug 보다 높은 값을 가질 것으로 기대되고, 이와 유사한 Sum 값을 갖는 약들은 Drug 1, 3, 4와 유사한 효능을 가진다고 기대할 수 있다.In the embodiments shown in FIGS. 12 and 13, the score derived from the matching rate of each representative DG topic is derived as Sum. In this case, Drugs 1, 3, and 4 are expected to have higher Sum values than other drugs, and drugs with similar Sum values can be expected to have similar efficacy to Drugs 1, 3 and 4.

본 발명의 일 실시예에서는 합산 방식으로 각각의 Drug 의 대표DG토픽에 대한 점수에 대한 Sum을 도출하고, 도출된 Sum에 대한 정규분포를 도출하고, 각각의 약의 Sum 에 대하여 Z-score 를 도출한 후에, Drug 1, 3, 4의 Z-score 에 상응하는 범위를 갖는 Z-score를 갖는 약들을 판별대상 효능 혹은 부작용을 갖는 약으로 예측할 수 있다.In one embodiment of the present invention, Sum for the scores of the representative DG topics of each drug is derived by the summation method, the normal distribution for the derived Sum is derived, and the Z-score for the Sum of each drug is derived. After that, drugs having a Z-score having a range corresponding to the Z-score of Drugs 1, 3, and 4 can be predicted as drugs having an efficacy or side effect to be identified.

본 발명에서는 Sum 에 한정되는 것이 아니라, 다른 방식으로도 대표DG토픽에 대한 판별대상 효능 혹은 부작용이 있다고 알려진 약의 매칭율, 확률값, 혹은 매칭율, 확률값으로부터 도출된 점수가 유사한 양상을 가지는 약들을 판별대상 효능 혹은 부작용이 있다고 예측할 수 있다. In the present invention, it is not limited to Sum, but in other ways, drugs having a similar pattern in which the matching rate, probability value, or score derived from the matching rate and probability value of drugs known to have a discriminant effect or side effect on a representative DG topic It can be predicted that there are effects or side effects to be identified.

도 14는 본 발명의 일 실시예에 따른 컴퓨팅장치의 내부 구성을 예시적으로 도시한다.14 exemplarily illustrates an internal configuration of a computing device according to an embodiment of the present invention.

도 14에 도시한 바와 같이, 컴퓨팅 장치(11000)은 적어도 하나의 프로세서(processor)(11100), 메모리(memory)(11200), 주변장치 인터페이스(peripheral interface)(11300), 입/출력 서브시스템(I/Osubsystem)(11400), 전력 회로(11500) 및 통신 회로(11600)를 적어도 포함할 수 있다. 이때, 컴퓨팅 장치(11000)은 촉각 인터페이스 장치에 연결된 사용자단말기(A) 혹은 전술한 컴퓨팅 장치(B)에 해당될 수 있다.14, a computing device 11000 includes at least one processor 11100, a memory 11200, a peripheral interface 11300, and an input/output subsystem ( I/Osubsystem) 11400, a power circuit 11500, and a communication circuit 11600 may be included at least. In this case, the computing device 11000 may correspond to the user terminal A connected to the tactile interface device or the aforementioned computing device B.

메모리(11200)는, 일례로 고속 랜덤 액세스 메모리(high-speed random access memory), 자기 디스크, 에스램(SRAM), 디램(DRAM), 롬(ROM), 플래시 메모리 또는 비휘발성 메모리를 포함할 수 있다. 메모리(11200)는 컴퓨팅 장치(11000)의 동작에 필요한 소프트웨어 모듈, 명령어 집합 또는 그밖에 다양한 데이터를 포함할 수 있다.The memory 11200 may include, for example, high-speed random access memory, magnetic disk, SRAM, DRAM, ROM, flash memory, or nonvolatile memory. have. The memory 11200 may include a software module, an instruction set, or other various data necessary for the operation of the computing device 11000.

이때, 프로세서(11100)나 주변장치 인터페이스(11300) 등의 다른 컴포넌트에서 메모리(11200)에 액세스하는 것은 프로세서(11100)에 의해 제어될 수 있다.In this case, accessing the memory 11200 from another component such as the processor 11100 or the peripheral device interface 11300 may be controlled by the processor 11100.

주변장치 인터페이스(11300)는 컴퓨팅 장치(11000)의 입력 및/또는 출력 주변장치를 프로세서(11100) 및 메모리 (11200)에 결합시킬 수 있다. 프로세서(11100)는 메모리(11200)에 저장된 소프트웨어 모듈 또는 명령어 집합을 실행하여 컴퓨팅 장치(11000)을 위한 다양한 기능을 수행하고 데이터를 처리할 수 있다.The peripheral device interface 11300 may couple input and/or output peripheral devices of the computing device 11000 to the processor 11100 and the memory 11200. The processor 11100 may execute various functions for the computing device 11000 and process data by executing a software module or instruction set stored in the memory 11200.

입/출력 서브시스템(11400)은 다양한 입/출력 주변장치들을 주변장치 인터페이스(11300)에 결합시킬 수 있다. 예를 들어, 입/출력 서브시스템(11400)은 모니터나 키보드, 마우스, 프린터 또는 필요에 따라 터치스크린이나 센서등의 주변장치를 주변장치 인터페이스(11300)에 결합시키기 위한 컨트롤러를 포함할 수 있다. 다른 측면에 따르면, 입/출력 주변장치들은 입/출력 서브시스템(11400)을 거치지 않고 주변장치 인터페이스(11300)에 결합될 수도 있다.The input/output subsystem 11400 may couple various input/output peripherals to the peripherals interface 11300. For example, the input/output subsystem 11400 may include a monitor, a keyboard, a mouse, a printer, or a controller for coupling a peripheral device such as a touch screen or a sensor to the peripheral device interface 11300 as needed. According to another aspect, the input/output peripheral devices may be coupled to the peripheral device interface 11300 without going through the input/output subsystem 11400.

전력 회로(11500)는 단말기의 컴포넌트의 전부 또는 일부로 전력을 공급할 수 있다. 예를 들어 전력 회로(11500)는 전력 관리 시스템, 배터리나 교류(AC) 등과 같은 하나 이상의 전원, 충전 시스템, 전력 실패 감지 회로(power failure detection circuit), 전력 변환기나 인버터, 전력 상태 표시자 또는 전력 생성, 관리, 분배를 위한 임의의 다른 컴포넌트들을 포함할 수 있다.The power circuit 11500 may supply power to all or part of the components of the terminal. For example, the power circuit 11500 may include a power management system, one or more power sources such as batteries or alternating current (AC), a charging system, a power failure detection circuit, a power converter or inverter, a power status indicator or power. It may contain any other components for creation, management, and distribution.

통신 회로(11600)는 적어도 하나의 외부 포트를 이용하여 다른 컴퓨팅 장치와 통신을 가능하게 할 수 있다.The communication circuit 11600 may enable communication with another computing device using at least one external port.

또는 상술한 바와 같이 필요에 따라 통신 회로(11600)는 RF 회로를 포함하여 전자기 신호(electromagnetic signal)라고도 알려진 RF 신호를 송수신함으로써, 다른 컴퓨팅 장치와 통신을 가능하게 할 수도 있다.Alternatively, as described above, the communication circuit 11600 may enable communication with other computing devices by transmitting and receiving an RF signal, also known as an electromagnetic signal, including an RF circuit, if necessary.

이러한 도 14의 실시예는, 컴퓨팅 장치(11000)의 일례일 뿐이고, 컴퓨팅 장치(11000)은 도 14에 도시된 일부 컴포넌트가 생략되거나, 도 14에 도시되지 않은 추가의 컴포넌트를 더 구비하거나, 2개 이상의 컴포넌트를 결합시키는 구성 또는 배치를 가질 수 있다. 예를 들어, 모바일 환경의 통신 단말을 위한 컴퓨팅 장치는 도 14에도시된 컴포넌트들 외에도, 터치스크린이나 센서 등을 더 포함할 수도 있으며, 통신 회로(1160)에 다양한 통신방식(WiFi, 3G, LTE, Bluetooth, NFC, Zigbee 등)의 RF 통신을 위한 회로가 포함될 수도 있다. 컴퓨팅 장치(11000)에 포함 가능한 컴포넌트들은 하나 이상의 신호 처리 또는 어플리케이션에 특화된 집적 회로를 포함하는 하드웨어, 소프트웨어, 또는 하드웨어 및 소프트웨어 양자의 조합으로 구현될 수 있다.The embodiment of FIG. 14 is only an example of the computing device 11000, and the computing device 11000 omits some of the components shown in FIG. 14, further includes additional components not shown in FIG. 14, or 2 It can have a configuration or arrangement that combines two or more components. For example, a computing device for a communication terminal in a mobile environment may further include a touch screen or a sensor, in addition to the components shown in FIG. 14, and various communication methods (WiFi, 3G, LTE) in the communication circuit 1160 , Bluetooth, NFC, Zigbee, etc.) may include a circuit for RF communication. Components that may be included in the computing device 11000 may be implemented in hardware, software, or a combination of hardware and software including one or more signal processing or application-specific integrated circuits.

본 발명의 실시예에 따른 방법들은 다양한 컴퓨팅 장치를 통하여 수행될 수 있는 프로그램 명령(instruction) 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 특히, 본 실시예에 따른 프로그램은 PC 기반의 프로그램 또는 모바일 단말 전용의 어플리케이션으로 구성될 수 있다. 본 발명이 적용되는 애플리케이션은 파일 배포 시스템이 제공하는 파일을 통해 이용자 단말에 설치될 수 있다. 일 예로, 파일 배포 시스템은 이용자 단말이기의 요청에 따라 상기 파일을 전송하는 파일 전송부(미도시)를 포함할 수 있다.Methods according to an embodiment of the present invention may be implemented in the form of program instructions that can be executed through various computing devices and recorded in a computer-readable medium. In particular, the program according to the present embodiment may be composed of a PC-based program or an application dedicated to a mobile terminal. An application to which the present invention is applied may be installed on a user terminal through a file provided by the file distribution system. For example, the file distribution system may include a file transmission unit (not shown) that transmits the file according to the request of the user terminal.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The apparatus described above may be implemented as a hardware component, a software component, and/or a combination of a hardware component and a software component. For example, the devices and components described in the embodiments are, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA). , A programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions, such as one or more general purpose computers or special purpose computers. The processing device may execute an operating system (OS) and one or more software applications executed on the operating system. In addition, the processing device may access, store, manipulate, process, and generate data in response to the execution of software. For the convenience of understanding, although it is sometimes described that one processing device is used, one of ordinary skill in the art, the processing device is a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it may include. For example, the processing device may include a plurality of processors or one processor and one controller. In addition, other processing configurations are possible, such as a parallel processor.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로 (collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨팅 장치 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instructions, or a combination of one or more of these, configuring the processing unit to operate as desired or processed independently or collectively. You can command the device. Software and/or data may be interpreted by a processing device or to provide instructions or data to a processing device, of any type of machine, component, physical device, virtual equipment, computer storage medium or device. , Or may be permanently or temporarily embodyed in a transmitted signal wave. The software may be distributed over networked computing devices and stored or executed in a distributed manner. Software and data may be stored on one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -A hardware device specially configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of the program instructions include not only machine language codes such as those produced by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware device described above may be configured to operate as one or more software modules to perform the operation of the embodiment, and vice versa.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.Although the embodiments have been described by the limited embodiments and drawings as described above, various modifications and variations are possible from the above description to those of ordinary skill in the art. For example, the described techniques are performed in a different order from the described method, and/or components such as a system, structure, device, circuit, etc. described are combined or combined in a form different from the described method, or other components Alternatively, even if substituted or substituted by an equivalent, an appropriate result can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and claims and equivalents fall within the scope of the claims to be described later.

Claims

A method of deriving a new property of a drug, implemented as a computing device,
DG relevance data extracting step of loading DG relevance data including one or more related gene information for each of one or more drugs in the computing device or derived from biomedical literature data; And
Drugs with unknown efficacy or side effects, based on the matching rate of gene information extracted from DG related data of each drug, to one or more DG topics extracted from the DG-related data and constituted by one or more gene information Including; a drug characteristic derivation step of determining whether there is an effect or side effect to be determined for
The DG-related data extraction step,
A text loading step of loading text from the biomedical document data;
A sentence extraction step of extracting a text to be analyzed according to a preset rule from the loaded text;
A sentence discrimination step of determining whether the analysis target text contains about or a gene;
A valid word extraction step of extracting valid words related to the relationship between the drug and the gene from the analysis target text; And
Including; a first GRS calculation step of extracting DG related data by deriving the relationship between the drug and the gene from the effective word; and
In the effective word extraction step, a valid word group including valid words including verbs and nouns is extracted from the analyzed text,
The first GRS calculation step,
If the element of the effective word group corresponds to a biological term having a predetermined suppressive meaning, assigning a first parameter value to the effective word of the effective word group;
Assigning a second parameter value to the effective word of the effective word group when the element of the effective word group does not correspond to a biological term of a predetermined suppressive meaning; And
A method for deriving a new characteristic of a drug, comprising the step of deriving a GRS for calculating a gene control score (GRS) for a drug and a gene, based on one or more parameter values assigned to the effective word of the effective word group.

delete

A method of deriving a new property of a drug, implemented as a computing device,
DG relevance data extracting step of loading DG relevance data including one or more related gene information for each of one or more drugs in the computing device or derived from biomedical literature data; And
Drugs with unknown efficacy or side effects, based on the matching rate of gene information extracted from DG related data of each drug, to one or more DG topics extracted from the DG-related data and constituted by one or more gene information Including; a drug characteristic derivation step of determining whether there is an effect or side effect to be determined for
The weak characteristic extraction step,
A first step of calculating a matching rate between gene information in each of the at least one DG related data and gene information included in each topic of a DG topic;
A second step of deriving one or more representative DG topics based on a matching rate for the DG topic or a score derived from the matching rate of one or more drugs known to have an efficacy or side effect to be identified;
A third step of deriving a new drug for which the target efficacy or side effect is predicted based on the score derived from the matching rate or the matching rate of one or more drugs that are not known to have the target efficacy or side effects for the representative DG topic; Including,
The DG topic is a method of deriving a new characteristic of a drug, which is a plurality of topics derived through a topic modeling method by considering each drug in the DG relevance data as a document, and regards genetic information as a word.

delete

The method of claim 3,
The second step,
For each of the one or more drugs known to have an efficacy or side effect to be identified, a new drug for deriving at least one DG topic with a score derived from the matching rate or matching rate for the DG topic is equal to or higher than a preset criterion as the representative DG topic. How to derive characteristics.

The method of claim 3,
The representative DG topic includes a plurality of topics,
The third step,
Based on a representative score derived from a score derived from each matching rate or a score derived from the matching rate for a plurality of topics belonging to the representative DG topic of each of the at least one drug that is not known to have the target efficacy or side effect, the determination target A method of deriving new properties of a drug by deriving a new drug with predicted efficacy or side effect among one or more drugs that are not known to have efficacy or side effects.

The method of claim 6,
The representative score is the sum of the scores derived from the respective matching rates or matching rates for a plurality of topics belonging to the representative DG topic of each of the at least one drug that is not known to have an efficacy or side effect to be identified, derives new characteristics of a drug How to.

An apparatus for deriving new properties of a drug comprising at least one memory and at least one processor,
A DG-related data extracting unit for loading DG-related data including one or more related gene information for each of one or more drugs from a memory or derived from biomedical literature data; And
Drugs with unknown efficacy or side effects, based on the matching rate of gene information extracted from DG related data of each drug, to one or more DG topics extracted from the DG-related data and constituted by one or more gene information It includes a drug characteristic extraction unit that determines whether there is an efficacy or side effect to be determined for,
The DG related data extracting unit,
A text loading unit for loading text from the biomedical document data;
A sentence extracting unit for extracting an analysis target text according to a preset rule from the loaded text;
A sentence discrimination unit for determining whether the analysis target text contains a drug or a gene;
A valid word extracting unit for extracting valid words related to a relationship between a drug and a gene from the analysis target text; And
Including; a first GRS calculation unit for extracting DG related data by deriving the relationship between the drug and the gene from the effective word; and
The effective word extracting unit extracts a valid word group including valid words including verbs and nouns from the analyzed text,
The first GRS calculation unit,
If the element of the effective word group corresponds to a biological term having a predetermined suppressive meaning, assigning a first parameter value to the effective word of the effective word group;
Assigning a second parameter value to the effective word of the effective word group when the element of the effective word group does not correspond to a biological term of a predetermined suppressive meaning; And
A device for deriving a new characteristic of a drug by performing a GRS derivation step of calculating a gene control score (GRS) for a drug and a gene based on one or more parameter values assigned to the effective word of the effective word group.

An apparatus for deriving new properties of a drug comprising at least one memory and at least one processor,
A DG-related data extracting unit for loading DG-related data including one or more related gene information for each of one or more drugs from a memory or derived from biomedical literature data; And
Drugs with unknown efficacy or side effects, based on the matching rate of gene information extracted from DG related data of each drug, to one or more DG topics extracted from the DG-related data and constituted by one or more gene information It includes a drug characteristic extraction unit that determines whether there is an efficacy or side effect to be determined for,
The weak characteristic extraction unit,
A first step of calculating a matching rate between gene information in each of the at least one DG related data and gene information included in each topic of a DG topic;
A second step of deriving one or more representative DG topics based on a matching rate for the DG topic or a score derived from the matching rate of one or more drugs known to have an efficacy or side effect to be identified;
A third step of deriving a new drug for which the target efficacy or side effect is predicted based on the score derived from the matching rate or the matching rate of one or more drugs that are not known to have the target efficacy or side effects for the representative DG topic; Perform,
The DG topic is a device for deriving a new characteristic of a drug, which is a plurality of topics derived through a topic modeling method by considering each drug of the DG relevance data as a document and regards genetic information as a word.

As a computer-readable recording medium,
The computer-readable recording medium stores instructions that cause a computing device to perform the following steps, the steps being:
DG relevance data extracting step of loading DG relevance data including one or more related gene information for each of one or more drugs in the computing device or derived from biomedical literature data; And
Drugs with unknown efficacy or side effects, based on the matching rate of gene information extracted from DG related data of each drug, to one or more DG topics extracted from the DG-related data and constituted by one or more gene information Including; a drug characteristic derivation step of determining whether there is an effect or side effect to be determined for
The DG-related data extraction step,
A text loading step of loading text from the biomedical document data;
A sentence extraction step of extracting a text to be analyzed according to a preset rule from the loaded text;
A sentence discrimination step of determining whether the analysis target text contains about or a gene;
A valid word extraction step of extracting valid words related to the relationship between the drug and the gene from the analysis target text; And
Including; a first GRS calculation step of extracting DG related data by deriving the relationship between the drug and the gene from the effective word; and
In the effective word extraction step, a valid word group including valid words including verbs and nouns is extracted from the analyzed text,
The first GRS calculation step,
If the element of the effective word group corresponds to a biological term having a predetermined suppressive meaning, assigning a first parameter value to the effective word of the effective word group;
Assigning a second parameter value to the effective word of the effective word group when the element of the effective word group does not correspond to a biological term of a predetermined suppressive meaning; And
A computer-readable recording medium comprising a GRS derivation step of calculating a gene control score (GRS) for a drug and a gene based on one or more parameter values assigned to the effective word of the effective word group.

As a computer-readable recording medium,
The computer-readable recording medium stores instructions that cause a computing device to perform the following steps, the steps being:
DG relevance data extracting step of loading DG relevance data including one or more related gene information for each of one or more drugs in the computing device or derived from biomedical literature data; And
Drugs whose discrimination target efficacy or side effect is unknown based on the matching rate of gene information extracted from DG related data of each drug to one or more DG topics extracted from the DG-related data and constituted by one or more gene information. Including; a drug characteristic derivation step of determining whether there is an effect or side effect to be determined for
The weak characteristic extraction step,
A first step of calculating a matching rate between gene information in each of the at least one DG related data and gene information included in each topic of a DG topic;
A second step of deriving one or more representative DG topics based on a matching rate for the DG topic or a score derived from the matching rate of one or more drugs known to have an efficacy or side effect to be identified;
A third step of deriving a new drug for which the target efficacy or side effect is predicted based on the score derived from the matching rate or the matching rate of one or more drugs that are not known to have the target efficacy or side effects for the representative DG topic; Including,
The DG topic is a plurality of topics derived through a topic modeling method by considering each drug of the DG relevance data as a document, and considering genetic information as a word, a computer-readable recording medium.

delete