KR102447359B1

KR102447359B1 - Apparatus and method for predicting novel disease genes based on the integration of diverse gene-gene relations

Info

Publication number: KR102447359B1
Application number: KR1020200027783A
Authority: KR
Inventors: 이관수; 김율
Original assignee: 한국과학기술원
Priority date: 2019-03-07
Filing date: 2020-03-05
Publication date: 2022-09-26
Also published as: KR20200107840A

Abstract

질병유전자 발굴 장치의 동작 방법으로서, 유전자간 관계 정보 및 유전자들이 포함된 기능군 정보를 결합하여 유전자들과 기능들이 연결된 네트워크를 구축하는 단계, 그리고 상기 네트워크에서 신규 질병유전자 후보를 선정하는 분석 주기를 반복하는 단계를 포함한다. 상기 분석 주기는 각 기능군에 포함된 질병유전자들의 통계적 유의성을 기초로, 각 기능군의 질병 유의성 점수를 계산하는 단계, 상기 네트워크에서, 각 유전자가 연관된 기능군들의 질병 유의성 점수를 합하여 각 유전자의 기능 유사성 질병 점수를 계산하는 단계, 각 유전자의 기능 유사성 질병 점수가 반영된 각 유전자의 초기 질병 점수를 네트워크 전파하는 단계, 상기 네트워크 전파를 통해 계산된 유전자들의 질병 점수를 기초로 신규 질병유전자 후보를 선정하는 단계, 그리고 상기 신규 질병유전자 후보를 다음 분석 주기에 사용하는 질병유전자 정보에 추가하는 단계를 포함한다.A method of operating an apparatus for discovering a disease gene, comprising: constructing a network in which genes and functions are connected by combining relationship information between genes and functional group information including genes; and an analysis cycle for selecting a new disease gene candidate from the network; repeating steps. The analysis cycle includes calculating the disease significance score of each functional group based on the statistical significance of the disease genes included in each functional group. Calculating the functional similarity disease score, network propagating the initial disease score of each gene reflecting the functional similarity disease score of each gene, and selecting a new disease gene candidate based on the disease score of genes calculated through the network propagation and adding the new disease gene candidate to disease gene information used in the next analysis cycle.

Description

Apparatus and method for predicting novel disease genes based on the integration of diverse gene-gene relations

본 발명은 신규 질병유전자 예측 기술에 관한 것이다The present invention relates to a novel disease gene prediction technology

질병유전자는 질병을 조절할 수 있는 유전자로서 약물 표적 후보이다. 따라서, 질병유전자를 예측하는 기술은 효율적인 치료 방법 및 약물 개발을 위해 의약학 산업에 가장 필요로 하는 기술 중 하나이다. 최근 고속 대량 스크리닝(high-throughput screening) 기술의 발전으로, 실험을 통해 약물 표적 후보 또는 질병유전자를 발굴하는 연구가 다수 진행 중이다. 하지만 실험을 통한 발굴 작업은 시간과 비용의 소모가 커서, 예상되는 약물 표적 후보 숫자에 크게 못 미치는 발굴 성과를 보이고 있다. 최근에는 전산 기술을 통하여 질병유전자들을 예측하고 이들을 실험적으로 검증하여 적은 비용과 빠른 시간에 약물 표적 후보들을 발굴하는 방법이 제안되고 있다.A disease gene is a gene that can modulate a disease and is a drug target candidate. Therefore, the technology for predicting disease genes is one of the technologies most needed in the pharmaceutical industry for efficient treatment methods and drug development. With the recent development of high-throughput screening technology, a number of studies are underway to discover drug target candidates or disease genes through experiments. However, excavation through experiments consumes a lot of time and money, so the discovery results are far below the expected number of drug target candidates. Recently, a method of predicting disease genes through computational technology and experimentally verifying them to discover drug target candidates at low cost and in a short time has been proposed.

질병유전자 예측 기술들의 원리는 기존에 알려진 질병유전자와의 유사성을 다양한 유전자 연관 관계를 통해 점수화하는 것으로 요약될 수 있다. 이러한 기술들은 크게 기능 유사성 기반 기술과 네트워크 전파 기반 기술로 구분된다. ToppGene과 Endeavour는, 동일 질병에 연관된 유전자들은 세포 내에서 유사한 기능을 갖는다는 가정에서, 알려진 질병유전자와 기능 유사성을 점수화하여 새로운 질병유전자들을 예측한다. 한편, PINTA와 DADA는 동일 질병과 연관된 유전자들은 단백질간 상호작용 등을 통하여 서로 높은 조절관계를 갖는다는 가정에서, 단백질 상호작용으로 연결된 네트워크 상에서 기존에 알려진 질병유전자들의 질병 점수를 전파하여 얻어진 점수를 통해 새로운 질병유전자들을 예측한다.The principle of disease gene prediction techniques can be summarized as scoring similarity with previously known disease genes through various gene association relationships. These technologies are largely divided into functional similarity-based technologies and network propagation-based technologies. ToppGene and Endeavor predict new disease genes by scoring functional similarities with known disease genes, assuming that genes related to the same disease have similar functions in cells. On the other hand, PINTA and DADA are based on the assumption that genes related to the same disease have a high regulatory relationship with each other through protein interactions, etc. predict new disease genes.

PRINCE와 HybridRanker는 기능 유사성 기반 기술과 네트워크 전파 기반 기술의 장점을 살리고자 두 기술을 함께 활용하여 질병유전자를 예측한다. 하지만 이 기술들은 기능 유사성 분석과정과 네트워크 전파 분석과정을 각각 실행한 후 나오는 점수들을 단순히 합산해서 질병유전자를 예측한다. 따라서, 이 기술들은, 서로 다른 두 방법에서 나오는 결과를 각 방법의 계산 과정에 상호 반영하여 얻을 수 있는 최적의 성능을 도출하지 못하는 한계가 있다. 또한, 이 기술들은, 현재 증가된 정보 자원들과 비교하여 현저히 낮은 유전자 연관 관계 정보들을 기반으로 구성하고, 연관 관계를 활용한 추론 과정을 한 번의 과정으로 한정하는 구조라서, 민감도와 특이도 향상에 근본적인 한계가 있다. PRINCE and HybridRanker predict disease genes by using the two technologies together to take advantage of the functional similarity-based technology and the network propagation-based technology. However, these techniques predict disease genes by simply summing the scores obtained after executing the functional similarity analysis process and the network propagation analysis process, respectively. Therefore, these techniques have a limitation in that they cannot derive the optimal performance that can be obtained by mutually reflecting the results from the two different methods in the calculation process of each method. In addition, these technologies are constructed based on significantly lower genetic linkage information compared to the currently increased information resources, and have a structure that limits the reasoning process using the linkage relationship to a single process. There are fundamental limitations.

해결하고자 하는 과제는 유전자의 기능군 정보와 네트워크 특성을 상호 연계하여 계산하고 반복 과정을 통해 발굴된 정보를 재활용하는 신규 질병유전자 예측 장치 및 방법을 제공하는 것이다.The task to be solved is to provide a novel disease gene prediction device and method that calculates by linking functional group information and network characteristics of genes and recycles information discovered through an iterative process.

해결하고자 하는 과제는 분석 주기마다 발굴되는 유전자들의 신규 질병 연관 관계를 유전자의 기능군 분석과 네트워크 전파과정에 반복 적용하여 질병유전자의 발굴 가능성을 향상하는 장치 및 방법을 제공하는 것이다.The task to be solved is to provide a device and method for improving the possibility of discovering disease genes by repeatedly applying the new disease-related relationships of genes discovered at each analysis cycle to the functional group analysis and network propagation process of genes.

한 실시예에 따라 적어도 하나의 프로세서에 의해 동작하는 질병유전자 발굴 장치의 동작 방법으로서, 유전자간 관계 정보 및 유전자들이 포함된 기능군 정보를 결합하여 유전자들과 기능들이 연결된 네트워크를 구축하는 단계, 그리고 상기 네트워크에서 신규 질병유전자 후보를 선정하는 분석 주기를 반복하는 단계를 포함한다. 상기 분석 주기는 각 기능군에 포함된 질병유전자들의 통계적 유의성을 기초로, 각 기능군의 질병 유의성 점수를 계산하는 단계, 상기 네트워크에서, 각 유전자가 연관된 기능군들의 질병 유의성 점수를 합하여 각 유전자의 기능 유사성 질병 점수를 계산하는 단계, 각 유전자의 기능 유사성 질병 점수가 반영된 각 유전자의 초기 질병 점수를 네트워크 전파하는 단계, 상기 네트워크 전파를 통해 계산된 유전자들의 질병 점수를 기초로 신규 질병유전자 후보를 선정하는 단계, 그리고 상기 신규 질병유전자 후보를 다음 분석 주기에 사용하는 질병유전자 정보에 추가하는 단계를 포함한다.According to an embodiment, there is provided an operating method of an apparatus for discovering a disease gene operated by at least one processor, the method comprising: constructing a network in which genes and functions are connected by combining information on a relationship between genes and information on a functional group including genes; and repeating the analysis cycle for selecting new disease gene candidates from the network. The analysis cycle includes calculating the disease significance score of each functional group based on the statistical significance of the disease genes included in each functional group. Calculating the functional similarity disease score, network propagating the initial disease score of each gene reflecting the functional similarity disease score of each gene, and selecting a new disease gene candidate based on the disease score of genes calculated through the network propagation and adding the new disease gene candidate to disease gene information used in the next analysis cycle.

이번 주기에서 선정된 상기 신규 질병유전자 후보에 의해, 다음 주기에서 상기 신규 질병유전자 후보가 연관된 기능군들의 질병 유의성 점수가 가변될 수 있다.By the new disease gene candidate selected in this cycle, the disease significance score of the functional groups to which the new disease gene candidate is related may vary in the next cycle.

상기 분석 주기는 상기 네트워크에 포함된 유전자들 중, 알려진 질병유전자 또는 이전 분석 주기들에서 선정된 질병유전자 후보에 대해, 해당 유전자의 질병 유의성을 사전 질병 점수로 점수화하는 단계를 더 포함할 수 있다. 상기 초기 질병 점수는 각 유전자의 기능 유사성 질병 점수와 사전 질병 점수의 가중합으로 할당될 수 있다.The analysis cycle may further include scoring the disease significance of the gene with a pre-disease score for a known disease gene or a disease gene candidate selected in previous analysis cycles among genes included in the network. The initial disease score may be assigned as a weighted sum of a functional similarity disease score and a prior disease score of each gene.

상기 사전 질병 점수로 점수화하는 단계는 알려진 질병유전자의 사전 질병 점수와 이전 분석 주기들에서 선정된 질병유전자 후보의 사전 질병 점수를 차등 점수화할 수 있다.In the scoring with the prior disease score, the prior disease score of the known disease gene and the prior disease score of the disease gene candidate selected in previous analysis cycles may be differentially scored.

각 기능군의 질병 유의성 점수를 계산하는 단계는 각 기능군에 포함된 질병유전자와 일반 유전자의 비율을 기초로 질병 유의성을 계산할 수 있다.In the calculating of the disease significance score of each functional group, the disease significance may be calculated based on the ratio of the disease gene and the general gene included in each functional group.

상기 동작 방법은 상기 네트워크 전파를 통해 유의미한 질병 점수를 가지는 유전자가 없는 경우, 상기 분석 주기를 반복하지 않고 종료하는 단계를 더 포함할 수 있다.The method may further include terminating the analysis cycle without repeating the analysis cycle when there is no gene having a significant disease score through the network propagation.

상기 네트워크는 복수의 공개된 데이터베이스들에 포함된 유전자간 관계 정보를 결합한 유전자 네트워크, 그리고 기능군들에 연관된 유전자들이 연결된 유전자-기능 네트워크를 포함하고, 적어도 하나의 알려진 질병유전자가 적어도 하나의 기능군에 연관될 수 있다.The network includes a gene network combining relationship information between genes included in a plurality of public databases, and a gene-function network in which genes related to functional groups are linked, wherein at least one known disease gene is included in at least one functional group can be related to

다른 실시예에 따라 적어도 하나의 프로세서에 의해 동작하는 질병유전자 발굴 장치의 동작 방법으로서, 유전자들이 연관된 각 기능군에 대해, 해당 기능군에 연관된 질병유전자들의 통계적 유의성을 기초로, 해당 기능군의 질병 유의성 점수를 계산하는 단계, 유전자간 관계 정보로 구축된 유전자 네트워크에서, 각 유전자가 연관된 기능군들의 질병 유의성 점수를 합하여 각 유전자의 기능 유사성 질병 점수를 계산하는 단계, 상기 유전자 네트워크에 포함된 유전자들 중, 알려진 질병유전자 또는 신규 질병유전자 후보에 대해, 해당 유전자의 질병 유의성을 점수화한 사전 질병 점수를 부여하는 단계, 상기 유전자 네트워크에 포함된 유전자들에 대해, 상기 기능 유사성 질병 점수와 상기 사전 질병 점수의 가중합을 해당 유전자의 초기 질병 점수로 할당하는 단계, 그리고 상기 유전자 네트워크에서 각 유전자의 초기 질병 점수를 네트워크 전파하여 계산된 각 유전자의 네트워크 전파 질병 점수를 기초로 신규 질병유전자 후보를 선정하는 단계를 포함한다.According to another embodiment, there is provided a method of operating an apparatus for discovering a disease gene operated by at least one processor, for each functional group to which genes are related, based on the statistical significance of disease genes related to the functional group, the disease of the corresponding functional group. Calculating a significance score, calculating a functional similarity disease score of each gene by summing the disease significance scores of functional groups to which each gene is related in a gene network constructed with relationship information between genes, genes included in the gene network Among them, for a known disease gene or a new disease gene candidate, giving a prior disease score that scores the disease significance of the gene, for the genes included in the gene network, the function similarity disease score and the prior disease score allocating the weighted sum of , as the initial disease score of the corresponding gene, and selecting a new disease gene candidate based on the network propagation disease score of each gene calculated by network propagation of the initial disease score of each gene in the gene network. includes

상기 각 유전자의 기능 유사성 질병 점수를 계산하는 단계는 각 기능군의 질병 유의성 점수를 연관된 유전자들에게 할당하고, 각 유전자에 할당된 질병 유의성 p-값들의 로그 합을 정규화하여 상기 기능 유사성 질병 점수를 계산할 수 있다.Calculating the function similarity disease score of each gene assigns the disease significance score of each functional group to the related genes, and normalizes the log sum of the disease significance p-values assigned to each gene to obtain the function similarity disease score can be calculated

상기 사전 질병 점수를 부여하는 단계는 상기 알려진 질병유전자와 상기 신규 질병유전자 후보에 대해, 차등된 사전 질병 점수를 부여할 수 있다.In the assigning of the prior disease score, a differential prior disease score may be given to the known disease gene and the new disease gene candidate.

상기 질병 유의성 점수를 계산하는 단계는 특정 기능군에 연관된 유전자가 이전 분석 주기에서 신규 질병유전자 후보로 선정된 경우, 상기 신규 질병유전자 후보에 의한 통계적 유의성을 반영하여 상기 특정 기능군의 질병 유의성 점수를 계산할 수 있다.In the calculating of the disease significance score, when a gene related to a specific functional group is selected as a new disease gene candidate in the previous analysis cycle, the disease significance score of the specific functional group is calculated by reflecting the statistical significance of the new disease gene candidate. can be calculated

상기 특정 기능군의 질병 유의성 점수가 가변되는 경우, 상기 특정 기능군에 연관된 유전자들의 기능 유사성 질병 점수가 가변될 수 있다.When the disease significance score of the specific functional group is varied, the function similarity disease score of genes related to the specific functional group may be varied.

상기 이전 분석 주기에서 신규 질병유전자 후보로 선정된 유전자는, 상기 이전 분석 주기에서 계산된 기능 유사성 질병 점수 및 사전 질병 점수와 다른 점수가 현재 분석 주기에서 부여되어 상기 이전 분석 주기의 초기 질병 점수와 다른 초기 질병 점수를 전파할 수 있다.The gene selected as a new disease gene candidate in the previous analysis cycle is different from the initial disease score of the previous analysis cycle by giving a score different from the function similarity disease score and the prior disease score calculated in the previous analysis cycle in the current analysis cycle Can spread early disease scores.

상기 동작 방법은 상기 신규 질병유전자 후보를 다음 분석 주기에 사용하는 질병유전자 정보에 추가하는 단계를 더 포함할 수 있다.The method may further include adding the new disease gene candidate to disease gene information used in a next analysis cycle.

상기 신규 질병유전자 후보를 선정하는 단계는 상기 유전자 네트워크에서, 상기 네트워크 전파에 의해 유의미한 네트워크 전파 질병 점수를 가지는 유전자가 없는 경우, 신규 질병유전자 발굴을 종료하는 단계를 더 포함할 수 있다.The selecting of the new disease gene candidate may further include terminating the discovery of the new disease gene when there is no gene having a significant network disseminated disease score by the network propagation in the gene network.

실시예에 따르면, 본 발명의 기술은 유전자 기능군 분석과 네트워크 전파 분석을 서로 연계한 계산을 하여 종래 개별 방법이나 두 방법의 결과를 통계적으로 합산한 통합 방법보다 더 정확한 신규 질병유전자를 발굴할 수 있다.According to an embodiment, the technology of the present invention can discover new disease genes that are more accurate than the conventional individual methods or the integrated method that statistically sums the results of the two methods by calculating the gene function group analysis and the network propagation analysis with each other. .

실시예에 따르면, 분석 주기별로 구해지는 질병 연관 정보를 반복적으로 상호 연계된 기능군 분석과 네트워크 전파 분석에 적용하여 성능 향상에 기여할 수 있다.According to an embodiment, it is possible to contribute to performance improvement by applying disease-related information obtained for each analysis cycle to functional group analysis and network propagation analysis that are iteratively interconnected.

질병 마커로 사용될 수 있는 실제 질병유전자와 약물개발에 사용될 수 있는 실제 약물 표적을 이용한 실시예에 따르면, 본 발명의 질병유전자 예측 기술은 질병 상태 판별 및 약물 반응성 예측을 통한 동반진단 마커 후보 발굴과 함께 약물 표적 후보 발굴에 활용될 수 있다.According to an embodiment using an actual disease gene that can be used as a disease marker and an actual drug target that can be used for drug development, the disease gene prediction technology of the present invention can be used together with discovery of companion diagnostic marker candidates through disease state identification and drug reactivity prediction. It can be used to discover drug target candidates.

실시예에 따르면, 헬스 케어와 관련된 의료기기, 웹 기반 건강관리 서비스 등 IT 기반의 하드웨어와 genomics, proteomics, epigenomics 등 분자생물정보에 관련된 BT 기반의 바이오마커 및 약물 표적 컨텐츠의 접목을 가속하여, 보건의료 및 제약산업의 발전에 기여할 수 있다. According to the embodiment, by accelerating the grafting of IT-based hardware such as healthcare-related medical devices and web-based health management services, and BT-based biomarkers and drug target contents related to molecular biological information such as genomics, proteomics, and epigenomics, It can contribute to the development of the medical and pharmaceutical industries.

도 1은 한 실시예에 따른 질병유전자 발굴 장치의 하드웨어 구성도이다.
도 2는 한 실시예에 따른 질병유전자 발굴 장치의 동작 방법의 흐름도이다.
도 3은 한 실시예에 따른 질병유전자 발굴 장치의 동작을 설명하는 도면이다.
도 4는 한 실시예에 따라 류마티스관절염 질병유전자를 발굴한 결과를 나타내는 도면이다.
도 5와 도 6 각각은 본 발명의 성능을 평가한 결과 그래프이다.1 is a hardware configuration diagram of an apparatus for discovering a disease gene according to an embodiment.
2 is a flowchart of a method of operating an apparatus for excavating a disease gene according to an exemplary embodiment.
3 is a view for explaining an operation of an apparatus for discovering a disease gene according to an embodiment.
4 is a view showing a result of excavating a rheumatoid arthritis disease gene according to an embodiment.
5 and 6 are graphs of the results of evaluating the performance of the present invention.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art to which the present invention pertains can easily implement them. However, the present invention may be embodied in several different forms and is not limited to the embodiments described herein. And in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "…부", "…기", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.Throughout the specification, when a part "includes" a certain component, it means that other components may be further included, rather than excluding other components, unless otherwise stated. In addition, terms such as “…unit”, “…group”, and “module” described in the specification mean a unit that processes at least one function or operation, which may be implemented as hardware or software or a combination of hardware and software. have.

도 1은 한 실시예에 따른 질병유전자 발굴 장치의 하드웨어 구성도이다.1 is a hardware configuration diagram of an apparatus for discovering a disease gene according to an embodiment.

도 1을 참고하면, 질병유전자 발굴 장치(100)는 적어도 하나의 프로세서에 의해 본 발명의 동작이 기술된 프로그램을 실행하는 컴퓨팅 장치이다.Referring to FIG. 1 , a disease gene discovery apparatus 100 is a computing device that executes a program in which an operation of the present invention is described by at least one processor.

질병유전자 발굴 장치(100)의 하드웨어는 적어도 하나의 프로세서(110), 메모리(130), 스토리지(150), 통신 인터페이스(170)을 포함할 수 있고, 버스를 통해 연결될 수 있다. 이외에도 입력 장치 및 출력 장치 등의 하드웨어가 포함될 수 있다. 질병유전자 발굴 장치(100)는 프로그램을 구동할 수 있는 운영 체제를 비롯한 각종 소프트웨어가 탑재될 수 있다.The hardware of the disease gene discovery apparatus 100 may include at least one processor 110 , a memory 130 , a storage 150 , and a communication interface 170 , and may be connected through a bus. In addition, hardware such as an input device and an output device may be included. The disease gene discovery apparatus 100 may be equipped with various software including an operating system capable of driving a program.

프로세서(110)는 질병유전자 발굴 장치(100)의 동작을 제어하는 장치로서, 프로그램에 포함된 명령들을 처리하는 다양한 형태의 프로세서일 수 있고, 예를 들면, CPU(Central Processing Unit), MPU(Micro Processor Unit), MCU(Micro Controller Unit), GPU(Graphic Processing Unit) 등 일 수 있다. 메모리(130)는 본 발명의 동작을 실행하도록 기술된 명령들이 프로세서(110)에 의해 처리되도록 해당 프로그램을 로드한다. 메모리(130)는 예를 들면, ROM(read only memory), RAM(random access memory) 등 일 수 있다. 스토리지(150)는 본 발명의 동작을 실행하는데 요구되는 각종 데이터, 프로그램 등을 저장한다. 통신 인터페이스(170)는 유/무선 통신 모듈로서, 유무선 네트워크를 통해 외부 데이터베이스와 연동할 수 있다.The processor 110 is a device for controlling the operation of the disease gene discovery apparatus 100 , and may be various types of processors that process instructions included in a program, for example, a central processing unit (CPU), a microprocessor (MPU) processor unit), microcontroller unit (MCU), graphic processing unit (GPU), or the like. The memory 130 loads the corresponding program so that the instructions described to execute the operation of the present invention are processed by the processor 110 . The memory 130 may be, for example, read only memory (ROM), random access memory (RAM), or the like. The storage 150 stores various data, programs, etc. required to execute the operation of the present invention. The communication interface 170 is a wired/wireless communication module, and may interwork with an external database through a wired/wireless network.

질병유전자 발굴 장치(100)는 유전자 연관 관계를 통합하여 신규 질병유전자를 예측한다. 질병유전자 발굴 장치(100)는 분석 주기(cycle)마다, 질병유전자와 유전자들이 포함된 기능군의 연관성 분석으로 기능군들의 질병 유의성을 계산하고, 이로부터 각 유전자의"기능 유사성 질병 점수"를 계산한다. 또한, 질병유전자 발굴 장치(100)는 각 유전자의 "사전 질병 점수"를 계산한다. 이후, 질병유전자 발굴 장치(100)는 "기능 유사성 질병 점수"와 "사전 질병 점수"가 반영된 각 유전자의 초기 질병 점수를 네트워크 전파하고, 네트워크 전파를 통해 최종적으로 얻어지는 질병 점수를 분석하여 신규 질병유전자 후보를 발굴한다. 질병유전자 발굴 장치(100)는 이전 분석 주기에서 계산된 유전자들의 신규 질병 점수를 다음 분석 주기의 "기능 유사성 질병 점수"와 "사전 질병 점수"계산에 반영하고 네트워크 전파하는 반복 과정을 통해, 기존의 다른 방법들에서 반영할 수 없는 간접적인 유전자 간의 연관성들을 반영한다. The disease gene discovery apparatus 100 predicts a new disease gene by integrating the gene association relationship. The disease gene discovery apparatus 100 calculates the disease significance of the functional groups by analyzing the association between the disease gene and the functional group including the genes at every analysis cycle, and calculates the "function similarity disease score" of each gene therefrom. do. In addition, the disease gene discovery apparatus 100 calculates a "pre-disease score" of each gene. Thereafter, the disease gene discovery apparatus 100 network propagates the initial disease score of each gene in which the "function similarity disease score" and the "pre-disease score" are reflected, and analyzes the disease score finally obtained through the network propagation to obtain a new disease gene. discover candidates The disease gene discovery apparatus 100 reflects the new disease score of genes calculated in the previous analysis cycle in the calculation of the “function similarity disease score” and “pre-disease score” of the next analysis cycle, and through an iterative process of network propagation, It reflects indirect gene associations that cannot be reflected in other methods.

다음에서, 질병유전자 발굴 장치의 동작 방법에 대해 자세히 설명한다.Hereinafter, an operation method of the disease gene discovery apparatus will be described in detail.

도 2는 한 실시예에 따른 질병유전자 발굴 장치의 동작 방법의 흐름도이다.2 is a flowchart of a method of operating an apparatus for excavating a disease gene according to an exemplary embodiment.

도 2를 참고하면, 질병유전자 발굴 장치(100)는 복수의 공개된 데이터베이스들로부터 수집한 유전자간 관계 정보 및 유전자들이 포함된 기능군 정보들을 결합하여, 유전자들과 기능들이 연결된 네트워크를 구축한다(S110). 네트워크는 노드에 해당하는 유전자들이 관계 정보에 따라 연결되고, 기능군 정보에 따라 유전자들의 적어도 일부가 기능들에 연관된다. 네트워크를 구성하는 유전자들 중에는 기존에 알려진 질병유전자들이 포함되어 있고, 질병유전자는 적어도 하나의 기능군에 포함되어 있다고 가정한다. 질병유전자 발굴 장치(100)는 다양한 데이터베이스들을 이용하여 네트워크를 구축할 수 있는데, 예를 들면, HIPPIE의 단백질간 상호작용 네트워크와 Graphite의 신호전달경로를 수집하고, TRNASFAC, E3Net, PhosphoSitePlus 및 DEPOD 등의 조절 관계 자원을 통해 유전자간 관계를 확장할 수 있다. 그리고, 질병유전자 발굴 장치(100)는 예를 들면, Molecular Signatures Database, Enrichr 및 Gene Ontology로부터 기능군들을 수집할 수 있다. 질병유전자 발굴 장치(100)는 수집한 기능군들에서 유사한 기능군들을 통합할 수 있다.Referring to FIG. 2 , the disease gene excavation apparatus 100 builds a network in which genes and functions are connected by combining intergene relation information collected from a plurality of public databases and functional group information including genes ( S110). In the network, genes corresponding to nodes are connected according to relationship information, and at least some of the genes are related to functions according to function group information. It is assumed that known disease genes are included among the genes constituting the network, and the disease genes are included in at least one functional group. The disease gene discovery apparatus 100 can build a network using various databases, for example, collects the protein-protein interaction network of HIPPIE and the signal transduction pathway of Graphite, and includes TRNASFAC, E3Net, PhosphoSitePlus and DEPOD. Relations between genes can be extended through regulatory relationship resources. In addition, the disease gene discovery apparatus 100 may collect functional groups from, for example, Molecular Signatures Database, Enrichr, and Gene Ontology. The disease gene discovery apparatus 100 may integrate similar functional groups from the collected functional groups.

질병유전자 발굴 장치(100)는 각 기능군이 포함하는 질병유전자들의 통계적 유의성을 기초로 각 기능군의 질병 유의성 점수를 계산한다(S120). 질병유전자는 질병유전자 발굴 장치(100)의 반복 수행 과정 중에 신규로 업데이트될 수 있어서, 각 기능의 질병 유의성은 신규로 포함되는 질병유전자들에 의해 반복되는 분석 주기마다 업데이트될 수 있다. The disease gene discovery apparatus 100 calculates a disease significance score of each functional group based on the statistical significance of disease genes included in each functional group (S120). The disease gene may be newly updated during the repetitive execution process of the disease gene discovery apparatus 100 , so that the disease significance of each function may be updated at each repeated analysis cycle by the disease genes newly included.

질병유전자 발굴 장치(100)는, 각 유전자가 연관된 모든 기능군들의 질병 유의성 점수를 합하여 각 유전자의 "기능 유사성 질병 점수"를 계산한다(S130). 질병유전자 발굴 장치(100)는, 각 유전자가 연관된 모든 기능군들의 질병 유의성 점수를 합하여 각 유전자의 질병 점수를 정하고, 이를 정규화하여 각 유전자의 "기능 유사성 질병 점수"를 계산할 수 있다.The disease gene discovery apparatus 100 calculates the "function similarity disease score" of each gene by summing the disease significance scores of all functional groups to which each gene is related ( S130 ). The disease gene discovery apparatus 100 may calculate a "function similarity disease score" of each gene by summing the disease significance scores of all functional groups associated with each gene, determining the disease score of each gene, and normalizing it.

또한, 질병유전자 발굴 장치(100)는 "기능 유사성 질병 점수"와 별개로, 각 유전자에 대해 알려지거나 이 장치에 의해 신규로 구해진 질병 유의성을 점수화한 "사전 질병 점수"를 계산한다(S140). 최초의 초기 질병 점수에서 개별 유전자의 "사전 질병 점수"는 기존에 알려진 질병 유의성이 있는지 여부에 의해서만 결정되나, 네트워크 전파 과정을 진행한 후 반복되는 주기에서의 초기 질병 점수는 네트워크 전파에 의해 구해진 질병 유의성이 추가되어 결정된다. In addition, the disease gene discovery apparatus 100 calculates a "pre-disease score" that scores a disease significance that is known for each gene or newly obtained by the apparatus, separately from the "function similarity disease score" (S140). In the initial initial disease score, the "pre-disease score" of an individual gene is determined only by whether there is a known disease significance, but the initial disease score in repeated cycles after the network propagation process is the disease obtained by network propagation. Significance is added and determined.

질병유전자 발굴 장치(100)는 유전자별로 "기능 유사성 질병 점수"와 "사전 질병 점수"를 가중합(weighted sum)하여 유전자 네트워크 내 유전자 노드들의 초기 질병 점수를 할당한다(S150). The disease gene discovery apparatus 100 assigns initial disease scores of gene nodes in the gene network by weighted summing the "function similarity disease score" and the "pre-disease score" for each gene (S150).

질병유전자 발굴 장치(100)는 유전자 노드에 할당된 초기 질병 점수를 유전자 네트워크 내에 전파하는 네트워크 전파(network propagation) 방법을 통해 유전자들의 새로운 "네트워크 전파 질병 점수"를 계산한다(S160)."네트워크 전파 질병 점수"는 간단히 질병 점수라고 할 수 있다.The disease gene discovery apparatus 100 calculates a new "network propagation disease score" of genes through a network propagation method of propagating an initial disease score assigned to a gene node in a gene network (S160). A "disease score" can be simply referred to as a disease score.

질병유전자 발굴 장치(100)는 유의미한 "네트워크 전파 질병 점수"(예를 들면, p-값 0.05 이하)를 가지는 유전자가 있는지 판단한다(S170).The disease gene discovery apparatus 100 determines whether there is a gene having a significant "network disseminated disease score" (eg, a p-value of 0.05 or less) (S170).

질병유전자 발굴 장치(100)는 유의미한 네트워크 전파 질병 점수를 가지는 유전자들을 신규 질병유전자 후보로 업데이트하고, 다음 분석 주기에 피드백한다(S180). 질병유전자 발굴 장치(100)는 다음 분석 주기에서 업데이트된 질병유전자 후보 정보를 활용하여 기능군의 질병 유의성, 유전자의 기능 유사성 질병 점수 및 사전 질병 점수를 재계산한다. The disease gene discovery apparatus 100 updates genes having a significant network disseminated disease score as new disease gene candidates, and feeds them back to the next analysis cycle ( S180 ). The disease gene discovery apparatus 100 recalculates the disease significance of the functional group, the functional similarity disease score, and the pre-disease score by using the updated disease gene candidate information in the next analysis cycle.

질병유전자 발굴 장치(100)는 유의미한 네트워크 전파 질병 점수를 가지는 유전자가 없어서 더 이상 신규 질병유전자 후보가 발굴되지 않으면, 질병유전자 발굴 절차를 종료한다(S190).The disease gene discovery apparatus 100 terminates the disease gene discovery process when no new disease gene candidates are no longer discovered because there are no genes having a significant network spread disease score ( S190 ).

이와 같이, 질병유전자 발굴 장치(100)는 기능 유사성 질병 점수와 네트워크 전파 질병 점수를 상호 반영하여 반복적으로 업데이트하며 유전자 질병 점수를 계산하므로, 종래 기술보다 더 정확하고 효율적으로 질병유전자를 발굴할 수 있다.In this way, the disease gene discovery apparatus 100 repeatedly updates the functional similarity disease score and the network disseminated disease score by mutual reflection and calculates the gene disease score, so that the disease gene can be discovered more accurately and efficiently than in the prior art. .

다음에서, 질병유전자 발굴 장치의 동작에 대해 구체적으로 설명한다. Next, the operation of the disease gene discovery apparatus will be described in detail.

도 3은 한 실시예에 따른 질병유전자 발굴 장치의 동작을 설명하는 도면이다. 설명을 위해, 유전자(원)간의 연결 관계(실선)를 유전자 네트워크(10)로 부르고, 유전자들과 기능(네모)들의 연관 관계(점선)를 유전자-기능 연관 네트워크(20)으로 부를 수 있고, 유전자 네트워크(10)와 유전자-기능 연관 네트워크(20)를 통칭하여 네트워크(30)라고 할 수 있다. 3 is a view for explaining an operation of an apparatus for discovering a disease gene according to an embodiment. For the sake of explanation, a connection relationship (solid line) between genes (circles) may be called a gene network 10 , and a relationship relationship (dotted line) between genes and functions (square) may be called a gene-function association network 20 , The gene network 10 and the gene-function association network 20 may be collectively referred to as a network 30 .

도 3의 (a)를 참고하면, 질병유전자 발굴 장치(100)는 복수의 공개된 데이터베이스들에 포함된 유전자간 관계 정보들을 이용하여 유전자 네트워크(10)를 구축한다. 유전자 네트워크(10)의 구축에는 다양한 데이터베이스들이 이용될 수 있다. 위에서 기술한 바와 같이, 질병유전자 발굴 장치(100)는 HIPPIE의 단백질간 상호작용 네트워크와 Graphite의 신호전달경로를 수집하고, TRNASFAC, E3Net, PhosphoSitePlus 및 DEPOD 등의 조절 관계 자원을 통해 유전자간 관계를 확장할 수 있다. 이때, 질병유전자 발굴 장치(100)는 유전자쌍이 속한 관계 자원의 수를 해당 유전자쌍의 관계 신뢰도로 정의하여 연관 정도가 정규화된 유전자쌍들을 구성할 수 있다. 이를 통해, 총 15,165 유전자에 대한 379,730개의 연관 관계로 구성된 유전자 네트워크(10)가 구축될 수 있다.Referring to FIG. 3A , the apparatus 100 for discovering a disease gene constructs a gene network 10 by using inter-gene relation information included in a plurality of public databases. Various databases may be used to construct the gene network 10 . As described above, the disease gene discovery device 100 collects the protein-protein interaction network of HIPPIE and the signal transduction pathway of Graphite, and expands the relationship between genes through regulatory relationship resources such as TRNASFAC, E3Net, PhosphoSitePlus and DEPOD. can do. In this case, the disease gene excavation apparatus 100 may define the number of relational resources to which the gene pair belongs as the relational reliability of the corresponding gene pair to configure gene pairs in which the degree of association is normalized. Through this, the gene network 10 consisting of 379,730 association relationships for a total of 15,165 genes can be constructed.

또한, 질병유전자 발굴 장치(100)는 Molecular Signatures Database, Enrichr 및 Gene Ontology와 같은 복수의 공개된 데이터베이스들로부터 같은 기능으로 묶여진 유전자들의 집합인 기능군들을 수집한다. 서로 다른 데이터베이스들에서 수집된 기능군들이 중복될 수 있으므로, 질병유전자 발굴 장치(100)는 수집된 기능군들의 유사성을 기초로 유사한 기능군들을 통합할 필요가 있다. 이를 위해, 질병유전자 발굴 장치(100)는 기능의 종류에 따라 분류된 기능군들에서 유사한 기능군들을 추출하고, 이를 하나의 기능군으로 통합할 수 있다. 예를 들면, 수집된 기능군들은 cellular function, molecular relation, co-localization의 3가지 분류로 범주화될 수 있다. 수집된 기능군들은 질병유전자 발굴 장치(100)에 의해 자동 분류되거나, 수동으로 분류될 수 있다. 질병유전자 발굴 장치(100)는 Jaccard coefficient 등의 유사성 척도를 기준으로, 유사한 기능군 쌍을 하나의 기능군으로 통합할 수 있다. 이를 통해, 질병유전자 발굴 장치(100)는 Molecular Signatures Database, Enrichr 및 Gene Ontology로부터 28,465개 기능군들과 이들에 포함된 유전자 정보를 획득할 수 있고, 각 기능군 내의 유전자(원)들을 해당 기능군의 기능(네모)에 연결하여 유전자-기능 연관 네트워크(20)를 구성할 수 있다. In addition, the disease gene discovery apparatus 100 collects functional groups, which are sets of genes grouped with the same function, from a plurality of public databases such as Molecular Signatures Database, Enrichr, and Gene Ontology. Since functional groups collected from different databases may be duplicated, the disease gene discovery apparatus 100 needs to integrate similar functional groups based on the similarity of the collected functional groups. To this end, the disease gene discovery apparatus 100 may extract similar functional groups from functional groups classified according to the type of function, and integrate them into one functional group. For example, the collected functional groups can be categorized into three categories: cellular function, molecular relation, and co-localization. The collected functional groups may be automatically classified by the disease gene discovery apparatus 100 or may be manually classified. The disease gene discovery apparatus 100 may integrate a pair of similar functional groups into one functional group based on a similarity measure such as a Jaccard coefficient. Through this, the disease gene discovery apparatus 100 can acquire 28,465 functional groups and gene information included in them from Molecular Signatures Database, Enrichr, and Gene Ontology, and assigns genes (sources) within each functional group to the corresponding functional group. The gene-function association network 20 can be constructed by connecting to the function (square) of

(b)를 참고하면, 질병유전자 발굴 장치(100)는 기능군에 연관된 질병유전자 정보로부터 유전자-기능 연관 네트워크(20)에 포함된 모든 기능들의 질병 유의성을 계산한다. Referring to (b), the disease gene discovery apparatus 100 calculates the disease significance of all functions included in the gene-function association network 20 from the disease gene information related to the functional group.

예를 들면, 질병유전자 발굴 장치(100)는 기능(21)에 연관된 질병유전자(11, 원-검정색)와 일반 유전자(12, 원-흰색)들의 비율을 기초로 기능(21)의 질병 연관성을 계산할 수 있다. 질병유전자 발굴 장치(100)는 유전자-기능 연관 네트워크(20)에 포함된 전체 기능들의 질병 연관성 값들로부터 기능(21)의 통계적 유의성을 p-value (p-값)으로 계산하여 기능(21)의 질병 유의성을 할당할 수 있다. For example, the disease gene discovery apparatus 100 determines the disease association of the function 21 based on the ratio of the disease gene (11, circle-black) and the general gene (12, circle-white) related to the function 21 . can be calculated The disease gene discovery apparatus 100 calculates the statistical significance of the function 21 as a p-value from the disease association values of all functions included in the gene-function association network 20 as a p-value. Disease significance can be assigned.

각 기능의 질병 유의성 p-값은 수학식 1과 같이 계산될 수 있다. 이때, p-값은 false discovery rate(FDR)에 의해 조정된다.The disease significance p-value of each function may be calculated as in Equation 1. In this case, the p-value is adjusted by the false discovery rate (FDR).

수학식 1에서, G는 전체 유전자의 개수, S는 전체 질병유전자 개수, M은 기능군 내의 유전자의 개수, k는 기능군 내 질병유전자의 개수이다. In Equation 1, G is the total number of genes, S is the total number of disease genes, M is the number of genes in the functional group, and k is the number of disease genes in the functional group.

(c)를 참고하면, 질병유전자 발굴 장치(100)는 유전자 네트워크(10)에서 네트워크 전파(network propagation)할 초기 질병 점수 P₀(g)를 할당한다. 질병유전자 발굴 장치(100)는 기능의 질병 유의성 p-값(21, 네모-회색)을 해당 기능에 연결된 유전자들에게 할당한다. 이 때, 각 유전자는 다수의 기능군과 연관될 수 있으므로. 각 유전자의 "기능 유사성 질병 점수"는 해당 유전자에 연관된 기능들의 질병 유의성 p-값을 종합하여 계산한다. 유전자의 "기능 유사성 질병 점수"는 유전자에 연관된 기능들의 질병 유의성 p-값의 로그 합으로 정의될 수 있다. 질병유전자 발굴 장치(100)는 최소-최대 정규화를 통해, 유전자들의 기능 유사성 질병 점수를 [0, 1] 범위로 조정할 수 있다. 유전자의 "기능 유사성 질병 점수" S_f(g)는 수학식 2와 같이 표현될 수 있다. Referring to (c), the disease gene discovery apparatus 100 allocates an initial disease score P ₀ (g) for network propagation in the gene network 10 . The disease gene discovery apparatus 100 allocates a disease significance p-value (21, square-gray) of a function to genes linked to a corresponding function. In this case, since each gene can be associated with multiple functional groups. The "function similarity disease score" of each gene is calculated by aggregating the disease significance p-values of the functions associated with that gene. A "function similarity disease score" of a gene may be defined as the log sum of the disease significance p-values of the functions associated with the gene. The disease gene discovery apparatus 100 may adjust the functional similarity disease score of genes to a range of [0, 1] through min-max normalization. "Functional similarity disease score" S _f (g) of a gene can be expressed as Equation (2).

종래의 네트워크 전파 기반 방법은 질병유전자에 대해 기존에 알려진 질병 유의성이 있는 유전자 정보만을 반영하여 각 유전자 노드의 초기 질병 점수를 할당하고 네트워크 전파하여 최종 질병 점수를 구한다. 반면, 본 발명의 질병유전자 발굴 장치(100)는 사전에 알려진 질병유전자 점수와 함께, 이 장치의 네트워크 전파 과정으로부터 반복하여 구해지는 신규 질병유전자 점수와 위에서 보인 기능 유사성 분석 과정에서 반복적으로 구해지는 신규 질병유전자 점수를 모두 포함하여 초기 질병 점수를 조정한다. 특히, 본 발명에서는 질병유전자로 알려지지 않았으나, 기능적으로 질병 연관성이 높은 유전자에게 질병 점수(기능 유사성 질병 점수)를 부여하고 네트워크 전파도 가능하게 한다. 이를 통해 네트워크 구조와 함께 초기값에 영향을 크게 받는 기존의 네트워크 전파 방법의 질병유전자 발굴 한계를 극복할 수 있게 된다. The conventional network propagation-based method allocates an initial disease score of each gene node by reflecting only previously known disease-significant gene information for a disease gene, and then obtains a final disease score by network propagation. On the other hand, the disease gene discovery apparatus 100 of the present invention provides a new disease gene score repeatedly obtained from the network propagation process of the device, along with a previously known disease gene score, and a novel disease gene score repeatedly obtained in the functional similarity analysis process shown above. Adjust the initial disease score to include all disease gene scores. In particular, in the present invention, a disease score (function similarity disease score) is given to a gene that is not known as a disease gene, but is functionally highly correlated with a disease, and network propagation is also possible. Through this, it is possible to overcome the limitation of the discovery of disease genes of the existing network propagation method, which is greatly affected by the initial value as well as the network structure.

질병유전자 발굴 장치(100)의 초기 질병 점수 계산 방법은 수학식 3과 같이 표현될 수 있다. 유전자의 초기 질병 점수 P₀(g)는 유전자의 "기능 유사성 질병 점수" S_f(g)와 "사전 질병 점수" S_p(g)의 가중 합으로 정의될 수 있다. 수학식 3에서,

는 가중치이다. The method of calculating the initial disease score of the disease gene discovery apparatus 100 may be expressed as Equation (3). The initial disease score P ₀ (g) of a gene can be defined as the weighted sum of the gene's "function similarity disease score" S _f (g) and "prior disease score" S _p (g). In Equation 3,

is the weight.

수학식 3의 사전 질병 점수 S_p(g)는 수학식 4와 같이 정의된다.The prior disease score S _p (g) of Equation 3 is defined as Equation 4.

수학식 4에서, 사전 질병 점수 S_p(g)는 기존에 알려진 질병유전자(seed gene)에 대해서는 1점, 질병유전자 발굴 장치(100)에서 발굴된 신규 질병유전자 후보에 대해서는 [0, 1] 범위내의 점수, 그렇지 않은 유전자에는 0이 할당될 수 있다. In Equation 4, the pre-disease score S _p (g) is 1 point for a previously known disease gene, and a range of [0, 1] for a new disease gene candidate discovered by the disease gene discovery apparatus 100 Scores within, genes that are not may be assigned 0.

예를 들면, 유전자(11)가 알려진 질병유전자라면, 초기 질병 점수에 사전 질병 점수 1이 반영된다. 유전자(12)는 알려진 질병유전자가 아니고, 기능 유사성 점수를 받지 못하면 초기 질병 점수는 0이 되지만, 이 장치의 첫 번째 발굴 주기 후에는 신규 질병유전자 후보로 선정되어 [0, 1] 범위내의 "사전 질병 점수"가 반영될 수 있고, 업데이트된 기능들의 질병 유의성 p-값들로부터 "기능 유사성 질병 점수"가 반영될 수 있다. For example, if the gene 11 is a known disease gene, the prior disease score of 1 is reflected in the initial disease score. The gene (12) is not a known disease gene, and if it does not receive a functional similarity score, the initial disease score is 0, but after the first discovery cycle of this device, it is selected as a candidate for a new disease gene, Disease score" can be reflected, and "function similarity disease score" can be reflected from disease significance p-values of the updated functions.

(d)를 참고하면, 질병유전자 발굴 장치(100)는 유전자 네트워크(10)에서 유전자들(예를 들면, 11,12)의 초기 질병 점수 P₀(g)를 연결된 유전자로 네트워크 전파한다. 유전자의 초기 질병 점수 P₀(g)는 네트워크를 따라 r의 비율로 이웃 유전자에게 순차적으로 반복 전달된다. 이 때, t번째 전파된 질병 점수 P^t(g)는 수학식 5와 같이 정의될 수 있다.Referring to (d), the disease gene discovery apparatus 100 network propagates the initial disease score P ₀ (g) of genes (eg, 11 and 12 ) in the gene network 10 as linked genes. The gene's initial disease score P ₀ (g) is transmitted sequentially to neighboring genes at a rate of r along the network. In this case, the t-th spread disease score P ^t (g) may be defined as in Equation 5.

수학식 5에서, W'는 가중치 인접 행렬 W의 정규화된 행렬로서, 수학식 6과 같이 정의된다. In Equation 5, W' is a normalized matrix of the weight adjacency matrix W, and is defined as Equation 6.

D는 대각 요소 D(i, i)가 W의 i번째 행의 합으로 이루어진 대각 행렬을 의미한다. 네트워크 전파 방법은 수학식 5의 정규화된 가중치 인접 행렬을 이용하는 알고리즘(Vanunu et al.)과 같은 다른 알려진 기술을 이용할 수도 있다. 가중치 인접 행렬 W가 정규화될 때에 각 유전자의 입출력 흐름이 모두 고려되어 정규화되는 점에서, 기존에 개발된 네트워크 전파를 이용한 질병유전자 발굴 방법들의 random walk with restart(RWR) 방법과 차이가 있다.D denotes a diagonal matrix in which the diagonal element D(i, i) is the sum of the i-th row of W. The network propagation method may use other known techniques, such as an algorithm (Vanunu et al.) using the normalized weight adjacency matrix of Equation (5). It is different from the random walk with restart (RWR) method of previously developed disease gene discovery methods using network propagation in that the input/output flow of each gene is considered and normalized when the weight adjacency matrix W is normalized.

질병유전자 발굴 장치(100)는 각 분석 주기의 마지막 단계에서 네트워크 전파를 통해 획득한 유전자의 질병 점수를 Z-점수화한 후, 유의미하게 높은 점수(p-값 0.05 이하)를 가지는 유전자를 신규 질병유전자로 선정한다. The disease gene discovery apparatus 100 Z-scores the disease score of the gene acquired through network propagation in the last stage of each analysis cycle, and then converts the gene having a significantly high score (p-value 0.05 or less) to the new disease gene. to be selected as

질병유전자 발굴 장치(100)는 이렇게 선정된 신규 질병유전자를 이 장치의 질병유전자 정보에 추가한다. 그리고, 질병유전자 발굴 장치(100)는 업데이트된 질병유전자를 이용하여 기능군들의 질병 유의성 점수와 이로부터 구해지는 각 유전자의 "기능 유사성 질병 점수"를 재계산하고, 재계산된 "기능 유사성 질병 점수"가 반영된 각 유전자의 초기 질병 점수를 네트워크 전파하는 절차를 반복한다. The disease gene discovery device 100 adds the selected new disease gene to the disease gene information of the device. Then, the disease gene discovery apparatus 100 recalculates the disease significance score of the functional groups and the “function similarity disease score” of each gene obtained therefrom using the updated disease gene, and the recalculated “function similarity disease score” Repeat the procedure for network propagation of the initial disease score of each gene that reflects

질병유전자 발굴 장치(100)는 추가적인 신규 질병유전자가 선정되지 않을 때까지 이러한 절차를 반복해서 유전자들의 알려진 질병 유의성, 유전자와 세포 기능의 연관성, 유전자들 간의 연관성을 모두 연계하여 질병유전자 후보를 발굴할 수 있다.The disease gene discovery apparatus 100 repeats this procedure until no additional new disease genes are selected to discover disease gene candidates by linking all known disease significance of genes, correlation between genes and cellular functions, and correlation between genes. can

도 4는 한 실시예에 따라 류마티스관절염 질병유전자를 발굴한 결과를 나타내는 도면이다.4 is a view showing a result of excavating a rheumatoid arthritis disease gene according to an embodiment.

도 4를 참고하면, 질병유전자 발굴 장치(100)는 각 유전자의 "기능 유사성 질병 점수"를 계산하고, 유전자의 "기능 유사성 질병 점수"와 "사전 질병 점수"가 반영된 초기 질병 점수를 네트워크 전파하여 신규 질병유전자 후보를 발굴하는 분석 주기를 반복한다.Referring to FIG. 4 , the disease gene discovery apparatus 100 calculates the "function similarity disease score" of each gene, and network propagates the initial disease score reflecting the "function similarity disease score" and "pre-disease score" of the gene. Repeat the analysis cycle to discover new disease gene candidates.

도 4에서, 붉은색은 기존의 알려진 질병유전자 중 선별된 테스트셋이고, 녹색은 기존의 알려진 질병유전자 중에서 테스트셋을 제외한 나머지이며, 파란색은 각 분석 주기에서 선정된 질병유전자 후보이고, 검은색은 그 외 나머지 유전자를 의미한다. In FIG. 4 , red is a test set selected from known disease genes, green is the rest of the known disease genes excluding the test set, blue is a disease gene candidate selected in each analysis cycle, and black is the rest of the genes.

테스트셋은 류마티스관절염 중 실제 약물의 표적이 되는 11종의 유전자이고, 나머지 기존의 알려진 질병유전자들을 입력하여 본 발명을 수행하였을 때 이들 11종의 유전자의 순위변화를 관찰한다.The test set is 11 kinds of genes that are actually targets of drugs among rheumatoid arthritis, and the ranking change of these 11 kinds of genes is observed when the present invention is performed by inputting the remaining known disease genes.

분석 결과, 약물의 표적으로 알려진 11개의 테스트 유전자들에 대하여 각 분석 주기에서 네트워크 전파 후의 점수가 상승하는 것을 확인할 수 있다. 특히 5번째 분석 주기에서는 테스트 유전자의 점수가 기존의 알려진 질병유전자만큼 점수가 높아지는 것을 확인할 수 있다. 또한, 11개의 테스트 유전자 중 10개의 유전자가 전체 15,000여개의 유전자 중 350위 안에 위치하는 것과 그 중 6개는 100위 이내에 위치하는 것이 확인된다. 이 결과로부터 본 발명의 장치가 반복 주기를 통한 추론과정으로 테스트 유전자와 기존 질병유전자들의 간접적인 연관성을 누적하여 반영함으로써 직접적인 질병 유의성 정보가 제외된 실제 약물 표적의 발굴이 가능해지는 것을 확인할 수 있다.As a result of the analysis, it can be seen that the score after network propagation increases in each analysis cycle for 11 test genes known as drug targets. In particular, in the fifth analysis cycle, it can be confirmed that the score of the test gene is as high as the existing known disease gene. In addition, it is confirmed that 10 genes out of 11 test genes are located in the 350th place out of 15,000 genes, and 6 of them are located in the 100th place. From this result, it can be confirmed that the device of the present invention accumulates and reflects the indirect association between the test gene and the existing disease genes through an inference process through repeated cycles, thereby enabling the discovery of actual drug targets excluding direct disease significance information.

도 5와 도 6 각각은 본 발명의 성능을 평가한 결과 그래프이다.5 and 6 are graphs of the results of evaluating the performance of the present invention.

본 발명의 성능을 평가하고 기존 방법들과의 차이를 설명하기 위하여, 작동원리에 따라 모듈화된 방법들에 대해 알려진 약물 표적들을 leave-one-out test한다. In order to evaluate the performance of the present invention and to explain the differences from existing methods, we leave-one-out test known drug targets for modularized methods according to the working principle.

먼저 기존에 약물 표적 데이터베이스인 DrugCentral과 알려진 질병유전자 데이터베이스인 DisGeNet으로부터 류마티스관절염, 암, 제2형 당뇨와 관련된 알려진 약물 표적과 질병유전자 정보를 수집한다. 수집된 약물 표적과 질병유전자들에서 표본을 정한 후 한 개의 테스트 유전자를 제외한 나머지 유전자들을 이용하여 테스트 유전자의 질병 유의성을 예측하는 leave-one-out test를 수행한다. 모든 알려진 질병유전자 또는 약물 표적에 대한 leave-one-out test 결과를 ROC 커브로 표현하여 성능을 분석한다.First, we collect known drug targets and disease gene information related to rheumatoid arthritis, cancer, and type 2 diabetes from DrugCentral, a drug target database, and DisGeNet, a known disease gene database. After selecting a sample from the collected drug targets and disease genes, a leave-one-out test that predicts the disease significance of the test gene is performed using the remaining genes except for one test gene. The performance is analyzed by expressing the leave-one-out test results for all known disease genes or drug targets as ROC curves.

도 5를 참고하면, 기능군 분석 기반 예측 방법과 네트워크 전파 기반 예측 방법을 각각 수행한 경우와 두 방법이 연계된 방법을 수행한 결과를 비교한 결과, 세가지 질병에서 모두 본 발명에서 개발한 연계된 방법의 성능이 우수함을 알 수 있다. Referring to FIG. 5 , as a result of comparing the results of performing the function group analysis-based prediction method and the network propagation-based prediction method respectively and the method in which the two methods are linked, the linked linkage developed in the present invention in all three diseases It can be seen that the performance of the method is excellent.

특히 기존의 질병유전자 정보가 가장 많은 암에 대해서는 ROC커브의 AUC를 비교해 본 결과 p-value 0.05 이하로 매우 유의미하게 성능이 향상됨을 알 수 있다.In particular, as a result of comparing the AUC of the ROC curve for cancer with the most existing disease gene information, it can be seen that the performance is significantly improved with a p-value of 0.05 or less.

도 6을 참고하면, 기존의 방법인 ToppGene, Random with Restart, PRINCE와의 성능 비교 결과이다. 각각의 방법은 순서대로 기능 유사성 기반 방법, 네트워크 전파 기반 방법, 두 방법의 기존 통합 방법을 대표한다. 성능 비교 결과, 본 발명의 질병유전자 예측 능력이 가장 우수함을 알 수 있고, 암에서 특히 p-value 0.05 이하로 매우 유의미한 성능 향상을 확인할 수 있다.Referring to FIG. 6 , it is a performance comparison result with the existing methods ToppGene, Random with Restart, and PRINCE. Each method in turn represents a functional similarity-based method, a network propagation-based method, and an existing integration method of the two methods. As a result of the performance comparison, it can be seen that the disease gene prediction ability of the present invention is the best, and a very significant performance improvement can be confirmed, particularly in cancer, with a p-value of 0.05 or less.

실시예에 따르면, 본 발명의 방법은 기능군 분석과 네트워크 전파 분석을 상호 연계하여 계산하게 함으로써, 두 방법의 결과를 통계적으로 통합하는 기존의 방법보다 성능이 향상됨을 확인할 수 있다. 또한, 상호 연계 계산을 통해 얻어지는 질병 연관 정보를 다시 계산에 반복적으로 적용하는 과정도 성능향상에 기여하는 사실을 확인할 수 있다. According to an embodiment, it can be confirmed that the performance of the method of the present invention is improved compared to the existing method of statistically integrating the results of the two methods by calculating the functional group analysis and the network propagation analysis in conjunction with each other. In addition, it can be confirmed that the process of repeatedly applying the disease-related information obtained through the cross-linkage calculation to the calculation also contributes to the performance improvement.

또한, 실시예에 따르면, 헬스 케어와 관련된 의료기기, 웹 기반 건강관리 서비스 등 IT 기반의 하드웨어와 genomics, proteomics, epigenomics 등 분자생물정보에 관련된 BT 기반의 바이오마커 및 약물 표적 컨텐츠의 접목을 가속하여, 보건의료 및 제약산업의 발전에 기여할 수 있다.In addition, according to the embodiment, by accelerating the grafting of IT-based hardware such as healthcare-related medical devices and web-based health management services and BT-based biomarkers and drug target contents related to molecular biological information such as genomics, proteomics, and epigenomics, , can contribute to the development of health care and pharmaceutical industries.

이상에서 설명한 본 발명의 실시예는 장치 및 방법을 통해서만 구현이 되는 것은 아니며, 본 발명의 실시예의 구성에 대응하는 기능을 실현하는 프로그램 또는 그 프로그램이 기록된 기록 매체를 통해 구현될 수도 있다.The embodiment of the present invention described above is not implemented only through the apparatus and method, but may be implemented through a program for realizing a function corresponding to the configuration of the embodiment of the present invention or a recording medium in which the program is recorded.

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.Although the embodiments of the present invention have been described in detail above, the scope of the present invention is not limited thereto. is within the scope of the right.

Claims

As a method of operating a disease gene discovery apparatus operated by at least one processor,
A step of constructing a network in which genes and functions are connected by combining information on relationship between genes and information on functional groups including genes; and
Repeating the analysis cycle to select a new disease gene candidate in the network,
The analysis cycle is
calculating a disease significance score for each functional group based on the statistical significance of the disease genes included in each functional group;
In the network, calculating the function similarity disease score of each gene by summing the disease significance scores of the functional groups to which each gene is associated;
Network propagation of the initial disease score of each gene reflecting the functional similarity disease score of each gene;
selecting a new disease gene candidate based on the disease score of the genes calculated through the network propagation; and
adding the new disease gene candidate to disease gene information used in the next analysis cycle
comprising, a method of operation.

In claim 1,
According to the new disease gene candidate selected in this cycle, the disease significance score of the functional groups associated with the new disease gene candidate in a next cycle is variable.

In claim 1,
The analysis cycle is
Among the genes included in the network, for a known disease gene or a disease gene candidate selected in previous analysis cycles, the step of scoring the disease significance of the gene with a prior disease score,
The initial disease score is
Method of operation, assigned as the weighted sum of the functional similarity disease score and pre-disease score of each gene.

In claim 3,
The step of scoring with the prior disease score is
A method of operating, differentially scoring the prior disease score of the known disease gene and the prior disease score of the disease gene candidate selected in previous analysis cycles.

In claim 1,
The step of calculating the disease significance score for each functional group is
A method for calculating disease significance based on the ratio of disease genes and general genes included in each functional group.

In claim 1,
If there is no gene having a significant disease score through the network propagation, terminating the analysis cycle without repeating
Further comprising, the method of operation.

In claim 1,
the network is
A gene network combining relationship information between genes included in a plurality of public databases, and a gene-function network in which genes related to functional groups are connected,
wherein at least one known disease gene is associated with at least one functional group.

As a method of operating a disease gene discovery apparatus operated by at least one processor,
For each functional group to which the genes are related, calculating a disease significance score for the functional group based on the statistical significance of the disease genes associated with the functional group;
calculating the function similarity disease score of each gene by summing the disease significance scores of the functional groups to which each gene is related in a gene network constructed with inter-gene relationship information;
Among the genes included in the gene network, for a known disease gene or a new disease gene candidate, giving a prior disease score that scores the disease significance of the gene;
allocating a weighted sum of the function similarity disease score and the prior disease score to the genes included in the gene network as an initial disease score of the gene; and
Selecting a new disease gene candidate based on the network propagation disease score of each gene calculated by network propagation of the initial disease score of each gene in the gene network
comprising, a method of operation.

In claim 8,
The step of calculating the functional similarity disease score of each gene is
and assigning a disease significance score of each functional group to associated genes, and calculating the function similarity disease score by normalizing the log sum of disease significance p-values assigned to each gene.

In claim 8,
The step of giving the prior disease score is
For the known disease gene and the new disease gene candidate, a differential prior disease score is given.

In claim 8,
The step of calculating the disease significance score
When a gene related to a specific functional group is selected as a new disease gene candidate in a previous analysis cycle, the disease significance score of the specific functional group is calculated by reflecting the statistical significance of the new disease gene candidate.

In claim 11,
When the disease significance score of the specific functional group is varied, the function similarity disease score of genes related to the specific functional group is varied.

In claim 11,
The gene selected as a new disease gene candidate in the previous analysis cycle is different from the initial disease score of the previous analysis cycle by giving a score different from the function similarity disease score and the prior disease score calculated in the previous analysis cycle in the current analysis cycle Propagating early disease scores, a method of operation.

In claim 8,
The method further comprising the step of adding the new disease gene candidate to disease gene information used in the next analysis cycle.

In claim 8,
The step of selecting the new disease gene candidate
In the gene network, when there is no gene having a significant network spread disease score by the network propagation, terminating the discovery of a new disease gene
Further comprising, the method of operation.