KR101839572B1

KR101839572B1 - Apparatus Analyzing Disease-related Genes and Method thereof

Info

Publication number: KR101839572B1
Application number: KR1020170155272A
Authority: KR
Inventors: 박상현; 김정우
Original assignee: 연세대학교 산학협력단
Priority date: 2017-11-21
Filing date: 2017-11-21
Publication date: 2018-03-16
Anticipated expiration: 2037-11-21

Abstract

본 발명은 유전자 관계를 분석하는 장치로서, 유전자 통합 네트워크를 구축하여 특정 질병과 관련한 후보 유전자들을 검색할 수 있는 유전자 관계 분석 장치를 개시한다. 본 발명의 유전자 관계 분석 장치는 적어도 하나의 유전자 심볼을 이용하여 유전자에 관한 정보를 포함하는 텍스트 데이터를 검색하고, 상기 검색된 텍스트 데이터에서 상기 유전자 심볼이 조합된 유전자 조합 심볼을 검출하는 조합 심볼 검출부; 상기 검출된 유전자 조합 심볼에 기반하여 상기 유전자 심볼을 노드로 하고 상기 유전자의 관계를 엣지로 나타내는 지역 유전자 네트워크를 구축하고, 상기 구축된 지역 유전자 네트워크를 통합하여 통합 유전자 네트워크를 구축하는 통합 네트워크 구축부; 및 상기 통합 유전자 네트워크를 이용하여 분석하고자 하는 질병과 관련된 유전자 심볼 우선순위를 결정하고 상기 결정된 우선 순위에 따라 상기 질병과 관련된 유전자들의 관계를 분석하는 분석부; 를 포함한다.The present invention discloses an apparatus for analyzing a gene relationship, which is capable of searching candidate genes related to a specific disease by constructing a gene integration network. The apparatus for analyzing a gene relationship according to the present invention includes a combination symbol detector for searching text data including information on a gene using at least one gene symbol and for detecting a gene combination symbol combined with the gene symbol in the retrieved text data; An integrated network construction unit for constructing an integrated gene network by integrating the constructed local gene network based on the detected gene combination symbol as a node and representing the relationship of the gene as an edge, ; And an analyzer for determining a gene symbol prioritization related to a disease to be analyzed using the integrated gene network and analyzing a relationship between genes related to the disease according to the determined priority; .

Description

[0001] Apparatus Analyzing Disease-related Genes and Methods [0002]

본 발명은 유전자 관계를 분석하는 장치에 관한 것이다. 보다 상세하게는, 질병과 관련된 유전자를 검색하고, 검색된 유전자의 관계를 분석하는 장치에 관한 것이다.The present invention relates to an apparatus for analyzing a gene relationship. More particularly, the present invention relates to a device for searching genes associated with diseases and analyzing the relationship of the genes searched.

유전자와 질병에 관한 많은 연구들이 진행 되면서, 그 결과가 다량의 문서로 축적 되었다. 하지만, 유전자 기술의 특성상 연구 자료가 방대하고 필요한 생물학 문헌들이 산재되어 있기 때문에 적재 적소에 생물학 정보를 분석하여 유전자 연구에 힘든 어려움이 있다.As much research on genes and diseases has progressed, the results have accumulated in a large amount of documentation. However, due to the nature of gene technology, research data are vast and necessary biologic documents are scattered. Therefore, there is a difficulty in genetic research by analyzing biological information at the right place.

생물학 문헌들은 수많은 생물학 실험들의 결과로 생성되는데, 이러한 생물학 실험들의 결과들은 PubMed, OMIM과 같은 데이터 베이스에 저장되어 있다. 이러한 생물학 실험 결과들이 저장된 데이터 베이스는 연구자들이 새로운 실험을 수행함에 있어 유용하나, 그 자료의 양이 방대하고 산재되어 있어서 연구자들이 실질적으로 필요한 자료에 접근하기에 용이하지 않은 문제점이 있다. Biological documents are generated as a result of a number of biological experiments, the results of which are stored in databases such as PubMed and OMIM. The stored database of these biological experiment results is useful for researchers to carry out new experiments. However, since the amount of the data is large and scattered, it is not easy for the researchers to access the data that is practically necessary.

이에, 텍스트 마이닝 기법을 이용하여 문서를 분석하고, 질병 관련한 유전자 및 단백질의 상호 관계를 파악하는 연구가 진행되어 왔고, 종래 텍스트 마이닝 기법을 이용한 연구 중에는 동의어를 활용하는 방법, 문장에서 동사를 활용한 방법들이 이미 존재한다. Therefore, research has been carried out to analyze texts using text mining techniques and to understand the correlation between genes and proteins related to diseases. In the past research using text mining techniques, methods using synonyms and using verbs in sentences Methods already exist.

유전자 우선 순위(Prioritization)는 유전자가 질병을 규명하는데 있어서 유전자의 중요성을 고려하면 텍스트 마이닝 기법에서 중요한 요소이고, 유전자의 표현형이나 단백질의 상호 유사성에 기초하여 질병과 관련한 잠재적 유전자를 추론하는 기술 개발 역시 진행 중이다.Prioritization is an important factor in text mining when considering the importance of genes in the identification of disease, and the development of techniques for inferring potential genes related to disease based on the phenotype of genes or the similarity of proteins Is in progress.

다만 종래의 기술은 질병 관련 유전자를 검색함에 있어 특정 질병과 관련한 데이터베이스를 한정적으로 사용하여 유용한 유전자에 대한 정보를 놓칠 우려가 있으며, 네트워크를 구축할 질병 관련 문헌 정보를 수집함에 있어서 질병 관련성이 떨어지는 문제가 있다.However, in the conventional technology, when searching for disease-related genes, information on useful genes may be missed by using a database related to a specific disease, and in collecting disease-related information for establishing a network, .

따라서, 방대한 생물학 문헌을 분석하여 질병과 관련도 높은 유전자들의 관계를 분석하기 위한 기술 개발이 요구되고 있다.Therefore, there is a need to develop a technique for analyzing a large biological literature and analyzing the relationship between genes associated with diseases.

한국 공개 특허 제 10-2009-0007972 (공개)Korean Patent Laid-Open No. 10-2009-0007972 (published)

본 발명은 상기한 문제점을 해결하기 위하여 안출된 것으로서, 유전자 관계를 분석하는 장치 및 방법을 개시한다. 특히 이미 질병과 관련된 것으로 알려진 유전자를 기반으로 유전자 네트워크를 구축하여 질병과 관련된 유전자 관계를 분석하는 장치를 개시한다.Disclosure of Invention Technical Problem [8] The present invention has been made to solve the above problems, and an apparatus and a method for analyzing gene relationships are disclosed. In particular, a gene network is constructed based on a gene already known to be related to a disease, and a device for analyzing a gene relationship related to a disease is disclosed.

본 발명은 상기한 목적을 달성하기 위해 안출된 것으로서, 본 발명의 유전자 관계 분석 장치는 적어도 하나의 유전자 심볼을 이용하여 유전자에 관한 정보를 포함하는 텍스트 데이터를 검색하고, 상기 검색된 텍스트 데이터에서 상기 유전자 심볼이 조합된 유전자 조합 심볼을 검출하는 조합 심볼 검출부; 상기 검출된 유전자 조합 심볼에 기반하여 상기 유전자 심볼을 노드로 하고 상기 유전자의 관계를 엣지로 나타내는 통합 유전자 네트워크를 구축하는 통합 네트워크 구축부; 및 상기 구축된 유전자 네트워크를 이용하여 분석하고자 하는 질병과 관련된 유전자 심볼 우선순위를 결정하고 상기 결정된 우선 순위에 따라 상기 질병과 관련된 유전자들의 관계를 분석하는 분석부; 를 포함한다. According to an aspect of the present invention, there is provided a gene-relationship analyzing apparatus for searching text data including information on a gene using at least one gene symbol, A combined symbol detector for detecting a combined symbol combination of symbols; An integrated network construction unit for constructing an integrated gene network in which the gene symbol is a node and the relation of the gene is an edge based on the detected gene combination symbol; And an analysis unit for determining a gene symbol prioritization related to a disease to be analyzed using the constructed gene network and analyzing a relationship between genes related to the disease according to the determined priority; .

본 발명에서 상기 유전자 관계 분석 장치는 분석하고자 하는 질병과 관련된 유전자 심볼로 이미 알려진 종자 유전자 심볼을 획득하는 종자 유전자 심볼 획득부; 를 더 포함하고, 상기 조합 심볼 검출부는 상기 획득된 종자 유전자 심볼을 이용하여 상기 텍스트 데이터를 검색하고, 상기 검색된 텍스트 데이터에서 상기 유전자 조합 심볼을 검출할 수 있다.In the present invention, the gene relationship analyzing apparatus includes a seed gene symbol acquiring unit for acquiring a seed gene symbol already known as a gene related to a disease to be analyzed; And the combination symbol detector may search the text data using the obtained seed gene symbol and detect the gene combination symbol in the retrieved text data.

본 발명에서 상기 조합 심볼 검출부는 상기 검색된 텍스트 데이터를 문장 단위로 구분하고, 상기 구분된 문장을 분석하여 상기 문장에서 유전자 심볼을 추출하는 심볼 추출부; 를 더 포함하고, 상기 문장에서 적어도 2개 이상의 유전자 심볼들이 추출되는 경우 상기 추출된 유전자 심볼들을 조합 심볼로서 검출할 수 있다.In the present invention, the combination symbol detector may include: a symbol extractor for extracting the genetic symbol from the sentence by classifying the searched text data on a sentence-by-sentence basis; analyzing the sentence; And when the at least two gene symbols are extracted in the sentence, the extracted gene symbols can be detected as a combination symbol.

본 발명에서 상기 통합 네트워크 구축부는 상기 종자 유전자 심볼을 이용하여 검출된 유전자 조합 심볼에 포함된 유전자 심볼을 각각의 노드로 하는 지역 유전자 네트워크를 구축하는 지역 네트워크 구축부; 를 더 포함하고, 상기 구축된 지역 유전자 네트워크를 이용하여 상기 통합 유전자 네트워크를 구축할 수 있다. In the present invention, the integrated network construction unit may include a local network construction unit for constructing a local gene network using each of the nodes as a gene symbol included in the gene combination symbol detected using the seed gene symbol. And the integrated gene network can be constructed using the constructed local gene network.

본 발명에서 상기 통합 네트워크 구축부는 상기 유전자 조합 심볼에 포함된 유전자 심볼의 종류 또는 상기 유전자 조합 심볼이 검출된 빈도수를 고려하여 상기 구축된 지역 유전자 네트워크를 미리 결정된 방법에 따라 통합하는 네트워크 통합부; 를 포함하고, 상기 통합된 지역 유전자 네트워크를 이용하여 상기 통합 유전자 네트워크를 구축할 수 있다.In the present invention, the integrated network construction unit may include a network integrator for integrating the constructed local gene network according to a predetermined method, taking into consideration the type of gene symbol included in the gene combination symbol or the frequency at which the combined symbol is detected; And the integrated gene network can be constructed using the integrated local gene network.

본 발명에서 상기 미리 결정된 방법에 따라 통합하는 것은 상기 유전자 조합 심볼에 포함된 유전자 심볼이 일치하는 경우, 상기 조합 심볼의 각 유전자 심볼을 노드로 하여, 상기 노드를 연결하는 엣지의 가중치를 증가시키는 방법으로 통합하도록 마련될 수 있다.According to the predetermined method of the present invention, when integrating the gene symbols included in the gene combination symbol, integrating each gene symbol of the combination symbol as a node increases the weight of the edge connecting the node As shown in FIG.

본 발명에서 상기 미리 결정된 방법에 따라 통합하는 것은 상기 유전자 조합 심볼에 포함된 하나의 유전자 심볼이 일치하는 경우, 상기 일치하는 유전자 심볼을 공통 노드로 하여 상기 유전자 조합 심볼을 연결하는 방법으로 통합하는 것으로 마련될 수 있다.In the present invention, integration according to the predetermined method may be performed by combining the gene combination symbols with the matching gene symbol as a common node when one gene symbol included in the gene combination symbol coincides .

본 발명에서 상기 미리 결정된 방법에 따라 통합하는 것은 상기 유전자 조합 심볼에 포함된 유전자 심볼이 일치하지 않는 경우, 상기 일치하지 않는 유전자 심볼을 별개 노드로 하여 상기 유전자 조합 심볼을 연결하지 않는 방법으로 통합하는 것으로 마련될 수 있다.According to the method of the present invention, when the gene symbols included in the gene combination symbol do not coincide with each other, the non-matching gene symbol is integrated as a separate node and the gene combination symbols are not linked .

본 발명에서 상기 분석부는 상기 구축된 통합 유전자 네트워크를 기반으로 분석하고자 하는 질병과 관련된 유전자 심볼 우선순위를 결정하는 우선 순위 결정부; 및 상기 결정된 우선 순위에 따라 상기 질병과 관련된 유전자 심볼을 배치하고, 상기 배치된 유전자 심볼을 상기 질병과 관련된 검증 데이터로 검증하는 검증부; 를 더 포함하고, 상기 검증된 유전자 심볼을 이용하여 상기 질병과 관련된 유전자들의 관계를 분석할 수 있다.In the present invention, the analysis unit may include a priority determination unit for determining a priority of a gene symbol related to a disease to be analyzed based on the constructed integrated gene network; And a verification unit for arranging gene symbols related to the disease according to the determined priority and verifying the placed gene symbols with verification data related to the disease. , And the verified gene symbol can be used to analyze the relationship of the genes associated with the disease.

본 발명에서 상기 우선 순위 결정부는 상기 구축된 통합 유전자 네트워크를 기반으로 하나의 노드에 연결된 다른 노드의 수에 관한 제1 기준치를 산출하는 제1 기준치 산출부; 및 상기 하나의 노드를 상기 다른 노드와 연결하는 엣지의 가중치에 관한 제2 기준치를 산출하는 제2 기준치 산출부; 를 더 포함하고, 상기 결정된 제1 기준치 및 제2 기준치를 이용하여 상기 질병과 관련된 유전자 심볼의 우선 순위를 결정할 수 있다.In the present invention, the priority determining unit may include: a first reference value calculating unit for calculating a first reference value related to the number of other nodes connected to one node based on the constructed integrated gene network; And a second reference value calculation unit for calculating a second reference value related to a weight of an edge connecting the one node with the other node; And the priority of the gene symbol related to the disease can be determined using the determined first reference value and the second reference value.

본 발명에서 상기 우선 순위 결정부는 상기 통합 유전자 네트워크상에서 임의의 두개 노드를 최단거리로 연결하는 경로에서 상기 하나의 노드가 노출되는 빈도수에 관한 제3 기준치를 산출하는 제3 기준치 산출부; 를 더 포함하고, 상기 산출된 제3 기준치를 이용하여 상기 질병과 관련된 유전자 심볼의 우선 순위를 결정할 수 있다.In the present invention, the priority determining unit may include: a third reference value calculating unit for calculating a third reference value related to a frequency at which the one node is exposed in a path connecting the arbitrary two nodes at the shortest distance on the integrated gene network; And the priority of the gene symbol related to the disease can be determined using the calculated third reference value.

또한 상기한 목적을 달성하기 위하여 본 발명의 유전자 관계 분석 방법은 적어도 하나의 유전자 심볼을 이용하여 유전자에 관한 정보를 포함하는 텍스트 데이터를 검색하고, 상기 검색된 텍스트 데이터에서 상기 유전자 심볼이 조합된 유전자 조합 심볼을 검출하는 단계; 상기 검출된 유전자 조합 심볼에 기반하여 상기 유전자 심볼을 노드로 하고 상기 유전자의 관계를 엣지로 나타내는 통합 유전자 네트워크를 구축하는 단계; 및 상기 구축된 유전자 네트워크를 이용하여 분석하고자 하는 질병과 관련된 유전자 심볼 우선순위를 결정하고 상기 결정된 우선 순위에 따라 상기 질병과 관련된 유전자들의 관계를 분석하는 단계; 를 포함한다.According to another aspect of the present invention, there is provided a method for analyzing gene relationships, comprising the steps of: searching for text data including information on a gene using at least one gene symbol; Detecting a symbol; Constructing an integrated gene network based on the detected gene combination symbol, the gene symbol being a node and the relation of the gene as an edge; And determining the gene symbol priorities related to the disease to be analyzed using the constructed gene network and analyzing the relationship of the genes related to the disease according to the determined priority; .

본 발명에서 상기 유전자 관계 분석 방법은 분석하고자 하는 질병과 관련된 유전자 심볼로 이미 알려진 종자 유전자 심볼을 획득하는 단계; 를 더 포함하고, 상기 획득된 종자 유전자 심볼을 이용하여 상기 텍스트 데이터를 검색하고, 상기 검색된 텍스트 데이터에서 상기 유전자 조합 심볼을 검출할 수 있다.In the present invention, the gene relational analysis method comprises: obtaining a seed gene symbol already known as a gene related to a disease to be analyzed; And searching the text data using the obtained seed gene symbol and detecting the gene combination symbol in the retrieved text data.

본 발명에서 상기 검출하는 단계는 상기 검색된 텍스트 데이터를 문장 단위로 구분하고, 상기 구분된 문장을 분석하여 상기 문장에서 유전자 심볼을 추출하는 단계; 를 더 포함하고, 상기 문장에서 적어도 2개 이상의 유전자 심볼들이 추출되는 경우 상기 추출된 유전자 심볼들을 조합 심볼로서 검출할 수 있다.According to an embodiment of the present invention, the detecting step may include: dividing the searched text data into sentences, analyzing the separated sentences, and extracting gene symbols from the sentences; And when the at least two gene symbols are extracted in the sentence, the extracted gene symbols can be detected as a combination symbol.

본 발명에서 상기 통합 유전자 네트워크를 구축하는 단계는 상기 종자 유전자 심볼을 이용하여 검출된 유전자 조합 심볼에 포함된 유전자 심볼을 각각의 노드로 하는 지역 유전자 네트워크를 구축하는 단계; 를 더 포함하고, 상기 구축된 지역 유전자 네트워크를 이용하여 상기 통합 유전자 네트워크를 구축할 수 있다.In the present invention, the step of constructing the integrated gene network includes constructing a local gene network having gene nodes included in the gene combination symbols detected using the seed gene symbols as respective nodes. And the integrated gene network can be constructed using the constructed local gene network.

본 발명에서 상기 통합 유전자 네트워크를 구축하는 단계는 상기 유전자 조합 심볼에 포함된 유전자 심볼의 종류 또는 상기 유전자 조합 심볼이 검출된 빈도수를 고려하여 상기 구축된 지역 유전자 네트워크를 미리 결정된 방법에 따라 통합하는 단계; 를 더 포함하고, 상기 통합된 지역 유전자 네트워크를 이용하여 상기 통합 유전자 네트워크를 구축할 수 있다.In the present invention, the step of constructing the integrated gene network may include integrating the constructed local gene network according to a predetermined method in consideration of the type of the gene symbol included in the gene combination symbol or the frequency with which the combined gene symbol is detected, ; And the integrated gene network can be constructed using the integrated local gene network.

본 발명에서 상기 미리 결정된 방법에 따라 통합하는 것은 상기 유전자 조합 심볼에 포함된 유전자 심볼이 모두 일치하는 경우, 상기 조합 심볼의 각 유전자 심볼을 노드로 하여, 상기 노드를 연결하는 엣지의 가중치를 증가시키고, 상기 유전자 조합 심볼에 포함된 하나의 유전자 심볼이 일치하는 경우, 상기 일치하는 유전자 심볼을 공통 노드로 하여 상기 유전자 조합 심볼을 연결하는 방법으로 통합하는 것으로 마련될 수 있다.According to the present invention, when integrating according to the predetermined method, when all of the gene symbols included in the gene combination symbol coincide with each other, each of the gene symbols of the combination symbol is used as a node to increase the weight of the edge connecting the node And combining the gene combination symbols with the matching gene symbol as a common node when one gene symbol included in the gene combination symbol coincides with each other.

본 발명에서 상기 미리 결정된 방법에 따라 통합하는 것은 상기 유전자 조합 심볼에 포함된 유전자 심볼이 일치하지 않는 경우, 상기 일치하지 않는 유전자 심볼을 별개 노드로 하여 상기 유전자 조합 심볼을 연결하지 않는 방법으로 통합하도록 마련될 수 있다.According to the present invention, in the case where the gene symbols included in the gene combination symbol do not coincide with each other, the unmatched gene symbols are combined as a separate node and the gene combination symbols are not linked .

본 발명에서 상기 분석하는 단계는 상기 구축된 통합 유전자 네트워크를 기반으로 하나의 노드에 연결된 다른 노드의 수, 상기 하나의 노드를 상기 다른 노드와 연결하는 엣지의 가중치 및 상기 통합 유전자 네트워크상에서 임의의 두개 노드를 최단거리로 연결하는 경로에서 상기 하나의 노드가 노출되는 빈도수를 고려하여 분석하고자 하는 질병과 관련된 유전자 심볼 우선순위를 결정하는 단계; 및 상기 결정된 우선 순위에 따라 상기 질병과 관련된 유전자 심볼을 배치하고, 상기 배치된 유전자 심볼을 상기 질병과 관련된 검증 데이터로 검증하는 단계; 를 더 포함하고, 상기 검증된 유전자 심볼을 이용하여 상기 질병과 관련된 유전자들의 관계를 분석할 수 있다.In the present invention, the analyzing step may include analyzing the number of other nodes connected to one node based on the constructed integrated gene network, the weight of the edge connecting the one node to the other node, Determining a priority of a gene related to a disease to be analyzed in consideration of the frequency of exposure of the one node in the path connecting the node at the shortest distance; And placing the gene symbol related to the disease according to the determined priority and verifying the placed gene symbol with verification data related to the disease; , And the verified gene symbol can be used to analyze the relationship of the genes associated with the disease.

또한 본 발명은 컴퓨터에서 상기한 유전자 관계 분석 방법을 실행시키기 위한 컴퓨터에서 판독 가능한 기록매체에 저장된 컴퓨터 프로그램을 개시한다.The present invention also discloses a computer program stored in a computer readable recording medium for executing the above-described gene relationship analysis method in a computer.

본 발명에 따르면, 질병과 관련된 유전자의 관계를 분석할 수 있는 잇점이 있다.According to the present invention, there is an advantage that the relationship of the gene related to the disease can be analyzed.

특히, 유전자 네트워크를 통합하고, 검증된 후보 유전자를 도출할 수 있는 잇점이 있다.In particular, it has the advantage of integrating gene networks and deriving proven candidate genes.

도 1은 본 발명의 일 실시 예에 따른 유전자 관계 분석 장치의 블록도이다.
도 2는 도 1의 실시 예에서 조합 심볼 검출부의 확대 블록도이다.
도 3은 도 1의 실시 예에서 통합 네트워크 구축부의 확대 블록도이다.
도 4는 도 1의 실시 예에서 통합 네트워크를 구축하는 과정을 나타내는 예시도이다.
도 5는 도 1의 실시 예에서 통합 네트워크 구축부가 구축한 통합 유전자 네트워크를 나타내는 예시도이다.
도 6은 스케일 조정된 통합 유전자 네트워크를 나타내는 예시도이다.
도 7은 도 1의 실시 예에서 분석부의 확대 블록도이다.
도 8은 도 7의 실시 예에서 우선 순위 결정부의 확대 블록도이다.
도 9는 유전자 심볼 우선순위에 따라 나열된 질병 관련 유전자들을 나타내는 예시도이다.
도 10은 도 7의 실시 예에서 검증부가 후보 유전자 심볼을 검증하기 위하여 검출한 텍스트를 나타내는 예시도이다.
도 11은 본 발명의 일 실시 예에 따른 유전자 관계 분석 방법의 흐름도이다.
도 12는 도 11의 실시 예에서 검출하는 단계의 확대 흐름도이다.
도 13은 도 11의 실시 예에서 통합 유전자 네트워크를 구축하는 단계의 확대 흐름도이다.
도 14는 도 11의 실시 예에서 분석하는 단계의 확대 흐름도이다.
도 15는 본 발명의 또 다른 실시 예에 따른 유전자 관계 분석 방법의 흐름을 나타내는 참고도이다.
도 16은 다른 유전자 분석 방법과 비교한 본 발명의 유전자 분석 성능을 나타내는 참고도이다.1 is a block diagram of a gene-relationship analyzing apparatus according to an embodiment of the present invention.
2 is an enlarged block diagram of the combination symbol detection unit in the embodiment of FIG.
3 is an enlarged block diagram of the integrated network construction unit in the embodiment of FIG.
4 is an exemplary diagram illustrating a process of establishing an integrated network in the embodiment of FIG.
5 is an exemplary diagram illustrating an integrated gene network constructed by the integrated network building unit in the embodiment of FIG.
Figure 6 is an exemplary diagram illustrating a scaled integrated gene network.
7 is an enlarged block diagram of the analysis unit in the embodiment of FIG.
8 is an enlarged block diagram of the priority determining unit in the embodiment of FIG.
FIG. 9 is an illustration showing disease-related genes listed according to gene symbol priorities. FIG.
FIG. 10 is an exemplary diagram showing a text detected by a verification unit in the embodiment of FIG. 7 to verify a candidate gene symbol. FIG.
11 is a flowchart of a method of analyzing a gene relationship according to an embodiment of the present invention.
12 is an enlarged flow chart of the step of detecting in the embodiment of Fig.
13 is an enlarged flow chart of a step of constructing an integrated gene network in the embodiment of FIG.
14 is an enlarged flow chart of the step of analyzing in the embodiment of Fig.
15 is a reference diagram showing a flow of a method of analyzing a gene relationship according to another embodiment of the present invention.
16 is a reference diagram showing the gene analysis performance of the present invention as compared with other gene analysis methods.

이하, 본 발명의 일 실시예를 첨부된 도면들을 참조하여 상세히 설명한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

첨부 도면을 참조하여 설명함에 있어, 동일하거나 대응하는 구성 요소는 동일한 도면번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS In the following description with reference to the accompanying drawings, the same or corresponding components are denoted by the same reference numerals, and a duplicate description thereof will be omitted.

또한 본 발명을 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략할 수 있다. In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear.

본 출원에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 용어를 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 이하에서 설명하는 각 단계는 하나 또는 여러 개의 소프트웨어 모듈로도 구비가 되거나 또는 각 기능을 담당하는 하드웨어로도 구현이 가능하며, 소프트웨어와 하드웨어가 복합된 형태로도 가능하다.The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the scope. The singular expressions include plural expressions unless the context clearly dictates otherwise. Each of the steps described below may be implemented by one or a plurality of software modules, or hardware that is responsible for each function, or a combination of software and hardware.

각 용어의 구체적인 의미와 예시는 각 도면의 순서에 따라 이하 설명 한다.Specific meanings and examples of the terms will be described below in accordance with the order of each drawing.

이하에서는 본 발명의 실시예에 따른 유전자 관계 분석 장치의 구성을 관련된 도면을 참조하여 상세히 설명한다.Hereinafter, a configuration of a gene relationship analyzing apparatus according to an embodiment of the present invention will be described in detail with reference to the related drawings.

도 1은 본 발명의 일 실시 예에 따른 유전자 관계 분석 장치(10)의 블록도이다.1 is a block diagram of a gene-relationship analyzing apparatus 10 according to an embodiment of the present invention.

유전자 관계 분석 장치(10)는 종자 유전자 심볼 획득부(100), 조합 심볼 검출부(200), 통합 네트워크 구축부(300) 및 분석부(400)를 포함한다. 예를 들어, 유전자 관계 분석 장치(10)는 특정 질병과 관련된 것으로 이미 알려진 유전자인 종자 유전자와 유전자 네트워크를 이용하여 분석하고자 하는 질병과 관련된 유전자 관계를 분석할 수 있다. 유전자 관계 분석 장치(10)는 종자 유전자를 입력으로 하여 질병과 관련된 유전자 관계를 분석하고, 질병과 관련된 것으로 예상되는 후보 유전자들을 도출하며, 도출된 후보 유전자들을 검증할 수 있다.The gene relational analysis apparatus 10 includes a seed gene symbol acquisition unit 100, a combination symbol detection unit 200, an integrated network construction unit 300, and an analysis unit 400. For example, the gene relational analysis apparatus 10 can analyze a gene relationship related to a disease to be analyzed using a seed gene and a gene network, which are genes known to be related to a specific disease. The gene-relationship analyzing apparatus 10 analyzes the gene-related genes related to the disease by inputting the seed gene, derives candidate genes expected to be related to the disease, and verifies the candidate genes.

유전자 관계 분석 장치(10)는 생물학 실험을 통하여 그 실험 결과로 생성된 다양한 생물학 문헌들을 분석하는데, 이를 위하여 PubMed, OMIM, GHR, KEGG, LuGend 및 IGDB.NSCLC등의 생물학 정보들을 포함하는 데이터 베이스를 사용할 수 있다. 예를 들어, 유전자 관계 분석 장치(10)는 특정 질병과 관련된 종자 유전자 심볼을 검색하기 위하여 OMIM데이터 베이스에서 폐암과 관련된 것으로 알려진 유전자 심볼들을 획득하고 이를 종자 유전자 심볼로 사용할 수 있고, 획득된 각각의 종자 유전자 심볼들을 이용하여 PubMed에서 유전자에 관한 정보를 포함하는 텍스트 데이터를 추출할 수 있으며, 추출된 텍스트 데이터에서 유전자 조합 심볼을 검출하여 유전자 네트워크를 구축하고, 구축된 유전자 네트워크에 기반하여 도출된 후보 유전자들을 GHR, KEGG, LuGend 및 IGDB.NSCLC 데이터 베이스를 이용하여 검증할 수 있다. The genetic relationship analyzer 10 analyzes a variety of biological documents produced as a result of a biological experiment through a biological experiment. For this purpose, a database containing biological information such as PubMed, OMIM, GHR, KEGG, LuGend and IGDB.NSCLC Can be used. For example, the gene relational analysis apparatus 10 may acquire gene symbols known to be associated with lung cancer in the OMIM database and use them as seed gene symbols to search for seed gene symbols related to a specific disease, Using seed gene symbols, we can extract text data including information about genes in PubMed. We can construct a gene network by detecting the gene combination symbol in the extracted text data, and extract candidates derived from the constructed gene network The genes can be verified using the GHR, KEGG, LuGend and IGDB.NSCLC databases.

유전자 관계 분석 장치(10)는 입력값을 획득하기 위한 데이터 베이스, 입력값을 이용하여 원하는 유전자 심볼들을 획득하기 위한 데이터 베이스, 획득된 유전자 심볼들을 검증하기 위한 데이터 베이스를 각각 서로 다르게 구분하여 사용할 수 있고, 본 발명에서 유전자 심볼이란 유전자를 나타내는 유전자 코드 및 유전자 명칭과 같이 유전자를 식별할 수 있는 식별 수단을 포함한다.The apparatus 10 for analyzing a gene relationship includes a database for acquiring input values, a database for acquiring desired gene symbols using input values, and a database for verifying acquired gene symbols, In the present invention, a gene symbol includes an identification means capable of identifying a gene, such as a genetic code representing a gene and a gene name.

종자 유전자 심볼 획득부(100)는 분석하고자 하는 질병과 관련된 유전자 심볼로 이미 알려진 종자 유전자 심볼을 획득한다. 예를 들어, 종자 유저자 심볼 획득부(100)는 질병과 관련된 것으로 이미 알려진 유전자 정보들을 포함하는 생물학 정보들이 미리 저장되어 있는 데이터 베이스로부터 종자 유전자 심볼을 획득할 수 있다. 종자 유전자 심볼 획득부(100)가 종자 유전자 심볼들을 획득하는 데이터 베이스와 조합 심볼 검출부(200)가 종자 유전자 심볼을 이용하여 유전자에 관한 정보를 포함하는 텍스트 데이터를 검색하기 위하여 사용하는 데이터 베이스는 서로 다른 데이터 베이스로 마련될 수 있다.The seed gene symbol acquisition unit 100 acquires a seed gene symbol already known as a gene symbol related to the disease to be analyzed. For example, the seed user's symbol acquisition unit 100 may acquire a seed gene symbol from a database in which biological information including gene information already known to be related to disease is stored in advance. A database in which the seed gene symbol acquisition unit 100 acquires seed gene symbols and a database used by the combination symbol detection unit 200 for searching text data including information on genes using seed gene symbols But may be provided in other databases.

상기 표 1은 종자 유전자 심볼 획득부(100)가 획득한 폐암과 관련된 것으로 이미 알려진 32개의 유전자 심볼인 32개의 종자 유전자 심볼(Seed Gene Symbol)들을 나타낸다. 상기 표 1에서 Literature는 각각의 종자 유전자 심볼들을 이용하여 검색된 텍스트 데이터의 수를 의미한다. 예를 들어, 종자 유전자 심볼 획득부(100)는 제1 데이터 베이스(OMIM 데이터 베이스)로부터 특정 질병과 관련된 것으로 이미 알려진 종자 유전자 심볼을 획득하고, 조합 심볼 검출부(200)는 종자 유전자 심볼들을 이용하여 제2 데이터 베이스(PubMed 데이터 베이스)에서 유전자에 관한 정보를 포함하는 텍스트 데이터를 검색할 수 있다.Table 1 shows 32 seeded gene symbols, which are 32 gene symbols already known to be associated with lung cancer acquired by the seed gene symbol acquisition unit 100. In Table 1, Literature means the number of text data retrieved using each seed gene symbol. For example, the seed gene symbol acquiring unit 100 acquires a known seed gene symbol related to a specific disease from a first database (OMIM database), and the combination symbol detecting unit 200 uses seed gene symbols And the text data including the information about the gene can be retrieved from the second database (PubMed database).

조합 심볼 검출부(200)는 전처리부(220) 및 심볼 추출부(240)를 포함한다. 예를 들어, 조합 심볼 검출부(200)는 적어도 하나의 유전자 심볼을 이용하여 유전자에 관한 정보를 포함하는 텍스트 데이터를 검색하고, 상기 검색된 텍스트 데이터에서 상기 유전자 심볼이 조합된 유전자 조합 심볼을 검출한다. 조합 심볼 검출부(200)는 적어도 하나의 종자 유전자 심볼들을 검색 키워드로 이용하여 유전자에 관한 정보를 포함하는 텍스트 데이터를 검색하고, 상기 검색된 텍스트 데이터에서 조합 유전자 심볼을 검출할 수 있다. 본 발명에서 조합 유전자 심볼이란, 검색된 텍스트 데이터 내의 한 문장(Sentence)안에 존재하는 적어도 2이상의 유전자 심볼들을 의미한다. The combined symbol detecting unit 200 includes a preprocessing unit 220 and a symbol extracting unit 240. For example, the combination symbol detector 200 searches for text data including information on a gene using at least one gene symbol, and detects a gene combination symbol combined with the gene symbol in the retrieved text data. The combination symbol detection unit 200 can search text data including information about a gene using at least one seed gene symbol as a search keyword and detect a combination gene symbol in the retrieved text data. In the present invention, a combined gene symbol means at least two or more gene symbols present in a sentence in the retrieved text data.

예를 들어, 조합 심볼 검출부(200)는 검색된 텍스트 데이터 안에서 ALK 및 BCL2 라는 두 개의 유전자 심볼들이 존재한다면, ALK-BCL2를 유전자 조합 심볼로서 검출한다. 이는 텍스트 마이닝(Text-Mining) 기법 중 하나로서 동시 출현(Co-Occurrence) 분석 방법에 대응 되는 것으로, 한 문장 안에 동시에 등장하는 유전자 심볼들을 서로 관련이 있다고 가정하는 방법론에 따른 것이다.For example, the combination symbol detector 200 detects ALK-BCL2 as a gene combination symbol if two gene symbols ALK and BCL2 exist in the retrieved text data. This is one of the text mining techniques and corresponds to the co-occurrence analysis method. It is based on a methodology that assumes that mutually appearing gene symbols are related to each other in a sentence.

전처리부(220)는 검색된 텍스트 데이터를 전처리하여 검색된 텍스트 데이터 중 적어도 일부의 텍스트 데이터를 추출할 수 있다. 예를 들어, 전처리부(220)는 검색된 텍스트 데이터에서 키워드(MeSH Term)를 이용하여 추록 텍스트(Abstract Text)만을 추출할 수 있다. 유전자 관계 분석 장치(10)는 전처리부(220)에서 수행되는 전처리 과정을 통하여 처리 데이터의 양을 줄일 수 있다.The preprocessing unit 220 preprocesses the retrieved text data and extracts at least a part of the retrieved text data. For example, the preprocessing unit 220 may extract only the abstract text using the keyword MeSH Term in the retrieved text data. The gene relational analysis apparatus 10 can reduce the amount of processing data through a preprocessing process performed in the preprocessing unit 220. [

심볼 추출부(240)는 검색된 텍스트 데이터를 문장 단위로 구분하고, 상기 구분된 문장을 분석하여 상기 문장에서 유전자 심볼을 추출한다. 예를 들어, 심볼 추출부(240)는 전처리부(220)에서 전처리 과정을 거치는 경우, 전처리된 텍스트 데이터를 문장 단위로 구분하고, 상기 구분된 문장을 분석하여 문장안에서 유전자 심볼을 추출한다. 조합 심볼 검출부(200)는 심볼 추출부(240)에서 추출한 유전자 심볼이 한 문장 안에서 적어도 두개 이상인 경우 추출된 유전자 심볼을 조합 유전자 심볼로서 검출한다. 도 3을 참조하여 설명한다.The symbol extracting unit 240 extracts the genetic symbols in the sentence by analyzing the separated sentences by classifying the searched text data on a sentence-by-sentence basis. For example, when the preprocessing unit 220 performs a preprocessing process, the symbol extracting unit 240 extracts the preprocessed text data on a sentence-by-sentence basis, and extracts the gene symbols in the sentence by analyzing the separated sentences. The combined symbol detection unit 200 detects the extracted gene symbol as a combined gene symbol when the extracted symbol is at least two or more within one sentence. Will be described with reference to FIG.

통합 네트워크 구축부(300)는 지역 네트워크 구축부(320) 및 네트워크 통합부(340)를 포함한다. 예를 들어, 통합 네트워크 구축부(300)는 상기 검출된 유전자 조합 심볼에 기반하여 상기 유전자 심볼을 노드로 하고 상기 유전자의 관계를 엣지로 나타내는 지역 유전자 네트워크를 구축하고, 상기 구축된 지역 유전자 네트워크를 통합하여 통합 유전자 네트워크를 구축한다. 본 발명에서 통합 유전자 네트워크는 특정 질병과 관련된 유전자 심볼들을 포함하는 네트워크로서, 각 유전자 심볼을 노드로 하고, 상기 노드 사이를 엣지로 연결하여 구축될 수 있다.The integrated network construction unit 300 includes a local network construction unit 320 and a network integration unit 340. For example, the integrated network construction unit 300 constructs a local gene network that uses the gene symbol as a node and the relation of the gene as an edge based on the detected gene combination symbol, Integrate and build an integrated gene network. In the present invention, the integrated gene network is a network containing gene symbols related to a specific disease, and can be constructed by connecting each gene symbol as a node and connecting the nodes as an edge.

지역 네트워크 구축부(320)는 종자 유전자 심볼을 이용하여 검출된 유전자 조합 심볼에 포함된 유전자 심볼을 각각의 노드로 하여, 상기 종자 유전자 심볼 단위로 생성되는 지역 유전자 네트워크를 구축한다. 예를 들어, 지역 네트워크 구축부(320)는 상기 획득된 종자 유전자 심볼을 이용하여 검색된 텍스트 데이터에서 검출된 유전자 조합 심볼을 기반으로 하여 구축된 네트워크로서, 종자 유전자 심볼을 입력 단위로 하여 구축되는 유전자 네트워크를 의미한다. 예를 들어, 종자 유전자 심볼 ALK를 이용하여 구축된 지역 네트워크에는 종자 유전자 심볼 ALK를 이용하여 검색된 텍스트 데이터에서 검출된 유전자 심볼들을 각 노드로 하는 네트워크 만을 포함한다. 즉, 지역 네트워크 구축부(320)는 32개의 종자 유전자를 사용하여 32개의 지역 유전자 네트워크를 구축할 수 있다.The local network construction unit 320 constructs a local gene network generated by the seed gene symbol by using the gene symbol included in the gene combination symbol detected using the seed gene symbol as each node. For example, the local network construction unit 320 constructs a network based on the gene combination symbol detected in the text data retrieved using the obtained seed gene symbol, Network. For example, the local network constructed using the seed gene symbol ALK includes only the network in which the gene symbols detected from the text data retrieved using the seed gene symbol ALK are each node. That is, the local network construction unit 320 can construct 32 local gene networks using 32 seed genes.

네트워크 통합부(340)는 상기 유전자 조합 심볼에 포함된 유전자 심볼의 종류 또는 상기 유전자 조합 심볼이 검출된 빈도수를 고려하여 상기 구축된 지역 유전자 네트워크를 미리 결정된 방법에 따라 통합한다. 예를 들어, 네트워크 통합부(340)는 하나의 종자 유전자 심볼을 기반으로 구축된 지역 유전자 네트워크를 통합하여 통합 유전자 네트워크를 생성한다. 도 4를 참조하여 설명한다.The network integration unit 340 integrates the constructed local gene network according to a predetermined method in consideration of the type of the gene symbol included in the gene combination symbol or the frequency at which the gene combination symbol is detected. For example, the network integration unit 340 integrates a local gene network constructed based on one seed gene symbol to generate an integrated gene network. Will be described with reference to FIG.

네트워크 통합부(340)가 미리 결정된 방법에 따라 통합하는 것은 3가지 경우로 설명될 수 있는데, 먼저, 네트워크 통합부(340)는 상기 유전자 조합 심볼에 포함된 유전자 심볼이 일치하는 경우(326), 상기 조합 심볼의 각 유전자 심볼을 노드로 하여, 상기 노드를 연결하는 엣지의 가중치를 증가시키는 방법으로 네트워크를 통합할 수 있다. 도 4에 도시된 바와 같이, 네트워크 1(322)과 네트워크 2(324)에 포함된 조합 유전자 심볼의 종류가 모두 일치하는 경우, 상기 유전자 조합 심볼의 가중치를 증가시키는 방법으로 통합 네트워크(342)를 생성할 수 있다.The network integrating unit 340 may integrate the genes according to a predetermined method. First, if the gene symbols included in the gene combination symbol are identical (326) It is possible to integrate the network by using each gene symbol of the combination symbol as a node and increasing the weight of the edge connecting the node. As shown in FIG. 4, when the types of the combination gene symbols included in the network 1 322 and the network 2 324 are all the same, the integrated network 342 is used as a method of increasing the weights of the gene combination symbols Can be generated.

예를 들어, 네트워크 통합부(340)는 상기 유전자 조합 심볼에 포함된 하나의 유전자 심볼이 일치하는 경우, 상기 일치하는 유전자 심볼을 공통 노드로 하여 상기 유전자 조합 심볼을 연결하는 방법으로 통합할 수 있다. 도 4에 도시된 바와 같이, 네트워크 1(322)과 네트워크 2(324)의 각 유전자 조합 심볼에서 하나의 유전자 조합 심볼만이 일치하는 경우(328, 유전자 심볼 b)에는 유전자 심볼 b를 공통 노드로 하여 양쪽에 유전자 심볼 b 및 c에 대응되는 노드를 연결하는 방법으로 네트워크를 통합할 수 있다.For example, when one gene symbol included in the gene combination symbol is identical, the network integration unit 340 may integrate the corresponding gene combination symbol as a common node . As shown in FIG. 4, when only one gene combination symbol (328, gene symbol b) in the respective gene combination symbols of network 1 322 and network 2 324 coincides (328, gene symbol b) And connecting the nodes corresponding to the gene symbols b and c on both sides to integrate the network.

또 다른 실시 예로, 네트워크 통합부(340)는 상기 유전자 조합 심볼에 포함된 유전자 심볼이 일치하지 않는 경우(330) 상기 일치하지 않는 유전자 심볼을 별개 노드로 하여 상기 유전자 조합 심볼을 연결하지 않는 방법으로 네트워크를 통합할 수 있다. 예를 들어, 도 4에 도시된 바와 같이, 네트워크 통합부(340)는 네트워크 1(322) 및 네트워크 2(324)에 포함된 적어도 어느 하나의 유전자 심볼도 일치하지 않는 경우(330) 각 네트워크 1 및 네트워크 2의 유전자 조합 심볼을 변경하지 않고 그대로 통합할 수 있다. 도 5 및 6을 참조하여 설명한다.In another embodiment, when the gene symbols included in the gene combination symbol do not match (330), the network integration unit 340 does not connect the gene combination symbol with the inconsistent gene symbol as a separate node Network can be integrated. For example, as shown in FIG. 4, if at least one of the gene symbols included in the network 1 322 and the network 2 324 does not match (330) And the network combination symbol of the network 2 can be integrated unchanged. Will be described with reference to Figs. 5 and 6. Fig.

통합 네트워크 구축부(300)가 구축한 유전자 통합 네트워크는 수많은 유전자 심볼들을 각 노드로 하여 상기 각 노드를 엣지로 연결하는 구조를 가지고, 빈도수가 많은 유전자 노드 심볼에 대응되는 노드는 다른 유전자 심볼 노드에 비하여 크게 나타날 수 있다. 예를 들어, 통합 네트워크 구축부(300)에서 구축된 유전자 통합 네트워크에 포함된 유전자 심볼의 수가 많은 경우 각 유전자 심볼은 관측하기에 어려울 수 있는데, 이러한 경우 유전자 통합 네트워크를 특정 질병과 관련된 유전자를 발견하기 쉽게 스케일링(re-scale)할 수 있다. 도 7을 참조하여 설명한다.The integrated gene network constructed by the integrated network construction unit 300 has a structure in which a plurality of gene symbols are connected to each node and each of the nodes is connected to an edge, and a node corresponding to a gene node symbol having a high frequency is connected to another gene symbol node . For example, when the number of gene symbols included in the gene integrated network constructed in the integrated network construction unit 300 is large, each gene symbol may be difficult to observe. In this case, It is possible to re-scale easily. Will be described with reference to FIG.

분석부(400)는 우선순위 결정부(420) 및 검증부(440)를 포함한다. 예를 들어, 분석부(400)는 상기 구축된 통합 유전자 네트워크를 이용하여 분석하고자 하는 질병과 관련된 유전자 심볼 우선순위를 결정하고 상기 결정된 우선 순위에 따라 상기 질병과 관련된 유전자들의 관계를 분석한다. 분석부(400)는 구축된 통합 유전자 네트워크 상에서 나타나는 노드에 대응되는 유전자 심볼들을 특정 기준치들을 고려하여 우선 순위를 결정하고, 상기 결정된 우선 순위에 따라 상기 질병과 관련된 유전자들의 관계를 분석한다. 또한 분석부(400)는 구축된 통합 유전자 네트워크를 이용하여 질병과 관련된 것으로 추론되는 후보 유전자들을 도출할 수 있다. 도 8을 참조하여 설명한다.The analysis unit 400 includes a priority determination unit 420 and a verification unit 440. For example, the analysis unit 400 determines a gene symbol prioritization related to a disease to be analyzed using the constructed integrated gene network, and analyzes the relation of genes related to the disease according to the determined priority. The analysis unit 400 determines the priority of the gene symbols corresponding to the nodes appearing on the constructed integrated gene network in consideration of specific reference values and analyzes the relation of the genes related to the disease according to the determined priority. Also, the analysis unit 400 can derive candidate genes deduced to be related to diseases using the constructed integrated gene network. Will be described with reference to FIG.

우선 순위 결정부(420)는 제1 기준치 산출부(422), 제2 기준치 산출부(424) 및 제3 기준치 산출부(426)을 포함한다. 예를 들어. 우선 순위 결정부(420)는 상기 구축된 통합 유전자 네트워크를 기반으로 분석하고자 하는 질병과 관련된 유전자 심볼 우선순위를 결정한다. 우선 순위 결정부(420)가 우선 순위를 결정하는 방법은 어떤 종류의 기준치를 활용하여 우선 순위를 매기는지 여부에 따라 달라질 수 있다.The priority determining unit 420 includes a first reference value calculating unit 422, a second reference value calculating unit 424, and a third reference value calculating unit 426. E.g. The priority determining unit 420 determines the priority of the gene symbol related to the disease to be analyzed based on the constructed integrated gene network. The manner in which the priority determining unit 420 determines the priority order may be changed depending on what kind of reference value is used and prioritized.

제1 기준치 산출부(422)는 상기 구축된 통합 유전자 네트워크를 기반으로 하나의 노드에 연결된 다른 노드의 수에 관한 제1 기준치를 산출한다. 예를 들어 제1 기준치 산출부(422)가 산출하는 제1 기준치는 통합 유전자 네트워크 상에서 결정되는 하나의 노드에 연결된 다른 노드의 수가 많은 경우 높게 설정될 수 있다. 제2 기준치 산출부(422)는 상기 하나의 노드를 상기 다른 노드와 연결하는 엣지의 가중치에 관한 제2 기준치를 산출한다. 전술한 네트워크 통합부(340)에서 유전자 조합 심볼에 포함된 모든 유전자 심볼이 일치 하는 경우(326)에 일치하는 유전자 심볼들을 연결하는 엣지의 가중치가 증가할 수 있음은 전술한 바와 같다. 제2 기준치 산출부(422)는 통합 유전자 네트워크 상에서 하나의 노드와 다른 노드들을 연결하는 엣지의 가중치를 나타내는 제2 기준치를 산출한다.The first reference value calculation unit 422 calculates a first reference value related to the number of other nodes connected to one node based on the constructed integrated gene network. For example, the first reference value calculated by the first reference value calculation unit 422 may be set high when the number of other nodes connected to one node determined on the integrated gene network is large. The second reference value calculation unit 422 calculates a second reference value related to the weight of the edge connecting the one node with the other node. In the case where all the gene symbols included in the gene combination symbol in the network integration unit 340 described above match (326), the weight of the edge connecting the corresponding gene symbols can be increased as described above. The second reference value calculation unit 422 calculates a second reference value representing a weight of an edge connecting one node and the other nodes on the integrated gene network.

제3 기준치 산출부(423)는 상기 통합 유전자 네트워크상에서 임의의 두개 노드를 최단거리로 연결하는 경로에서 상기 하나의 노드가 노출되는 빈도수에 관한 제3 기준치를 산출한다. 예를 들어, 통합 유전자 네트워크 상의 임의의 두개의 노드는 각 노드를 연결하는 하나의 최단 경로를 가지는데, 통합 유전자 네트워크 상에서 조합 가능한 임의의 두 노드간 최단 경로를 측정하여 통합 유전자 경로(Path) 맵을 생성할 수 있다. 제2 기준치 산출부(423)는 상기 생성된 통합 유전자 경로 맵에서 각 노드가 노출되는 빈도수를 측정하여 제3 기준치를 산출할 수 있다. 도 9를 참조하여 설명한다.The third reference value calculation unit 423 calculates a third reference value related to the frequency at which the one node is exposed in a path connecting the two nodes at the shortest distance on the integrated gene network. For example, any two nodes on an integrated gene network have one shortest path connecting each node, measuring the shortest path between any two nodes that can be combined on an integrated gene network, Can be generated. The second reference value calculation unit 423 may calculate the third reference value by measuring the frequency at which each node is exposed in the generated integrated gene path map. Will be described with reference to FIG.

우선 순위 결정부(420)는 산출된 제1 기준치, 제2 기준치 및 제3 기준치를 이용하여 상기 통합 유전자 네트워크 상에서 질병과 관련된 유전자 심볼의 우선 순위를 결정하고, 분석부(400)는 상기 결정된 우선 순위에 따라 통합 유전자 네트워크상 상위 20개의 유전자 심볼들을 배치할 수 있다. 우선 순위 결정부(420)가 결정하는 우선 순위는 연결 중심성(Degree Centrality), 가중 연결 중심성(Weighted Degree Centrality), 매개 중심성(Betweness) 및 고유벡터 중심성(EigenVector)을 포함한다. The priority determining unit 420 determines the priority of the gene related to the disease on the integrated gene network using the calculated first reference value, second reference value, and third reference value, and the analyzing unit 400 determines the priority Depending on the ranking, the top 20 gene symbols can be placed on the integrated gene network. The priorities determined by the priority decision unit 420 include Degree Centrality, Weighted Degree Centrality, Betweness and EigenVector.

예를 들어, 우선 순위 결정부(420)가 사용하는 우선 순위 중 하나인 연결 중심성(Degree)은 통합 유전자 네트워크상에서 하나의 노드에 연결된 다른 노드의 수에 관한 특성을 반영한 것으로, 제1 기준치를 사용하여 결정할 수 있다. 분석부(400)는 연결 중심성(Degree)에 따라 통합 유전자 네트워크 상 존재하는 유전자 심볼들을 나열할 수 있고, 이 경우 폐암과 관련한 통합 유전자 네트워크상에서 EGFR 유전자 심볼에 가장 많은 노드들이 연결되어 있음을 알 수 있다. 도 9에 도시된 바와 같이, 분석부(400)가 연결 중심성(Degree)을 사용하여 통합 유전자 네트워크 상 존재하는 유전자 심볼들을 나열하는 경우 BRAF 유전자 심볼(516)에 두번째로 가장 많은 노드들이 연결되어 있음을 알 수 있다. 예를 들어, 우선 순위 결정부(420)가 사용하는 우선 순위 중 하나인 가중 연결 중심성(Weighted Degree Centrality)은 통합 유전자 네트워크상에서 하나의 노드에 연결된 다른 노드의 수 및 하나의 노드에 연결된 엣지의 가중치를 고려한 우선 순위로서, 제1 기준치 및 제2 기준치를 사용하여 결정할 수 있다. For example, the degree of connection, which is one of the priorities used by the priority determining unit 420, reflects characteristics related to the number of other nodes connected to one node on the integrated gene network. . The analysis unit 400 can list the gene symbols existing on the integrated gene network according to the degree of connection, and it is found that the most nodes are connected to the EGFR gene symbol on the integrated gene network related to lung cancer have. As shown in FIG. 9, when the analyzer 400 lists the gene symbols existing on the integrated gene network using the degree of connection, the second largest number of nodes are connected to the BRAF gene symbol 516 . For example, Weighted Degree Centrality, which is one of the priorities used by the priority determining unit 420, is a function of the number of other nodes connected to one node on the integrated gene network and the weights of edges connected to one node And can be determined using the first reference value and the second reference value.

또한, 우선 순위 결정부(420)가 사용하는 우선 순위 중 하나인 매개 중심성(Betweness)은 상기 최단 경로 맵에서 상기 각 노드가 노출되는 빈도수를 고려한 우선 순위로서 제3 기준치를 이용하여 결정할 수 있다. 상기 우선 순위 중 하나로서 고유벡터 중심성(EigenVector)은 통합 유전자 네트워크상에서 하나의 노드에 연결된 다른 노드의 수, 상기 하나의 노드와 다른 노드를 연결하는 엣지의 가중치 및 상기 다른 노드에 연결된 또 다른 노드의 수를 고려하는 우선 순위로서, 제1 기준치 및 제2 기준치를 사용하여 결정할 수 있다. 고유벡터 중심성(EigenVector)은 하기의 수학식으로 표현될 수 있다.In addition, the mediation centency (Betweens), which is one of the priorities used by the prioritization unit 420, can be determined using the third reference value as a priority order considering the frequency with which each node is exposed in the shortest path map. The eigenvector as one of the priorities includes a number of other nodes connected to one node on an integrated gene network, a weight of an edge connecting the one node and another node, The priority can be determined using the first reference value and the second reference value. The Eigenvector can be expressed by the following equation.

여기에서 I 및 j는 1~n까지의 정수이고,

는 상수이며, x_i는 i노드의 고유벡터로서 주변 노드 들의 고유벡터의 평균을 의미하고, A_ij는 i노드와 j노드를 연결하는 엣지의 가중치, x_j는 i노드에 연결된 주변 노드들의 고유벡터를 의미한다. 우선 순위 결정부(420)는 고유벡터 중심성(EigenVector)을 이용하여 통합 유전자 네트워크상에서 하나의 노드에 연결된 다른 노드의 영향력까지 고려하여 우선순위를 결정할 수 있다.Wherein I and j are integers from 1 to n,

Is constant, and, x _i is unique from i close to mean the average of the eigenvectors of the neighboring nodes as a unique vector of the node and, A _ij is the weight, x _j of the edge connecting the i node and j nodes connected to the i-node Vector. The priority determining unit 420 may determine the priority by taking into consideration the influence of other nodes connected to one node on the integrated gene network using the eigenvector.

검증부(440)는 결정된 우선 순위에 따라 상기 질병과 관련된 유전자 심볼을 배치하고, 상기 배치된 유전자 심볼을 상기 질병과 관련된 검증 데이터로 검증한다. 예를 들어, 검증부(440)는 종자 유전자 심볼을 검색하기 위하여 사용한 데이터 베이스 및 상기 획득된 종자 유전자 심볼들을 입력 키워드로 유전자에 관한 정보를 포함한 텍스트 데이터를 추출한 데이터 베이스 외의 유전자 정보들을 포함하는 데이터 베이스를 이용하여 통합 유전자 네트워크 상의 적어도 일부 유전자 심볼들을 검증할 수 있다. The verification unit 440 arranges the gene symbols related to the disease according to the determined priority, and verifies the placed gene symbols with the verification data related to the disease. For example, the verification unit 440 may include a database used for searching for seed gene symbols, and data including genetic information other than the database in which the obtained seed gene symbols are extracted from text data including information on genes using input keywords The base can be used to verify at least some of the gene symbols on the integrated gene network.

상기 표2는 검증부(440)가 사용하는 검증용 데이터 베이스에서 폐암과 관련된 것으로 검출된 유전자들의 리스트를 나타낸다. GHR, KEGG, LuGend 및 IGDB.NSCLC는 검증용 데이터 베이스이고, Gene는 각 검증용 데이터 베이스에서 폐암과 관련된 유전자 심볼로 도출된 유전자 심볼의 수이고, Total은 각 검증용 데이터 베이스에서 도출된 유전자 심볼 리스트를 합한 값이다.Table 2 shows a list of genes detected as being associated with lung cancer in the verification database used by the verifying unit 440. GHR, KEGG, LuGend, and IGDB.NSCLC are the database for verification, Gene is the number of gene symbols derived from the gene associated with lung cancer in each verification database, and Total is the number of gene symbols derived from each verification database This is the sum of the lists.

예를 들어, 검증부(440)는 특정 질병과 관련되어 있다고 이미 알려진 종자 유전자 심볼을 검색하기 위한 OMIM데이터 베이스 및 상기 획득된 각각의 종자 유전자 심볼들을 이용하여 유전자에 관한 정보를 포함하는 텍스트 데이터를 추출하기 위한 PubMed 데이터 베이스 외의 데이터 베이스인 GHR, KEGG, LuGend 및 IGDB.NSCLC 데이터 베이스의 유전자에 관한 정보를 포함하는 데이터를 이용하여 통합 유전자 네트워크 상의 적어도 일부 유전자 심볼들을 검증할 수 있다.For example, the verification unit 440 may include an OMIM database for searching for seed gene symbols that are already known to be associated with a specific disease, and text data including information about genes using the obtained seed gene symbols At least some of the gene symbols on the integrated gene network can be verified using data including information about genes in the GHR, KEGG, LuGend, and IGDB.NSCLC databases, which are databases other than the PubMed database for extraction.

우선 순위 결정부(420)에서 결정된 연결 중심성(Degree)에 따라 폐암과 관련된 통합 유전자 네트워크 상의 유전자 심볼을 배치하는 경우, 도 9를 참조하면, EGFR-BRAF-EGF-KRAS-PIK3CA-ERBB2-PTEN-TP53-CASP8-NARS-PARK2-CDKN2A-APC-AKT1-IRF1-CCND1-STAT3-FAS-MET-MGMT와 같이 배치되고, 상기 배치된 유전자 심볼들 중에서 우상 좌하단으로 빗금친 셀에 위치하는 유전자 심볼(502)은 입력 키워드로 사용한 종자 유전자 심볼을 의미하고, 우상 좌하단 빗금에 더하여 좌상 우하단으로 빗금 친 셀에 위치하는 유전자 심볼(510)은 검증 데이터로 검증된 유전자 심볼을 의미한다. 상기 배치된 유전자 심볼들 중에서 빗금이 쳐져있지 않는 셀에 위치하는 유전자 심볼은 후보 유전자 심볼을 의미하고, 이는 종자 유전자 심볼에 해당하지 않고, 검증용 데이터 베이스에도 존재하지 않는 유전자 심볼들을 의미한다.EGFR-BRAF-EGF-KRAS-PIK3CA-ERBB2-PTEN-EGF-KRAS-EGFR-KRAS- (SEQ ID NO: 3), which is located at the upper right upper left of the gene symbols arranged as TP53-CASP8-NARS-PARK2-CDKN2A-APC-AKT1-IRF1-CCND1-STAT3-FAS- 502 denotes a seed gene symbol used as an input keyword, and a gene symbol 510 located in a cell shaded to the lower left and upper right of the upper left corner indicates a gene symbol verified by the verification data. Among the placed gene symbols, a gene symbol located in a non-hatched cell means a candidate gene symbol, which means a gene symbol that does not correspond to a seed gene symbol nor exists in a verification database.

예를 들어, 검증부(440)가 검증하는 검증의 대상은 우선 순위 결정부(420)에서 결정된 우선 순위에 따라 배치된 유전자 심볼들 중에서 후보 유전자로 마련될 수 있다. 검증부(440)는 종자 유전자 심볼에 해당하지 않고, 검증용 데이터 베이스에도 존재하지 않는 유전자 심볼인 후보 유전자의 질병 관련성을 확인 하기 위하여, 후보 유전자와 관련된 질병 및 실험 결과 등에 대한 내용을 포함하는 문장을 검출하는 검증 텍스트 검출부를 더 포함하고, 상기 검증된 검증 텍스트들을 근거로 후보 유전자의 질병 관련성을 검증할 수 있다. For example, the verification target to be verified by the verification unit 440 may be provided as a candidate gene among the gene symbols arranged according to the priority determined by the priority decision unit 420. [ The verifying unit 440 verifies whether or not the candidate gene is a genetic symbol that does not correspond to the seed gene symbol but does not exist in the verification database, And verifying the disease-relatedness of the candidate gene based on the verified verification texts.

도 2는 도 1의 실시 예에서 조합 심볼 검출부(200)의 확대 블록도이다.2 is an enlarged block diagram of the combination symbol detector 200 in the embodiment of FIG.

조합 심볼 검출부(200)는 전처리부(220) 및 심볼 추출부(240)를 포함한다. 예를 들어, 조합 심볼 검출부(200)는 적어도 하나의 유전자 심볼을 이용하여 유전자에 관한 정보를 포함하는 텍스트 데이터를 검색하고, 상기 검색된 텍스트 데이터에서 상기 유전자 심볼이 조합된 유전자 조합 심볼을 검출한다. 조합 심볼 검출부(200)는 한 문장안에서 적어도 2이상의 유전자 심볼을 조합 심볼로서 검출하는데, 조합 심볼 검출부(200)가 유전자 심볼을 검출하는 구체적인 방법은 전술한 바와 같으므로 생략한다.The combined symbol detecting unit 200 includes a preprocessing unit 220 and a symbol extracting unit 240. For example, the combination symbol detector 200 searches for text data including information on a gene using at least one gene symbol, and detects a gene combination symbol combined with the gene symbol in the retrieved text data. The combination symbol detection unit 200 detects at least two or more gene symbols as a combination symbol in one sentence. The specific method of detecting the combination symbol detection unit 200 by the combination symbol detection unit 200 is the same as described above, and thus is omitted.

도 3은 도 1의 실시 예에서 통합 네트워크 구축부(300)의 확대 블록도이다.3 is an enlarged block diagram of the integrated network construction unit 300 in the embodiment of FIG.

통합 네트워크 구축부(300)는 지역 네트워크 구축부(320) 및 네트워크 통합부(340)를 포함한다. 예를 들어, 통합 네트워크 구축부(300)는 상기 검출된 유전자 조합 심볼에 기반하여 상기 유전자 심볼을 노드로 하고 상기 노드 사이를 엣지로 연결하여 유전자의 관계를 나타내는 통합 유전자 네트워크를 구축한다. 통합 유전자 네트워크를 구축하는 구체적인 방법은 전술한 바와 같으므로 생략한다.The integrated network construction unit 300 includes a local network construction unit 320 and a network integration unit 340. For example, the integrated network construction unit 300 constructs an integrated gene network that links the gene symbols to the nodes and connects the nodes to the edges based on the detected gene combination symbols. The concrete method of constructing the integrated gene network is as described above, so it is omitted.

도 4는 도 1의 실시 예에서 통합 네트워크를 구축하는 과정을 나타내는 예시도이다.4 is an exemplary diagram illustrating a process of establishing an integrated network in the embodiment of FIG.

네트워크 통합부(340)가 각 종자 유전자 별로 검색된 텍스트 데이터로부터 검출된 유전자 조합 심볼들을 이용하여 구축된 지역 유전자 네트워크를 구축하는 방법은 크게 3가지로 나누어 설명할 수 있다. 네트워크 통합부(340)가 지역 유전자 네트워크를 통합하여 통합 유전자 네트워크를 구축하기 위하여 사용하는 네트워크 통합 방법은 지역 네트워크 구축부(320)가 종자 유전자 별로 검색된 텍스트 데이터로부터 검출된 유전자 조합 심볼들을 이용하여 지역 유전자 네트워크를 구축하는 방법에도 사용될 수 있다. 네트워크 1(322) 및 네트워크 2(324)를 합하여 통합 네트워크(342)를 구축하는 방법은 전술한 바와 같으므로 생략한다.The method for constructing the local gene network constructed using the gene combination symbols detected from the text data retrieved for each seed gene can be roughly divided into three methods. The network integration method used by the network integration unit 340 to construct the integrated gene network by integrating the local gene network is a network integration method that uses the gene combination symbols detected from the text data retrieved for each seed gene, It can also be used to build gene networks. The method of constructing the unified network 342 by combining the network 1 322 and the network 2 324 is the same as described above, and thus will not be described here.

도 5는 도 1의 실시 예에서 통합 네트워크 구축부가 구축한 통합 유전자 네트워크를 나타내는 예시도이다.5 is an exemplary diagram illustrating an integrated gene network constructed by the integrated network building unit in the embodiment of FIG.

통합 네트워크 구축부(300)가 구축한 통합 유전자 네트워크는 유전자 심볼들을 노드로 하고, 상기 노드 사이를 특정 가중치로 연결하는 엣지들을 포함한다. 유전자 관계 분석 장치(10)는 상기 구축된 통합 유전자 네트워크를 이용하여 우선 순위에 따라 상위 랭크되는 유전자 심볼들을 획득할 수 있고, 분석하고자 하는 질병과 관련된 유전자 심볼들의 관계를 분석할 수 있다. 일 실시 예로, 본 발명에서 구축된 통합 유전자 네트워크는 1339개의 노드와 4092개의 엣지들을 포함하고, 우선 순위중 연결 중심성(Degree)에 따라 랭크화 화는 경우 EGFR, BRAF 및 KRAS가 상위 랭크되어 다른 유전자 심볼들보다 큰 노드로 표시될 수 있다.The integrated gene network constructed by the integrated network construction unit 300 includes the nodes that make the gene symbols as nodes and connect the nodes with specific weights. The gene relational analysis apparatus 10 can acquire gene symbols of the highest rank according to the priority order using the constructed integrated gene network and analyze the relation of gene symbols related to the disease to be analyzed. In one embodiment, the integrated gene network constructed in the present invention includes 1339 nodes and 4092 edges, and in ranking according to the degree of connection among the priorities, EGFR, BRAF, and KRAS are ranked, May be represented by nodes larger than the symbols.

도 6은 스케일 조정된 통합 유전자 네트워크를 나타내는 예시도이다.Figure 6 is an exemplary diagram illustrating a scaled integrated gene network.

통합 네트워크 구축부(300)는 통합 유전자 네트워크가 너무 많은 유전자 심볼들로 구축되는 경우, 지나치게 복잡하고, 질병과 관련한 후보 유전자들을 획득하는데 어려움이 있을 수 있기 때문에 통합 유전자 네트워크상 엣지의 가중치를 스케일링(Re-scale)하여 통합 유전자 네트워크를 다시 구축할 수 있다. 도 6을 참조하면 스케일 조정된 통합 유전자 네트워크는 60개의 노드들 및 100개의 엣지들을 포함한다. The integrated network construction unit 300 scales the weights of edges on the integrated gene network because the integrated gene network is too complex and difficult to acquire candidate genes related to diseases when the integrated gene network is constructed with too many gene symbols Re-scale) to rebuild the integrated gene network. Referring to FIG. 6, the scaled integrated gene network includes 60 nodes and 100 edges.

도 7은 도 1의 실시 예에서 분석부의 확대 블록도이다.7 is an enlarged block diagram of the analysis unit in the embodiment of FIG.

분석부(400)는 우선순위 결정부(420) 및 검증부(440)를 포함한다. 예를 들어, 분석부(400)는 상기 구축된 통합 유전자 네트워크를 이용하여 분석하고자 하는 질병과 관련된 유전자 심볼 우선순위를 결정하고 상기 결정된 우선 순위에 따라 상기 질병과 관련된 유전자들의 관계를 분석한다. 분석부(400)가 통합 유전자 네트워크를 이용하여 질병과 관련된 유저자 관계를 분석하는 방법은 전술한 바와 같으므로 생략한다.The analysis unit 400 includes a priority determination unit 420 and a verification unit 440. For example, the analysis unit 400 determines a gene symbol prioritization related to a disease to be analyzed using the constructed integrated gene network, and analyzes the relation of genes related to the disease according to the determined priority. The analyzing unit 400 analyzes the user-related relationship related to the disease using the integrated gene network as described above, so that it is omitted.

도 8은 도 7의 실시 예에서 우선 순위 결정부의 확대 블록도이다.8 is an enlarged block diagram of the priority determining unit in the embodiment of FIG.

우선 순위 결정부(420)는 제1 기준치 산출부(422), 제2 기준치 산출부(424) 및 제3 기준치 산출부(426)을 포함한다. 예를 들어. 우선 순위 결정부(420)는 상기 구축된 통합 유전자 네트워크를 기반으로 분석하고자 하는 질병과 관련된 유전자 심볼 우선순위를 결정한다. 우선 순위 결정부(420)가 산출된 제1 기준치, 제2 기준치 및 제3 기준치를 이용하여 우선 순위를 결정하는 구체적인 방법은 전술한 바와 같으므로 생략한다.The priority determining unit 420 includes a first reference value calculating unit 422, a second reference value calculating unit 424, and a third reference value calculating unit 426. E.g. The priority determining unit 420 determines the priority of the gene symbol related to the disease to be analyzed based on the constructed integrated gene network. A specific method of determining the priority order using the calculated first, second, and third reference values by the priority determining unit 420 is the same as described above, and thus will not be described.

도 9는 유전자 심볼 우선순위에 따라 나열된 질병 관련 유전자들을 나타내는 예시도이다. 도 10을 참조하여 설명한다.FIG. 9 is an illustration showing disease-related genes listed according to gene symbol priorities. FIG. Will be described with reference to FIG.

검증부(440)는 결정된 우선 순위에 따라 상기 질병과 관련된 유전자 심볼을 배치하고, 상기 배치된 유전자 심볼을 상기 질병과 관련된 검증 데이터로 검증한다. The verification unit 440 arranges the gene symbols related to the disease according to the determined priority, and verifies the placed gene symbols with the verification data related to the disease.

예를 들어, 통합 네트워크 구축부(300)에서 구축된 폐암과 관련된 통합 유전자 네트워크상에서의 유전자 심볼들을 연결 중심성(Degree Centrality)에 따라 배치하는 경우 EGFR-BRAF-EGF-KRAS-PIK3CA-ERBB2-PTEN-TP53-CASP8-NARS-PARK2-CDKN2A-APC-AKT1-IRF1-CCND1-STAT3-FAS-MET-MGMT와 같이 배치될 수 있다. 도 9에 도시된 바와 같이, 종자 유전자 심볼(502)은 폐암과 관련한 통합 유전자 네트워크를 구축하기 위하여 입력 키워드로 사용된 유전자 심볼로서 우상 좌화단의 빗금친 셀에 위치하는 유전자 심볼이고, 검증된 유전자 심볼(504)은 검증용 데이터 베이스에 존재하는 폐암과 관련된 유전자 심볼에 해당한다. 이는 도 9에 도시된 폐암과 관련된 유전자 심볼 리스트를 구축하기 위한 데이터 베이스(예를 들어 PubMed)에는 존재하지 않았으나, 검증용 데이터 베이스(GHR, KEGG, LuGend 및 IGDB.NSCLC)에 존재하는 폐암과 관련된 유전자 심볼이므로 질병 관련성 측면에서 신뢰도 있는 유전자 심볼이다. 다만, 빗금이 없는 셀에 위치하는 유전자 심볼은 후보 유전자 심볼(506)로서, 이는 종자 유전자 심볼에 해당하지 않고, 검증용 데이터 베이스에도 존재하지 않는 유전자 심볼들을 의미한다. EGFR-BRAF-EGF-KRAS-PIK3CA-ERBB2-PTEN-EGFR-EGFR-EGFR-EGFR is constructed when the gene symbols on the integrated gene network associated with lung cancer constructed in the integrated network construction unit 300 are arranged according to the degree of connection. TP53-CASP8-NARS-PARK2-CDKN2A-APC-AKT1-IRF1-CCND1-STAT3-FAS-MET-MGMT. As shown in FIG. 9, the seed gene symbol 502 is a gene symbol used as an input keyword to construct an integrated gene network related to lung cancer, and is a gene symbol located in a hatched cell of an upper left hemisphere, Symbol 504 corresponds to a gene associated with lung cancer present in the verification database. This was not present in a database (e.g., PubMed) for constructing a gene symbol list related to lung cancer shown in Fig. 9, but was associated with lung cancer present in the verification database (GHR, KEGG, LuGend and IGDB.NSCLC) Because it is a gene symbol, it is a reliable gene symbol in terms of disease relatedness. However, the gene symbol located in the non-hatched cell is a candidate gene symbol 506, which means a gene symbol that does not correspond to a seed gene symbol and does not exist in a verification data base.

우선 순위 결정부(420)에서 결정한 우선 순위의 종류에 따라 도 9에 도시된 차트를 분석하면 다음과 같다.The chart shown in FIG. 9 is analyzed according to the types of priorities determined by the priority determining unit 420 as follows.

상기 표 3에서 Number of seed genes은 폐암과 관련한 통합 유전자 네트워크를 구축하기 위하여 입력 키워드로서 사용한 종자 유전자의 수, Number of inferred genes은 검증된 유전자도 아니고, 종자 유전자에도 해당하지 않는 후보 유전자의 수, Percentage of inferred genes는 랭크된 20개 유전자 중에서 후보 유전자의 비율, Number of confirmed inferred genes는 검증된 유전자 심볼의 수, Percentage of confirmed inferred genes는 검증된 유전자 심볼 및 후보 유전자 수의 비 및 Percentage of confirmed genes은 랭크된 20개 유전자중에서 종자 유전자 및 검증된 유전자 심볼이 차지하는 정도를 의미한다. In Table 3, the number of seed genes is the number of seed genes used as input keywords to construct an integrated gene network related to lung cancer, the number of inferred genes is not the verified gene, the number of candidate genes not corresponding to the seed gene, The percentage of inferred genes is the ratio of candidate genes out of the 20 ranked genes, the number of confirmed inferred genes is the number of verified gene symbols, the percentage of confirmed genes and the number of candidate genes and the percentage of confirmed genes Means the extent of the seed gene and the proven gene symbol among the 20 ranked genes.

도 10은 도 7의 실시 예에서 검증부가 후보 유전자 심볼을 검증하기 위하여 검출한 텍스트를 나타내는 예시도이다. FIG. 10 is an exemplary diagram showing a text detected by a verification unit in the embodiment of FIG. 7 to verify a candidate gene symbol. FIG.

검증부(440)는 후보 유전자 심볼의 질병 관련성 검증을 위하여 후보 유전자와 관련된 질병 및 실험 결과 등에 대한 내용을 포함하는 문장을 검출하는 검증 텍스트 검출부를 포함할 수 있고, 상기 검증된 검증 텍스트들을 근거로 후보 유전자의 질병 관련성을 검증할 수 있다. 예를 들어, 검증 텍스트 검출부는 폐암과 관련한 통합 유전자 네트워크를 구축하기 위하여 종자 유전자를 입력 키워드로 하여 텍스트 데이터를 검색한 PubMed 데이터 베이스에서 질병 및 실험 결과 등에 대한 내용을 포함하는 문장을 검출할 수 있다. The verifying unit 440 may include a verification text detecting unit for detecting a sentence including a content related to a disease related to the candidate gene, an experimental result, and the like, for verifying the disease relevance of the candidate gene symbol. Based on the verified verification texts The disease-relatedness of candidate genes can be verified. For example, in order to construct an integrated gene network related to lung cancer, the verification text detection unit can detect a sentence including a disease and an experimental result in a PubMed database that retrieves text data using a seed gene as an input keyword .

예를 들어, 검증 텍스트 검출부는 APC 유전자 심볼이 폐암과 관련된 잠재 진단 지표로 활용될 수 있음을 암시하는 연구 자료의 일부분인 "Hypermethylation of the APC gene promoter in plasma is a potential diagnostic marker for lung cancer diagnosis" 와 같은 텍스트를 검출하여 후보 유전자의 질병 관련성의 근거로 삼을 수 있다.For example, the Verification Text Detector is part of a study that suggests that the APC gene symbol may be used as a potential diagnostic marker for lung cancer, "Hypermethylation of the APC gene promoter in a potential diagnostic marker for lung cancer diagnosis" And can be used as a basis for disease-relatedness of candidate genes.

도 11은 본 발명의 일 실시 예에 따른 유전자 관계 분석 방법의 흐름도이다.11 is a flowchart of a method of analyzing a gene relationship according to an embodiment of the present invention.

S100에서, 종자 유전자 심볼 획득부(100)는 분석하고자 하는 질병과 관련된 유전자 심볼로 이미 알려진 종자 유전자 심볼을 획득한다. 종자 유전자 심볼 획득부(100)가 질병과 관련되어 있는 것으로 이미 알려진 유전자 심볼을 획득하는 구체적인 방법은 전술한 바와 같으므로 생략한다.In S100, the seed gene symbol acquisition unit 100 acquires a seed gene symbol already known as a gene related to the disease to be analyzed. The specific method of acquiring the gene symbol already known to be related to the disease by the seed gene symbol acquiring unit 100 is as described above and is omitted.

S200에서, 조합 심볼 검출부(200)는 적어도 하나의 유전자 심볼을 이용하여 유전자에 관한 정보를 포함하는 텍스트 데이터를 검색하고, 상기 검색된 텍스트 데이터에서 상기 유전자 심볼이 조합된 유전자 조합 심볼을 검출한다. 예를 들어, 조함 심볼 검출부(200)는 종자 유전자 심볼을 이용하여 유전자에 관한 정보를 포함하는 텍스트 데이터를 검색하고, 상기 검색된 텍스트 데이터에서 유전자 조합 심볼을 검출할 수 있다. 조합 심볼 검출부(200)가 서로 다른 데이터 베이스를 사용하여 유전자 조합 심볼을 검출할 수 있음은 전술한 바와 같으므로 생략한다.In S200, the combination symbol detection unit 200 searches text data including information on a gene using at least one gene symbol, and detects a gene combination symbol in which the gene symbol is combined in the retrieved text data. For example, the combining symbol detector 200 may search for text data including information on a gene using a seed gene symbol, and may detect a gene combination symbol in the retrieved text data. The combination symbol detection unit 200 can detect the gene combination symbol using different databases, as described above, and thus will be omitted.

S300에서, 통합 네트워크 구축부(300)는 상기 검출된 유전자 조합 심볼에 기반하여 상기 유전자 심볼을 노드로 하고 상기 유전자의 관계를 엣지로 나타내는 지역 유전자 네트워크를 구축하고, 상기 구축된 지역 유전자 네트워크를 통합하여 통합 유전자 네트워크를 구축한다. 통합 네트워크 구축부(300)가 네트워크를 통합하는 구체적인 방법은 전술한 바와 같으므로 생략한다.In step S300, the integrated network construction unit 300 constructs a local gene network that uses the gene symbol as a node and the relation of the gene as an edge based on the detected gene combination symbol, and integrates the constructed local gene network To build an integrated gene network. The concrete method of integrating the network by the integrated network building unit 300 is the same as described above, and thus will not be described.

S400에서, 분석부(400)는 상기 구축된 유전자 네트워크를 이용하여 분석하고자 하는 질병과 관련된 유전자 심볼 우선순위를 결정하고 상기 결정된 우선 순위에 따라 상기 질병과 관련된 유전자들의 관계를 분석한다. 분석부(400)에서 우선순위를 결정하는 방법, 검증용 데이터 베이스를 이용하여 결정된 우선 순위에 따라 배치된 적어도 하나의 유전자 심볼들을 검출하는 구체적인 방법은 전술한 바와 같으므로 생략한다.In step S400, the analysis unit 400 determines a gene symbol prioritization related to a disease to be analyzed using the constructed gene network, and analyzes the relationship of genes related to the disease according to the determined priority. The method for determining the priority in the analysis unit 400 and the specific method for detecting at least one gene symbol arranged according to the priority determined using the verification database are the same as described above and will not be described here.

도 12는 도 11의 실시 예에서 검출하는 단계의 확대 흐름도이다.12 is an enlarged flow chart of the step of detecting in the embodiment of Fig.

S220에서, 전처리부(220)는 검색된 텍스트 데이터를 전처리하여 검색된 텍스트 데이터 중 적어도 일부의 텍스트 데이터를 추출할 수 있다. 예를 들어, 전처리부(220)는 검색된 텍스트 데이터에서 키워드(MeSH Term)를 이용하여 추록 텍스트(Abstract Text)만을 추출할 수 있음은 전술한 바와 같다. In step S220, the preprocessing unit 220 preprocesses the retrieved text data and extracts at least a part of the retrieved text data. For example, the preprocessing unit 220 can extract only the abstract text using the keyword MeSH Term in the retrieved text data as described above.

S240에서 심볼 추출부(240)는 검색된 텍스트 데이터를 문장 단위로 구분하고, 상기 구분된 문장을 분석하여 상기 문장에서 유전자 심볼을 추출한다. 심볼 추출부는 구분된 문장안에서 유전자 심볼들을 추출하고, 미리 설정된 유전자 심볼 리스트를 활용하여 유전자 심볼을 추출할 수 있다.In step S240, the symbol extracting unit 240 extracts the genetic symbols in the sentence by classifying the searched text data on a sentence-by-sentence basis. The symbol extracting unit extracts the gene symbols in the separated sentences, and extracts the gene symbols using the preset gene symbol list.

도 13은 도 11의 실시 예에서 통합 유전자 네트워크를 구축하는 단계의 확대 흐름도이다.13 is an enlarged flow chart of a step of constructing an integrated gene network in the embodiment of FIG.

S320에서, 지역 네트워크 구축부(320)는 상기 종자 유전자 심볼을 이용하여 검출된 유전자 조합 심볼에 포함된 유전자 심볼을 각각의 노드로 하여 상기 종자 유전자 심볼 단위로 생성되는 지역 유전자 네트워크를 구축한다. 지역 네트워크 구축부(320)는 각 종자 유전자 심볼 마다 별개의 지역 유전자 네트워크를 구축할 수 있고, 지역 네트워크를 구축하는 구체적인 방법은 전술한 바와 같다.In step S320, the local network construction unit 320 constructs a local gene network generated on the basis of the seed gene symbol by using the gene symbol included in the gene combination symbol detected using the seed gene symbol as each node. The local network construction unit 320 can construct a separate local gene network for each seed gene symbol, and a specific method for constructing the local network is as described above.

S340에서, 네트워크 통합부(340)는 상기 유전자 조합 심볼에 포함된 유전자 심볼의 종류 또는 상기 유전자 조합 심볼이 검출된 빈도수를 고려하여 상기 구축된 지역 유전자 네트워크를 미리 결정된 방법에 따라 통합한다. 예를 들어, 네트워크 통합부(340)가 미리 결정된 방법에 따라 통합 하는 것은 상기 유전자 조합 심볼에 포함된 유전자 심볼이 모두 일치하는 경우, 상기 조합 심볼의 각 유전자 심볼을 노드로 하여, 상기 노드를 연결하는 엣지의 가중치를 증가시키고, 상기 유전자 조합 심볼에 포함된 하나의 유전자 심볼이 일치하는 경우, 상기 일치하는 유전자 심볼을 공통 노드로 하여 상기 유전자 조합 심볼을 연결하는 방법으로 통합하도록 마련될 수 있다. 또 다른 실시 예로, 네트워크 통합부(340)가 미리 결정된 방법에 따라 통합하는 것은 상기 유전자 조합 심볼에 포함된 유전자 심볼이 일치하지 않는 경우, 상기 일치하지 않는 유전자 심볼을 별개 노드로 하여 상기 유전자 조합 심볼을 연결하지 않는 방법으로 통합하도록 마련될 수 있다.In step S340, the network integrating unit 340 integrates the constructed local gene network according to a predetermined method in consideration of the type of the gene symbol included in the gene combination symbol or the frequency with which the combined symbol is detected. For example, when the network integration unit 340 integrates according to a predetermined method, when all of the gene symbols included in the gene combination symbol coincide with each other, each of the gene symbols of the combination symbol is used as a node, And if the one gene symbol included in the gene combination symbol coincides with the corresponding one of the gene combination symbols, the combining of the gene combination symbols using the corresponding gene symbol as a common node may be performed. In another embodiment, the network integrating unit 340 integrates according to a predetermined method in the case where the gene symbols included in the gene combination symbol do not coincide with each other, using the inconsistent gene symbol as a separate node, In a non-connected manner.

도 14는 도 11의 실시 예에서 분석하는 단계의 확대 흐름도이다.14 is an enlarged flow chart of the step of analyzing in the embodiment of Fig.

S420에서, 우선 순위 결정부(420)는 상기 구축된 통합 유전자 네트워크를 기반으로 하나의 노드에 연결된 다른 노드의 수, 상기 하나의 노드를 상기 다른 노드와 연결하는 엣지의 가중치 및 상기 통합 유전자 네트워크상에서 임의의 두개 노드를 최단거리로 연결하는 경로에서 상기 하나의 노드가 노출되는 빈도수를 고려하여 분석하고자 하는 질병과 관련된 유전자 심볼 우선순위를 결정한다. 예를 들어, 우선 순위 결정부(420)는 하나의 노드에 연결된 다른 노드의 수를 제1 기준치, 하나의 노드를 상기 다른 노드와 연결하는 엣지의 가중치를 제2 기준치 및 상기 통합 유전자 네트워크상에서 임의의 두개 노드를 최단거리로 연결하는 경로에서 상기 하나의 노드가 노출되는 빈도수를 제3 기준치로 산출하고, 산출된 제1 기준치, 제2 기준치 및 제3 기준치를 이용하여 연결 중심성(Degree Centrality), 가중 연결 중심성(Weighted Degree Centrality), 매개 중심성(Betweness) 및 고유벡터 중심성(EigenVector)을 포함하는 우선 순위를 결정할 수 있다.In step S420, the priority determining unit 420 determines the number of other nodes connected to one node based on the constructed integrated gene network, the weight of the edge connecting the one node to the other node, The priorities of the genes related to the disease to be analyzed are determined in consideration of the frequency of exposure of the one node in the path connecting the two nodes at the shortest distance. For example, the priority determining unit 420 may set the number of other nodes connected to one node as a first reference value, the weight of an edge connecting one node to the other node as a second reference value, The degree of exposure of the one node is calculated as a third reference value in a path connecting the two nodes of the first node to the shortest distance, and the degree of connection, the degree of connection, and the like are calculated using the calculated first, second, Weighted Degree Centrality, Betweness, and EigenVector can be determined.

S440에서, 검증부(440)는 상기 결정된 우선 순위에 따라 상기 질병과 관련된 유전자 심볼을 배치하고, 상기 배치된 유전자 심볼을 상기 질병과 관련된 검증 데이터로 검증한다. 검증부(440)가 상기 배치된 유전자 심볼을 상기 질병과 관련된 검증 데이터로 검증하는 구체적인 방법은 전술한 바와 같으므로 생략한다.In S440, the verifying unit 440 arranges the gene symbol related to the disease according to the determined priority, and verifies the placed gene symbol with the verification data related to the disease. The verifying unit 440 verifies the placed gene symbol with the verification data related to the disease as described above, and thus will be omitted.

도 15는 본 발명의 또 다른 실시 예에 따른 유전자 관계 분석 방법의 흐름을 나타내는 참고도이다.15 is a reference diagram showing a flow of a method of analyzing a gene relationship according to another embodiment of the present invention.

S100에서, 종자 유전자 심볼 획득부(100)는 분석하고자 하는 질병과 관련된 유전자 심볼로 이미 알려진 종자 유전자 심볼을 OMIM 데이터 베이스(2000)에서 획득할 수 있다. S220에서, 조합 심볼 검출부(200)는 획득된 종자 유전자 심볼을 이용하여 유전자 정보를 포함하는 텍스트 데이터를 검색한다. S240에서, 조합 심볼 검출부는 검색된 텍스트 데이터 안의 한 문장안에 두개의 유전자 심볼들이 존재한다면 유전자 심볼 쌍을 유전자 조합 심볼로서 검출한다.In step S100, the seed gene symbol acquisition unit 100 may acquire a seed gene symbol, which is already known as a gene related to the disease to be analyzed, in the OMIM database 2000. In S220, the combination symbol detector 200 searches for text data including genetic information using the obtained seed gene symbol. In S240, the combination symbol detection unit detects a pair of gene symbols as a gene combination symbol if there are two gene symbols in one sentence in the retrieved text data.

S320에서, 지역 네트워크 구축부(320)는 상기 종자 유전자 심볼을 이용하여 검출된 유전자 조합 심볼에 포함된 유전자 심볼을 각각의 노드로 하는 지역 유전자 네트워크를 구축한다. S340에서, 네트워크 통합부(340)는 상기 유전자 조합 심볼에 포함된 유전자 심볼의 종류 또는 상기 유전자 조합 심볼이 검출된 빈도수를 고려하여 상기 구축된 지역 유전자 네트워크를 미리 결정된 방법에 따라 통합한다.In step S320, the local network construction unit 320 constructs a local gene network using each of the genes included in the gene combination symbol detected using the seed gene symbol. In step S340, the network integrating unit 340 integrates the constructed local gene network according to a predetermined method in consideration of the type of the gene symbol included in the gene combination symbol or the frequency with which the combined symbol is detected.

도 16은 다른 유전자 분석 방법과 비교한 본 발명의 유전자 분석 성능을 나타내는 참고도이다.16 is a reference diagram showing the gene analysis performance of the present invention as compared with other gene analysis methods.

본 발명에 따른 유전자 관계 분석 장치(10)를 이용하여 폐암과 관련된 유전자 심볼들을 분석한 결과, SSL Method 및 RWRHN 을 포함하는 다른 유전자 분석 방법에 비하여 더 많은 폐암과 관련된 유전자(종자 유전자 심볼+검증된 유전자 심볼)을 도출할 수 있음을 알 수 있다. 본 발명의 유전자 관계 분석 장치(10)는 9개의 폐암과 관련된 유전자를 분석하였으나, SSL Method은 8개, RWRHN는 4개의 폐암 관련 유전자들을 각각 분석할 수 있다. 본 발명의 유전자 관계 분석 장치(10)는 특정 질병과 관련된 유전자들의 관계 분석에 사용될 수 있을 뿐만 아니라, 다양한 생물학적 자료나 특정 질병과 관련된 약물의 탐색, 약물과 유전자의 관련성 분석에도 사용될 수 있다.As a result of analyzing the gene symbols related to lung cancer using the gene analysis apparatus 10 according to the present invention, it was found that more lung cancer-related genes (seed gene symbols + proven genes) than the other methods including SSL Method and RWRHN Gene symbol) can be derived. The gene analysis apparatus (10) of the present invention analyzed 9 genes associated with lung cancer, but the SSL method can analyze 8 lung cancer genes and the RWRHN can analyze 4 lung cancer related genes. The gene-relationship analyzing apparatus 10 of the present invention can be used not only for analyzing the relationship between genes related to a specific disease, but also for searching for various biological data or drugs related to a specific disease, and for analyzing the relationship between drugs and genes.

상기 설명된 본 발명의 일 실시예의 방법의 전체 또는 일부는, 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행 가능한 기록 매체의 형태(또는 컴퓨터 프로그램 제품)로 구현될 수 있다. 여기에서, 컴퓨터 판독 가능 매체는 컴퓨터 저장 매체(예를 들어, 메모리, 하드디스크, 자기/광학 매체 또는 SSD(Solid-State Drive) 등)를 포함할 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다.All or part of the method of an embodiment of the present invention described above can be implemented in the form of a computer-executable recording medium (or a computer program product) such as a program module executed by a computer. Here, the computer-readable medium may include computer storage media (e.g., memory, hard disk, magnetic / optical media or solid-state drives). Computer readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media.

또한, 본 발명의 일 실시예에 따르는 방법의 전체 또는 일부는 컴퓨터에 의해 실행 가능한 명령어를 포함하며, 컴퓨터 프로그램은 프로세서에 의해 처리되는 프로그래밍 가능한 기계 명령어를 포함하고, 고레벨 프로그래밍 언어(High-level Programming Language), 객체 지향 프로그래밍 언어(Object-oriented Programming Language), 어셈블리 언어 또는 기계 언어 등으로 구현될 수 있다.Also, all or part of the method according to an embodiment of the present invention may include instructions executable by a computer, the computer program comprising programmable machine instructions to be processed by a processor, Language, an object-oriented programming language, an assembly language, or a machine language.

본 명세서에서의 부(means) 또는 모듈(Module)은 본 명세서에서 설명되는 각 명칭에 따른 기능과 동작을 수행할 수 있는 하드웨어를 의미할 수도 있고, 특정 기능과 동작을 수행할 수 있는 컴퓨터 프로그램 코드를 의미할 수도 있고, 또는 특정 기능과 동작을 수행시킬 수 있는 컴퓨터 프로그램 코드가 탑재된 전자적 기록 매체, 예를 들어 프로세서 또는 마이크로 프로세서를 의미할 수 있다. 다시 말해, 부(means) 또는 모듈(Module)은 본 발명의 기술적 사상을 수행하기 위한 하드웨어 및/또는 상기 하드웨어를 구동하기 위한 소프트웨어의 기능적 및/또는 구조적 결합을 의미할 수 있다. Means or module in the present specification may mean hardware capable of performing the functions and operations according to the respective names described herein and may be implemented by computer program code , Or may refer to an electronic recording medium, e.g., a processor or a microprocessor, having computer program code embodied thereon to perform particular functions and operations. In other words, a means or module may mean a functional and / or structural combination of hardware for carrying out the technical idea of the present invention and / or software for driving the hardware.

따라서 본 발명의 일 실시예에 따르는 방법은 상술한 바와 같은 컴퓨터 프로그램이 컴퓨팅 장치에 의해 실행됨으로써 구현될 수 있다. 컴퓨팅 장치는 프로세서와, 메모리와, 저장 장치와, 메모리 및 고속 확장포트에 접속하고 있는 고속 인터페이스와, 저속 버스와 저장 장치에 접속하고 있는 저속 인터페이스 중 적어도 일부를 포함할 수 있다. 이러한 성분들 각각은 다양한 버스를 이용하여 서로 접속되어 있으며, 공통 머더보드에 탑재되거나 다른 적절한 방식으로 장착될 수 있다.Thus, a method according to an embodiment of the present invention may be implemented by a computer program as described above being executed by a computing device. The computing device may include a processor, a memory, a storage device, a high-speed interface connected to the memory and a high-speed expansion port, and a low-speed interface connected to the low-speed bus and the storage device. Each of these components is connected to each other using a variety of buses and can be mounted on a common motherboard or mounted in any other suitable manner.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위 내에서 다양한 수정, 변경 및 치환이 가능할 것이다. 따라서, 본 발명에 개시된 실시예 및 첨부된 도면들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예 및 첨부된 도면에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구 범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리 범위에 포함되는 것으로 해석되어야 할 것이다.It will be apparent to those skilled in the art that various modifications, changes, and substitutions are possible, without departing from the essential characteristics and spirit of the invention as disclosed in the accompanying claims. will be. Therefore, the embodiments disclosed in the present invention and the accompanying drawings are intended to illustrate and not to limit the technical spirit of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments and the accompanying drawings . The scope of protection of the present invention should be construed according to the following claims, and all technical ideas within the scope of equivalents should be construed as falling within the scope of the present invention.

Claims

A seed gene symbol acquisition unit acquiring a seed gene symbol already known as a gene symbol related to a disease to be analyzed;
A combination symbol detector for searching text data including information about the seed gene using the obtained seed gene symbol and for detecting a gene combination symbol combined with the gene symbol in the retrieved text data;
An integrated network construction unit for constructing an integrated gene network by integrating the constructed local gene network based on the detected gene combination symbol as a node and representing the relationship of the gene as an edge, ; And
An analysis unit for determining gene symbol priorities related to a disease to be analyzed using the integrated gene network and analyzing a relationship between genes related to the disease according to the determined priority; Lt; / RTI >
Wherein the analyzer comprises: a priority determining unit for determining a priority of a gene symbol related to a disease to be analyzed based on the constructed integrated gene network; Further comprising:
Wherein the priority determination unit comprises: a first reference value calculation unit for calculating a first reference value related to the number of other nodes connected to one node based on the constructed integrated gene network; A second reference value calculation unit for calculating a second reference value related to a weight of an edge connecting the one node with the other node; And a third reference value calculation unit for calculating a third reference value related to a frequency at which the one node is exposed in a path connecting the arbitrary two nodes at the shortest distance on the integrated gene network. Further comprising:
Wherein the gene symbol priority is determined using at least one of the first reference value, the second reference value, and the third reference value.

delete

The apparatus of claim 1, wherein the combination symbol detector
A symbol extracting unit for dividing the searched text data by a sentence unit, analyzing the separated sentences, and extracting a gene symbol from the sentence; Further comprising:
Wherein the extracted gene symbols are detected as combination symbols when at least two or more gene symbols are extracted in the sentence.

[2] The apparatus of claim 1,
A local network construction unit for constructing a local gene network generated by the seed gene symbol by using the gene symbol included in the gene combination symbol detected using the seed gene symbol as each node; Further comprising:
And constructing the integrated gene network using the constructed local gene network.

5. The system according to claim 4, wherein the integrated network building unit
A network integrator for integrating the constructed local gene network according to a predetermined method in consideration of the type of gene symbol included in the gene combination symbol or the frequency at which the combined gene symbol is detected; Lt; / RTI >
And the integrated gene network is constructed using the integrated local gene network.

6. The method of claim 5, wherein incorporating according to the predetermined method
And integrating the gene symbols of the combination symbol as a node and increasing a weight of an edge connecting the node when the gene symbols included in the gene combination symbol coincide with each other.

6. The method of claim 5, wherein incorporating according to the predetermined method
And combining the gene combination symbols by using the matching gene symbol as a common node when one gene symbol included in the gene combination symbol coincides with each other.

6. The method of claim 5, wherein incorporating according to the predetermined method
Wherein when the gene symbols included in the gene combination symbol do not coincide with each other, the non-matching gene symbol is integrated as a separate node and the gene combination symbols are not connected.

The apparatus of claim 1, wherein the analyzing unit
A verifying unit for arranging a gene symbol related to the disease according to the determined gene symbol priority and verifying the placed gene symbol with verification data related to the disease to check the relation with the disease; Further comprising:
And analyzing a relationship between genes related to the disease using the verified gene symbol.

delete

Acquiring a seed gene symbol already known as a gene symbol associated with the disease to be analyzed;
Searching for text data including information on the seed gene using the obtained seed gene symbol, and detecting a gene combination symbol in which the gene symbol is combined in the retrieved text data;
Constructing a local gene network that represents the gene symbol as a node and the relation of the gene as an edge based on the detected gene combination symbol, and constructing an integrated gene network by integrating the constructed local gene network; And
Determining a gene symbol prioritization related to a disease to be analyzed using the constructed integrated gene network and analyzing a relationship between genes related to the disease according to the determined priority; Lt; / RTI >
Wherein the analyzing comprises: determining a gene symbol prioritization related to a disease to be analyzed based on the constructed integrated gene network; Further comprising:
Wherein the determining comprises determining a first reference value for the number of other nodes connected to one node based on the constructed integrated gene network, a second reference value for a weight of an edge connecting the one node to the other node, Wherein the priority of the gene symbol is determined by using at least one of a third reference value related to the frequency with which one node is exposed in a path connecting an arbitrary two nodes at the shortest distance on the integrated gene network Way.

delete

13. The method of claim 12, wherein detecting
Dividing the searched text data into sentences, analyzing the separated sentences, and extracting gene symbols from the sentences; Further comprising:
Wherein when the at least two gene symbols are extracted from the sentence, the extracted gene symbols are detected as a combination symbol.

13. The method of claim 12, wherein the step of constructing the integrated gene network comprises
Constructing a local gene network generated by the seed gene symbol by using the gene symbol included in the gene combination symbol detected using the seed gene symbol as each node; Further comprising:
And constructing the integrated gene network using the constructed local gene network.

16. The method of claim 15, wherein constructing the integrated gene network comprises:
Integrating the constructed local gene network according to a predetermined method in consideration of the type of the gene symbol included in the gene combination symbol or the frequency at which the combined gene symbol is detected; Further comprising:
Wherein the integrated gene network is constructed using the integrated local gene network.

17. The method of claim 16, wherein incorporating according to the predetermined method
And when each of the gene symbols included in the gene combination symbol coincides with each other, each of the gene symbols of the combination symbol is used as a node, the weight of the edge connecting the node is increased,
And combining the gene combination symbols by using the matching gene symbol as a common node when one gene symbol included in the gene combination symbol coincides with each other.

17. The method of claim 16, wherein incorporating according to the predetermined method
If the gene symbols included in the gene combination symbol do not match, integrating the non-matching gene symbols as separate nodes and not linking the gene combination symbols.

13. The method of claim 12, wherein analyzing comprises:
Disposing a gene symbol related to the disease according to the determined gene symbol priority, and verifying the placed gene symbol with verification data related to the disease; Further comprising:
And analyzing a relationship between genes related to the disease using the verified gene symbol.

A program stored in a computer-readable recording medium for realizing the method for analyzing a gene relationship according to any one of claims 12 to 14 through being executed by a processor.