KR20100065949A

KR20100065949A - The method to identify the multipurpose potential gene using cross-talk mapping

Info

Publication number: KR20100065949A
Application number: KR1020080124551A
Authority: KR
Inventors: 이동희; 최윤희; 주혜준
Original assignee: 이화여자대학교 산학협력단
Priority date: 2008-12-09
Filing date: 2008-12-09
Publication date: 2010-06-17
Also published as: KR101003175B1

Abstract

PURPOSE: A method for identifying a multi purpose potential gene using cross-talk mapping is provided to find a lot of genes which are made under a specific condition. CONSTITUTION: Data of a relative location where a gene is displayed by a 2D coordinate is inputted. The gene is obtained based on gene expression similarity in microarray data. A border size for classifying the 2D coordinate into a gene set is determined. A new gene bundle catalog with four different origin points is made. Microarray data about a normal cell and a stimulated cell are obtained. A gene expression aspect is displayed by comparing and analyzing the data.

Description

The method to identify the multipurpose potential gene using cross-talk mapping}

본 발명은 마이크로어레이 데이터로부터 세포의 기능을 조절하는데 관여하는 유전자들을 발굴하는 방법에 관한 것이다.The present invention relates to a method of discovering genes involved in regulating cell function from microarray data.

마이크로어레이는 암이나 스트레스 자극처럼 대량의 유전자 변이가 예상되는 환경에서 이들 유전자들의 발현변화를 동시에(simultaneously) 관찰하고자 할 때 주로 사용되는 기술이다. 상기 기술은 내/외부적 자극에 대한 세포 내 수천 개 유전자의 발현변화를 동시에 분석할 수 있다는 장점을 가지지만, 상기 분석결과 생성된 대량의 수치데이터를 분석하여, 그 결과를 생물학적인 현상과 연결 시켜야 한다는 어려움을 가지고 있다. 마이크로어레이 기술을 사용하여 얻은 유전자 발현 프로파일을 해석하고자 수치해석에 관한 많은 분석방법과 이들을 구현해주는 많은 도구들이 지금까지 제시되어 왔다. 이들 대부분은 유전자들의 상대적인 발현 차이를 비교 분석하여 해당 자극에 연관된 유전자들을 선별하고 이들 중 이미 알려진 유전 자들의 기능을 중심으로 이들과 함께 새로 선별된 유전자의 기능을 예측해보는 방법을 통해, 해당 자극에 대한 생물학적인 반응 메커니즘을 살펴보는데 사용되어 왔다. 최근 들어서는 더욱 정교한 데이터 분석방법을 사용하여 이들 유전자 사이의 상호관계를 유추해보는 단계로 발전해왔다. 이 중 마이크로어레이 데이터 분석에 매우 유용하다고 생각되는 몇 가지 분석 도구들이 있다. Microarray is a technique that is mainly used to observe the expression changes of these genes simultaneously in an environment where a large number of gene mutations are expected, such as cancer or stress stimulation. The technique has the advantage of simultaneously analyzing the expression changes of thousands of genes in cells for internal / external stimulation, but analyzing the large amounts of numerical data generated from the analysis results and linking the results with biological phenomena. I have a hard time to do it. To analyze gene expression profiles obtained using microarray technology, many analytical methods of numerical analysis and many tools for implementing them have been presented. Most of these genes are analyzed by comparing and analyzing the relative expression differences of genes, selecting genes related to the stimulus, and predicting the function of newly selected genes with them based on the functions of known genes. It has been used to look at the biological reaction mechanisms. In recent years, more sophisticated data analysis methods have evolved to infer the interrelationships between these genes. There are several analysis tools that I find very useful for microarray data analysis.

이중 첫번째로 먼저 미국의 Sandia National Laboratories (www.sandia.gov)에 의해 개발된 정보 발굴 소프트웨어(Data-mining software)인 VxInsight가 있다. VxInsight는 데이터베이스에 저장된 대량의 데이터들 사이의 상호 관계를 2차원 혹은 3차원 그림으로 보여줌으로써 데이터가 함축하고 있는 정보를 쉽게 발견할 수 있게 도와준다. 이 도구는 이미 2001년에 스탠포드 대학의 Stuart Kim Lab 에서 예쁜꼬마선충(Caenorhabditis elegans) 에 대한 마이크로 어레이 데이터 분석에 사용되어 공동 발현되는 유전자들(co-expressed genes)을 선별하는데 사용되기도 했다(SCIENCE, VOL 293, 2001, 2087 -2092). 또한 TAIR(The Arabidopsis Information Resource - www.arabidopsis. org) 역시 이 도구를 이용하여 애기장대 기능 유전체 컨소시엄[abidopsis Functional Genomics Consortium(AFGC)]과 AtGenExpress 컨소시엄 등의 대형 프로젝트를 통해 얻어진 마이크로 어레이 데이터를 분석하고 그에 연관한 정보들을 제공하고 있다. First and foremost is VxInsight, a data-mining software developed by Sandia National Laboratories (www.sandia.gov) in the United States. VxInsight makes it easy to discover the information implied by the data by presenting a two-dimensional or three-dimensional picture of the interrelationships between the large amounts of data stored in the database. The tool was already used in 2001 for the microarray data analysis of Caenorhabditis elegans at the Stuart Kim Lab at Stanford University to select co-expressed genes (SCIENCE, VOL 293, 2001, 2087 -2092). The Arabidopsis Information Resource (www.arabidopsis.org) also uses this tool to analyze microarray data from large projects such as the Arabidopsis Functional Genomics Consortium (AFGC) and the AtGenExpress Consortium. It provides related information.

VxInsight는 두 개의 부분으로 나누어져 있는데, 하나는 데이터 대상(data objects) 사이의 유사성을 이용하여 2차원 평면 위에서의 이들의 위치를(layout) 계산해주는 VxOrd라고 하는 오디네이션 루틴(Ordination routine)이라는 부분이고 (도 1), 다른 하나는 이렇게 계산된 유전자의 x, y 좌표를 시각화 시켜주는 시각화 엔진(visualization engine)이다. VxOrd는 분석하고자 하는 마이크로어레이 발현 프로파일 내에 존재하는 모든 유전자 쌍에 대한 유사성에 근거해 유사한 발현 패턴을 가지는 유전자들이 가까이 위치하도록 계산하기 때문에 함께 발현되는 유전자들의 2차원 좌표 값은 매우 비슷한 지역에 몰려있는 경향을 보인다. 이것은 그만큼의 유전자들이 많은 경우 함께 발현된다는 것을 의미하며, VxInsight에서는 이러한 것을 데이터의 클러스터링 결과를 2차원뿐만 아니라 3차원 산악지도(mountain terrain maps)로도 표시하여 함께 발현되는 유전자의 수를 산(mountain)의 높이로써 반영한다. 즉, A라는 위치에 몰려있는 유전자 개수가 B 위치에 몰려있는 유전자 수보다 많을 경우, A 지역에 생긴 산의 높이가 B 보다 높아지게 된다. 이렇듯 VxInsight의 시각화 엔진에서는 대량의 데이터가 함축하고 있는 정보를 쉽게 파악할 수 있는 도구이지만, 분석 결과 매우 길게 연속적으로 이어져 있는 위치를 가지는 유전자들의 경우, 양끝의 유전자들을 유사한 발현 패턴을 보이는 하나의 유전자 그룹으로 보기에는 어려움이 있어, 데이터 해석 시 이들을 하위 클러스터(sub cluster)로 분리하는데 있어서의 기준이 다시 필요하게 된다. VxInsight is divided into two parts, one called the Ordination routine, called VxOrd, that calculates their layout on the two-dimensional plane using similarities between data objects. (Figure 1), the other is a visualization engine (visualization engine) for visualizing the x, y coordinates of the calculated gene. Since VxOrd calculates that genes with similar expression patterns are located close to each other based on the similarity of all the gene pairs present in the microarray expression profile to be analyzed, the two-dimensional coordinate values of genes expressed together are concentrated in very similar regions. Show a tendency. This means that many genes are expressed together in many cases, and VxInsight displays these clustering results in not only two-dimensional but three-dimensional mountain maps to mount the number of genes that are expressed together. Reflect as the height of. That is, if the number of genes concentrated in the position A is greater than the number of genes concentrated in the position B, the height of the mountain formed in the region A becomes higher than B. VxInsight's visualization engine is a tool that makes it easy to grasp the information implied by a large amount of data.However, in the case of genes having a long continuous sequence, the genes at both ends have a similar expression pattern. As it is difficult to see the data, the criteria for separating them into sub clusters are needed again when interpreting the data.

두 번째로는 생물학적 기능이나 유전체 상에서의 물리학적 위치 같은 이미 알려진 정보들을 바탕으로 유전자들을 분류하고, 동일 카테고리에 속하는 유전자들을 유전자 묶음(gene set)이라고 하는 하나의 유전자 그룹으로 묶어 이를 이용해, 유전자 발현 데이터들을 분석하여 어느 묶음의 유전자들이 표현형간의 차이를 나타내는데 관여하는지를 알아보는 유전자 묶음 충실도 분석(Gene Set Enrichment Aanlysis)이라는 통계학적인 분석방법이 있다. 2000년 초에 미국 MIT 공과대학의 브로드 그룹(Broad group)에 의해 개발된 (Nature genetics 2003, vol 34 267-273, PNAS 2005, 102, 15545-15550) 이 방법의 경우, 유전자 산출물의 기능이나 유전체 상의 위치에 따른 분류 말고도, 여러 가지 기준을 가지고 유전자들을 분류하여 다양한 측면에서 유전자 묶음을 구성하였을 경우, 어느 유전자 묶음이 어느 상황에서 활성화 되는지를 통해 생명현상에 대한 보다 깊이 있는 이해에 큰 도움을 얻을 수 있지만, 분석의 핵심이라고 할 수 있는 유전자 묶음의 구성이 선행되어야 하고 사용 유전자 묶음과 분석 하고자 하는 그룹이 많아질 경우, 이를 통해 얻어진 결과들을 통합하여 생물학적인 메커니즘을 이해하는데 다시금 많은 시간이 소모된다. Second, we classify genes based on known information, such as biological function or physical location on the genome, and use gene expression to group genes that belong to the same category into a group of genes called a gene set. There is a statistical method called Gene Set Enrichment Aanlysis that analyzes the data to determine which groups of genes are involved in representing phenotype differences. Developed by the Broad group of the MIT Institute of Technology in the early 2000s (Nature genetics 2003, vol 34 267-273, PNAS 2005, 102, 15545-15550), for this method the function or genome of the gene product In addition to the classification according to the location of the images, if the genes are organized in various aspects by classifying the genes with various criteria, it is possible to obtain a great help for a deeper understanding of life phenomena by which gene bundles are activated in which situation. However, if the gene bundle, which is the core of the analysis, needs to be preceded, and the number of gene bundles and groups to be analyzed is increased, it is again time-consuming to understand the biological mechanisms by integrating the results obtained. .

마지막으로 미국 콜럼비아 대학의 캘리파노 실험실(Califano lab)에서 개발한 마이크로어레이 데이터로부터 세포 네트워크(cellular network)를 재구성 해주는‘ARACNe(Algorithm for the Reverse Engineering of Accurate Cellular Networks)’(nature genetics vol37, 382-390)라고 하는 리버스엔지니어링(reverse engineering) 알고리즘을 들 수 있다. 이 알고리즘은 유전자들 사이의 상호 연관성을 나타내주는 상호정보척도 값을 계산하여 주는데, 여기서 말하는 상호 정보(mutual information 이하 MI라고 한다.)란 확률 이론이나 정보이론에서, 두 개의 대상이 서로 의존적으로 나타나는 정도를 표시한 값으로 A가 나타난 정도, B가 나타난 정도, AB가 함께 나타난 정도를 비교해서 만약 A나 B가 나타난 경우는 항상 AB가 같이 있는 경우였다면, A와 B 사이에 필연적 관계가 있을 것이라고 판단 하게 해주는 근거가 되는 정보이다. 따라서 이 알고리즘은 일정한 신뢰 수준 이상의 MI 값을 가지는 유전자 쌍을 찾아내줌으로써 보다 구체적으로, 유전자들 사이의 상호작용 현상에 대한 이해를 높일 수 있도록 도와준다. 그러나 이 알고리즘을 적용한 분석이 진행되기 위해서는 결과에 대한 신뢰성 문제로 인해 원칙적으로 최소 100장 이상의 대량의 마이크로어레이 데이터가 필요하다.Finally, `` ARACNe (Algorithm for the Reverse Engineering of Accurate Cellular Networks) '' (nature genetics vol37, 382-), which reconstructs cellular networks from microarray data developed by the California College of Columbia, USA. And reverse engineering algorithm (390). This algorithm calculates a mutual information scale that represents the correlation between genes. Mutual information (MI) is referred to here as a probability theory or information theory. Compare the degree to which A appears, the degree to B, and the degree to which AB appears together. If A or B are always AB together, there will be an inevitable relationship between A and B. It is the information on which the judgment is made. Therefore, this algorithm helps to better understand the interaction phenomenon between genes by finding gene pairs with MI values above a certain level of confidence. However, in order to proceed with the analysis using this algorithm, at least 100 pieces of microarray data are required in principle due to the reliability of the results.

지금까지 살펴본 분석방법들은 모두 유전자 발현 데이터로부터 세포의 반응기전을 이해하도록 도와주지만, 이들 분석방법을 극대화 시키는데 전제되어야 하는 조건을 만족 시켜주는 방법을 제시해주지는 못한다. 앞서 말했듯이, GSEA 방법의 경우, 데이터의 효과적인 분석을 위해서는 매우 정밀하면서도 많은 정보를 담고 있는 유전자 묶음의 구성이 선행되어야 하며, 리버스엔지니어링 알고리즘을 구현할 경우에도 신뢰성 문제로 인해 많은 수의 마이크로어레이 자료가 필요하다. 현재까지 마이크로어레이 데이터에 대한 다양한 분석방법이 제시되고 있는 상황에도 불구하고 아직도 많은 연구자들이 내/외부적 자극에 대한 세포 반응을 이해함에 있어 유전자 발현 프로파일에 나타난 수치 데이터를 유전자 수준의 이해로까지 구체화 시키는데 큰 어려움을 느끼고 있음은 이러한 문제가 하나의 원인으로 생각된다. While all of the assays discussed so far help us to understand the pre-reactor behavior of cells from gene expression data, they do not provide a way to meet the conditions that must be premised on maximizing these assays. As mentioned above, in the case of the GSEA method, the analysis of data requires the construction of a very precise and high-volume gene bundle, and even when the reverse engineering algorithm is implemented, a large number of microarray data are generated due to reliability problems. need. In spite of the fact that various methods of analysis of microarray data have been proposed to date, many researchers still incorporate numerical data from gene expression profiles into gene-level understanding in understanding cellular responses to internal and external stimuli. This problem is thought to be one cause of the great difficulty in making this work.

이에 본 발명자들은 생물학적으로 유용한 정보를 추출하는 방법의 개발 필요성을 느끼고 앞서 언급한 3가지 마이크로어레이 데이터 분석방법들을 적절하게 수정 보완하여, 시너지 효과를 유도할 수 있도록 결합한 후, 유전자 프로파일로부터 세포의 기능을 조절하는데 관여하는 유전자 그룹과 외부자극과의 관계성을 신속하게 발견하는 방법을 개발하였다.Therefore, the present inventors feel the necessity of developing a method of extracting biologically useful information, and appropriately modify and supplement the aforementioned three microarray data analysis methods, and combine them to induce synergistic effects, and then function the cells from the gene profile. We have developed a method to quickly discover the relationship between a group of genes involved in regulating the expression of proteins and external stimuli.

또한 이것을 비슷한 조건에서 실험된 대량의 마이크로어레이 데이터를 얻기 어려운 식물에 적용하는 경우, 예를 들면, 환경 스트레스에 노출된 애기장대 유전자들의 발현 프로파일을 분석할 경우 환경 스트레스 반응에 관여하는 대량의 애기장대 유전자들의 발굴이 가능할 뿐 아니라, 이들 유전자의 상호작용 관계도 쉽게 관찰해 볼 수 있다는 것을 기존에 알려진 문헌 정보들과의 비교를 통해 보여줌으로써 본 발명을 완성하였다 In addition, when this is applied to plants in which it is difficult to obtain a large amount of microarray data tested under similar conditions, for example, when analyzing expression profiles of Arabidopsis genes exposed to environmental stress, a large amount of Arabidopsis involved in environmental stress response. The present invention was completed by showing that not only the discovery of genes but also the interaction relations between these genes can be easily observed through comparison with known literature information.

마이크로 데이터 분석 방법을 상호 조합하여 스트레스 등 특정 조건에서 공통적으로 발현하는 유전자들을 대량으로 발굴하는 방법을 제공하며, 이들 유전자의 상호 작용 관계를 분석하여 반응 특이적이거나 다자극에 반응하는 유전자를 선별하는 방법을 제공하는 것이다.The combination of micro data analysis methods provides a method of discovering a large amount of genes that are commonly expressed in specific conditions such as stress, and analyzes the interaction relationship of these genes to select response-specific or multi-stimulatory genes. To provide a way.

또한, 마이크로 데이터 분석 방법과 연관된 연산부를 조합하여 스트레스 등 특정 조건에서 공통적으로 발현하는 유전자를 대량 발굴하며, 이들 유전자의 상호 작용 관계를 분석하여 반응 특이적이거나 다자극에 반응하는 유전자를 선별하는 장치를 제공하는 것이다.In addition, by combining the computational unit associated with the micro-data analysis method to discover a large number of genes commonly expressed in a specific condition, such as stress, and to analyze the interaction relationship of these genes to select a response-specific or multi-stimulation genes To provide.

상기 목적을 달성하기 위하여 본 발명은 데이터 마이닝 프로그램에 의해 분 석된 대량의 마이크로어레이 데이터에 대한 군집화(clustering) 결과를 바탕으로 유사한 발현 패턴을 나타내는 유전자 그룹들을 유전자 묶음으로 구성하고, 상기 유전자 묶음을 이용하여 통계학적 분석방법으로 특정조건에서 발현하는 유전자묶음들을 빠르게 대량 발굴하는 방법을 제공한다.In order to achieve the above object, the present invention uses gene bundles to group gene groups that exhibit similar expression patterns based on clustering results of a large amount of microarray data analyzed by a data mining program. By statistical analysis, it provides a method for rapid mass discovery of gene bundles expressed under specific conditions.

또한, 본 발명은 상기 방법에 추가하여, 발굴된 유전자 묶음의 유전자들을 대상으로 리버스엔지니어링 프로그램을 적용하여 상호 연관성이 있는 유전자를 선별하는 방법을 제공한다.In addition, the present invention, in addition to the above method, provides a method for screening genes that are correlated by applying a reverse engineering program to the genes of the discovered gene bundle.

또한, 본 발명은 마이크로어레이 데이터로부터, 세포의 기능을 조절하는데 관여하는 대량의 유전자들이 발굴되는 장치를 제공한다.In addition, the present invention provides a device from which microarray data are discovered, in which a large number of genes involved in regulating the function of a cell are discovered.

또한, 본 발명은 상기 방법에 추가적으로 세포내 네트워크를 재구성하는 연산부를 포함하는 상호 연관성이 있는 유전자를 선별하는 장치를 제공한다.The present invention also provides an apparatus for selecting correlated genes, including an operation unit for reconfiguring an intracellular network in addition to the above method.

본 발명은 마이크로어레이 데이터 분석에 사용되는 방법들을 효과적으로 결합하여 처리함으로 반응 유전자들을 대량으로 발굴하고 특정 반응이나 스트레스에 특이적인 혹은 상호 연관(cross-talk)된 유전자들을 선별하는 분석 작업이 짧은 시간 안에 이루어질 수 있게 만들어 준다.The present invention effectively combines and processes the methods used for microarray data analysis to discover large quantities of response genes and to select genes that are specific or cross-talk specific to a specific reaction or stress. Make it happen.

또한, 본 발명은 오버랩 개념을 도입하여 구획을 나누기 어려운 연속형 데이터를 군집화하여, 개체가 한 군집에 속해 버림으로써 다른 식의 군집에 속하지 못하는 기회박탈 상황을 피하게 한다.In addition, the present invention introduces an overlap concept to cluster continuous data that is difficult to divide, thereby avoiding an opportunity deprivation situation in which an individual belongs to one cluster and does not belong to another type of cluster.

또한, 본 발명은 통계학적 분석방법을 통해 선별된 유의성 높은 유전자들에 대해 리버스 엔지니어링 알고리즘을 적용하여 그들 사이의 상호작용을 분석함으로써 신뢰성 문제로 많은 양의 마이크로어레이 데이터가 필요했던 기존방법의 한계를 극복하게 한다.In addition, the present invention analyzes the interaction between the highly significant genes selected through statistical analysis method by applying reverse engineering algorithms to overcome the limitations of the conventional method, which requires a large amount of microarray data as a reliability problem. To overcome.

또한 본 발명에 의하여 특정 유전자와 연관된 유전자 그룹의 선별 및 이들의 상호작용 역시 쉽게 파악해 볼 수 있게 된다. In addition, according to the present invention, selection of gene groups associated with specific genes and their interactions can be easily understood.

본 발명에 따른 분석방법으로 인하여 외부 자극에 대한 세포의 반응기전을 유전자 수준에서 설명하는 가설을 용이하게 유도해 볼 수 있게 되었고 그에 대한 검증 실험의 디자인을 도와줄 수 있게 된다. Due to the analytical method according to the present invention, it is possible to easily derive a hypothesis explaining the reaction of cells to an external stimulus at the gene level and help design a verification experiment therefor.

상기의 과제를 해결하기 위하여, 본 발명은In order to solve the above problems, the present invention

유전자 상대적 위치 결정용 데이터 마이닝 프로그램을 이용하여 마이크로어레이 데이터에서 유전자 발현 유사성을 근거로 수득된 유전자 및 상기 유전자가 2차원 좌표로 표시되는 상대적인 위치가 계산된 자료를 입력받는 제 1 단계;A first step of inputting a gene obtained on the basis of gene expression similarity in the microarray data and a relative position in which the gene is expressed in two-dimensional coordinates by using a data mining program for determining the relative position of genes;

상기 제 1 단계에서 입력된 유전자의 2차원 좌표를 유전자 묶음으로 분류하기 위한 경계크기가 결정되는 제 2 단계;A second step of determining a boundary size for classifying the two-dimensional coordinates of the gene input in the first step into a gene bundle;

상기 2 단계에서 결정된 유전자 묶음의 경계크기를 기준으로 4개의 서로 다른 원점을 갖는 새로운 유전자 묶음 카탈로그가 작성되는 제 3 단계;A third step of creating a new gene bundle catalog having four different origins based on the boundary size of the gene bundle determined in step 2;

상기 제 3 단계에서 작성된 유전자 묶음 카탈로그에 통계학적 분석 방법이 적용되어, 정상 세포(대조군)과 자극을 받은 세포(실험군)에 대한 마이크로어레이 데이터가 수득된 후, 상기 데이터가 비교 분석되어 유전자 발현 양상이 표시되는 제 4 단계;Statistical analysis is applied to the gene bundle catalog prepared in the third step, and microarray data is obtained for normal cells (control) and stimulated cells (experimental group), and then the data are compared and analyzed to express gene expression patterns. The fourth step is displayed;

상기 제 4 단계의 분석된 데이터를 기반으로, 컴퓨터의 마이크로프로세서 및 메모리에 로드된 시각화 프로그램을 이용하여, 상기 제 4 단계의 분석된 유전자 묶음들을 모두 연결한 데이터를 시각화하여 상호 연관 유전자묶음이 선별되는 제 5 단계를 포함하는 마이크로어레이 데이터에서 세포의 기능을 조절하는데 관여하는 대량의 유전자들의 발굴 방법을 제공한다.Based on the analyzed data of the fourth step, using a visualization program loaded into the microprocessor and the memory of the computer, the data associated with all the analyzed gene bundles of the fourth step are visualized to select the correlated gene bundles. It provides a method of discovering a large amount of genes involved in regulating the function of the cells in the microarray data comprising a fifth step.

이때, 유전자 묶음을 구성하고 비교분석에 사용한 마이크로어레이 데이터는 TAIR(The Arabidopsis Information Resource, //www.arabidopsis.org)에서 제공하는 마이크로어레이 데이터베이스에 저장된 Affymetrix ATH1 (25K) arrays를 사용한 실험데이터이지만, 이에 한정되지 않고, NCBI의 GEO(gene expression omnibus-//www.ncbi.nlm.nih.gov/geo/)나 EBI의 array express(//www.ebi.ac.uk/microarray -as/ae/)에서 제공하는 다양한 종류의 마이크로어레이를 사용한 실험데이터가 애기장대 이외의 생물에서의 공동 발현 유전자 묶음을 구성하기 위해 같은 방식으로 사용될 수 있다.In this case, the microarray data used for constructing the gene bundle and for comparative analysis is experimental data using Affymetrix ATH1 (25K) arrays stored in a microarray database provided by TAIR (The Arabidopsis Information Resource, //www.arabidopsis.org). This is not limited to gene expression omnibus-// www.ncbi.nlm.nih.gov/geo/ of the NCBI or array express (// www.ebi.ac.uk/microarray -as / ae / of the EBI. Experimental data using the various types of microarrays provided by) can be used in the same way to construct coexpressed gene bundles in organisms other than Arabidopsis.

이와 더불어 상기의 마이크로어레이 데이터는 특정 생물체의 특정 자극간에 공통적으로 발현되는 유전자들을 발굴하기 위하여, 실험실에서 수득한 마이크로어레이 데이터도 사용할 수 있다. In addition, the microarray data may also use microarray data obtained in a laboratory to discover genes commonly expressed between specific stimuli of a specific organism.

또한, 마이닝 프로그램은 대용량의 마이크로어레이 데이터에서 상호 연관성 및 패턴을 분석하기 데이터 분석기법으로서, 본원 발명에서는 Sandia National Lab.(//www.sandia.gov)에서 제공하는 VxInsight의 VxOrd를 기반으로 사용하고 있으나, STN AnaVist (//www.stn-international.de/stninterfaces/stnanavist/ stn_anavist.html) 라는 프로그램 또한 사용가능하며, 마이크로어레이에서 연관성과 발현패턴에 대한 분석된 자료를 얻을 수 있는 한 특정 프로그램에 한정되는 것은 아니다.In addition, the mining program is a data analysis method for analyzing correlations and patterns in a large amount of microarray data. In the present invention, the mining program is based on VxOs of VxInsight provided by Sandia National Lab. (//Www.sandia.gov). However, a program called STN AnaVist (//www.stn-international.de/stninterfaces/stnanavist/stn_anavist.html) can also be used, as long as you can obtain analytical data on associations and expression patterns in the microarray. It is not limited.

또한, 1 단계에서 계산된 유전자들의 발현유사성 여부를 반영하는 상대적 위치데이터는 그 성격상 연속형 데이터에 속하기 때문에, 데이터 해석시 이들을 하위 클러스터로 분리하는데 또 다른 기준이 필요하게 된다. 이러한 문제점을 해결하고자, 본 발명자들은 서로 다른 원점을 가지는 4가지 타입의 유전자 묶음 카탈로그를 작성하였다. 즉, 2차원 평면상에 나타난 유전자들을 묶음단위로 분리하는 기준이 되는 길이 15의 사각형의 시작점을 달리하여 [1)UL(Up-Left) 2) UR(Up-Right) 3)DL(Down-Left) 4)DR(Down-right)이라고 명명] 4가지 타입의 유전자 묶음 카탈로그를 작성하였다.In addition, since the relative position data reflecting the expression similarity of the genes calculated in step 1 belong to the continuous data, another criterion is required to separate them into sub-clusters when interpreting the data. To solve this problem, the inventors created four types of gene bundle catalogs with different origins. In other words, by changing the starting point of the rectangle of length 15, which is a standard for separating genes displayed on a two-dimensional plane into a bundle unit, (1) UL (Up-Left) 2) UR (Up-Right) 3) DL (Down- Left) 4) DR (Down-right)] Four types of gene bundle catalogs were prepared.

2 단계의 경계 크기는 공동 발현으로 인정되어 한개의 공동 발현 유전자 묶음으로 인정될 수 있는 유전자들 사이의 거리를 의미하는 것으로, 상기 한개의 유전자 묶음에 4 내지 20개의 유전자가 포함될 수 있도록 결정되는 것이 바람직하나, 이에 제한되는 것은 아니다. 이에 상기 경계의 크기는 10 내지 20이 가능하며, 15로 하는것이 바람직하나 이에 제한되는 것은 아니다.The boundary size of the second stage means the distance between genes that are recognized as co-expression and can be recognized as one co-expressed gene bundle, and is determined to include 4 to 20 genes in the single gene bundle. Preferred, but not limited to. Accordingly, the size of the boundary may be 10 to 20, and preferably 15, but is not limited thereto.

상기와 같이 4가지 타입의 원점으로 나눌 경우, GSEA 통계 처리후 시각화시에 오버랩을 통하여 자료가 통합되게 된다. 오버랩 개념을 도입할 경우 구획을 나누기 어려운 연속형 데이터를 군집화하여, 개체가 한 군집에 속해 버림으로서 다른 군집에 속하지 못하는 기회박탈 상황을 피할 수 있게 한다. 또한, 군집으로 계산되므로 계산상의 부담을 현격히 줄여줄 수 있게 되므로 결국 효율적인 분석 방법을 가능하게 해 준다. In the case of dividing into four types of origin as described above, the data is integrated through the overlap during visualization after processing GSEA statistics. The introduction of the overlap concept clusters continuous data that is difficult to partition, thus avoiding the opportunity deprivation situation where an entity belongs to one cluster and does not belong to another. In addition, since it is calculated as a cluster, it is possible to significantly reduce the computational burden, thus enabling an efficient analysis method.

또한, 제 4 단계의 통계학적 분석방법은 MIT 공과 대학의 broad group에서 개발된 유전자 묶음 충실도 분석(gene set enrichment analysis, GSEA)방법이 바람직하지만, PAGE(parametric analysis of gene set enrichment)라는 방법(Ruan J. et al., BMC Bioinformatics. 2005; 6:114) 등의 방법도 가능하며 비슷한 계산을 하는 한 이에 한정되는 것은 아니다. In addition, the statistical analysis method of the fourth step is preferably a gene set enrichment analysis (GSEA) method developed by the broad group of the MIT University of Technology, but a method called parametric analysis of gene set enrichment (PAGE) (Ruan). J. et al., BMC Bioinformatics. 2005; 6: 114) and the like are also possible and are not limited to this as long as similar calculations are made.

또한, 시각화 프로그램은 4 단계를 통하여 얻은 데이터를 시각화하는 프로그램이다. 이러한 프로그램을 이용시 즉시 핵심 유전자를 찾아 낼 수 있게 된다. 이때 이용할 수 있는 프로그램으로는 'Graphviz'라는 graph visualization(http://www.graphviz.org/)이라는 프로그램, 미국 Institute for Systems Biology에서 개발한 Cytoscape (//www.cytoscape.org/) 또는 Vladimir Batagelj와 Andrej Mrvar가 공동 개발한 pajek(//vlado.fmf.uni-lj.si/pub/networks/pajek/)이 있으며, 본 발명의 구체적인 실시예에서는 'Graphviz'를 사용하였으나, 이에 한정되는 것은 아니다.In addition, the visualization program is a program for visualizing the data obtained through the four steps. Using these programs, you can immediately find key genes. Available programs include graph visualization (Graphviz) (http://www.graphviz.org/), Cytoscape (//www.cytoscape.org/) developed by the American Institute for Systems Biology, or Vladimir Batagelj And Pajek (//vlado.fmf.uni-lj.si/pub/networks/pajek/) co-developed by Andrej Mrvar, but the 'Graphviz' is used in a specific embodiment of the present invention, but is not limited thereto. .

제 5 단계에서 선별된 유전자 묶음안에 속한 유전자들은 스트레스 및 자극에 반응하는, 세포의 기능을 조절하는데 관여하는 유전자들을 판단되어질 수 있다.Genes belonging to the gene bundle selected in the fifth step can be determined genes involved in regulating the function of the cell in response to stress and stimulation.

즉, 본 발명의 방법을 통하여 세포의 기능을 조절하는데 관여하는 유전자를 대량으로 발굴할 수 있으며, 시각화 프로그램을 통하여 특이적이거나 상호 연관성이 있는 유전자 묶음 및 유전자들을 쉽고 빠르게 발굴할 수 있게 된다. That is, through the method of the present invention it is possible to discover a large amount of genes involved in regulating the function of the cell, and through a visualization program it is possible to quickly and easily find specific or correlated gene bundles and genes.

또한 본 발명은 상기 단계에 추가적으로 상기 5 단계에서 선별된 유전자 묶음에 대해 자극에 따른 묶음 내 핵심 유전자의 분포 현황을 응집력과 기여도로 구분한 뒤, 자극에 대한 반응의 관련 정도에 따라 이들 유전자 묶음이 ⅰ) 모든 자극에 상호 연관이 있는 응집력과 기여도가 높은 유전자 묶음 ⅱ) 응집력은 낮으나 상호연관성을 가진다고 판단된 자극 중 일부에 대해서는 매우 높은 기여도와 응집력을 보이는 유전자 묶음 ⅲ) 응집력은 높으나 낮은 기여도를 보이는 유사 상호 연관 유전자 묶음 ⅳ) 응집력과 기여도가 모두 낮은 거짓 양성 유전자 묶음과 같이, 4 가지로 세분화한 후 진정 상호연관(real cross-talk) 유전자 묶음을 찾는 6 단계; 및In addition, the present invention is divided into the coherence and contribution to the distribution of the core gene in the bundle according to the stimulus for the gene bundle selected in step 5 in addition to the above step, these gene bundles according to the related degree of response to the stimulus V) cohesive and highly contributing gene bundles that correlate to all stimuli; and ii) gene bundles with high cohesiveness and cohesiveness to some of the stimuli that are judged to be low cohesive but interrelated. Similar cross-linked gene bundle iii) six steps to find a real cross-talk gene bundle after subdividing into four, such as a false positive gene bundle with low cohesion and low contribution; And

상기 6 단계에서 찾은 진정 상호연관(real cross-talk) 유전자 묶음에 포함된 유전자들을 대상으로 마이크로프로세서 및 메모리에 로드된 리버스 알고리즘을 구현하는 프로그램을 이용하여 세포내 네트워크를 재구성하여 세포내에서 상호 연관성이 있을 것으로 예측되는 유전자를 선택하는 7 단계를 Intracellular correlation by reconfiguring the intracellular network using a program that implements a reverse algorithm loaded in a microprocessor and memory for genes included in the real cross-talk gene bundle found in step 6 7 steps to select genes that are expected to be

더 포함하여 상호 연관성이 있는 유전자의 선별 방법을 제공한다.It further provides a method for screening genes that are interrelated.

이때, '기여도'라 함은 특정자극에서 선발된 유전자묶음 내 전체 유전자 개수에 대한 그 묶음이 당첨되는데 결정적인 역할을 수행한 핵심유전자의 비율을 의미이다. 이 값이 1에 근접할수록 묶음 내 많은 유전자들이 해당 자극에 반응을 보인다는 의미이다.In this case, 'contribution' refers to the percentage of the core genes that played a decisive role in winning the bundle for the total number of genes selected in a specific stimulus. As the value approaches 1, many genes in the bundle respond to the stimulus.

또한, '응집도'는 어느 유전자 묶음이 다자극에 반응하는 유전자 묶음으로 당첨된 경우, A,B,C 자극 모두에서 핵심유전자로 분석된 유전자의 수가 전체 유전자수에서 차지하는 비율을 의미한다. 이 값이 1에 근접할수록, 이들 유전자들은 하나의 기능단위로써 잘 묶인 유전자 묶음일 뿐 아니라, 생물학적으로 유용한 진정 상호연관(real cross-talk) 유전자 묶음일 가능성이 높다(도 13).In addition, 'aggregation' refers to the ratio of the number of genes analyzed as key genes in all A, B, and C stimulus in the total number of genes when a bundle of genes wins a bundle of genes in response to multi-stimulation. The closer this value is to 1, the more likely that these genes are not only gene bundles that are well bound as a functional unit, but are also likely to be biologically useful real cross-talk gene bundles (FIG. 13).

이때, 사용되는 리버스 엔지니어링 프로그램은 유전자 사이의 상호 연관성을 나타내주는 상호정보척도 값을 계산하여 유전자 사이의 네트워크를 구성을 예상해 볼 수 있게 도와주는 프로그램이다. 본 발명에서는 'ARACNe'(nature genetics vol 37, 382-390)를 사용하였으나 이에 한정되지 않고, Linear Model(BMC Bioinformatics. 2004 10;5:108), 베이지안(BMC Syst Biol. 2008 4;2:57) 또는 가우시안 네트워크 모델(Genome Res. 2007 17(11):1614-25)등도 모두 적용 가능하다. 이러한 프로그램을 이용하여 본원 발명에서 선별된 '상호 관련 유전자'가 실제로 세포내 네트워크상에서 어떠한 기작에 포함되서 작용하는지 확인할 수 있었다.In this case, the reverse engineering program used is a program that helps to predict the composition of the network between genes by calculating a mutual information scale value representing the correlation between genes. In the present invention, 'ARACNe' (nature genetics vol 37, 382-390) is used, but is not limited thereto. Linear Model (BMC Bioinformatics. 2004 10; 5: 108), Bayesian (BMC Syst Biol. 2008 4; 2: 57 ) Or Gaussian network model (Genome Res. 2007 17 (11): 1614-25) are also applicable. Using this program, it was possible to confirm which mechanisms of the 'interrelated genes' selected in the present invention were actually included in the intracellular network.

본 발명은 상기와 같은 분석 방법의 조합을 통하여 기존에 알려진 많은 마이 크로어레이 결과를 이용하고 그에 대한 발현 패턴을 분석하여, 특정 자극에 대한 상호 작용하는 반응 유전자들을 대량 발굴할 수 있으며, 유전자 수준에서 상호 작용하는 것을 관찰하여 반응 특이적인 유전자를 선별할 수 있는 방법을 제공한다.The present invention utilizes a number of microarray results known in the art and analyzes expression patterns thereof through a combination of the above-described analytical methods, thereby mass-producing reaction genes that interact with specific stimuli, and at the gene level. Observing the interaction provides a way to select response specific genes.

또한, 본 발명은 상기의 목적을 달성하기 위하여 세포의 기능을 조절하는데 관여하는 유전자를 대량으로 발굴하는 장치를 제공한다. 상기 장치는 데이터를 입력하는 입력부, 입력된 데이터를 처리하는 연산부 및 처리된 결과를 얻는 출력부의 3부분으로 구성되어 있다. 장치를 보다 상세히 설명하면 하기와 같다.In addition, the present invention provides an apparatus for discovering a large amount of genes involved in regulating the function of cells in order to achieve the above object. The apparatus is composed of three parts: an input unit for inputting data, an operation unit for processing the input data, and an output unit for obtaining the processed result. The device is described in more detail as follows.

마이크로어레이 데이터에서 유전자의 발현 유사성을 근거로 작성된 유전자 및 상기 유전자의 2차원 좌표로 표시되는 유전자의 상대적 위치 자료를 입력받는 입력부를 제공한다.It provides an input unit for receiving the relative position data of the gene expressed on the basis of the expression similarity of the gene in the microarray data and the gene represented by the two-dimensional coordinates of the gene.

이때, 유전자 묶음을 구성하고 비교 분석에 사용한 마이크로어레이 데이터는 TAIR(The Arabidopsis Information Resource, //www.arabidopsis.org)에서 제공하는 마이크로어레이 데이터베이스에 저장된 Affymetrix ATH1 (25K) arrays를 사용한 실험데이터이지만, 이에 한정되지 않고, NCBI의 GEO(gene expression omnibus-//www.ncbi.nlm.nih.gov/geo/)나 EBI의 array express(//www.ebi.ac.uk/microarray -as/ae/)에서 제공하는 다양한 종류의 마이크로어레이를 사용한 실험데이터가 사용될 수 있다. 이 뿐 아니라 실험실에서 얻은 각종 마이크로어레이 데이터를 사용할 될 수 있다.In this case, the microarray data used for constructing the gene bundle and for comparative analysis is experimental data using Affymetrix ATH1 (25K) arrays stored in a microarray database provided by TAIR (The Arabidopsis Information Resource, //www.arabidopsis.org). This is not limited to gene expression omnibus-// www.ncbi.nlm.nih.gov/geo/ of the NCBI or array express (// www.ebi.ac.uk/microarray -as / ae / of the EBI. Experimental data using the various types of microarrays provided by) can be used. In addition, various microarray data obtained in the laboratory can be used.

또한, 데이터에 나타난 유전자들의 발현 유사성을 반영하여 2차원 평면상에 위치하도록 좌표 값을 계산하는 연산은, 본원 발명에서는 Sandia National Lab.(//www.sandia.gov)에서 제공하는 VxInsight를 기반으로 사용하고 있으나, STN AnaVist(//www.shinwon.co.kr/cas/products/anavi st/index.html)라는 프로그램 또한 사용가능하며, 마이크로어레이에서 연관성과 발현패턴을 분석하여 위치 데이터를 제공할 수 있는 한 그 종류가 한정되는 것은 아니다.In addition, the operation of calculating the coordinate value to be located on the two-dimensional plane by reflecting the expression similarity of the genes shown in the data is based on VxInsight provided by Sandia National Lab. (//Www.sandia.gov) in the present invention. Although it is used, a program called STN AnaVist (// www.shinwon.co.kr/cas/products/anavi st / index.html) is also available, and it can provide location data by analyzing the association and expression patterns in the microarray. The type is not limited as far as possible.

또한, 추가적으로 하기의 연산부를 제공하는데,In addition, the following calculation unit is provided,

상기 입력된 유전자의 2차원 좌표를 일정한 경계크기에 따라 유전자 묶음으로 분류하는 제 1 연산부;A first calculating unit classifying the two-dimensional coordinates of the input gene into gene bundles according to a predetermined boundary size;

상기 유전자 묶음의 경계 크기를 기준으로 4개의 서로 다른 원점을 갖는 새로운 유전자 묶음 카탈로그로 분류하는 제 2 연산부;A second operation unit classifying a new gene bundle catalog having four different origins based on the boundary size of the gene bundle;

상기 유전자 묶음 카탈로그가 통계학적으로 분석하여 정상 세포(대조군)과 자극을 받은 세포(실험군)에 대한 마이크로어레이 데이터를 얻은 후 그 값을 비교하여 유전자의 발현 양상을 표시하는 제 3 연산부;A third calculation unit which statistically analyzes the gene bundle catalog to obtain microarray data of normal cells (control) and stimulated cells (experimental group), and then compares the values to display expression patterns of genes;

3연산부에서 얻은 데이트를 기반으로 컴퓨터의 마이크로프로세서 및 메모리에 로드된 시각화 프로그램을 이용하여, 상기 유전자 묶음 발현 지도를 시각화하여 상호 연관 유전자묶음을 선별하는 제 4 연산부; 및A fourth operation unit which visualizes the gene bundle expression map and selects the correlated gene bundles by using a visualization program loaded in a microprocessor and a memory of a computer based on the data obtained from the three calculation units; And

상기 연산 결과를 출력하기 위한 출력수단으로 구성된 세포의 기능을 조절하는데 관여하는 대량의 유전자들의 발굴 장치를 제공한다.It provides a device for discovering a large amount of genes involved in regulating the function of the cell consisting of an output means for outputting the operation result.

이때, 제 1 연산부에서 얻은 유전자들의 발현유사성 여부를 반영하는 상대적 위치데이터는 그 성격상 연속형 데이터에 속하기 때문에, 데이터 해석시 이들을 하위 클러스터로 분리하는데 또 다른 기준이 필요하게 된다. 이러한 문제점을 해결하고자, 본 발명자는 서로 다른 원점을 가지는 좌표를 구성하였다. 즉, 2 연산부에서 얻은 유전자 묶음을 1)UL(Up-Left) 2) UR(Up-Right) 3)DL(Down-Left) 4)DR(Down-right)이라고 명명한 다른 원점을 가지는 4 가지 타입의 유전자 묶음 카탈로그를 작성한다.In this case, since the relative position data reflecting the expression similarity of the genes obtained by the first calculation unit belongs to continuous data in nature, another criterion is required to separate them into lower clusters when interpreting the data. To solve this problem, the present inventors constructed coordinates having different origins. In other words, the gene bundle obtained from the operation unit 2 is 1) UL (Up-Left) 2) UR (Up-Right) 3) DL (Down-Left) 4) DR (Down-right) Create a gene pool catalog of types.

또한, 제 4 연산부의 통계학적 분석방법은 MIT 공과 대학의 broad group에서 개발된 충실도(gene set enrichment analysis)방법이 바람직하고, PAGE(parametric analysis of gene set enrichment)라는 방법(Ruan J et al., BMC Bioinformatics. 2005; 6:114) 등의 방법도 가능하나, 이에 한정되는 것은 아니다. In addition, the statistical analysis method of the fourth operation unit is preferably a method of gene set enrichment analysis developed by the broad group of the MIT University of Technology, and a method called parametric analysis of gene set enrichment (PAGE) (Ruan J et al., BMC Bioinformatics. 2005; 6: 114) and the like are also possible, but are not limited thereto.

또한, 본발명은 세포의 기능을 조절하는데 관여하는 대량의 유전자의 발굴 장치의 연산부 중에서 상기 제 4 연산부에 추가하여,In addition, the present invention, in addition to the fourth operation unit of the operation unit of the excavation device of a large amount of genes involved in regulating the function of the cell,

상기 다자극에 반응하는 유전자묶음에 대해 응집력과 기여도를 기준으로ⅰ) 모든 자극에 상호 연관이 있는 응집력과 기여도가 높은 유전자 묶음 ⅱ) 응집력은 낮으나 상호연관성을 가진다고 판단된 자극 중 일부에 대해서는 매우 높은 기여도와 응집력을 보이는 유전자 묶음 ⅲ) 유사 상호 연관 묶음인 응집력은 높으나 낮은 기여도를 보이는 유사 상호 연관 유전자 묶음 ⅳ) 응집력과 기여도가 모두 낮은 거짓 양성 유전자 묶음으로 분류한 후 진정 상호연관 유전자 묶음을 찾는 5 연산부; Based on cohesion and contribution to the bundle of genes responding to the multi-stimulus ⅰ) High cohesion and high contribution of gene bundles that correlate to all stimuli Genetic bundles with contributing and cohesiveness 유사) Pseudo-associative gene bundles with high cohesiveness but low contribution ⅳ) Classified as false positive gene bundles with low cohesiveness and low contribution. A calculator;

상기 5 연산부에서 찾은 진정 상호연관 유전자 묶음에 포함된 유전자들을 대상으로 세포내 네트워크를 재구성하여, 세포내에서 상호 연관성이 있을 것으로 예측되는 유전자를 선별하는 6 연산부 및6 operation unit for reconstructing the intracellular network for the genes included in the truly correlated gene bundle found in the 5 operation unit to select genes that are expected to be correlated in the cell and

상기 연산 결과를 출력하기 위한 출력수단을 더 포함하는 것을 특징으로하는 상호 연관성이 있는 유전자의 선별 장치를 제공한다.It provides a screening device for the correlation of genes, characterized in that it further comprises an output means for outputting the operation result.

이때, 사용되는 리버스 엔지니어링 프로그램은 유전자 사이의 상호 연관성을 나타내주는 상호정보척도 값을 계산하여 유전자 사이의 네트워크를 구성을 예상해 볼 수 있게 도와주는 프로그램이다. 'ARACNe'(nature genetics vol 37, 382-390), Linear Model, 베이지안 또는 가우시안 네트워크등이 이용가능하다. ARACNe가 바람직하나 이에 한정되는 것은 아니다.In this case, the reverse engineering program used is a program that helps to predict the composition of the network between genes by calculating a mutual information scale value representing the correlation between genes. 'ARACNe' (nature genetics vol 37, 382-390), Linear Model, Bayesian or Gaussian networks are available. ARACNe is preferred but not limited thereto.

본 발명은 상기와 같은 연산부의 조합을 통하여 기존에 알려진 많은 마이크로어레이 결과를 이용하고 그에 대한 발현 프로파일을 분석하여, 특정 자극에 대해 상호 작용하는 반응 유전자들을 대량 발굴할 수 있으며, 유전자 수준에서 상호 작용하는 것을 관찰하여 반응 특이적인 유전자를 선별할 수 있는 장치를 제공한다.The present invention utilizes a number of microarray results that are known in the past through a combination of computational units and analyzes expression profiles thereof, thereby mass-producing reaction genes that interact with specific stimuli, and interact at the gene level. The present invention provides a device capable of selecting reaction-specific genes.

이하, 본 발명을 실시예에 의하여 상세히 설명한다.Hereinafter, the present invention will be described in detail with reference to examples.

단, 하기 실시예는 본 발명을 구체적으로 예시하는 것이며, 본 발명의 내용 이 하기 실시예에 의해 한정되는 것은 아니다.However, the following examples are intended to specifically illustrate the present invention, and the content of the present invention is not limited by the following examples.

<< 실시예Example 1> 애기장대 유전자의 공동 발현 유전자 묶음의 구성 1> Composition of coexpressed gene bundles of Arabidopsis genes

미국의 샌디아 국가 지정 실험실(Sandia National Laboratories (www.sandia.gov))에 의해 개발된 데이터 마이닝 소프트웨어(data-mining software)인 VxInsight의 VxOrd를 이용하여 TAIR의 마이크로어레이 데이터베이스에 저장된 마이크로어레이 데이터 중 Affymetrix ATH1(25K) 게놈 어레이(genome arrays)를 기반(platform)으로 사용한 1436개의 마이크로어레이 데이터 묶음(microarray data set)에 나타난 28000여 개 애기장대 유전자들의 발현유사성이 반영된 이들의 XY 평면상에서의 위치 데이터를 다운받아 이를 기초자료로 하여 공동발현 유전자 묶음을 작성하기로 하였다(도 2). 그러나 여기서 얻은 유전자의 상대적인 2차원 위치 데이터를 이용해 유전자 묶음을 구성 시 고려해야 할 점이 몇 가지 있다. 첫째, 공동 발현으로 인정할 수 있는 유전자들 사이의 거리를 정의하는 것이다 즉 얼만큼 가까운 위치에 있는 유전자들이 공동 발현 되는 동일 유전자 묶음의 구성원이 되느냐 하는 문제이다. 두 번째로는 공동 발현으로 인정할 수 있는 위치 값을 가지지만, 하나의 유전자 묶음 구성원에 속하지 못하는 유전자들이 없어야 한다는 점인데 이것은 유전자들의 위치 값이 연속적으로 이어져 있기 때문에 발생한다. 즉, ‘a1’ 이라는 유전자가 ‘a2,‘a3’이라는 유전자와 가까워 A라는 유전자 묶음의 구성원 되기는 했지만,‘b1’이라는 유전자와도 가까워 공동 발현 된다고 말할 수도 있다. 그러나 기준에 의해 이미 ‘a1’은 A 묶음에,‘b1’ 은 B 묶음에 속해 있어 동일 유전자 묶음 member가 되지 못하는 상황을 말한다. 마지막으로는, 발현이 강화된 유전자 묶음(충실화된 유전자 묶음)이 통계적으로 유의미하다고 판정되는 유전자 묶음 안에 속한 유전자들중 거짓 양성(false positive)이 섞여있는 비율을 의미하는 오류 발견 비율(False Discovery rate (FDR)) 값이 분석에 참여하는 전체 유전자 묶음에 의해 영향을 받기 때문에, 한 카탈로그 내의 유전자 묶음 사이의 중복성을 최대한 피해야 한다. 가능한 모든 유전자 묶음 그룹을 빠뜨리지 않도록 유전자 묶음을 작성할 경우에는, 유전자 묶음 구성원들 간에 차이가 거의 없어지는데, 이것은 GSEA(유전자 묶음 충실도 분석) 결과에 영향을 미칠 수 있다. Of the microarray data stored in TAIR's microarray database using VxOrd from VxInsight, a data-mining software developed by Sandia National Laboratories (www.sandia.gov) Location data on their XY plane reflecting the expression-likeness of more than 28,000 Arabidopsis genes shown in 1436 microarray data sets using the Affymetrix ATH1 (25K) genome arrays as a platform Downloaded and decided to create a co-expression gene bundle as the basic data (Fig. 2). However, there are a few things to consider when constructing gene bundles using the relative two-dimensional positional data of the genes obtained here. First, it is to define the distance between genes that can be recognized as co-expression, that is, how close the genes are to be members of the same gene bundle. Secondly, there is a position value that can be recognized as co-expression, but there should be no genes that do not belong to one gene group member. This occurs because the position values of genes are continuous. In other words, although the gene 'a1' is close to the genes' a2 and 'a3', it is a member of the gene bundle A, but it is also close to the gene 'b1' and co-expressed. However, by standard 'a1' belongs to the 'A' group and 'b1' belongs to the 'B' group. Finally, the false discovery rate means the rate at which false positives are mixed among the genes within the gene bundle that are determined to be statistically significant. Because (FDR) values are influenced by the entire gene bundle involved in the analysis, redundancy between gene bundles within a catalog should be avoided as much as possible. Generating a gene bundle so that it does not miss all possible groups of gene bundles results in little difference between gene bundle members, which can affect GSEA (gene bundle fidelity analysis) results.

이상의 세가지 점들을 고려하여 구성된 공동 발현 유전자 묶음은 공동 발현으로 인정할 수 있는 유전자들 사이의 거리는 10에서 20정도가 가능하나, 한개의 유전자묶음에 포함된 유전자 수가 해석하기에 적절한 수인 4-20개를 만족할 수 있는 15가 바람직하다. 본 발명에서는 경계크기를 15로 정의하여 하나의 유전자 묶음으로 정의할 수 있는 사각틀의 크기가 동일하도록 하였다. Considering the above three points, the co-expression gene bundle can be recognized as co-expression. The distance between genes can be recognized as 10 to 20. 15 which can be satisfied is preferable. In the present invention, the boundary size is defined as 15 so that the size of the rectangular frame that can be defined as one gene bundle is the same.

다시 말해, x/y 좌표를 15의 거리를 가진 작은 정사면체들로 이루어진 블럭으로 구분하고, 이 블록안에 위치한 유전자들을 동일 유전자 묶음으로 그룹화하였다 또한 묶음 사이의 중복을 배제하고 한 유전자 묶음에 속하게 됨으로써 충분히 가까운 거리임에도 불구하고 다른 유전자 묶음과 한 그룹에 속하지 못하는 문제를 해결하고자, 이 정사면체 블록 구성 시, 서로 다른 원점을 가지는 4가지 형태의 (UL, UR, DL, DR) 유전자 묶음 카탈로그를 작성하였다(도 3). 즉, XY 좌표가 경계 크기 15인 블록으로 나누어질 경우, 원점 좌표를 (0,0),(7.5,0),(0,7.5),(7.5,7.5)로 하여 유전자 묶음 카탈로그를 작성할 수 있다. 이 후 각각의 서로 다른 원점을 갖는 유전자 묶음 카탈로그는 하나의 단위로서 분석에 이용되게 된다.In other words, the x / y coordinates are divided into blocks of small tetrahedrons with a distance of 15, and the genes located in these blocks are grouped into the same gene bundle. In order to solve the problem of not belonging to a different gene bundle and a group even though it is a short distance, four types of (UL, UR, DL, DR) gene bundle catalogs having different origins were constructed when constructing this tetrahedral block ( 3). That is, if the XY coordinates are divided into blocks having a boundary size of 15, the gene bundle catalog can be created with the origin coordinates as (0,0), (7.5,0), (0,7.5), (7.5,7.5). . The gene bundle catalogs with different origins are then used for analysis as a unit.

이들은 GSEA법을 이용한 마이크로어레이 데이터 분석 시, 상기의 유전자 카탈로그 블럭은 각각 다른 유전자 묶음으로써 사용되어 앞서 말한 두 가지 문제, 중복 및 누락(missing) 유전자 묶음 조합 문제를 해결했을 뿐 아니라, 결과 해석 시에 이들 4가지 catalog 사용에 의해 얻어진 발현이 많이된 유전자 묶음을 하나로 묶어 좀더 정확한 공동 발현 유전자 묶음을 찾아준다는 장점을 가진다. 앞서 말했듯이 ATH1(25K) 마이크로어레이를 사용한 1436개의 유전자 발현 프로파일에 나타난 애기장대 유전자의 발현변화를 근거로 하여 VxInsight가 계산한 유전자들의 2차원 좌표 값은, 이들 유전자들의 상대적인 관계성을 대변한다고 볼 수 있다. 따라서 이를 응용하여 작성된 공동 발현 유전자 묶음은 앞서 언급했듯이, 특정 자극이나 신호에 반응하기 위해 생물이 사용하는 유전자 모듈(gene module)들에 대한 예측을 해볼 수 있다는 장점을 가진다. When analyzing the microarray data using the GSEA method, these gene catalog blocks were used as different gene bundles to solve the aforementioned two problems, the combination of overlapping and missing gene bundles, as well as the interpretation of the results. By using these four catalogs, we have the advantage of finding more accurate co-expressed gene bundles by tying up the gene expression gene bundles. As mentioned earlier, the two-dimensional coordinate values of the genes calculated by VxInsight represent the relative relationship of these genes based on the changes in Arabidopsis gene expressions in 1436 gene expression profiles using ATH1 (25K) microarrays. Can be. Therefore, as described above, the co-expression gene bundle prepared by applying this has the advantage of making predictions about gene modules used by living organisms in response to specific stimuli or signals.

<< 실시예Example 2> 2> GSEAGSEA 를 이용한 유전자 묶음의 분석과 결과의 시각화Analysis of gene bundles and visualization of results

<2-1> 공동 발현 유전자 묶음을 이용한 <2-1> Co-expression Gene Bundle GSEAGSEA 분석 analysis

식물이 가진 한정된 유전 자원이 어떤 식으로 운영되어 맞춤형 반응을 보이는지에 대한 분자생물학적 이해를 높이기 위해, 앞서 구성한 공동 발현 유전자 묶음들과 GSEA 방법을 환경 스트레스와 연관된 마이크로어레이 실험 데이터 분석에 사용하였다. TAIR에서 다운받아 분석에 사용한 마이크로어레이 데이터는 Affymetrix ATH1 라는 동일한 마이크로어레이를 이용하여 파종 후 18일이 경과한 애기장대의 뿌리와 줄기에 9가지비생물학적 스트레스를 각각 처리하고 일정 시간이 지난 후에 그 개체에 대한 마이크로어레이 실험을 수행함으로써, 시간에 따른 유전자 발현 변화를 볼 수 있게 디자인된 시계열 데이터이다 (도 4). 우리는 뿌리와 줄기에 대해 6가지 시간 별로 독립적으로 진행된 구성된 이 데이터 중, 스트레스 별 초기 반응을 비교하고자 30분, 1시간, 3시간 데이터를 사용해 GSEA를 진행하였다. 그후 각 조건에서 발현량이 증가 또는 감소되는 유전자를 표시하였다(도 5).그 결과 GSEA 방법을 사용해 스트레스 환경에 따른 애기장대의 초기 반응과 연관되어 있다고 보여지는 유전자 묶음을 선별할 수 있었고, 각 스트레스 처리시, 대조적인 발현차이를 보이는 유전자 묶음의 개수를 뿌리와 줄기로 나누어 살펴보았을 때, 스트레스에 대한 애기장대의 초기 유전자 발현 변화는 스트레스 종류뿐 아니라 조직 별로도 차이를 보인다는 것을 알 수 있었다. 예를 들어, 스트레스 간 비교를 통해서 저온 스트레스에 대한 초기 반응의 경우는 염이나 열 스트레스와 달리 많은 수의 유전자 묶음들이 관여한다는 것을, 조직 별 비교에서는 저온 스트레스에 노출되었을 적에 뿌리보다는 줄기에서 많은 유전자 묶음들이 초기에 발현이 증가된다는 것 등을 알 수 있었다. To enhance the molecular biological understanding of how the limited genetic resources of plants operate and respond to customized responses, the co-expressed gene bundles and GSEA methods previously constructed were used to analyze microarray experimental data associated with environmental stress. The microarray data downloaded from TAIR for analysis was used to treat nine abiotic stresses on the roots and stems of the Arabidopsis, 18 days after sowing, using the same microarray called Affymetrix ATH1. By performing microarray experiments on, time series data designed to view gene expression changes over time (FIG. 4). We conducted GSEA using 30-minute, 1-hour, and 3-hour data to compare the initial response by stress among the 6 independent time-series data for the roots and stems. The genes were then marked with increasing or decreasing expression levels in each condition (FIG. 5). As a result, the GSEA method was used to select gene bundles that were shown to be associated with early response to Arabidopsis depending on the stress environment, and each stress In the treatment, when the number of gene bundles showing a contrasting expression difference was divided into roots and stems, it was found that the change in early gene expression of the Arabidopsis to stress differed not only by stress but also by tissue. For example, comparisons between stresses indicate that the initial response to cold stress involves a large number of gene bundles, unlike salt or heat stress. It was found that the bundles initially increased expression.

<2-2> <2-2> GSEAGSEA 를 이용하여 공동 발현되는 유전자 분석 및 시각화Coexpressed Gene Analysis and Visualization

9가지 종류의 비생물학적 스트레스에 대한 애기장대의 초기 유전자 발현 데 이터 분석에 사용된 공동 발현 유전자 묶음을 GSEA 결과는 위의 그래프에서 알 수 있다시피 조직별 비교, 스트레스별 비교 등이 가능한 또 하나의 다차원 데이터로 볼 수도 있다. 따라서 이들 결과를 정리하고 통합하여 식물의 초기 반응을 이해하는데 의미를 줄 수 있도록 비교 분석하는 것은 매우 중요한 일이다. 그렇지만 서로 다른 원점의 위치를 나타내는 타입과 위치 좌표로 표시된 이들의 이름(예를 들어, 23x34y)은 기본적인 정리작업을 마친 후에도, 내/외부적 자극에 대한 전체 애기장대 유전자의 변화양상에 대한 직관적인 이해를 어렵게 만들므로, 결과에 대한 보다 쉬운 해석을 위해 본 발명자들은 이를 시각화 하는 방법을 개발하게 되었다.The GSEA results of the cohort of gene expression data used in Arabidopsis early gene expression data analysis for nine types of abiotic stresses are shown in the graph above. It can also be viewed as multidimensional data. Therefore, it is very important that these results be summarized and compared so that they can be meaningful for understanding the plant's initial response. However, the type of location of the different origins and their names (e.g. 23x34y), represented by the location coordinates, are intuitive for changing the overall Arabidopsis gene's pattern of internal and external stimuli, even after basic cleanup. To make the understanding difficult, the inventors have developed a method of visualizing it for easier interpretation of the results.

VxOrd로 계산한 유전자들의 2차원 좌표 값을 사용하면 간단하게 2차원 그래프를 그릴 수가 있는데, 우리는 이를 응용하여, 유전자 묶음 이름이 나타내고 있는 x/y 위치좌표 값과 type code가 내포하고 있는 시작점을 결합한 오버랩시킨 유전자묶음 발현지도(ovelapped co-expression map)을 작성하여 선택된 유전자 묶음(enriched gene set)에 해당하는 위치에 이를 표시함으로써, 전체 유전자 묶음의 발현변화를 쉽게 파악 해볼 수 있도록 하였다(도 6).Using the two-dimensional coordinate values of genes calculated with VxOrd, we can simply draw a two-dimensional graph. We apply this to the x / y positional coordinate value indicated by the gene bundle name and the starting point of the type code. By combining the overlapped co-expression map (ovelapped co-expression map) and marking it in the position corresponding to the selected gene (enriched gene set), it was possible to easily identify the expression changes of the entire gene bundle (Fig. 6). ).

도면 7은 상기 공동 발현 맵을 이용하여 각 스트레스 상황에서 스트레스 처리 후, 초기반응이 일어날 것으로 예상되는 30분 / 1시간 / 3시간에 발현이 많이된 애기장대 유전자의 유전자 묶음들을 표시한 결과이다. 이 그림에서 파란색은 A-group 에서, 붉은 색은 B-group에서 활성화 되어짐을 보여주는데, 이 그림을 통해, 서로 다른 유전자 묶음이 스트레스에 따라 다르게 활성화된다는 것과 공동 발현 유전자 묶음 표시된 유전자 묶음 분포의 전체적인 경향성을 통해 반응의 유사성 정도 를 대략적으로 파악해 볼 수 있었다. 예를 들어 자외선과 물리적 상처로 인한 스트레스의 경우, 전체 패턴의 유사성을 쉽게 파악할 수 있어 높은 충실도를 보이는 유전자 묶음들이 매우 비슷하다는 것을 지도만을 통하여 쉽게 알 수 있다(도 7). FIG. 7 shows the gene bundles of Arabidopsis genes expressed at 30 minutes / 1 hour / 3 hours when the initial reaction is expected to occur after stress treatment in each stress situation using the co-expression map. In this figure, blue is shown to be activated in the A-group and red is shown in the B-group, which shows that different gene bundles are activated differently according to stress and the overall trend of the distribution of co-expressed gene bundles is shown. The approximate degree of similarity of reactions can be estimated through. For example, in the case of stress due to ultraviolet rays and physical wounds, it is easy to grasp the similarity of the entire pattern, and it can be easily seen through only the map that gene bundles showing high fidelity are very similar (FIG. 7).

<< 실시예Example 3> 3> GSEAGSEA 분석 결과의 통합에 의한 스트레스 특이 혹은 스트레스 범용 유전자 묶음의 확인 Identification of stress specificity or general stress gene bundles by integration of assay results

GSEA을 통해 선별된, 각각의 스트레스 환경에서 활성화되는 유전자 묶음들을 좀더 구체적으로 비교해보고자 이들 결과를 도면 8의 그림과 같은 매트릭스(matrix) 형태로 가로는 스트레스 종류를 세로는 발현이 많이 되는 유전자 이름을 기록한 표로 정리 하였다. 이 표를 통해 유전자 묶음들이 스트레스 특이적이거나, 스트레스에 일반적으로 나타나는지 확인해 볼 수 있었다. 특히 여러 스트레스에서 공통적으로 활성화 되는 유전자 묶음의 경우, 스트레스 신호전달 경로들 사이의 상호 연관성(cross-talk)과 관련된 유전자들이 모여있는 묶음일 가능성이 높으므로, 이후 이를 상호연관 유전자 묶음으로 부르기로 한다. 이들 상호연관 유전자 묶음으로 보이는 경우는 자극과의 연관성에 따라 A-group 또는 B-group 중 한쪽에서만 발현이 증가 되는 유전자 묶음이 있는가 하면, 스트레스 환경에 따라 발현되는 그룹이 달라지는 모습을 보이는 유전자 묶음도 있었다. To compare the gene bundles that are activated in each stress environment through GSEA in more detail, the names of genes that express many types of stress that cross the results in a matrix form as shown in FIG. The table is summarized. This table was used to determine whether the gene bundles were stress-specific or common to stress. In particular, in the case of a gene bundle that is commonly activated in various stresses, a bundle of genes related to cross-talk between stress signaling pathways is likely to be gathered, so it will be referred to as a correlated gene bundle. . In the case of these correlated gene bundles, there is a gene bundle in which expression is increased in only one of the A-group or B-group depending on the association with the stimulus, or the gene bundle in which the expressed group varies depending on the stress environment. there was.

각 스트레스 별 유전자 발현 프로파일에 대한 GSEA을 진행할 적에 대조군에 대한 데이터는 A-group으로, 실험군에 대한 데이터는 B-group으로 지정하였기 때문에, 어느 유전자 묶음이 B-group에서 발현이 증가 되었다는 것은 해당 유전자 묶음 에 포함된 일정 수 이상의 유전자가 정상상태의 애기장대에 비해 발현이 증가 되었다는 것을, A-group에서 발현이 증가 되었다면 그 반대로 발현이 감소 되었다는 의미한다. 뿌리조직이 추위(cold), 유전독성(genotoxic), 열(heat), 산화(oxidative), 염(salt), 혹은 상처(wound stress)를 받았을 때, A-group에서 활성화 되는 유전자 묶음 중 하나인 ‘DL_38x19y’와 건조(drought), 유전독성(genotoxic), 삼투압(osmotic), 염(salt), 혹은 자외선 스트레스(UVB stress)를 받았을 때, B-group에서 발현이 증가되는 유전자 묶음 중 하나인 ‘DL_55x55y’에 속한 유전자들의 초기 (30min, 1h, 3h)에 해당하는 시간대의 발현 프로파일을 UC Berkeley의 Eisen lab (//rana.lbl.gov)에서 개발한 ‘tree-view’라는 시각화(visualizing) 프로그램을 사용하여, 유전자 발현 프로파일 상에서의 신호 강도 정도를 이미지로 보여주고 있는 도면을 통해(도 9) A-group에서 발현이 증가된 유전자의 발현은 대부분 정상상태에 비해 감소되어 있고, 반대로 B-group에서 발현이 증가된(enriched)된 유전자의 발현은 대부분 증가되어 있다는 것을 쉽게 확인 할 수 있다. 이때 도면 8의 X축은 실험조건, 즉 30분, 1시간, 3시간을 의미한다. 그리고 Y축은 라인 하나가 유전자 하나를 의미하며, 정상 상태와 스트레스조건에서의 유전자발현양의 비율을 표시한 것으로 만약 정상에서의 유전자 발현이 더 많을수록 녹색, 반대 상황이면 적색으로 진하게 표시 되게 된다. As the GSEA for each stress-related gene expression profile was assigned to the A-group and the data for the experimental group to the B-group, it was indicated that a gene group increased expression in the B-group. The expression of more than a certain number of genes in the bundle increased compared to the normal Arabidopsis, meaning that if the expression increased in the A-group, the expression decreased. When the root tissue is cold, genotoxic, heat, oxidative, salt, or wound stress, one of the gene bundles activated in the A-group 'DL_38x19y' and one of a group of genes that increase expression in the B-group when subjected to drought, genotoxic, osmotic, salt, or UVB stress A tree-view visualization program developed by UC Berkeley's Eisen lab (//rana.lbl.gov) for time-domain expression profiles corresponding to the initial (30min, 1h, 3h) genes of DL_55x55y '. By using a diagram showing the image of the signal intensity on the gene expression profile (Fig. 9), the expression of the genes with increased expression in the A-group is mostly reduced compared to the normal state, and conversely, the B-group The expression of genes that are enriched in You can easily see that it is increased. At this time, the X-axis in Figure 8 means the experimental conditions, that is, 30 minutes, 1 hour, 3 hours. The Y-axis represents one gene, and the ratio of gene expression in steady state and stress condition is indicated. If more gene expression is normal, the green color is increased, and the opposite is dark red color.

이상과 같은 시각화 프로그램을 통하여 스트레스 특이적이거나 모든 스트레스에서 발현되는 스트레스 범용 유전자 묶음을 용이하게 발굴할 수 있다.Through the above visualization program, it is possible to easily discover the general bundle of stress genes that are stress-specific or expressed in all stresses.

<< 실시예Example 4> 시각화 프로그램을 이용한 4> using visualization program 조건 별By condition GSEAGSEA 결과의 통합 Consolidation of results

위에서 언급한 matrix 형태의 표는 9가지 종류의 스트레스 환경에 따른 개별 유전자 묶음의 연관 여부를 쉽게 파악할 수 있도록 도와주지만, 그럼에도 불구하고 스트레스들과 전체 활성화 된 유전자 묶음들 사이에 존재하는 상호연관(cross-talk) 하거나 스트레스 특이적인지 그 관계성을 종합적으로 파악 하는데 어려움이 있다. 이를 해결하고자‘Graphviz’ 라는 그래프 시각화(graph visualization) 프로그램을 이용 하여 전체 발현이 강화된 유전자 묶음들을 발현이 증가하는 방향으로만 변화하는 묶음, 감소하는 방향으로만 변화하는 묶음, 스트레스에 따라 증가 혹은 감소하는 방향으로 변화하는 묶음, 이렇게 3가지 그룹으로 분류한 후, 해당 스트레스들을 연결시킨 도면을 작성하였다(도 10). The matrix table mentioned above makes it easy to determine whether individual gene bundles are related to nine different stress environments, but nevertheless there is a cross between the stresses and the entire activated gene bundle. There is a difficulty in comprehensively understanding the relationship between the talk or the stress specificity. In order to solve this problem, a graph visualization program called 'Graphviz' is used to group bundles of genes whose expression is enhanced in whole, only in the direction of increasing expression, in the direction of decreasing, or increasing with stress. After dividing into three groups, the bundles varying in the decreasing direction, a diagram was created in which the corresponding stresses were connected (FIG. 10).

위의 9가지 오버랩된 공동 발현 유전자 묶음 맵의 그림을 통합한 이 그림은, 스트레스들과 개별 유전자 묶음의 연관성 여부뿐만 아니라 개별 스트레스들 사이에 공통으로 사용되는 상호연관 유전자 묶음 및 스트레스에 특이적인 초기 반응 유전자 묶음을 보다 구체적으로 확인해볼 수 있다는 장점을 가진다. 따라서 이를 임의로 상호 연관 지도(cross-talk map)이라 부르기로 한다. Incorporating the plots of the nine overlapping co-expressed gene bundle maps above, this plot shows the initial relationship specific to the interrelated gene bundles and stresses commonly used between individual stresses, as well as the association between stresses and individual gene bundles. It has the advantage that the reaction gene bundle can be identified in more detail. Therefore, it is arbitrarily called a cross-talk map.

이때 4개의 서로 다른 원점을 가지는 유전자 묶음이 하나로 통합되면서 보다 정확하게 상호 연관성을 파악 할 수 있다.At this time, a bundle of genes having four different origins can be integrated into one to more accurately identify the correlation.

<< 실시예Example 5> 선별된 상호연관(5> Selected Correlation ( cross-talkcross-talk ) 유전자 묶음의 진정성 여부 판단 방법) How to determine the authenticity of a gene bundle

특정조건에서 반응하는 유전자 묶음을 선별하기 위해 Subramanian A et al(PNAS October 25, 2005, vol. 102, No. 43, 15545-15550)이 고안한 계산방법을 사용한 결과, 일정 수준 이상의 충실치(enrichment score)를 받아 정상상태와 다른 발현 패턴을 보이는 유전자 묶음으로 판정 받는데 결정적인 역할을 한 유전자들을(이하 핵심 유전자라 한다.) 스트레스 별로 정리하면 도 11과 같다. 여기에서는 5가지의 스트레스 조건에서 모두 활성화 된 유전자 묶음이라 할지라도, 해당 스트레스와 실질적으로 연관되어 있다고 생각되는 핵심 유전자들은 동일하지 않다는 것을 쉽게 파악할 수 있다. 따라서, 상호 연관 지도(cross-talk map)에서의 유전자 묶음들의 연관 스트레스들과 연결성 만으로는 그들의 상호연관 유전자성을 판단할 수 없게 됨에 따라, 진정 상호 연관(real cross-talk) 유전자 묶음을 판단하기 위한 보다 정밀한 선정 방법이 필요하다. Enrichment of a certain level or more as a result of using the calculation method devised by Subramanian A et al (PNAS October 25, 2005, vol. 102, No. 43, 15545-15550) to select a set of genes that respond under specific conditions. Genes that play a decisive role (hereinafter, referred to as key genes) are determined as a bundle of genes having a different expression pattern from the normal state according to the stress. Here, it is easy to see that even though a set of genes activated under all five stress conditions are not identical, the key genes that are thought to be actually associated with the stress are not identical. Thus, as the association stresses and connectivity of gene bundles in a cross-talk map alone cannot determine their correlated genetics, it is necessary to determine a true cross-talk gene bundle. More precise selection methods are needed.

5가지 이상의 스트레스 조건에서 발현이 강화되어 상호 연관 유전자로 판단된 유전자 묶음 후보로 판단된 경우 유전자 수준에서의 진정성 (reality) 여부를 판단하기 위한 검증과정이 필요하다. 이 검증을 위해 묶음에 포함된 유전자 중 핵심 유전자를 중심으로 하여, 스트레스에 각각에 대한 핵심 유전자의 비율을 스트레스 반응에 참여하는 유전자 묶음의 기여도 (contribution)로, 해당 스트레스들 사이에서 중복하여 나타난 핵심 유전자의 비율을 유전자 묶음에 속한 유전자들 사이의 응집력 (cohesion)으로 각기 정의하고(도 12), 이들 값의 크고 작음으로 대변하면, 선별결과로 얻어진 각 상호연관 후보 유전자 묶음들은 아래 4가지 경우 중 하나에 해당 될 것이다. 첫번째는 모든 자극에 상호 연관이 있는 응집력과 기여도가 높은 유전자 묶음, 두번째는 전체적인 응집력은 낮으나 상호연관성을 가진다고 판단된 자극 중 일부에 대해서는 매우 높은 기여도와 응집역을 보이는 유전자 묶음, 세 번째는 유사 상호연관 묶음인 응집력은 높으나 낮은 기여도를 보이는 유사 상호연관 유전자 묶음, 마지막으로 네 번째는 응집력과 기여도가 모두 낮은 거짓 양성 유전자 묶음으로 나눌 수 있다.If expression is enhanced in five or more stress conditions and judged to be a gene bundle candidate judged to be a correlated gene, a verification process is needed to determine the authenticity at the gene level. Focusing on the core genes of the genes included in the bundle for this verification, the contribution of gene bundles that participate in the stress response to the ratio of the core genes to each of the stresses is the overlapping core among the stresses. When the ratio of genes is defined as the cohesion between genes belonging to the gene bundle (Fig. 12), and represented by the large and the small of these values, each of the correlation candidate gene bundles obtained as a result of the selection is one of the following four cases. Would be one. The first is a group of genes with high cohesion and high contribution that correlate to all stimuli; the second is a group of genes with very high contribution and cohesion for some of the stimuli judged to have low overall cohesive but interrelationships; An association bundle, a cohesive gene with high cohesion but low contribution, can be divided into a fourth, a false positive gene bundle with low cohesion and low contribution.

이와 같은 방법을 5가지 이상의 스트레스 조건에서 모두 충실 유전자 묶음 후보로 판정된 유전자 묶음에 적용하여 정리하면 도 11A와 같이. 이중 'DR55x57y'는 연관 자극에 대해 평균 77%의 높은 기여도와 해당 유전자 묶음에 속한 전체 유전자 개수의 70%가 4개 이상의 환경 스트레스 반응에 공통적으로 관련(도 11B)되어 첫 번째 카테고리에 분류되었고, 이 결과를 바탕으로 진정상호연관(real cross-talk) 유전자 묶음을 판정하였다.If such a method is applied to a gene bundle determined as a faithful gene bundle candidate in all five or more stress conditions, the arrangement is as shown in FIG. 11A. Of these, DR55x57y was categorized in the first category, with a high 77% contribution to the associated stimulus and 70% of the total number of genes in that gene bundle commonly associated with four or more environmental stress responses (Figure 11B). Based on this result, a real cross-talk gene bundle was determined.

<< 실시예Example 6> 유전자 묶음에 포함된 유전자들의 연관성 판단6> Determination of association of genes in gene bundle

공동 발현 유전자 묶음 지도와 충실 유전자 묶음 통합 지도(integration map)을 통해 유전자 묶음 단위에서의 유전자 발현 변화의 경향성을 파악하고 나면, 해당 유전자 묶음에 포함된 유전자들이 연관 스트레스와 실제적으로 어떤 관계를 가지고 있는지를 이해하기 위한 유전자 단위에서의 해석이 반드시 필요하다. 여기서는 공동 발현 유전자 묶음에 포함된 유전자 사이의 실질적인 관계성을 알아보기 위해, 실시예 5에서 스트레스 반응에 높은 필요성을 가지는 유전자들이 모여있는 유전자 묶음이라고 판정된 'DR55x57y'를 대상으로 그 묶음에 속한 유전자들의 발현 프로파일을 리버스엔지니어링 알고리즘을 적용하여 분석해 보기로 했다. 우리는 미국 Columbia 대학의 Califano lab에서 개발한 Microarray 데이터로부터 세포 네트워크를 재구성 해주는 ‘ARACNe’ (Nature Genetics 37:382-390)라고 하는 리버스 엔지니어링(reverse engineering) 알고리즘을 적용하여 충실 유전자 묶음에 속한 유전자들만의 하위 조절 네트워크(sub-regulatory network)를 구성하고, 여기에 각 스트레스 별 핵심 유전자(core gene)를 표시하여 도 14의 결과를 얻었다. 도 14에서 보이는 것처럼 네트워크(sub-regulartory network)를 구성하고 있는 대다수의 유전자가 실제 생체내에서 각 스트레스 반응에서의 핵심유전자에 해당함을 알 수 있었다.Once the co-expression gene bundle map and faithful gene integration map have been used to identify trends in gene expression changes in a gene bundle unit, it is important to understand how the genes in that gene bundle actually correlate with associated stress. Interpretation at the genetic level is essential to understand. Here, in order to examine the practical relationship between the genes included in the co-expressed gene bundle, the genes belonging to the bundle were examined in DR55x57y, which was determined to be a gene bundle in which genes having a high need for stress response were collected in Example 5 Their expression profile was analyzed by applying a reverse engineering algorithm. We applied a reverse engineering algorithm called 'ARACNe' (Nature Genetics 37: 382-390), which reconstructs the cellular network from the microarray data developed by Califano lab at Columbia University, USA. The sub-regulatory network was configured, and the core gene for each stress was displayed therein, thereby obtaining the result of FIG. 14. As shown in FIG. 14, it was found that a large number of genes constituting the sub-regulartory network correspond to the core genes in each stress response in vivo.

상기의 결과로 상호 관련 유전자를 발굴하는 방법을 통하여 얻은 유전자들이 실제 예측되는 기능을 함을 알 수 있었으므로, 본원 발명의 데이터 분석 방법이 효과적으로 사용될 수 있음을 입증하였다.As a result of the above results, it was found that the genes obtained through the method of discovering the related genes actually function as expected, and thus, the data analysis method of the present invention can be effectively used.

도 1은 유전자 발현 유사성을 근거로 유전자들의 상대적 위치를 계산하는 수식으로 VxOrd의 알고리즘의 기본 개념을 알기 쉽게 그림으로 표시한 것이다.Figure 1 is a formula for calculating the relative positions of genes based on gene expression similarity is a graphical representation of the basic concept of the VxOrd algorithm.

도 2는 TAIR에서 제공하는 마이크로어레이 데이터 베이스에 저장된 데이터 묶음 1436개를 데이터 마이닝 소프트웨어인 VxInsight를 이용하여 28000여 개의 애기장대 유전자를 메가 클러스터링한 결과로부터 동시 발현되는 유전자 묶음을 구성한 것을 보여주는 도면이다.FIG. 2 is a diagram showing 1436 bundles of data stored in a microarray database provided by TAIR using VxInsight, a data mining software, for constructing a bundle of genes co-expressed from mega clustering results of 28,000 Arabidopsis genes.

도 3은 서로 다른 원점을 가지는 4가지 종류의 유전자 묶음 카탈로그를 작성하는 방법을 보여주는 그래프이다. 원점의 위치에 따라 UL, UR, DL, DR 로 구분된다. 3 is a graph showing a method of preparing four kinds of gene bundle catalogs having different origins. It is classified into UL, UR, DL and DR according to the position of origin.

도 4은 애기장대의 뿌리와 줄기에 9가지 비생물학적 스트레스를 각각 처리 하고 일정 시간이 지난 후에 그 개체에 대한 마이크로어레이 실험을 수행함으로써, 시간에 따른 유전자 발현 변화를 볼 수 있게 디자인된 시계열 데이터이다.FIG. 4 is time series data designed to view gene expression changes over time by treating nine abiotic stresses on the roots and stems of Arabidopsis spp. And performing a microarray experiment on the individual after a certain time. .

도 5는 각 스트레스 처리시, 대조적인 발현차이를 보이는 유전자 묶음의 개수를 뿌리와 줄기로 나누어 살펴보았을 때, 스트레스에 대한 애기장대의 초기 유전자 발현 변화는 스트레스 종류뿐 아니라 조직 별로도 차이를 보여주는 도면이다. 5 is a diagram showing the change in the initial gene expression of the Arabidopsis to stress when the number of gene bundles showing a contrast expression difference at the time of each stress divided into roots and stems, showing a difference not only in the type of stress but also in the tissues. to be.

도 6는 VxInsight로 분석한 데이터를 유전자 묶음 이름이 나타내고 있는 x/y 위치좌표 값과 타입 코드(type code)가 내포하고 있는 시작점을 결합한 오버랩된 공동 발현 맵을 작성하여 발현이 강화된 유전자 묶음(enriched gene set)에 해당하는 위치에 이를 표시하여 유전자 묶음의 발현변화를 시각화한 도면이다.FIG. 6 shows overlapping co-expression maps that combine data analyzed by VxInsight with x / y positional coordinate values indicated by gene bundle names and the starting point of type code. This is a diagram visualizing changes in the expression of gene bundles by marking them at positions corresponding to enriched gene sets.

도 7은 파란색은 A-group 에서, 붉은 색은 B-group에서 발현됨을 보여준다. 서로 다른 유전자 묶음이 스트레스에 따라 다르게 발현(enrich) 된다는 것과 동시 발현되는 유전자 묶음으로 표시된 유전자 묶음 분포의 전체적인 경향성을 통해 반응의 유사성 정도를 나타내는 도면이다. 7 shows that blue is expressed in A-group and red is expressed in B-group. Different gene bundles are expressed differently according to stress (enrich) and the degree of similarity of responses is shown through the overall tendency of the gene bundle distribution indicated by co-expressed gene bundles.

도 8은 GSEA을 통해 선별된 각각의 스트레스 환경에서 발현이 활성화(enrich)되는 유전자 묶음(gene set)들을 좀더 구체적으로 비교한 그림이다. 가로는 스트레스 종류를 세로는 충실 유전자 묶음 명칭을 의미한다. FIG. 8 is a more detailed comparison of gene sets whose expression is enriched in each stress environment selected through GSEA. Horizontal refers to the type of stress, vertical refers to the name of the loyalty gene bundle.

도 9은 뿌리조직이 cold, genotoxic, heat, oxidative, salt, 혹은 wound stress를 받았을 때, A-group에서 활성화 되는 유전자 묶음 중 하나인 ‘DL_38x19y’와 drought, genotoxic, osmotic, salt, 혹은 UVB stress 을 받았을 때, B-group에서 활성화 되는 유전자 묶음 중 하나인 ‘DL_55x55y’에 속한 유전자들의 초기 (30min, 1h, 3h)에 해당하는 시간대의 발현 프로파일을‘tree-view’라는 visualizing 프로그램을 사용하여,유전자 발현 프로파일(gene expression profiles) 상에서의 신호 강도(signal intensity)정도를 이미지로 보여주고 있는 도면이다.Figure 9 shows that when the root tissue is subjected to cold, genotoxic, heat, oxidative, salt, or wound stress, one of the gene bundles activated in A-group is 'DL_38x19y' and drought, genotoxic, osmotic, salt, or UVB stress. When received, genes using the visualizing program called 'tree-view' are used to visualize the expression profile of the initial time (30min, 1h, 3h) of the genes belonging to one of the genes activated in the B-group, 'DL_55x55y'. It is a diagram showing the image of the signal intensity (generate intensity) on the expression profile (gene expression profiles).

도 10는‘Graphviz’ 라는 graph visualization 프로그램을 이용 하여 전체 충실 유전자 묶음들을 발현이 증가하는 방향으로만 변화하는 묶음, 감소하는 방향으로만 변화하는 묶음, 스트레스에 따라 증가 혹은 감소하는 방향으로 변화하는 묶음, 이렇게 3가지 그룹으로 분류한 후, 해당 스트레스들을 연결시킨 모습을 보여주는 도면이다.10 is a bundle that changes the whole loyalty gene bundle only in the direction of increasing expression, the bundle only changes in the direction of decreasing, using the graph visualization program 'Graphviz' bundles that change in the direction of increasing or decreasing with stress After dividing into 3 groups, these stresses are connected.

도 11의 (a)는 GSEA 결과, 해당 동시 발현되는 유전자 묶음(co-expression gene set)이 일정 점수 이상의 충실치를 받아 충실 유전자 묶음으로 선정되는데 결정적인 역할을 한 유전자인 핵심 유전자들을 해당 유전자 묶음을 활성화 시키는 스트레스 별로 정리한 표이다. (b) 유전자 묶음중 하나인 DR55x57y에 대한 GSEA 분석값을 나타낸 표이다.11 (a) shows that the core genes, which are the genes that played a decisive role in selecting the co-expression gene set and the co-expression gene set selected as faithful gene bundles by receiving a fidelity of a predetermined score or more, are activated. This is a table of stress by letting. (b) Table showing GSEA analysis values for DR55x57y, one of the gene bundles.

도 12는 스트레스에 각각에 대한 core gene의 비율을 스트레스 반응에 참여하는 유전자 묶음의 기여도 (contribution)로, 해당 스트레스들 사이에서 중복하여 core gene으로 나타난 유전자의 비율을 유전자 묶음에 속한 유전자들 사이의 응집력 (cohesion)으로 각기 정의하고 도시화한 도면이다.12 is a contribution of a bundle of genes participating in a stress response to a ratio of core genes to stress, and a ratio of genes represented as core genes overlapping among stresses among genes belonging to the gene bundle. Figures are defined and illustrated in terms of cohesion.

도 13은 다른 스트레스 조건에서 공통적으로 강하게 발현(enrich)되는 유전자 묶음들은 기여도와 응집력 면에서 4가지 타입으로 나눈 것을 예시한 도면이다.FIG. 13 is a diagram illustrating that gene bundles that are strongly expressed in different stress conditions are divided into four types in terms of contribution and cohesion.

도 14는‘ARACNe’라고 하는 리버스 엔지니어링(reverse engineering) 알고리즘을 적용하여 강하게 발현되는 유전자 묶음(enriched gene set)에 속한 유전자들만의 하위 조절 네트워크(sub-regulatory network)를 구성하고 각 스트레스 별 핵심 유전자(core gene)을 표시한 도면이다.FIG. 14 illustrates a sub-regulatory network of genes belonging to a strongly expressed gene set by applying a reverse engineering algorithm called 'ARACNe', and a core gene for each stress. (core gene) is shown.

도 15은 본원 발명의 흐름을 나타낸 모식도이다.15 is a schematic diagram showing the flow of the present invention.

Claims

A first step of inputting a gene obtained on the basis of gene expression similarity in the microarray data and a relative position in which the gene is expressed in two-dimensional coordinates by using a data mining program for determining the relative position of genes;

A second step of determining a boundary size for classifying the two-dimensional coordinates of the gene input in the first step into a gene bundle;

A third step of creating a new gene bundle catalog having four different origins based on the boundary size of the gene bundle determined in the second step;

Statistical analysis is applied to the gene bundle catalog prepared in the third step, and microarray data is obtained for normal cells (control) and stimulated cells (experimental group), and then the data are compared and analyzed to express gene expression patterns. 4 steps that are displayed;

Based on the analyzed data of the fourth step, using a visualization program loaded into the microprocessor and the memory of the computer, the data associated with all the analyzed gene bundles of the fourth step are visualized to select the correlated gene bundles. How to discover a large number of genes involved in regulating the function of the cells in the microarray data comprising five steps.

The microarray data of claim 1, wherein the microarray data is selected from the group consisting of a TAIR database, a gene expression omnibus (GEO) of NCBI, and an array express of EBI. How to excavate.

The method of claim 1, wherein the data mining program is VxOrd or STN AnaVist.

The method of claim 1, wherein the size of the boundary of the second step is 10 to 20. 3.

5. The method of claim 4, wherein the size of the boundary is 15. 6.

The method of claim 1, wherein the origin is located at a position consisting of 1/2 position of the X axis, 1/2 position of the Y axis and 1/2 position of the X and Y axes based on the boundary size of each gene bundle. A method for discovering large amounts of genes involved in regulating cell function in microarray data.

The method of claim 1, wherein the statistical analysis method is gene set enrichment analysis (GSEA) or parametric analysis of gene set enrichment (PAGE). .

The method of claim 1, wherein the graph visualization program is Graphviz or pajek.

Based on the analyzed data of the fourth step, using a visualization program loaded into the microprocessor and the memory of the computer, the data associated with all the analyzed gene bundles of the fourth step are visualized to select the correlated gene bundles. 5 steps being;

For the gene bundle selected in step 5, the distribution of core genes in the bundle according to the stimulus is classified into cohesion and contribution, and these gene bundles are correlated to all stimuli according to the degree of response to the stimulus. Gene bundles with high cohesion and high contribution ii) bundles of genes with high cohesion but low cohesion for some stimuli that are judged to be low coherence but low coherence ⅲ) 6 steps to find a real cross-talk gene bundle after four subdivisions, such as a false positive gene bundle with low cohesion and low contribution; And

Reconstruct intracellular networks using a program that implements a reverse algorithm loaded in a microprocessor and memory for genes included in the real cross-talk gene bundle found in step 6 above. A method for screening correlated genes comprising seven steps of selecting a gene that is predicted to be relevant.

10. The method of claim 9, wherein the program implementing the reverse algorithm is ARACNe.

An input unit for receiving relative position data of a gene based on expression similarity of a gene in microarray data and a gene represented by two-dimensional coordinates of the gene;

A first calculating unit classifying the two-dimensional coordinates of the input gene into gene bundles according to a predetermined boundary size;

A second operation unit classifying a new gene bundle catalog having four different origins based on the boundary size of the gene bundle;

A third calculator configured to statistically analyze the gene bundle catalog to obtain microarray data of normal cells (control) and stimulated cells (experimental group), and then compare the values to display a gene expression pattern;

A fourth calculating unit which visualizes the gene bundle expression map by using a visualization program loaded in a microprocessor and a memory of a computer based on the data obtained by the third calculating unit and selects the correlated gene bundles; And

Mass excavation apparatus for interrelated genes involved in regulating the function of the cell configured as an output means for outputting the operation result.

A third calculator configured to statistically analyze the gene bundle catalog to obtain microarray data of normal cells (control) and stimulated cells (experimental group), and then compare the values to display expression patterns of genes;

A fourth calculating unit which visualizes the gene bundle expression map by using a visualization program loaded in a microprocessor and a memory of a computer based on the data obtained by the third calculating unit and selects the correlated gene bundles;

Based on cohesion and contribution to the bundle of genes responding to the multi-stimulus ⅰ) High cohesion and high contribution of gene bundles that correlate to all stimuli Genetic bundles that show contribution and cohesion ⅲ) Genetic bundles that have similar cohesion but high cohesiveness but low contribution ⅳ) Classify false positive gene bundles with low cohesiveness and low contribution 5 arithmetic unit; And

And a sixth operation unit configured to reconfigure the intracellular network for genes included in the true correlation gene bundle found by the fifth operation unit and to select genes that are predicted to be correlated in the cell. Screening device for related genes.