KR20120014512A

KR20120014512A - Method and apparatus for generating model for prognostic prediction using snps

Info

Publication number: KR20120014512A
Application number: KR1020100076635A
Authority: KR
Inventors: 손대순; 이규상; 안태진; 김진석; 박창이; 손인석; 정신호
Original assignee: 삼성전자주식회사
Priority date: 2010-08-09
Filing date: 2010-08-09
Publication date: 2012-02-17

Abstract

PURPOSE: A prognosis prediction model creation method using a SNP(Single Nucleotide Polymorphism) method and apparatus thereof are provided to create a prognosis prediction model by collecting information for a dominant model, a recessive model, and an additive model. CONSTITUTION: An SNP determination unit(1102) determines the SNP of a test target person by using hereditary models which statistically examine the hereditary feature of the SNP. A hereditary model determination unit(1103) determines similar hereditary model for the SNP. A standardization unit(120) standardizes coding values of SNP hereditary types based on the determined hereditary model. A model creation unit(130) creates a prognosis evaluation model which displays a correlation by analyzing a statistical evaluation algorithm.

Description

Method and apparatus for generating model for prognostic prediction using SNPs

SNP들에 대한 피검자들의 유전자형 데이터들을 이용하여 예후예측을 위한 모델을 생성하는 방법 및 장치를 제공한다.A method and apparatus are provided for generating a model for prognostic prediction using genotype data of subjects for SNPs.

DNA가 발견된 후 개체의 유전자를 분석하는 기술이 발달함에 따라 이를 이용한 돌연변이의 유전형을 분석하고 그 다형성을 밝혀내기 위한 연구도 함께 진행되어 왔다. 다형성의 종류들 중 특히 인간의 게놈에서 가장 많이 발견되는 다형성은 단일염기다형성(SNP, Single Nucleotide Polymorphism)이다. 인간의 유전적인 요소는 모든 인간의 질병과도 연관되며, 또한 인간은 자신의 유전적인 요소에 따라 질병에 대한 저항성, 민감성 및 질병의 정도가 다르다. 특히, SNP는 인간의 질병 발현 등과 상관관계가 있어, 특정 질병들을 갖는 환자군 집단의 SNP를 나타내는 특정 위치들의 염기 서열은 동일한 위치들에 있는 대조군 또는 정상군 집단의 염기 서열과 차이가 있음이 연구를 통해 밝혀졌다. 따라서, DNA 서열을 통하여 밝혀진 염기의 차이에 기초하여 질병의 진단, 처방 및 예방이 가능하다. 최근에는 SNP와 관련된 염기 서열로부터 질병에 대한 개개인의 민감성 등과 같은 질병의 예후를 정확하게 예측하는 모델을 만들기 위한 연구가 진행되고 있다.As the technology of analyzing genes of individuals has been developed after the discovery of DNA, research has been conducted to analyze the genotype of mutations and to reveal the polymorphism. Among the types of polymorphism, the polymorphism most commonly found in the human genome is Single Nucleotide Polymorphism (SNP). Human genetic factors are also associated with all human diseases, and humans also differ in their resistance, sensitivity and degree of disease to their genetic factors. In particular, SNPs correlate with human disease expression, so that the base sequences of specific positions representing SNPs in a group of patients with specific diseases differ from those of the control or normal group at the same positions. Turned out. Thus, diagnosis, prescription, and prevention of diseases are possible based on the difference in bases found through DNA sequences. Recently, research has been conducted to create a model that accurately predicts the prognosis of diseases such as individual susceptibility to diseases from base sequences associated with SNPs.

본 발명의 적어도 하나의 실시예가 이루고자 하는 기술적 과제는 SNP를 이용하여 예후예측 모델을 생성하는 방법 및 장치를 제공하는 데 있다. 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.The technical problem to be achieved by at least one embodiment of the present invention is to provide a method and apparatus for generating a prognosis prediction del using SNP. The technical problem to be solved by this embodiment is not limited to the above-described technical problems, and other technical problems may exist.

일 측면에 따르면, 예후예측 모델 생성 방법은 SNP의 유전적 특성을 통계적으로 검정하는 복수의 유전 모델들로 피검자들로부터 획득한 SNP들 각각을 검정하여 상기 SNP들 중 상기 피검자들을 환자군과 대조군으로 유의하게 분류하는 SNP들을 결정하는 단계; 상기 결정된 SNP들 각각에 대해 상기 검정에 사용된 상기 유전 모델들 중 가장 유의한 유전 모델을 결정하는 단계; 상기 결정된 유전 모델에 기초하여 상기 결정된 SNP들 각각의 유전자형들이 부호화된 값들을 표준화하는 단계; 및 상기 표준화된 결과들을 통계적인 예측 알고리즘으로 분석함으로써 상기 결정된 SNP들의 상기 유전자형들과 상기 피검자들의 예후의 상관관계를 나타내는 예후예측 모델을 생성하는 단계를 포함한다.According to one aspect, the method for generating a prognostic model predicts each of the SNPs obtained from the subjects with a plurality of genetic models that statistically test the genetic characteristics of the SNP, thereby making the subjects of the SNPs a patient group and a control group. Determining SNPs to classify automatically; Determining the most significant genetic model of the genetic models used in the assay for each of the determined SNPs; Normalizing encoded values of genotypes of each of the determined SNPs based on the determined genetic model; And generating a prognostic prediction model indicating a correlation between the genotypes of the determined SNPs and the prognosis of the subjects by analyzing the standardized results with a statistical prediction algorithm.

다른 측면에 따르면, 상기 예후예측 모델 생성 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체가 제공된다.According to another aspect, a computer-readable recording medium having recorded thereon a program for executing the prognostic prediction model generation method on a computer is provided.

또 다른 측면에 따르면, 예후예측 모델 생성 장치는 SNP의 유전적 특성을 통계적으로 검정하는 복수의 유전 모델들로 피검자들로부터 획득한 SNP들 각각을 검정하여 상기 SNP들 중 상기 피검자들이 환자군과 대조군으로 유의하게 분류된 SNP들을 결정하는 SNP 결정부; 상기 결정된 SNP들 각각에 대해 상기 검정에 사용된 상기 유전 모델들 중 가장 유의한 유전 모델을 결정하는 유전 모델 결정부; 상기 결정된 유전 모델에 기초하여 상기 결정된 SNP들 각각의 유전자형들이 부호화된 값들을 표준화하는 표준화부; 및 상기 표준화된 결과들을 통계적인 예측 알고리즘으로 분석함으로써 상기 결정된 SNP들의 상기 유전자형들과 상기 피검자들의 예후의 상관관계를 나타내는 예후예측 모델을 생성하는 모델 생성부를 포함한다.According to another aspect, the apparatus for generating a prognostic model predicts each of the SNPs obtained from the subjects with a plurality of genetic models that statistically test the genetic characteristics of the SNPs. An SNP determiner for determining significantly classified SNPs; A genetic model determiner for determining the most significant genetic model of the genetic models used in the assay for each of the determined SNPs; A normalizer for normalizing encoded values of genotypes of each of the determined SNPs based on the determined genetic model; And a model generator for generating a prognostic prediction model indicating a correlation between the genotypes of the determined SNPs and the prognosis of the subjects by analyzing the standardized results by a statistical prediction algorithm.

상기된 바에 따르면, 우성 모델(dominant model), 열성 모델(recessive model) 및 부가 모델(additive model)과 같은 복수의 유전 모델들에 해당되는 SNP들을 모두 이용하여 하나의 예후예측 모델을 생성하므로, 어느 하나의 특정 유전 모델(genetic model), 예를 들어 우성 모델에 해당되는 일부의 SNP들만을 이용하여 예후예측 모델을 생성하는 것보다 정확한 예후예측 모델을 생성하는 것이 가능하다. 즉, SNP들이 갖는 우성(dominant), 열성(recessive) 및 부가(additive)의 서로 다른 유전 특성들(genetic characters)이 모두 반영된 하나의 예후예측 모델을 생성하는 것이 가능하다. As described above, since one prognostic model is generated using all of the SNPs corresponding to a plurality of genetic models such as a dominant model, a recessive model, and an additive model, It is possible to generate an accurate prognostic model rather than generating a prognostic model using only some SNPs corresponding to one particular genetic model, for example a dominant model. That is, it is possible to generate one prognostic model that reflects all of the dominant, recessive, and additive different genetic characters of SNPs.

도 1은 본 발명의 일 실시예에 따른 예후예측 모델 생성 장치의 블록도이다.
도 2는 본 발명의 일 실시예에 따른 데이터 수신부에서 수신된 피검자들의 SNP들에 대한 유전자형 데이터들을 나타내는 표이다.
도 3은 본 발명의 일 실시예에 따른 SNP 분석부에서 유전자형에 따라 입력된 데이터를 나타내는 표이다.
도 4는 본 발명의 일 실시예에 따른 표준화부에서 유전자형들이 부호화된 값들이 표준화된 결과를 나타낸 표이다.
도 5a 및 도 5b는 본 발명의 일 실시예에 따라 예후예측 모델 생성 장치에서 모의실험된 결과들을 나타내는 표들이다.
도 6은 본 발명의 일 실시예에 따른 예후예측 모델을 생성하는 방법의 흐름도이다. 1 is a block diagram of an apparatus for generating a prognostic prediction model according to an embodiment of the present invention.
2 is a table showing genotype data of SNPs of subjects received by a data receiving unit according to an embodiment of the present invention.
Figure 3 is a table showing the data input according to the genotype in the SNP analysis unit according to an embodiment of the present invention.
4 is a table showing a result of standardizing values encoded by genotypes in a standardization unit according to an embodiment of the present invention.
5A and 5B are tables showing results simulated in a prognostic model generation apparatus according to an embodiment of the present invention.
6 is a flowchart of a method of generating a prognostic model according to an embodiment of the present invention.

이하에서는 도면을 참조하여 본 발명의 실시예들을 상세히 설명한다.Hereinafter, with reference to the drawings will be described embodiments of the present invention;

도 1은 본 발명의 일 실시예에 따른 예후예측 모델 생성 장치(1)의 블록도이다. 도 1을 참고하면, 본 실시예에 따른 예후예측 모델 생성 장치(1)는 프로세서(10), 데이터 수신부(20), 저장부(30) 및 출력부(40)로 구성된다. 프로세서(10)는 SNP 분석부(110), 표준화부(120), 모델 생성부(130) 및 예측부(140)로 구성된다. 여기서, SNP 분석부(110)는 검정부(1101), SNP 결정부(1102) 및 유전 모델 결정부(1103)으로 구성된다. 이와 같은 프로세서(10)는 다수의 논리 게이트들의 어레이로 구현될 수 있고, 범용적인 마이크로프로세서와 이 마이크로프로세서에서 실행될 수 있는 프로그램이 저장된 메모리의 조합으로 구현될 수도 있다. 또한, 다른 형태의 하드웨어로 구현될 수도 있음을 본 실시예가 속하는 기술분야에서 통상의 지식을 가진 자라면 이해할 수 있다. 본 명세서에서는 본 실시예의 특징이 흐려지는 것을 방지하기 위하여 본 실시예에 관련된 하드웨어 구성요소(hardware component)들만을 기술하기로 한다. 다만, 도 1에 도시된 하드웨어 구성요소들 외에 다른 범용적인 하드웨어 구성요소들이 포함될 수 있음을 본 실시예가 속하는 기술분야에서 통상의 지식을 가진 자라면 이해할 수 있다.1 is a block diagram of a prognostic prediction model generating device 1 according to an embodiment of the present invention. Referring to FIG. 1, the apparatus for generating a prognostic prediction model according to the present embodiment 1 includes a processor 10, a data receiver 20, a storage unit 30, and an output unit 40. The processor 10 includes an SNP analyzer 110, a standardizer 120, a model generator 130, and a predictor 140. Here, the SNP analysis unit 110 is composed of an assay unit 1101, an SNP determination unit 1102, and a genetic model determination unit 1103. The processor 10 may be implemented as an array of a plurality of logic gates, or may be implemented as a combination of a general purpose microprocessor and a memory in which a program that may be executed in the microprocessor is stored. It will be appreciated by those skilled in the art that the present invention may be implemented in other forms of hardware. In this specification, only hardware components related to the present embodiment will be described in order to prevent blurring the features of the present embodiment. However, it will be understood by those skilled in the art that other general hardware components may be included in addition to the hardware components illustrated in FIG. 1.

도 1을 참고하면, 본 실시예의 예후예측 모델 생성 장치(1)는 마이크로어레이(microarray) 등과 같은 DNA 칩에서 반응한 피검자들의 유전자 샘플들을 분석하는 유전자 분석 장치(미도시)로부터 분석된 유전자 정보를 이용하여 예후예측 모델을 생성한다. 여기서, 예후예측 모델 생성 장치(1)가 이용하는 유전자 정보는 복수의 피검자들의 복수의 SNP들에 대한 유전자형 데이터들이다. 따라서, 유전자 분석 장치는 이와 같은 유전자 정보를 분석하는 유전자 분석 장치라면 어떠한 장치라도 무방함을 당해 기술 분야에서 통상의 지식을 가진 자라면 이해할 수 있다.Referring to FIG. 1, the prognosis prediction model generating device 1 of the present embodiment may analyze gene information analyzed from a genetic analysis device (not shown) that analyzes gene samples of test subjects responding to a DNA chip such as a microarray or the like. Create a prognostic model. Here, the genetic information used by the prognostic prediction model generating device 1 is genotype data for a plurality of SNPs of a plurality of subjects. Therefore, it will be understood by those skilled in the art that the gene analysis device may be any device as long as it is a gene analysis device that analyzes such genetic information.

SNP(Single Nucleotide Polymorphism)는 DNA 염기서열에서 어느 위치의 염기서열(A,T,G,C)의 차이를 보이는 유전적 변화 또는 변이를 의미하는 것으로써, 동일한 종의 개체 사이의 단일뉴클레오티드 변이의 형태이다. 대체로, SNP는 인간의 경우 약 1,000bp(base pair) 마다 1회의 빈도로 발생하는 것으로 알려져 있다.Single Nucleotide Polymorphism (SNP) refers to genetic changes or variations that show differences in the nucleotide sequence (A, T, G, C) at any position in the DNA sequence. Form. In general, SNPs are known to occur at a frequency of about every 1,000 bp (base pair) in humans.

SNP가 코딩 영역의 염기 서열에 발생하는 경우, 결함이 있거나 변이된 단백질이 발현되어 질병을 일으킬 수 있고, 아무런 영향을 미치지 않을 수 있다. 또한, 비코딩 영역의 염기 서열에도 SNP가 발생할 수 있다. 즉, SNP는 인간의 질병과 연관된 유전적인 요소로써, SNP의 차이로 인해 인간마다 질병에 대한 저항성, 민감성 및 질병의 정도가 다르게 나타난다. 따라서, SNP와 질병의 민감성 등과 상관 관계를 통해 질병의 진단, 처방 및 예방을 할 수 있다. 본 실시예의 예후예측 모델 생성 장치(1)는 이와 같은 인간의 질병의 진단, 처방 및 예방을 위하여 SNP를 이용함으로써 질병 등에 대한 개개인의 민감성 등과 같은 예후를 정확하게 예측하기 위한 예후예측 모델을 생성한다.When SNPs occur in the nucleotide sequence of a coding region, defective or mutated proteins may be expressed, causing disease and have no effect. SNPs may also occur in the nucleotide sequence of the non-coding region. In other words, SNP is a genetic factor associated with human disease, and due to differences in SNPs, humans have different resistance, sensitivity, and degree of disease. Therefore, it is possible to diagnose, prescribe, and prevent disease through correlation with SNP and disease sensitivity. Prognostic prediction model generation device 1 of the present embodiment generates a prognostic prediction model for accurately predicting the prognosis such as individual sensitivity to the disease by using the SNP for the diagnosis, prescription and prevention of such human diseases.

데이터 수신부(20)은 유전자 분석 장치(미도시)로부터 피검자들의 복수의 SNP들에 대한 유전자형 데이터들을 수신한다. 도 1에서는 유전자 분석 장치와 본 실시예의 예후예측 모델 생성 장치(1)가 별도의 장치임을 가정하였다. 그러나, 이에 한정되지 않고 예후예측 모델 생성 장치(1)는 유전자 분석 장치에 내장되어 있는 장치로도 구현될 수 있다. 이하에서는 도 1과 같이 유전자 분석 장치와 본 실시예의 예후예측 모델 생성 장치(1)가 별도의 장치인 것을 가정하여 설명하도록 하겠다.The data receiver 20 receives genotype data of a plurality of SNPs of the subjects from a genetic analysis device (not shown). In FIG. 1, it is assumed that the genetic analysis device and the prediction model generation device 1 of the present embodiment are separate devices. However, the present invention is not limited thereto, and the prognostic prediction model generating device 1 may be implemented as a device embedded in the genetic analysis device. Hereinafter, as shown in FIG. 1, it will be described on the assumption that the genetic analysis device and the prognostic prediction model generating device 1 of the present embodiment are separate devices.

프로세서(10)는 SNP 분석부(110), 표준화부(120), 모델 생성부(130) 및 예측부(140)로 구성된다.The processor 10 includes an SNP analyzer 110, a standardizer 120, a model generator 130, and a predictor 140.

SNP 분석부(110)는 피검자들로부터 획득한 SNP들에 대한 유전자형 데이터를 분석하여 획득한 SNP들 중 일정 기준으로 유의한 SNP들을 선별하고 또한 선별된 SNP들에 가장 적합한 유전 모델(genetic model)을 결정한다. 이와 같은 유전자형 데이터들은 데이터 수신부(20)를 통해 수신한다. The SNP analysis unit 110 selects significant SNPs based on a predetermined standard among the SNPs obtained by analyzing genotype data of SNPs obtained from the subjects, and also selects a genetic model most suitable for the selected SNPs. Decide Such genotype data is received through the data receiver 20.

도 2는 본 발명의 일 실시예에 따른 데이터 수신부(110)에서 수신된 피검자들의 SNP들에 대한 유전자형 데이터들을 나타내는 표이다.2 is a table showing genotype data of SNPs of subjects received by the data receiving unit 110 according to an embodiment of the present invention.

도 2를 참고하면, SNP 1번 위치부터 SNP 100,000번 위치까지의 300명의 피검자들의 유전자형들에 대한 데이터들이 나타나 있다. 각각의 피검자들은 질병에 대한 민감도, 외모, 장수 가능성 등과 같은 여러 요인들에 대한 반응성이 서로 다르므로 SNP들마다 완벽하게 동일한 유전자형을 나타낼 수 없다. 따라서, 특정 요인과 특정 요인에 대해 정상 또는 비정상을 나타내는 SNP들의 상관관계를 분석하고 모델링함으로써 예후 예측모델을 만들 수 있다.Referring to FIG. 2, data on genotypes of 300 subjects from SNP 1 to SNP 100,000 are shown. Since each subject has different responsiveness to various factors such as sensitivity to disease, appearance, and longevity, each SNP may not exhibit the same genotype. Therefore, a prognostic prediction model can be made by analyzing and modeling the correlation between SNPs representing normal or abnormality for a specific factor and a specific factor.

다시 도 1을 참고하면, SNP 분석부(110)는 Max Test 방법을 이용하여 SNP들을 분석한다. 이하에서는 SNP 분석부(110)가 Max Test 방법을 이용하여 SNP들을 분석하는 과정에 대해 설명하도록 하겠으나, Max Test 방법에 대해 이하에서 자세한 설명이 생략된 부분에 대해서는 당해 기술 분야에서 통상의 지식을 가진 자라면 이해할 수 있다.Referring back to FIG. 1, the SNP analyzer 110 analyzes SNPs using the Max Test method. Hereinafter, the SNP analysis unit 110 will be described for the process of analyzing the SNPs using the Max Test method, but the detailed description of the Max Test method will be described below with the general knowledge in the art. If you grow up, you can understand.

검정부(1101)는 피검자들의 SNP들에 대한 유의성들을 유전 모델들 각각을 이용하여 각각 검정(test)한다. 복수의 유전 모델들은 SNP의 유전적 특성을 통계적으로 검정하는 모델들로써, 우성 모델(dominant model), 열성 모델(recessive model) 및 부가 모델(additive model)을 포함한다. 여기서, 유의성들은 각 SNP가 특정 조건에 대해 환자군과 대조군으로 유의하게 분류하는지에 대한 유의성 및 각 SNP의 유전적 특성의 검증을 위해 사용된 유전 모델들 각각의 유의성을 포함한다.The test unit 1101 tests the significance of the SNPs of each subject using each of the genetic models. The plurality of genetic models are models that statistically test the genetic characteristics of the SNP, and include a dominant model, a recessive model, and an additive model. The significance here includes the significance of whether each SNP is significantly classified into a patient group and a control group for a particular condition, and the significance of each of the genetic models used for the verification of the genetic characteristics of each SNP.

검정부(1101)에서 Max Test 방법에 기초하여 SNP들 각각을 유전 모델들을 이용하여 검정하는 방법은 아래와 같다.Based on the Max Test method in the test unit 1101, the method of testing each of SNPëë¤ using genetic models is as follows.

도 2의 표에 대해 설명한 바와 같이, 피검자들로부터 획득한 SNP들은 수만 또는 수십만개가 존재할 수 있다. 그러나, 이 SNP들 중에는 질병 등과 같은 유전적 인자로써 작용하지 않는 SNP들이 존재할 수 있다. 즉, 피검자들의 SNP들 중에는 예후예측 모델 생성시 특정 요인에 대한 예후와 전혀 관계가 없거나 또는 영향이 미미한 SNP들이 존재할 수 있다. 따라서, 이와 같은 SNP들은 예후예측 모델 생성에 반영할 필요가 없다. 그러므로, 이와 같이 예후와 전혀 관계가 없거나 또는 영향이 미미한 SNP들은 아래의 표 1과 같이 어느 SNP에 대한 유전자형 데이터들을 정리하여 이를 통계적으로 분석함으로써 선별할 수 있다.As described with reference to the table of FIG. 2, there may be tens or hundreds of thousands of SNPs obtained from the subjects. However, among these SNPs, there may be SNPs which do not act as genetic factors such as disease. That is, among the SNPs of the subjects, there may be SNPs that have nothing to do or have little influence on the prognosis for a certain factor when generating a prognostic model. Therefore, such SNPs need not be reflected in the prognosis prediction model generation. Therefore, SNPs that have nothing to do with the prognosis or have little influence on the prognosis can be selected by arranging the genotype data for any SNP as shown in Table 1 below and analyzing them statistically.

SNP 1번SNP No. 1 AAAA ABAB BBBB TotalTotal ResponseResponse x₀ x ₀ x₁ x ₁ x₂ x ₂ xx No ResponseNo response n₀-x₀ n ₀ -x ₀ n₁-x₁ n ₁ -x ₁ n₂-x₂ n ₂ -x ₂ n-xn-x TotalTotal n₀ n ₀ n₁ n ₁ n₂ n ₂ nn

표 1에서, Response 및 No Response 각각은 SNP 1번 유전자형들에 대하여 특정 조건에 대하여 반응이 있는 경우와 반응이 없는 경우를 의미한다. 보다 상세하게 설명하면, Response 및 No Response의 분류는 SNP 1번 유전자형들이 특정 조건에 대하여 각각 환자군에 해당되는 경우 및 대조군에 해당되는 경우로 분류한 것을 의미하는 것으로써, 일반적인 case-control study에 따라 분류한 것을 의미한다. x₀ 내지 x₂ 각각은 환자군(Response)에 해당하는 피검자들의 유전자형 데이터들 중에서 AA, AB 및 BB 유전자형 각각이 존재하는 개수를 나타낸다. 그리고, n₀ 내지 n₂ 각각은 피검자들의 유전자형 데이터들 중에서 AA, AB 및 BB 유전자형 각각의 총 개수를 나타낸다. 따라서, 대조군(No Response)에 해당하는 피검자들의 유전자형 데이터들 중에서 AA, AB 및 BB 유전자형 각각이 존재하는 개수는 각각 n₀-x₀, n₁-x₁ 및 n₂-x₂에 해당한다.In Table 1, each of Response and No Response refers to a case in which there is a response to a specific condition and a case in which there is no response to SNP genotype 1. In more detail, classification of response and no response means that SNP genotype 1 is classified into a case group and a case group for a specific condition, respectively, according to a general case-control study. It means classified. Each of x ₀ to x ₂ represents the number of AA, AB and BB genotypes present among genotype data of the subjects corresponding to the patient group (Response). And, each of n ₀ to n ₂ represents the total number of each of the AA, AB and BB genotypes among the genotype data of the subjects. Accordingly, the number of AA, AB, and BB genotypes among genotype data of the subjects corresponding to the control group (No Response) corresponds to n ₀ -x ₀ , n ₁ -x _1, and n ₂ -x ₂ , respectively.

그러나, 검정부(1101)에서 Max Test 방법에 기초하여 검정하기 위해서는 표 1을 약간 변형하여 검정한다. However, in order to test based on the Max Test method in the test unit 1101, Table 1 is slightly modified and tested.

보다 상세하게 설명하면, 우선, 유전 모델들인 우성 모델, 열성 모델 및 부가 모델별로 SNP의 유전자형들을 서로 다른 조건들로 부호화한다. 우성 모델로 유의성을 검정하는 경우를 예로 들면, SNP의 유전자형들인 AA, AB 및 BB를 각각 0, 1, 1로 부호화한다. 따라서, 우성 모델에 따르면 AB 및 BB는 동일한 값으로 표시된다. 다음으로, 열성 모델로 유의성을 검정하는 경우를 예로 들면, SNP의 유전자형들인 AA, AB 및 BB를 각각 0, 0, 1로 부호화한다. 따라서, 열성 모델에 따르면 AA 및 AB는 동일한 값으로 표시된다. 마지막으로, 부가 모델로 유의성을 검정하는 경우를 예로 들면, SNP의 유전자형들인 AA, AB 및 BB를 각각 0, 1, 2로 부호화한다. 따라서, 부가 모델에 따르면 각각의 유전자형들은 모두 서로 다른 값으로 표시된다. 즉, 정리하면 다음의 표 2와 같이 부호화한다.In more detail, first, genotypes of SNPs are encoded under different conditions according to genetic models such as dominant model, recessive model, and additional model. For example, the significance test using the dominant model encodes the genotypes AA, AB, and BB of 0, 1, and 1, respectively. Thus, according to the dominant model, AB and BB are represented by the same value. Next, for example, the significance test using the recessive model encodes S, genotypes AA, AB, and BB as 0, 0, and 1, respectively. Thus, according to the recessive model, AA and AB are represented by the same value. Lastly, for example, the significance test using the additional model encodes the genotypes AA, AB, and BB of 0, 1, and 2, respectively. Therefore, according to the additional model, each genotype is represented by a different value. In other words, they are encoded as shown in Table 2 below.

SNP의 유전자형Genotype of SNP 유전 모델Genetic model 부호sign (AA, AB, BB)(AA, AB, BB) DominantDominant (0, 1, 1)(0, 1, 1) (AA, AB, BB)(AA, AB, BB) RecessiveRecessive (0, 0, 1)(0, 0, 1) (AA, AB, BB)(AA, AB, BB) AdditiveAdditive (0, 1, 2)(0, 1, 2)

유전 모델들인 우성 모델, 열성 모델 및 부가 모델은 표 2와 같이 부호화하여 표 1을 변형한 모델들이다.Genetic models such as dominant model, recessive model, and additional model are models modified from Table 1 by encoding as shown in Table 2.

SNP 1번을 예로 들면, SNP 1번에 대해 적용될 우성 모델, 열성 모델 및 부가 모델은 아래의 각각의 표 3, 표 4 및 표 5와 같다.Taking SNP 1 as an example, the dominant model, recessive model, and additional model to be applied to SNP 1 are shown in Tables 3, 4, and 5, respectively, below.

SNP 1번SNP No. 1 00 1One ResponseResponse x₀ x ₀ x₁ x ₁ No ResponseNo response y₀ y ₀ y₁ y ₁

표 3은 SNP 1번의 유전자형들에 대하여 우성 모델들로 유의성을 검정하기 위한 표이다. 앞서 설명한 바와 같이, 우성 모델에서는 AB 및 BB의 유전자형을 1로 부호화하였고, AA의 유전자형을 0으로 부호화하였다. 그리고, 앞서 살펴본 표 1과 유사하게, SNP 1번에 대한 x₀ 및 x₁은 특정 조건에 대하여 환자군(Response)에 해당될 때 부호화된 값이 0인 개수 및 1인 개수를 각각 의미한다. 그리고, SNP 1번에 대한 y₀ 및 y₁은 특정 조건에 대하여 대조군(No Response)에 해당할 때 부호화된 값이 0인 개수 및 1인 개수를 각각 의미한다. Table 3 is a table for testing the significance of dominant models for SNP No. 1 genotypes. As described above, in the dominant model, genotypes of AB and BB were encoded as 1, and genotypes of AA were encoded as 0. And, similar to Table 1 described above, x ₀ and x ₁ for the SNP No. ₁ refers to the number of the encoded value is 0 and the number of 1 when corresponding to the patient group (Response) for a specific condition, respectively. And, y ₀ and y ₁ for the SNP No. ₁ refers to the number of the encoded value is 0 and the number of 1, respectively, when corresponding to the control (No Response) for a specific condition.

표 4는 SNP 1번의 유전자형들에 대하여 열성 모델들로 유의성을 검정하기 위한 표이다. 앞서 설명한 바와 같이, 열성 모델에서는 BB의 유전자형을 1로 부호화하였고, AA 및 AB의 유전자형을 0으로 부호화하였다. 그리고, SNP 1번에 대한 x₀ 및 x₁은 특정 조건에 대하여 환자군(Response)에 해당될 때 부호화된 값이 0인 개수 및 1인 개수를 각각 의미한다. 또한, SNP 1번에 대한 y₀ 및 y₁은 특정 조건에 대하여 대조군(No Response)에 해당할 때 부호화된 값이 0인 개수 및 1인 개수를 각각 의미한다. Table 4 is a table for testing the significance of the recessive models for the genotype of SNP No. 1. As described above, in the recessive model, the genotype of BB was encoded as 1, and the genotypes of AA and AB were encoded as 0. And, x ₀ and x ₁ for the SNP No. ₁ refers to the number of the encoded value is 0 and the number of 1, respectively, when corresponding to the patient group (Response) for a specific condition. In addition, y ₀ and y ₁ with respect to SNP No. 1 mean a number having a coded value of 0 and a number of 1, respectively, when the response corresponds to a control (No Response) for a specific condition.

SNP 1번SNP No. 1 00 1One 22 ResponseResponse x₀ x ₀ x₁ x ₁ x₂ x ₂ No ResponseNo response y₀ y ₀ y₁ y ₁ y₂ y ₂

표 5는 SNP 1번의 유전자형들에 대하여 부가 모델들로 유의성을 검정하기 위한 표이다. 앞서 설명한 바와 같이, 부가 모델에서는 AA의 유전자형을 0으로 부호화하였고, AB의 유전자형을 1로 부호화하였고, BB의 유전자형을 2로 부호화하였다. 그리고, SNP 1번에 대한 x₀, x₁ 및 x₂는 특정 조건에 대하여 환자군(Response)에 해당될 때 부호화된 값이 0인 개수, 1인 개수 및 2인 개수를 각각 의미한다. 또한, SNP 1번에 대한 y₀, y₁ 및 y₂는 특정 조건에 대하여 대조군(No Response)에 해당할 때 부호화된 값이 0인 개수, 1인 개수 및 2인 개수를 각각 의미한다. Table 5 is a table to test the significance of the additional models for the genotype of SNP No. 1. As described above, in the additional model, the genotype of AA was encoded as 0, the genotype of AB was encoded as 1, and the genotype of BB was encoded as 2. In addition, x ₀ , x _1, and x ₂ for SNP No. 1 mean the number of encoded values 0, the number 1, and the number 2, respectively, when corresponding to a patient group (Response) for a specific condition. In addition, y ₀ , y _1, and y ₂ with respect to SNP No. 1 mean the number of encoded values 0, the number 1, and the number 2, respectively, when the response corresponds to a control (No Response) for a specific condition.

검정부(1101)는 예를 든 SNP 1번에 대해 표 3과 같은 우성 모델로 유의성들을 검정하고, 표 4와 같은 열성 모델로 유의성들을 검정하고, 표 5와 같은 부가 모델로 유의성들을 검정한다. 즉, 하나의 SNP에 대해 3개의 유전 모델들로 각각 검정한다.The test unit 1101 tests significances with an dominant model as shown in Table 3, for example SNP # 1, tests significances with a recessive model as shown in Table 4, and tests significances with an additional model as shown in Table 5. That is, each of the three genetic models for one SNP is tested.

검정부(1101)는 표 3 내지 5와 같이 통계적으로 정리된 데이터들을 카이 제곱 검정(chi-squared test)과 같은 유의성 검정 방법을 사용하여 검정한다.The test unit 1101 tests statistically summarized data as shown in Tables 3 to 5 using a significance test method such as a chi-squared test.

카이제곱 검정은 일반적으로 널리 알려진 통계적 검정 방법으로써, 위의 우성 모델인 표 3에 대하여 적용할 경우, SNP 1번에서 Response와 No Response가 유의하게 분류되었는지를 나타내는 p-value를 얻을 수 있다. 또한, 열성 모델인 표 4 및 부가 모델인 표 5에 대해서도 각각의 p-value를 얻을 수 있다. Chi-square test is a generally well-known statistical test method, and when applied to the above dominant model Table 3, a p-value indicating whether Response and No Response are classified significantly in SNP # 1 can be obtained. Moreover, each p-value can also be obtained about Table 4 which is a recessive model, and Table 5 which is an additional model.

여기서, p-value는 카이 제곱 검정 결과 어느 정도 유의한지를 나타내는 값에 해당한다. 사용자는 p-value에 대하여 유의성을 판단하는 기준이 되는 임계값을 미리 설정해 놓을 수 있고, 이 임계값과 각각의 p-value를 비교하여 유의성을 검정한다. Here, the p-value corresponds to a value indicating how significant the chi-square test is. The user may preset a threshold value, which is a criterion for determining the significance of the p-value, and test the significance by comparing the threshold value with each p-value.

본 실시예의 검정부(1101)는 카이 제곱 검정 방법 중 Cochran-Armitage trend test를 이용하여 검정할 수 있다. Cochran-Armitage trend test은 당해 기술 분야에서 통상의 지식을 가진 자에게 자명하므로 자세한 설명은 생략하도록 하겠다.The assay unit 1101 of this embodiment may test using the Cochran-Armitage trend test in the chi-square test method. Since the Cochran-Armitage trend test is obvious to those skilled in the art, detailed description thereof will be omitted.

검정부(1101)는 위와 같은 검정 방법을 다른 SNP들에 대해서도 동일하게 적용하여 검정한다. 즉, 표 3 내지 5와 같은 유전 모델들을 도 2의 SNP 2 내지 100,000과 같은 다른 SNP들에도 동일하게 적용하여 검정한다.The assay unit 1101 applies the same assay method to other SNPs in the same manner. That is, the genetic models shown in Tables 3 to 5 are also applied to other SNPs such as SNPs 2 to 100,000 of FIG.

검정부(1101)에서 각각의 SNP들에 대해 3개의 유전 모델들로 검정한 결과에 따른 3개의 p-value들은 SNP 결정부(1102)로 전송된다.Three p-values according to the test result of the three genetic models for each SNP in the assay unit 1101 are transmitted to the SNP determination unit 1102.

SNP 결정부(1102)는 피검자들의 SNP들 중 피검자들을 환자군과 대조군으로 유의하게 분류하는 SNP들을 결정한다. 보다 상세하게 설명하면, 하나의 SNP에 대해 각각의 유전 모델들로 검정한 결과 획득한 각각의 p-value들과 사용자에 의해 미리 설정된 임계값의 비교 결과에 기초하여 유의한 SNP들을 결정한다. The SNP determiner 1102 determines SNPs that significantly classify the subjects among the SNPs of the subject into a patient group and a control group. In more detail, significant SNPs are determined based on a comparison result of threshold values set by a user and respective p-values obtained as a result of testing the respective genetic models for one SNP.

예를 들어, SNP 1번에 대해 획득한 3개의 p-value들 중 가장 작은 p-value와 임계값을 비교한 결과에 기초하여, 임계값이 가장 작은 p-value보다 큰 경우에는 피검자들을 환자군과 대조군으로 유의하게 분류하는 SNP에 해당한다고 결정한다. 마찬가지로, 다른 SNP들의 p-value들과 임계값을 비교한 결과에 기초하여, 임계값보다 작은 p-value를 갖는 SNP들이 환자군과 대조군으로 유의하게 분류하는 SNP에 해당한다고 결정한다. For example, based on a comparison between the smallest p-value of the three p-values obtained for SNP # 1 and the threshold, if the threshold is greater than the smallest p-value, the subjects were compared with the patient group. It is determined that it corresponds to the SNP that is classified as a control significantly. Similarly, based on the comparison of the p-values and thresholds of other SNPs, it is determined that SNPs with p-values less than the threshold value correspond to SNPs that are significantly classified into patient and control groups.

이와 같이 유의한 SNP들만을 선별함으로써 예후에 높은 상관 관계를 나타내는 SNP들만으로 보다 정확한 예후예측 모델을 생성할 수 있다.By selecting only such significant SNPs, a more accurate prognostic model can be generated with only SNPs having a high correlation in prognosis.

유전 모델 결정부(1103)는 결정된 SNP들 각각에 대해 검정에 사용된 유전 모델들 중 가장 유의한 유전 모델을 결정한다. 즉, 유전 모델 결정부(1103)는 SNP 결정부(1102)에서 결정된 SNP들에 대해서만 유전 모델을 결정한다. Genetic model determination unit 1103 determines the most significant genetic model of the genetic models used in the assay for each of the determined SNPs. That is, the genetic model determiner 1103 determines the genetic model only for the SNPs determined by the SNP determiner 1102.

보다 상세하게 설명하면, 결정된 SNP들 각각에 대해 획득된 우성 모델에 대한 p-value, 열성 모델에 대한 p-value 및 부가 모델에 대한 p-value 중 가장 낮은 p-value를 갖는 유전 모델을 해당 SNP의 유전 모델로 결정한다. p-value가 가장 낮을 수록 유의하기 때문이다. SNP에 대해 결정된 유전 모델은 그 SNP의 유전적 특성을 반영하는 것으로써, 이는 그 SNP가 우성, 열성 또는 부가 중 어느 하나의 유전적 특성을 갖는 것을 의미한다.In more detail, the genetic model having the lowest p-value among the p-value for the dominant model, the p-value for the recessive model, and the p-value for the additional model obtained for each of the determined SNPs is selected. Determined by the genetic model of This is because the lower the p-value, the more significant it is. The genetic model determined for an SNP reflects the genetic properties of that SNP, which means that the SNP has the genetic properties of either dominant, recessive or additive.

예를 들어, SNP 결정부(1102)에서 결정된 SNP들 중 SNP 1번 및 SNP 2번이 포함되어 있다면, 유전 모델 결정부(1103)는 SNP 1번의 우성 모델에 대한 p-value가 다른 유전 모델들의 p-value들과 비교하여 가장 낮은 경우 SNP 1번은 우성 모델에 대응되는 것으로 결정하고, SNP 2번의 열성 모델에 대한 p-value가 다른 유전 모델들의 p-value들과 비교하여 가장 낮은 경우 SNP 2번은 열성 모델에 대응되는 것으로 결정한다. 즉, 유전 모델 결정부(1103)는 SNP 결정부(1102)에서 결정된 SNP들 전부에 대해 각각의 유전 모델을 결정한다.For example, if SNP 1 and SNP 2 are included among the SNPs determined by the SNP determiner 1102, the genetic model determiner 1103 may include the genetic models having different p-values for the dominant model of SNP 1. SNP 1 is determined to correspond to the dominant model when it is lowest compared to p-values, and SNP 2 is determined when the p-value for the recessive model of SNP 2 is lowest compared to p-values of other genetic models. Determine to correspond to the recessive model. That is, the genetic model determiner 1103 determines each genetic model for all of the SNPs determined by the SNP determiner 1102.

도 3은 본 발명의 일 실시예에 따른 SNP 분석부(110)에서 분석된 결과를 나타내는 표이다. 도 3을 참고하면, 검정부(1101)에서 각각의 SNP들의 유의성들을 검정한 후, SNP 결정부(1102)에서 결정된 SNP들 및 유전 모델 결정부(1103)에서 결정된 SNP들 각각에 대해 결정된 유전 모델이 나타나 있다. 결정된 SNP들의 각각의 유전자형들은 결정된 유전 모델에 따라 부호화된 값으로 표시되어 있다. 부호화된 값들은 앞서 살펴본 표 2에 기초하였다.3 is a table showing the results analyzed by the SNP analysis unit 110 according to an embodiment of the present invention. Referring to FIG. 3, after testing the significance of each SNP in the assay unit 1101, a genetic model determined for each of the SNPs determined by the SNP determiner 1102 and the SNPs determined by the genetic model determiner 1103. Is shown. Each genotype of the determined SNPs is represented by a value encoded according to the determined genetic model. The encoded values are based on Table 2 discussed above.

앞서 표 2에서 살펴본 바와 같이, 각각의 유전 모델에 따라 유전자형들을 부호화하는 조건들이 다르다는 것을 알 수 있다. 즉, 우성 모델에 해당하는 SNP의 유전자형들은 AA만이 0의 값을 갖고, AB 및 BB는 1의 값을 갖는다. 그러나, 열성 모델에 해당하는 SNP의 유전자형들은 AA 및 AB가 0의 값을 갖고, BB만이 1의 값을 갖는다. 나아가서, 부가 모델에 해당하는 SNP의 유전자형들 중 BB는 다른 유전 모델들에서 갖지 않는 2의 값을 갖는다. As shown in Table 2, it can be seen that the conditions for encoding genotypes are different according to each genetic model. That is, only genotypes of SNPs corresponding to the dominant model have a value of 0, and AB and BB have a value of 1. However, genotypes of SNPs corresponding to the recessive model have AA and AB values of 0, and only BB have a value of 1. Furthermore, of the genotypes of SNPs corresponding to the additional model, BB has a value of 2 which is not found in other genetic models.

즉, 결정된 SNP들마다 결정된 유전 모델에 따라 부호화하는 조건들이 서로 각기 다르다. 그러므로, 하나의 SNP들에 대해서만 0, 1, 2의 값이 유효할 뿐 결정된 SNP들 전체적으로는 각각의 SNP들마다 기준이 서로 다르므로 유효하지 않아 예후예측 모델 생성에 그대로 이용할 수 없다. 따라서, 각각의 SNP들에 동일한 기준을 적용하여 각각의 값들을 동일한 기준에 대한 값들로 통일적으로 표시할 수 있는 방법이 요구된다.In other words, the encoding conditions are different for each of the determined SNPs according to the determined genetic model. Therefore, the values of 0, 1, and 2 are valid for only one SNP, but the determined SNPs are not valid because the respective SNPs are different from each other and thus cannot be used for generating a prognostic prediction model. Therefore, there is a need for a method that can apply the same criterion to each of the SNPs to uniformly display each value as values for the same criterion.

종래에는 위와 같은 문제점을 해결할 수 없어 하나의 유전 모델들에 해당하는 SNP들만을 선별하여 예후예측 모델을 생성하였다. 예를 들어 우성 모델에만 해당하는 SNP들의 유전자형들만으로 예후예측 모델을 생성하였다. 그러나, 특정 조건에 대한 예후는 우성 모델에 해당하는 SNP들 뿐만 아니라, 다른 유전 모델들인 열성 모델 및 부가 모델에 해당하는 SNP들도 영향을 미치므로, 하나의 유전 모델만으로 예후예측 모델을 생성하면 정확도가 떨어지게 된다.In the related art, the above problems cannot be solved, and only prognostic prediction models are generated by selecting only SNPs corresponding to one genetic model. For example, a prognostic model was generated using only genotypes of SNPs corresponding to the dominant model. However, the prognosis for a particular condition affects not only SNPs corresponding to the dominant model, but also SNPs corresponding to other genetic models, the recessive model and the additional model, so that the prediction model can be generated by using only one genetic model. Will fall.

따라서, 본 실시예의 예후예측 모델 생성 장치(1)는 SNP들마다 각각의 SNP의 유전자형들이 부호화된 값들을 표준화(standardization)함으로써 각각의 값들을 동일한 기준에 대한 값들로 통일적으로 표시한다. 이로 인하여, 본 실시예의 예후예측 모델은 어느 하나의 유전 모델만을 이용한 결과가 반영되는 것이 아니라, 결정된 유전 모델들, 즉 우성 모델, 열성 모델 및 부가 모델을 이용한 결과들 모두가 반영되어 생성된다.Therefore, the prognostic prediction model generating apparatus 1 of the present embodiment uniformly displays each value as values for the same reference by standardizing values encoded by genotypes of each SNP for each SNP. For this reason, the prognostic model of the present embodiment is not generated by using only one genetic model but reflects all of the determined genetic models, that is, the results using the dominant model, the recessive model, and the additional model.

표준화부(120)는 결정된 유전 모델에 기초하여 결정된 SNP들 각각의 유전자형들이 부호화된 값들을 표준화한다. 즉, 표준화부(120)는 결정된 유전 모델들의 종류에 기초하여 SNP들마다 서로 다른 조건들에 의해 부호화된 값들을 결정된 SNP들마다 동일한 기준을 갖는 값들로 표준화한다.The standardization unit 120 normalizes the encoded values of genotypes of each of the SNPs determined based on the determined genetic model. That is, the standardization unit 120 normalizes values encoded by different conditions for each SNP based on the type of the genetic models determined to values having the same reference for each of the determined SNPs.

표준화부(120)는 일반적으로 널리 알려진 아래의 수학식 1과 같은 표준화 방법을 사용한다. 수학식 1은 각각의 결정된 SNP들에 대해 적용되어 표준화하는데 사용된다.The standardization unit 120 generally uses a standardization method such as Equation 1 below. Equation 1 is applied for each determined SNPs and used to standardize.

수학식 1을 참고하면, z_i는 표준화된 값이고, Z_i는 현재 유전자형들이 부호화된 값이고, μ는 현재 유전자형들이 부호화된 값들의 평균이고, σ는 현재 유전자형들이 부호화된 값들의 표준편차이다. Referring to Equation 1, z _i is a standardized value, Z _i is a value encoded by current genotypes, μ is an average of values encoded by current genotypes, and σ is a standard deviation of values encoded by current genotypes. .

보다 상세하게 설명하면, 어느 하나의 SNP의 유전자형들이 (AA, AB, BB)로 표현되는 경우 각각의 유전자형들은 (Z₁, Z₂, Z₃)의 값으로 부호화된다. 그리고, 이 SNP에서 각각의 유전자형들이 존재할 확률은 (f₁, f₂, f₃)이다. 그러면, 현재 유전자형이 부호화된 값들의 평균인 μ는

를 이용하여 계산된다. 그리고, 현재 유전자형이 부호화된 값들의 표준편차인 σ는

를 이용하여 계산된다. 그러므로, 이와 같은 수학식들을 이용하면, (AA, AB, BB)가 부호화된 (Z₁, Z₂, Z₃)의 표준화된 값인 (z₁, z₂, z₃)를 얻을 수 있다. In more detail, when genotypes of any one SNP are expressed as (AA, AB, BB), each genotype is encoded with a value of (Z ₁ , Z ₂ , Z ₃ ). And, the probability that each genotype exists in this SNP is (f ₁ , f ₂ , f ₃ ). Then, μ, the average of the values encoded by the current genotype,

It is calculated using And, the standard deviation of the current genotype encoded values

It is calculated using Therefore, using these equations, it is possible to obtain (z ₁ , z ₂ , z ₃ ), which is a standardized value of (Z ₁ , Z ₂ , Z ₃ ) encoded with (AA, AB, BB).

예를 들면, 어느 결정된 SNP가 부가 모델에 해당되는 경우 이 SNP의 (AA, AB, BB)은 (0, 1, 2)의 값으로 부호화된다. 여기서, 각각의 유전자형이 존재하는 확률이 (0.25, 0.5, 0.25)라고 가정할 수 있다. 따라서, 현재 유전자형이 부호화된 값들의 평균인 μ는 μ= E(Z)= 0×0.25 + 1×0.5 + 2×0.25 = 1 로 계산된다. 그리고, E(Z²)=0²×0.25 + 1²×0.5 + 2²×0.25 = 1.5 로 계산된다. 따라서, 현재 유전자형들이 부호화된 값들의 표준편차인 σ는

로 계산된다. 결국, 수학식 1에 각각의 값들을 대입하면, 이 SNP가 부호화된 (Z₁, Z₂, Z₃)=(0, 1, 2)의 값은

로 표준화된다.For example, if a determined SNP corresponds to an additional model, (AA, AB, BB) of this SNP is encoded with a value of (0, 1, 2). Here, it can be assumed that the probability of each genotype is (0.25, 0.5, 0.25). Therefore, μ, which is the average of the encoded values of the current genotype, is calculated as μ = E (Z) = 0 × 0.25 + 1 × 0.5 + 2 × 0.25 = 1. Then, E (Z ² ) = 0 ² × 0.25 + 1 ² × 0.5 + 2 ² × 0.25 = 1.5. Thus, σ, the standard deviation of the values encoded by the current genotypes,

. As a result, when each value is substituted into Equation 1, the value of (Z ₁ , Z ₂ , Z ₃ ) = (0, 1, 2) in which the SNP is encoded is

Is standardized.

표준화부(120)는 위와 같이 수학식 1을 이용하여 결정된 SNP들의 유전자형들이 부호화된 값들을 표준화한다. 당해 기술 분야에서 통상의 지식을 가진 자라면 표준화부(120)가 위와 같으 표준화 방법외에 다른 표준화 방법을 사용하여 유전자형들이 부호화된 값들을 표준화할 수 있음을 이해할 수 있다.The standardization unit 120 normalizes values encoded by genotypes of SNPs determined using Equation 1 as described above. Those skilled in the art may understand that the standardization unit 120 may standardize values encoded by genotypes using a standardization method other than the standardization method as described above.

도 4는 본 발명의 일 실시예에 따른 표준화부(120)에서 유전자형들이 부호화된 값들이 표준화된 결과를 나타낸 표이다. 도 4를 참고하면, 도 3의 표에서 각각의 유전자형들이 부호화된 값들이 표준화부(120)에서 모두 수학식 1을 이용하여 표준화되었다. 이로써 유전 모델들에 따라 서로 다른 조건들로 부호화된 값들은 동일한 기준으로 표준화되어 동일한 기준에 대한 값들로 표시된다.4 is a table illustrating a result of normalizing values encoded by genotypes in the standardization unit 120 according to an embodiment of the present invention. Referring to FIG. 4, values encoded by genotypes in the table of FIG. 3 are normalized by Equation 1 in the standardization unit 120. As a result, values encoded under different conditions according to genetic models are normalized to the same reference and displayed as values for the same reference.

다시 도 1을 참고하면, 모델 생성부(130)는 표준화된 결과들을 통계적인 예측 알고리즘(prediction algorithm)으로 분석함으로써 결정된 SNP들의 유전자형들과 피검자들의 예후의 상관관계를 나타내는 예후예측 모델을 생성한다. 여기서, 예측 알고리즘은 분류 알고리즘, 기계 학습 알고리즘 및 회귀 분석 알고리즘 중 적어도 하나를 포함한다.Referring back to FIG. 1, the model generator 130 generates a prognostic prediction model indicating a correlation between genotypes of SNPs determined by prognostics of the subjects by analyzing the standardized results with a statistical prediction algorithm. Here, the prediction algorithm includes at least one of a classification algorithm, a machine learning algorithm, and a regression analysis algorithm.

분류 알고리즘(classification algorithm)은 G-LASSO 알고리즘(Gradient lasso algorithm), LASSO 알고리즘 등이 있고, 기계 학습 알고리즘(Machine Learning Algorithm)은 서포트 벡터 머신(support vector machine, SVM) 등이 있다. 즉, 본 실시예의 모델 생성부(130)는 표준화된 결과들로부터 통계적 또는 수학적으로 예측 모델을 생성할 수 있는 어떠한 알고리즘도 이용할 수 있음을 당해 기술 분야에서 통상의 지식을 가진 자라면 이해할 수 있다.Classification algorithms include a G-LASSO algorithm (Gradient lasso algorithm), LASSO algorithm, etc. Machine learning algorithms (Machine Learning Algorithm), such as a support vector machine (SVM). That is, one of ordinary skill in the art may understand that the model generator 130 of the present embodiment may use any algorithm that can generate a predictive model statistically or mathematically from the standardized results.

모델 생성부(130)는 표준화된 값들을 예측 알고리즘을 이용하여 통계적 또는 수학적으로 분석한다. 모델 생성부(130)는 분석 결과, 수학식 2와 같이 일종의 수학식으로 표현된 예후예측 모델을 생성한다.The model generator 130 statistically or mathematically analyzes the standardized values using a prediction algorithm. The model generator 130 generates a prognostic prediction model expressed as a kind of equation, as shown in Equation 2 below.

수학식 2와 같은 예후예측 모델에서 SNP 1, SNP 2, SNP 3, (생략), SNP 99,876 등은 결정된 SNP들에 해당되고, α_i는 결정된 SNP들 각각에 대해 특정 조건에 대한 예후와의 상관 관계를 나타내는 계수들에 해당된다.In the prognostic prediction model such as Equation 2, SNP 1, SNP 2, SNP 3, (omitted), SNP 99,876, etc. correspond to the determined SNPs, and α _i correlates with the prognosis for a specific condition for each of the determined SNPs. Corresponds to the coefficients representing the relationship.

그러나, 모델 생성부(130)에서 예후예측 모델 생성에 사용되는 예측 알고리즘은 어느 하나에 한정되지 않으므로, 수학식 2의 형태와 다른 다양한 형태의 예후예측 모델이 생성될 수 있음을 본 발명의 기술 분야에서 통상의 지식을 가진 자라면 이해할 수 있다.However, since the prediction algorithm used for generating the prognostic model in the model generator 130 is not limited to any one, various types of prognostic models different from those of Equation 2 may be generated. Those of ordinary skill in the art can understand.

이와 같이 생성된 예후예측 모델은 어느 하나의 유전 모델만을 이용하여 생성된 것이 아니라, 여러 유전 모델들을 이용한 결과를 모두 반영하여 생성된 것으로써 하나의 유전 모델만을 이용하여 예후예측 모델을 생성할 때보다 정확도가 훨씬 높아진다.The prognostic model generated in this way is not generated using only one genetic model, but is generated by reflecting all the results using several genetic models. Thus, the prediction model is generated using only one genetic model. The accuracy is much higher.

다시 도 1을 참고하면, 예측부(140)는 생성된 예후예측 모델을 이용하여 피검자에 대한 예후를 예측한다. 즉, 수학식 2와 같은 예후예측 모델을 이용하여 피검자에 대한 예후를 예측한다. 보다 상세하게 설명하면, 어느 피검자가 특정 조건의 예후를 예측하고 싶어한다면, 이 피검자의 SNP들 중 수학식 2와 같은 예후예측 모델에 사용된 SNP들에 대한 유전자형이 무엇인지를 유전자 분석 장치(미도시)로부터 파악한다. 즉, 수학식 2와 같은 예후예측 모델에는 SNP 1, SNP 2, SNP 4, (생략), SNP 99,876가 사용되었으므로, 피검자의 SNP 1, SNP 2, SNP 4, (생략), SNP 99,876의 유전자형을 파악한다. 이후 예측부(140)에서는 피검자의 각 유전자형에 대응되는 표준화된 값들을 수학식 2와 같은 예후예측 모델에 대입한다.Referring back to FIG. 1, the prediction unit 140 predicts the prognosis for the subject by using the generated prognostic prediction model. That is, the prognosis for the subject is predicted by using the prognostic prediction model shown in Equation 2. In more detail, if a subject wants to predict the prognosis of a particular condition, the genetic analysis device (not shown) is the genotype for the SNPs used in the prognostic model, such as Equation 2, among the SNPs of the subject. We grasp from city). In other words, SNP 1, SNP 2, SNP 4, (omitted), and SNP 99,876 were used in the prognostic model, such as Equation 2, so that the genotypes of SNP 1, SNP 2, SNP 4, (omitted), and SNP 99,876 of the subject were used. Figure out. Thereafter, the predictor 140 substitutes standardized values corresponding to each genotype of the examinee into a prognostic model.

도 5a 및 도 5b는 본 발명의 일 실시예에 따라 예후예측 모델 생성 장치(1)에서 수행된 결과들을 나타내는 표들이다. 도 5a 및 도 5b은 250명의 1,000개의 SNP들에 대해 수행된 결과이고, 우성, 열성 및 부가 각각의 유전적 특성을 나타내는 3개의 prognostic SNP들을 가정하였다. 그리고, 리샘플링(resampling)은 50회 수행하였다.5A and 5B are tables showing the results performed in the prognostic model generating apparatus 1 according to an embodiment of the present invention. 5A and 5B are results performed for 1,000 SNPs of 250 persons, and assume three prognostic SNPs showing genetic characteristics of dominant, recessive and additive respectively. And, resampling was performed 50 times.

예측부(140)는 피검자의 각 유전자형에 대응되는 표준화된 값들이 생성된 예후예측 모델에 대입되어 계산된 결과, 일정한 임계값과 비교하여 예후를 예측한다. 수학식 2와 같은 예후예측 모델에 대입되어 계산된 결과는 수학식 2의 Y를 나타낸다. 즉, 수학식 2와 같은 예후예측 모델에 대입하여 계산된 값(수학식 2의 Y)이 일정한 임계값을 초과하는 경우 특정 조건, 예를 들어 암이 발병할 가능성이 높다고 예후를 예측하고, 일정한 임계값을 초과하지 않는 경우 암이 발병할 가능성이 낮다고 예후를 예측한다. The predictor 140 predicts the prognosis by comparing the threshold value with a predetermined threshold as a result of substituting the standardized values corresponding to each genotype of the subject into the generated prognostic model. The result calculated by substituting into the prognostic prediction model shown in Equation 2 represents Y in Equation 2. In other words, when the value calculated by substituting a prognostic model such as Equation 2 (Y in Equation 2) exceeds a certain threshold, the prognosis is predicted by a certain condition, for example, that the cancer is more likely to develop, The prognosis is predicted to be less likely to develop cancer if the threshold is not exceeded.

저장부(30)는 데이터 수신부(20)에서 수신된 유전자형 데이터들, 프로세서(10)에서 처리된 결과들, 예를 들어 생성된 모델(502) 등과 같은 데이터들을 저장한다.The storage unit 30 stores genotype data received by the data receiver 20, data processed by the processor 10, for example, the generated model 502, and the like.

출력부(40)는 예측부(140)에서 피검자의 예후를 예측한 결과를 피검자에게 출력한다. 출력부(40)는 사용자에게 정보를 보고하기 위하여 시각 정보를 표시하기 위한 장치(예를 들어, 디스플레이, LCD 화면, LED, 눈금 표시 장치 등), 청각 정보를 표시하기 위한 장치(예를 들어, 스피커 등) 등을 모두 포함한다.The output unit 40 outputs the result of predicting the prognosis of the subject in the predictor 140 to the subject. The output unit 40 may be a device for displaying visual information (eg, a display, an LCD screen, an LED, a scale display device, etc.) for reporting information to a user, or a device for displaying auditory information (for example, Speaker, etc.).

도 6은 본 발명의 일 실시예에 따른 예후예측 모델을 생성하는 방법의 흐름도이다. 도 6을 참조하면, 본 실시예에 따른 예후예측 모델 생성 방법은 도 1에 도시된 예후예측 모델 생성 장치(1)에서 시계열적으로 처리되는 단계들로 구성된다. 따라서, 이하 생략된 내용이라 하더라도 도 1에 도시된 예후예측 모델 생성 장치(1)에 관하여 이상에서 기술된 내용은 본 실시예에 따른 예후예측 모델을 생성하는 방법에도 적용된다.6 is a flowchart of a method of generating a prognostic model according to an embodiment of the present invention. Referring to FIG. 6, the method for generating a prognostic model according to the present embodiment includes steps that are processed in time series in the apparatus 1 for predicting predictive model shown in FIG. 1. Therefore, even if omitted below, the above descriptions of the prognostic model generation apparatus 1 shown in FIG. 1 also apply to the method of generating the prognostic model according to the present embodiment.

601 단계에서 SNP 결정부(1102)는 SNP의 유전적 특성을 통계적으로 검정하는 복수의 유전 모델들로 피검자들로부터 획득한 SNP들 각각을 검정하여 SNP들 중 피검자들을 환자군과 대조군으로 유의하게 분류하는 SNP들을 결정한다.In step 601, the SNP determiner 1102 tests each of the SNPs obtained from the subjects with a plurality of genetic models that statistically test the genetic characteristics of the SNP, thereby significantly classifying the subjects among the SNPs into a patient group and a control group. Determine SNPs.

602 단계에서 유전 모델 결정부(1103)는 결정된 SNP들 각각에 대해 검정에 사용된 유전 모델들 중 가장 유의한 유전 모델을 결정한다.In operation 602, the genetic model determiner 1103 determines the most significant genetic model among the genetic models used in the assay for each of the determined SNPs.

603 단게에서 표준화부(120)는 결정된 유전 모델에 기초하여 결정된 SNP들 각각의 유전자형들이 부호화된 값들을 표준화한다.In step 603, the standardization unit 120 normalizes values encoded by genotypes of each of the SNPs determined based on the determined genetic model.

604 단계에서 모델 생성부(130)는 표준화된 결과들을 통계적인 예측 알고리즘으로 분석함으로써 결정된 SNP들의 유전자형들과 피검자들의 예후의 상관관계를 나타내는 예후예측 모델을 생성한다.In operation 604, the model generator 130 generates a prognostic prediction model indicating a correlation between the genotypes of the SNPs determined by the statistical prediction algorithm and the prognosis of the test subjects.

한편, 상술한 본 발명의 실시예들은 컴퓨터에서 실행될 수 있는 프로그램으로 작성 가능하고, 컴퓨터로 읽을 수 있는 기록매체를 이용하여 상기 프로그램을 동작시키는 범용 디지털 컴퓨터에서 구현될 수 있다. 또한, 상술한 본 발명의 실시에에서 사용된 데이터의 구조는 컴퓨터로 읽을 수 있는 기록매체에 여러 수단을 통하여 기록될 수 있다. 상기 컴퓨터로 읽을 수 있는 기록매체는 마그네틱 저장매체(예를 들면, 롬, 플로피 디스크, 하드 디스크 등), 광학적 판독 매체(예를 들면, 시디롬, 디브이디 등)와 같은 저장매체를 포함한다.Meanwhile, the above-described embodiments of the present invention can be written as a program that can be executed in a computer, and can be implemented in a general-purpose digital computer that operates the program using a computer-readable recording medium. In addition, the structure of the data used in the above-described embodiment of the present invention can be recorded on the computer-readable recording medium through various means. The computer-readable recording medium includes a storage medium such as a magnetic storage medium (e.g., ROM, floppy disk, hard disk, etc.), optical reading medium (e.g., CD ROM,

이제까지 본 발명에 대하여 그 바람직한 실시예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far I looked at the center of the preferred embodiment for the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the disclosed embodiments should be considered in an illustrative rather than a restrictive sense. The scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the scope will be construed as being included in the present invention.

1: 예후예측 모델 생성 장치 10: 프로세서
20: 데이터 수신부 30: 저장부
40: 출력부 110: SNP 분석부
120: 표준화부 130: 모델 생성부
140: 예측부 1101: 검정부
1102: SNP 결정부 1103: 유전 모델 결정부1: prognostic model generation device 10: processor
20: data receiving unit 30: storage unit
40: output unit 110: SNP analysis unit
120: standardization unit 130: model generation unit
140: prediction unit 1101: test unit
1102: SNP determination unit 1103: genetic model determination unit

Claims

Testing each of the SNPs obtained from the subjects with a plurality of genetic models statistically testing the genetic characteristics of the SNPs to determine SNPs that significantly classify the subjects into a patient group and a control group;
Determining the most significant genetic model of the genetic models used in the assay for each of the determined SNPs;
Normalizing encoded values of genotypes of each of the determined SNPs based on the determined genetic model; And
Generating a prognostic prediction model indicating a correlation between the genotypes of the determined SNPs and the prognosis of the subjects by analyzing the standardized results by a statistical prediction algorithm. Way.

The method of claim 1,
The normalizing may include normalizing values encoded by different conditions for each of the SNPs to values having the same reference for each of the determined SNPs based on the determined types of genetic models.

The method of claim 1,
The determined genetic models include at least one of a dominant model, a recessive model, and an additive model,
The prognostic model is generated by reflecting the results using the determined genetic models.

The method of claim 1,
Before each determining the SNPs, further testing the significance for each of the SNPs using each of the genetic models,
Determining the SNPs and determining the genetic model are based on an assay result of the significances.

The method of claim 4, wherein
Wherein said assaying, determining said SNPs, and determining said genetic model use a Max Test method.

The method of claim 1,
Predicting a prognosis for the subject using the generated prognostic model.

The method of claim 1,
Wherein the prediction algorithm comprises at least one of a classification algorithm, a machine learning algorithm and a regression analysis algorithm.

The method of claim 1,
The method of generating a prognostic model is performed by a computing device including a processor.

A computer-readable recording medium storing a program for causing a computer to execute the method according to any one of claims 1 to 7.

An SNP determination unit that determines each of the SNPs obtained from the subjects by a plurality of genetic models statistically testing the genetic characteristics of the SNP, and determines the SNPs of which the subjects are significantly classified into a patient group and a control group;
A genetic model determiner for determining the most significant genetic model of the genetic models used in the assay for each of the determined SNPs;
A normalizer for normalizing encoded values of genotypes of each of the determined SNPs based on the determined genetic model; And
And a model generator for generating a prognostic model that represents a correlation between the genotypes of the determined SNPs and the prognosis of the subjects by analyzing the standardized results by a statistical prediction algorithm. Generating device.

The method of claim 10,
And the normalization unit normalizes values encoded by different conditions for each of the SNPs to values having the same reference for each of the determined SNPs based on the determined types of genetic models.

The method of claim 10,
The determined genetic models include at least one of a dominant model, a recessive model, and an additive model,
The prognostic model is generated by reflecting the results of using the determined genetic models.

The method of claim 10,
And further including an assay for respectively testing the significances of the SNPs using each of the genetic models before determining the SNPs.
And the SNP determiner and the genetic model determiner each determine based on a test result of the significances.

The method of claim 13,
And the assay unit, the SNP determination unit, and the genetic model determination unit use a Max Test method.

The method of claim 10,
And a predictor for predicting prognosis for the subject using the generated prognostic model.

The method of claim 10,
And the prediction algorithm comprises at least one of a classification algorithm, a machine learning algorithm, and a regression analysis algorithm.