KR101489536B1

KR101489536B1 - A Gene Expression Data Marker Identification Method for Two Groups

Info

Publication number: KR101489536B1
Application number: KR20100139605A
Authority: KR
Inventors: 이건명; 이찬희
Original assignee: 충북대학교 산학협력단
Priority date: 2010-12-30
Filing date: 2010-12-30
Publication date: 2015-02-04
Also published as: KR20120077593A

Abstract

본 발명에서는 유전자발현 데이터를 사용하여 두 개의 샘플집단을 구별할 수 있는 유전자 마커와 마커에 대한 정확도 정보를 분석자의 개입 없이 찾아내는 방법 즉, 두 개의 샘플집단에 대한 유전자발현 데이터가 주어질 때, 두 집단을 구분하는 유전자 마커를 식별하는 방법에 있어서, 유전자별로 발현정도의 평균값이 0이 되도록 정규화하는 과정 단계와, 발현정도의 값을 기호로 변화하는 단계를 거쳐, 샘플집단을 분할할 때 복잡도로 최소로 하는 유전자를 매 단계별 마커로 선정하고, 이 마커를 기준으로 샘플집단을 분할하는 과정을 반복하여, 최종적으로 남은 분할되지 않은 샘플집단들을 특징짓는 유전자 패턴으로 마커집합을 구성하고, 정확도를 결정하여 유전자발현 데이터를 식별하는 방법에 관해 개시된다.In the present invention, when gene expression data for two sample groups are given, the method of finding the accuracy information of genetic markers and markers capable of distinguishing two sample groups using gene expression data without analyzer intervention, that is, The method comprising the steps of: normalizing the average value of the degree of expression for each gene to zero, and changing the value of the degree of expression to a symbol, , And the process of dividing the sample group based on the marker is repeated to construct a marker set with a gene pattern characterizing the remaining unsegmented sample groups and determine the accuracy A method for identifying gene expression data is disclosed.

Description

A Gene Expression Data Marker Identification Method for Two Groups < RTI ID = 0.0 >

본 발명은 데이터 마커 식별방법에 관한 것으로, 보다 상세하게는 두 샘플집단에 대한 유전자발현 데이터로부터 두 집단을 구별하는 마커를 식별해 내는 방법에 관한 것이다.The present invention relates to a method of identifying a data marker, and more particularly, to a method for identifying a marker that distinguishes two groups from gene expression data for two sample populations.

본 발명은 두 샘플집단에 대한 유전자발현 데이터로부터 두 집단을 구별하는 데 사용될 수 있는 마커를 식별해 내는 방법에 관한 것으로 데이터 분석 기술에 속한다.The present invention pertains to data analysis techniques for identifying markers that can be used to distinguish between two populations from gene expression data for two sample populations.

종래의 상기와 같은 데이터 마커 식별방법은 수치로 표현된 유전자발현 데이터에서 두 샘플집단 간에 발현차이를 보이는 유전자를 확인하여, 의미 있다고 판단되는 유전자를 분석자가 직접 선택하여 마커로 결정하는 것이다. 그러나 이러한 방법은 정확하지 못할 뿐 아니라 비경제적이었다.Conventionally, the above-described data marker identification method identifies a gene showing a difference in expression between two sample groups in gene expression data represented by numerical values, and the analyzer directly selects a gene determined to be significant and determines it as a marker. However, this method was not only inaccurate but also uneconomical.

본 발명은 상기와 같은 종래기술의 문제점을 해결하기 위한 것으로서, 유전자발현 데이터를 사용하여 두 개의 샘플집단을 구별할 수 있는 유전자 마커와 마커에 대한 정확도 정보를, 분석자의 개입이 없이 찾아내는 방법을 제공하는 데 그 목적이 있다.Disclosure of Invention Technical Problem [8] Accordingly, the present invention has been made in view of the above problems, and it is an object of the present invention to provide a method of detecting accuracy of gene markers and markers capable of discriminating two sample groups using gene expression data without involvement of analysts The purpose is to do.

본 발명은 상기와 같은 본 발명의 목적을 달성하기 위하여 두 개의 샘플집단에 대한 유전자발현 데이터가 주어질 때, 두 집단을 구분하는 유전자 마커를 식별하는 방법에 있어서, 유전자별로 발현정도의 평균값이 0이 되도록 정규화하는 과정 단계와, 발현정도의 값을 기호로 변화하는 단계를 거쳐, 샘플집단을 분할할 때 복잡도로 최소로 하는 유전자를 매 단계별 마커로 선정하고, 이 마커를 기준으로 샘플집단을 분할하는 과정을 반복하여, 최종적으로 남은 분할되지 않은 샘플집단들을 특징짓는 유전자 패턴으로 마커집합을 구성하고, 정확도를 결정하여 유전자발현 데이터를 식별하는 방법을 제공한다.In order to achieve the object of the present invention, the present invention provides a method for identifying a genetic marker that distinguishes two groups when gene expression data for two sample groups are given, wherein the average value of the degree of expression is 0 And a step of changing the value of the degree of expression to a symbol to select a gene having a minimum complexity when dividing a sample group into a plurality of step markers and dividing the sample group based on the marker And repeating the process to construct a marker set with a gene pattern characterizing the remaining unsegmented sample groups and determining the accuracy to identify the gene expression data.

본 발명에 의하면 두 개 샘플집단에 대한 유전자발현 데이터들이 주어질 때, 두 집단을 구분할 수 있는 유전자 마커집합을 복잡한 형태도 포함하는 마커더 찾을 수 있어서, 바이오마커를 유전자발현 데이터 수준에서 찾을 때 등에 효과적으로 적용될 수 있다.According to the present invention, when gene expression data for two sample groups are given, it is possible to find a marker including a complicated form of a set of genetic markers capable of distinguishing between two groups, so that when a biomarker is found at the level of gene expression data, Can be applied.

도 1은 본 발명에 따른 입력과 발명의 방법의 결과물 간의 관계를 도시한 도면.
도 2는 본 발명에 따른 유전자발현 데이터로부터 주어진 두 집단을 구별하는 마커를 식별하는 방법을 알고리즘 형태로 도시한 도면.
도 3은 본 발명에 따라 전체 샘플집단이 단계에 따라 선택된 유전자의 기호값에 의해 분할되는 과정을 도시한 도면.
도 4는 본 발명을 설명할 때 사용된 기호를 정리한 도면.BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 shows the relationship between the input according to the invention and the result of the inventive method;
Figure 2 shows in algorithm form a method for identifying markers that distinguish two given groups from gene expression data according to the present invention.
Fig. 3 is a diagram showing a process in which a whole sample group is divided according to a symbol value of a selected gene according to the present invention. Fig.
4 is a diagram summarizing symbols used when explaining the present invention.

이하에서는 첨부 도면을 참조하여 본 발명을 보다 상세히 설명한다.Hereinafter, the present invention will be described in more detail with reference to the accompanying drawings.

먼저, 도 1은 본 발명에 따른 방법의 구성 요소를 표현하는 것으로, 유전자발현 데이터(10)에 대해서 본 발명에 따른 두 집단을 구별하는 마커 식별 알고리즘을 이용한 방법(20)에 따라 식별된 유전자 마커를 제공하는 부분(30)으로 구성된다.First, FIG. 1 represents the components of a method according to the present invention, in which a genetic marker identified according to a method 20 using a marker identification algorithm that distinguishes two groups according to the invention for gene expression data 10, (30).

도 2는 본 발명에 따른 유전자발현 데이터로부터 두 샘플집단을 구별하는 마커를 식별하는 방법(20)을 구체적인 알고리즘으로 표현한 것이다. 이 방법은 먼저 유전자발현 데이터

를 각 유전자별로 평균값이 0이 되도록 하기의 식 1로 변환하여 정규화된 데이터

을 생성한다(201). 정규화된 데이터

을 하기 식 2를 사용하여 기호로 변환된 유전자발현 데이터

를 생성한다(202). 유전자 집합의 각 유전자

별로 전체 샘플집합

에 대해서, 하기 식 3을 사용하여 분할복잡도

을 계산한다(203). 분할복잡도

을 가장 적게 하는 유전자

를 첫번째 분할 유전자

으로 선택한다(204). 유전자

을 기준으로 샘플집합

를 하기 식 4를 이용하여

로 분할한다(205). 분할된 샘플집합

의 각각에 대해서 하기 식 5를 이용하여 분류복잡도

을 계산한다(206). 분류복잡도가 지정된 임계값

보다 작은 샘플집합

이나, 포함된 샘플의 개수가 미리 지정한 임계값

보다 작은 샘플집합

은 더 이상 분할할 필요가 없는 것으로, 집합

에 원소로서 추가하고, 추가되는

에 대응되는 마커 정보를 하기 식 6과 같이 마커집합

에 추가한다(207). 집합

에 추가되지 않는 샘플집합들은 하기 식 7을 사용하여 샘플집합

를 구성한다(208). 샘플집합

에 대해서 유전자 집합의 각 유전자

별로 하기 식 3을 사용하여 분할복잡도

을 계산한다(209). 분할복잡도

을 가장 적게 하는 유전자

를 두번째 분할 유전자

으로 선택한다(210). 유전자

을 기준으로 집합

에 포함되지 않은 샘플집합

를 하기 식 8과 같이

로 분할한다(211). 분할된 샘플집합들에 대해서 위의 단계 (206)부터의 과정을 더 이상 분할할 샘플집합이 없어지거나, 미리 지정한 마커 유전자 개수에 도달할 때까지 반복한다(212). 최종적인 마커집합

은 두 샘플집단을 구별하는 마커로 사용될 수 있는 유전자와 대응되는 기호값 및, 해당 조건을 만족할 때 분류될 집단이름, 정확도 정보를 포함하게 되고, 마커집합

에 대한 정확도는 하기 식 9로 계산한다(213).Figure 2 is a specific algorithmic representation of a method 20 for identifying markers that distinguish two sample groups from gene expression data according to the present invention. In this method, gene expression data

Is converted into the following expression 1 so that the average value of each gene is 0, and the normalized data

(201). Normalized data

Using the following expression 2: < EMI ID =

(202). Each gene in the gene set

Not a whole sample set

, Using the following Equation 3,

(203). Partition complexity

The least

The first segmented gene

(204). gene

Sample set

Using the following equation 4

(205). Split sample set

&Lt; / RTI > for each of < RTI ID = 0.0 >

(206). If the classification complexity is a specified threshold

Smaller sample set

Or if the number of included samples exceeds a predetermined threshold

Smaller sample set

Is no longer required to be partitioned,

As an element, and the added

The marker information corresponding to the marker group

(207). set

Are added to the sample set < RTI ID = 0.0 >

(208). Sample set

For each gene in the gene set

Using the following equation 3, the partition complexity

(209). Partition complexity

The least

The second segmented gene

(210). gene

Based on

Sample set not included in

As shown in Equation 8 below

(Step 211). For the divided sample sets, the process from step 206 above is repeated until no more sets of samples are to be divided, or the number of marker genes specified beforehand is reached (212). The final set of markers

Includes a symbol value corresponding to a gene that can be used as a marker for distinguishing two sample groups, a group name and an accuracy information to be classified when the condition is satisfied,

Is calculated by the following equation (213).

도 4는 식 1과 식 9까지에서 사용되는 표기법의 의미를 설명한 것이다.FIG. 4 illustrates the meaning of the notation used in Equations 1 and 9.

식 1은 각 유전자별로 평균값이 0이 되도록 유전자 발현정도값을 변환하는 식이다.Equation 1 is a formula for converting the gene expression level value so that the average value of each gene is zero.

식 2는 발현정도값으로 표현된 데이터를 기호로 변환하기 위해 사용하는 변환 규칙을 나타내는 식이다.Equation 2 is an expression that represents a transformation rule used to transform data represented by an expression level value into a symbol.

식 3은 샘플집합

를 유전자

의 기호값인

에 따라 분할할 때, 각 분할에 샘플집합

와

에 포함되는 샘플들이 어떤 분포로 섞여있는지 측정하는 복잡도를 계산하는 식이다.Equation 3 is the sample set

Gene

The symbolic value of

, A sample set

Wow

And the complexity of measuring the distribution of the samples included in the distribution.

식 4는 전체 샘플집합

를 유전자

의 기호값인

에 따라 각각

로 분할한 것을 보인 식이다.Equation 4 shows the total sample set

Gene

The symbolic value of

Respectively

As shown in Fig.

식 5는 샘플집합

에 샘플집합

와

에 포함되는 샘플들이 어떤 분포로 섞여있는지 측정하는 복잡도를 계산하는 식이다.Equation 5 is a set of samples

Sample set in

Wow

식 6은 집합

에 포함된 각 샘플집합

에 대응하는 마커로서, 첫 번째 선택된 유전자

부터

번째 선택된 유전자

까지의 기호값인

와, 이 조건을 만족할 때 속하는 원래 샘플집단 이름

, 이 마커를 사용하여 판정할 때의 정확도

로 구성된 마커 정보들의 집합

을 나타내는 식이다.Equation 6 is the set

Each sample set contained in

As the marker corresponding to the first selected gene

from

Th selected gene

Which is the symbol value

And the name of the original sample group that belongs when this condition is satisfied

, The accuracy of judgment using this marker

A set of marker information

.

식 7은

번째 단계에서 추가적으로 분할이 필요한 샘플집합들의 원소로 구성된 집합을 나타내는 식이다.Equation 7

The second step is an expression that represents a set of elements of sample sets that need to be further segmented.

식 8은

번째 단계의 분할샘플집합이,

번째 마커유전자

의 값을 기준으로 분할되어,

번째 단계의 샘플집합들로 만들어지는 것을 나타낸다.Equation 8

The split sample set of the < RTI ID = 0.0 >

The second marker gene

And the value of < RTI ID = 0.0 >

Th sample set.

식 9는 마커집합

을 이용하여 어떤 샘플이 속하는 집단을 판정할 때의 정확도를 계산하는 식을 나타낸다.Equation 9 is a set of markers

Is used to calculate the accuracy when determining which group a sample belongs to.

[식 1][Formula 1]

[식 2][Formula 2]

[식 3][Formula 3]

[식 4][Formula 4]

[식 5][Formula 5]

[식 6][Formula 6]

[식 7][Equation 7]

[식 8][Equation 8]

[식 9][Equation 9]

Claims

Given the gene expression data for two sample populations, a method for identifying genetic markers that distinguish between two populations,
A step of normalizing the average value of the degree of expression to 0 for each gene,
A step of changing the value of the degree of expression to a symbol,
When dividing a sample group, a gene that minimizes the complexity is selected as a step-by-step marker, and a process of dividing a sample group based on the marker is repeated,
A method of identifying a gene expression data by constructing a marker set with a gene pattern characterizing the remaining unsegmented sample groups, and determining the accuracy thereof.