KR101943053B1

KR101943053B1 - System and method for detecting copy number variation

Info

Publication number: KR101943053B1
Application number: KR1020170025113A
Authority: KR
Inventors: 공진화; 신재문; 윤지희; 이은주
Original assignee: 한림대학교 산학협력단
Priority date: 2017-02-24
Filing date: 2017-02-24
Publication date: 2019-01-28
Also published as: KR20180098093A

Abstract

4단계의 정규화 변환 기법을 적용한 엑솜 시퀀싱 데이터 변환을 통해 데이터의 잡음 특성을 최소화하여 CNV 영역을 정확하게 추출하도록 한 유전자 단위 반복 변이 검출 시스템 및 방법을 제시한다. 제시된 유전자 단위 반복 변이 검출 시스템의 정규화 처리 장치는 테스트 데이터 및 컨트롤 데이터를 근거로 산출한 비율 데이터 또는 테스트 데이터를 타깃 영역에 대해 동일한 크기를 갖는 복수의 세그먼트로 분할하고, 분할된 복수의 세그먼트의 평균값의 표준화 평균값과 각 세그먼트의 비율 데이터를 근거로 각 세그먼트의 커버리지 데이터를 정규화 분포를 갖도록 변환하여 유효 비율 데이터를 추출한다.We propose a gene unit repeat detection system and method that extracts CNV region with minimum noise characteristics of data through transforming exome sequencing data using 4 step normalization transformation method. The proposed normalization processing unit of the gene unit repeat variation detection system divides ratio data or test data calculated on the basis of test data and control data into a plurality of segments having the same size with respect to the target region, The coverage data of each segment is transformed so as to have a normalized distribution based on the normalized average value of the segments and the ratio data of each segment to extract the effective ratio data.

Description

TECHNICAL FIELD The present invention relates to a system and a method for detecting a gene unit repeat variation,

본 발명은 인간 유전체의 구조적 변이(SV: Structural Variation) 중 하나인 유전체 단위 반복 변이(CNV: Copy Number Variation)을 검출하는 기술에 관한 것이다.TECHNICAL FIELD The present invention relates to a technique for detecting a copy number variation (CNV) which is one of structural variations (SV) of a human genome.

유전자 단위 반복 변이(CNV; Copy Number Variation)는 서로 다른 두 DNA 시퀀스를 비교하여 50bp 이상의 DNA 세그먼트의 카피 수(copy number)가 서로 다른 경우의 변이로 정의된다.Copy Number Variation (CNV) is defined as a variation in the number of copies of DNA segments of 50 bp or more by comparing two different DNA sequences.

유전자 단위 반복 변이는 자폐증(autism), 지적 장애(intellectual disability), 뇌전증(epilepsy), 조현병(schizophrenia), 소아비만(obesity), 암(cancer) 등과 같은 인간의 질병과 연관성이 있는 매우 중요한 변이 유형 중 하나이다.Genetic repeat variants are very important, associated with human diseases such as autism, intellectual disability, epilepsy, schizophrenia, obesity, and cancer. It is one of the variation types.

특히 최근 연구에서는 엑솜 시퀀싱 데이터를 이용한 rare, de novo CNV 추출에 근거한 원인 유전자 규명에 관한 연구 결과가 지속적으로 보고되고 있다.In particular, in recent studies, studies on the identification of causative genes based on rare and de novo CNV extraction using exome sequencing data have been continuously reported.

그러나 WES(Whole Exome Sequencing) 데이터는 작은 엑솜(Exome) 영역의 크기, 엑솜 영역의 비연속성, 프로브(probe) 의존성 등의 다양한 잡음 특성을 가지고 있다.However, WES (Whole Exome Sequencing) data has various noise characteristics such as small exome area size, discontinuity of exoskeleton, probe dependency.

대표적인 WES 데이터 기반의 CNV 추출 알고리즘은 ExomeCNV, Contra, CoNIFER, XHMM, Excavator 등이 있고, 각 알고리즘은 WES 데이터의 잡음 특성을 보정하기 위하여 다음과 같은 방식을 채택하고 있다.Existing CNV extraction algorithms based on WES data include ExomeCNV, Contra, CoNIFER, XHMM, and Excavator. Each algorithm adopts the following method to correct noise characteristics of WES data.

ExomeCNV는 Geary-Hinkley transformation을 적용하여 리드 커버리지 데이터의 정규화 분포를 유도하고 있다.ExomeCNV uses Geary-Hinkley transformation to derive the normalized distribution of lead coverage data.

Contra는 base-level log-ratio와 library size 보정 등의 작업을 수행하는 정규화 방식을 채택하고 있다.Contra adopts a normalization method that performs operations such as base-level log-ratio and library size correction.

CoNIFER와 XHMM은 singular value decomposition(SVD) 방식과 principal component analysis(PCA) 방식을 적용하여 리드 커버리지 데이터에 포함된 잡음을 제거하고 있다.CoNIFER and XHMM remove singular value decomposition (SVD) and principal component analysis (PCA) to remove noise in lead coverage data.

또한, Excavator는 GC content, mappability, exon size 등에서 발생하는 잡음을 제거하기 위하여 median 정규화 방식을 채택하고 있다.In addition, the excavator adopts the median normalization method to remove the noise generated from GC content, mappability, and exon size.

그러나 CoNIFER와 XHMM는 SVD 혹은 PCA 정규화 방식을 적용하기 위하여 매우 많은 샘플 데이터를 동시에 사용하기 때문에, 많은 샘플을 갖추지 못한 제한적 환경에서는 적용이 어려운 문제점이 있다.However, since CoNIFER and XHMM use very large amount of sample data to apply SVD or PCA normalization method, they are difficult to apply in a limited environment without many samples.

또한, Contra에서도 base control을 생성하기 위하여 많은 샘플을 이용하여야 하는 제약이 있다.There is also a restriction in Contra that many samples must be used to generate the base control.

또한, 몇몇 알고리즘에서는 변이 영역을 구분하기 위한 분할 방식으로서 circular binary segmentation (CBS) 알고리즘을 적용하고 있다.In some algorithms, a circular binary segmentation (CBS) algorithm is applied as a partitioning method for distinguishing mutation regions.

그러나, CBS 알고리즘은 일반적으로 연속 영역을 분할하는데 사용되는 방식으로서 엑솜 커버리지 데이터와 같이 비연속적인 데이터 영역에 적용하는 경우, 영역 크기가 매우 작거나 큰 경우의 CNV 영역을 놓칠 확률이 커지는 문제점이 있다.However, when the CBS algorithm is applied to a discontinuous data area, such as exo-coverage data, the CNV area is likely to be missed when the area size is very small or large .

한국공개특허 제10-2017-0000744호(명칭: 유전자의 복제수 변이(CNV)를 분석하는 방법 및 장치)Korean Patent Publication No. 10-2017-0000744 (name: method and apparatus for analyzing copy number variation (CNV) of a gene)

본 발명의 상술한 문제점을 해결하고자 제안된 것으로, 4단계의 정규화 변환 기법을 적용한 엑솜 시퀀싱 데이터 변환을 통해 데이터의 잡음 특성을 최소화하여 CNV 영역을 정확하게 추출하도록 한 유전자 단위 반복 변이 검출 시스템 및 방법을 제공하는 것을 목적으로 한다.In order to solve the above-described problems of the present invention, there is provided a gene unit repeat mutation detection system and method for accurately extracting a CNV region by minimizing noise characteristics of data through transformation of exome sequencing data using a four-step normalization transformation technique The purpose is to provide.

상기한 목적을 달성하기 위한 본 발명의 실시예에 따른 유전자 단위 반복 변이 검출 시스템의 입력 데이터를 정규화 처리하는 정규화 처리 장치에 있어서, 테스트 데이터 및 컨트롤 데이터의 리드 정렬 결과로 이루어진 두 입력 데이터를 근거로 각각 커버리지 데이터를 산출하고, 각 커버리지 데이터를 근거로 각 엑손 타깃 영역만의 리드 커버리지 데이터를 추출하고, 리드 커버리지 데이터의 비율 데이터를 산출하는 커버리지 산출모듈, 커버리지 산출모듈에서 산출한 비율 데이터를 근거로 타깃 영역에 대한 비율 데이터를 동일한 크기를 갖는 복수의 세그먼트로 분할하거나, 테스트 데이터를 타깃 영역에 대해 동일한 크기를 갖는 복수의 세그먼트로 분할하는 세그먼테이션 처리모듈, 세그먼테이션 처리모듈에서 분할된 복수의 세그먼트의 평균값을 표준화 평균값으로 조정하는 평균값 조정모듈 및 평균값 조정모듈에서 조정한 표준화 평균값과 각 세그먼트의 비율 데이터를 근거로 각 세그먼트의 커버리지 데이터를 정규화 분포를 갖도록 변환하여 유효 비율 데이터를 추출하는 정규화 변환모듈을 포함한다.According to another aspect of the present invention, there is provided a normalization processing apparatus for normalizing input data in a gene unit repeat variation detection system, comprising: A coverage calculation module that calculates coverage data, extracts lead coverage data of only each exon target area based on each coverage data, calculates ratio data of lead coverage data, and a ratio calculation module that calculates ratio data based on ratio data calculated by the coverage calculation module A segmentation processing module for dividing ratio data of a target area into a plurality of segments having the same size or dividing test data into a plurality of segments having the same size with respect to a target area; of And a normalization conversion module that converts the coverage data of each segment to have a normalization distribution based on the standardized average value and the ratio data of each segment adjusted by the average value adjustment module and the average value adjustment module adjusted by the standardization average value to extract the effective ratio data .

커버리지 산출모듈은 수학식The coverage calculation module calculates

(R_T ^e(i,j)는 테스트 데이터의 리드 커버리지 데이터이고, R_C ^e(i,j)는 컨트롤 데이터의 리드 커버리지 데이터이고, ε는

)을 이용하여 비율 데이터를 산출하고, 보정계수(ω)를 곱하여 비율 데이터를 보정하되, 보정계수(ω)는 라이브러리 사이즈의 불균형성을 보정하기 위한 방법으로, 컨트롤 데이터에 존재하는 모든 타깃 영역의 커버리지 값을 모두 더한 값을 테스트 데이터에 존재하는 모든 타깃 영역의 커버리지 값을 모두 더한 값으로 나눈 값일 수 있다. _{^{(R T e (i, j}} ) is the lead data of the test data coverage, R ^e _C (i, j) is the lead data of the control data coverage, ε is

) Is used to calculate the ratio data, and the ratio data is corrected by multiplying the correction coefficient by the correction coefficient (?). The correction coefficient (?) Is a method for correcting the unbalance of the library size, The value obtained by adding all of the coverage values may be a value obtained by adding all the coverage values of all the target areas existing in the test data.

세그먼테이션 처리모듈은 각 타깃 영역(e_i(1≤i≤N_e))에 대한 비율 데이터(R_T|C ^e(i,j), b_f(i)≤j≤b_l(i))를 동일한 크기(b_s)를 갖는 Ns(i)=[N_b(i)/b_s]-1개의 세그먼트로 분할하여, 각 세그먼트가 동일한 크기를 갖는 비율 데이터(

(1≤i≤N_e, 1≤k≤n_s(i), 1≤j_k≤b_s))를 산출하되, 세그먼트 분할 시 b_f(i)=[(mod(n_b(i),b_s)+b_s)/2]+1에서 타깃 영역의 첫 번째 세그먼트를 시작하고, 양끝의 남는 영역(R_T| _C ^e(i,j) (1≤j≤b_s(i)-1)과 R_T| _C ^e(i,j) (b_j(i)≤j≤N_b(i)))의 데이터는 제외할 수 있다.Segmentation processing module is configured to rate data for each target region _{_{(e i (1≤i≤N e))}} | a _{^{(R T C e (i,}} j), b f (i) ≤j≤b l (i)) Ns (i) = with the same size _{_{(b s) [N b (}} i) / b s] is divided by -1 segments, data rate, each segment having the same size (

_{_{(1≤i≤N e, 1≤k≤n s (i}} ), 1≤j k ≤b s)) a, b _f (i) = [(mod _(b n (i during segmentation but calculated), _{_{b s) + b s) /}} 2] to start the first segment of the target region from the +1 of the areas remains at both ends _{_{^{(R T | C e (i}}} , j) (1≤j≤b s (i) -1 ) And R _{T |} _C ^e (i, j) (b _j (i) ≤j ≤N _b (i)).

평균값 조정모듈은 수학식The average value adjustment module calculates

(1≤k≤n_s(i)) 을 이용하여 각 세그먼트의 평균값을 산출하고, 수학식 (1? K? N _s (i)) to calculate the average value of each segment,

, (1≤k≤n_s(i))

, (1? K? N _s (i))

(여기서, M(mR_s ^e)는 전체 세그먼트 S_i,k(1≤i≤N_e, 1≤k≤n_s(i))에 걸쳐져 있는 평균값의 평균이고, SD(mR_s ^e)는 전체 세그먼트 S_i,k(1≤i≤N_e, 1≤k≤n_s(i))에 걸쳐져 있는 평균값의 표준 편차)를 이용하여 표준화 평균값을 산출하고, 산출한 표준화 평균값이 음수이면 표준화 평균값에서 표준화 평균값의 최소값을 차감하여 표준화 평균값을 보정할 수 있다.(Where, M (mR _s ^e) is a full segment _{_{S i, k (1≤i≤N e,}} 1≤k≤n s ( the average of the average value, which straddle _{i)), SD (mR s} e) is a total segment _{_{S i, k (1≤i≤N e,}} 1≤k≤n s (i)) calculates a standardized mean value by the standard deviation) of the mean value which straddle and, when the calculated normalized average value is negative at the standardized average value The standardized average value can be corrected by subtracting the minimum value of the standardized average value.

정규화 변환모듈은 표준화 평균값과 각 세그먼트의 비율 데이터의 표준 편차를 근거로 각 세그먼트 내 각 포지션의 데이터가 정규 분포를 갖도록 변환하여 유효 비율 데이터를 추출할 수 있다.The normalization conversion module can extract the effective ratio data by converting the data of each position in each segment to have a normal distribution based on the standardized average value and the standard deviation of the ratio data of each segment.

상기한 목적을 달성하기 위한 본 발명의 실시예에 따른 유전자 단위 반복 변이 검출 시스템은 테스트 데이터 및 컨트롤 데이터를 포함하는 두 입력 데이터를 이용하여 비율 데이터를 산출한 후 정규화 처리하거나, 데이트 데이터를 포함한 입력데이터를 정규화 처리하는 정규화 처리 장치 및 정규화 처리 장치에서 정규화 처리된 입력 데이터를 근거로 유전자 단위 반복 변이를 추출하는 CNV 추출 장치를 포함하고, 정규화 처리 장치는 입력 데이터에 대해 각 타깃 영역에서의 염기 단위 리드 커버리지를 산출하고, 염기 단위 리드 커버리지를 근거로 세그먼테이션하여 타깃 영역을 동일한 크기를 갖는 복수의 세그먼트로 분할하고, 복수의 세그먼트들 각각의 리드 커버리지 데이터의 평균값을 표준화 평균값으로 조정하고, 표준화 평균값 및 각 세그먼트의 비율 데이터를 근거로 각 세그먼트의 커버리지 데이터를 정규화 분포를 갖도록 변환한다.According to an aspect of the present invention, there is provided a gene unit repeat mutation detection system, comprising: a normalization processing unit for calculating ratio data using two input data including test data and control data, And a CNV extraction device for extracting a gene unit repeat variation based on the input data normalized by the normalization processing device, wherein the normalization processing device comprises: a normalization processing device for normalizing the data, The lead coverage is calculated and segmented based on the base unit lead coverage to divide the target region into a plurality of segments having the same size, adjust the average value of the read coverage data of each of the plurality of segments to the standardized average value, Each segment On the basis of the rate data and converts the coverage data of each segment to have a normalized distribution.

상기한 목적을 달성하기 위한 본 발명의 실시예에 따른 정규화 처리 방법은 유전자 단위 반복 변이 검출을 위해 정규화 처리 장치를 이용하여 입력 데이터를 정규화 처리하는 방법으로, 테스트 데이터 및 컨트롤 데이터를 포함하는 입력 데이터를 근거로 타깃 단위의 커버리지를 산출하는 단계, 커버리지 데이터를 근거로 각 타깃 영역의 리드 커버리지 데이터를 추출하는 단계, 리드 커버리지 데이터의 비율 데이터를 산출하는 단계, 산출한 비율 데이터를 근거로 타깃 영역에 대한 비율 데이터를 동일한 크기를 갖는 복수의 세그먼트로 분할하거나, 테스트 데이터를 동일한 크기를 갖는 복수의 세그먼트로 분할하는 단계, 분할된 복수의 세그먼트의 평균값을 표준화 평균값으로 조정하는 단계 및 조정한 표준화 평균값과 각 세그먼트의 비율 데이터를 근거로 각 세그먼트의 커버리지 데이터를 정규화 분포를 갖도록 변환하여 유효 비율 데이터를 추출하는 단계를 포함한다.According to an aspect of the present invention, there is provided a method of normalizing input data using a normalization processing apparatus for detecting a gene unit repeat variation, the method comprising: inputting data including test data and control data; , Extracting the lead coverage data of each target area based on the coverage data, calculating ratio data of the lead coverage data, calculating the ratio of the target coverage area to the target area based on the calculated ratio data, Dividing the ratio data into a plurality of segments having the same size, or dividing the test data into a plurality of segments having the same size, adjusting the average value of the plurality of segmented segments to the standardized average value, The ratio data for each segment Georo converts the coverage data of each segment to have a normalized distribution comprises the steps of extracting an effective data rate.

비율 데이터를 산출하는 단계에서는 수학식 In the step of calculating ratio data,

)을 이용하여 비율 데이터를 산출하는 단계 및 보정계수(ω)를 곱하여 비율 데이터를 보정하는 단계를 포함하되, 보정계수(ω)는 컨트롤 데이터에 존재하는 모든 타깃 영역의 커버리지 값을 모두 더한 값을 테스트 데이터에 존재하는 모든 타깃 영역의 커버리지 값을 모두 더한 값으로 나눈 값일 수 있다. _{^{(R T e (i, j}} ) is the lead data of the test data coverage, R ^e _C (i, j) is the lead data of the control data coverage, ε is

), And correcting the ratio data by multiplying the correction coefficient (?), Wherein the correction coefficient (?) Is a value obtained by adding all the coverage values of all the target areas existing in the control data May be a value obtained by dividing the coverage value of all the target areas existing in the test data by the sum of all the coverage values.

복수의 세그먼트로 분할하는 단계는 각 타깃 영역(e_i(1≤i≤N_e))에 대한 비율 데이터(R_T| _C ^e(i,j), b_f(i)≤j≤b_l(i))를 동일한 크기(b_s)를 갖는 Ns(i)=[N_b(i)/b_s]-1개의 세그먼트로 분할하여, 각 세그먼트가 동일한 크기를 갖는 비율 데이터(

(1≤i≤N_e, 1≤k≤n_s(i), 1≤j_k≤b_s))를 산출하되, 세그먼트 분할 시 b_f(i)=[(mod(n_b(i),b_s)+b_s)/2]+1에서 타깃 영역의 첫 번째 세그먼트를 시작하고, 양끝의 남는 영역(R_T| _C ^e(i,j) (1≤j≤b_s(i)-1))과 R_T| _C ^e(i,j) (b_j(i)≤j≤N_b(i)))의 데이터를 제외할 수 있다.Dividing a plurality of segment data rate for each target region _{_{(e i (1≤i≤N e))}} (R T | C e (i, j), b f (i) ≤j≤b l ( i)) of Ns (i) = with the same size _{_{(b s) [N b (}} i) / b s] is divided by -1 segments, data rate, each segment having the same size (

_{_{(1≤i≤N e, 1≤k≤n s (i}} ), 1≤j k ≤b s)) a, b _f (i) = [(mod _(b n (i during segmentation but calculated), _{_{b s) + b s) /}} 2] to start the first segment of the target region from the +1 of the areas remains at both ends _{_{^{(R T | C e (i}}} , j) (1≤j≤b s (i) -1 )) And R _{T |} _{The data of C} ^e (i, j) (b _j (i) ≤j ≤N _b (i)) may be excluded.

표준화 평균값으로 조정하는 단계는 수학식 The step of adjusting to the standardized average value may be performed by:

, (1≤k≤n_s(i))

, (1? K? N _s (i))

을 이용하여 각 세그먼트의 평균값을 산출하는 단계, 수학식 Calculating an average value of each segment using the equation

, (1≤k≤n_s(i))

, (1? K? N _s (i))

(여기서, M(mR_s ^e)는 전체 세그먼트 S_i,k(1≤i≤N_e, 1≤k≤n_s(i))에 걸쳐져 있는 평균값의 평균이고, SD(mR_s ^e) 는 전체 세그먼트 S_i,k(1≤i≤N_e, 1≤k≤n_s(i))에 걸쳐져 있는 평균값의 표준 편차)를 이용하여 표준화 평균값을 산출하는 단계 및 산출한 표준화 평균값이 음수이면 표준화 평균값에서 표준화 평균값의 최소값을 차감하여 표준화 평균값을 보정하는 단계를 포함할 수 있다.(Where, M (mR _s ^e) is a full segment _{_{S i, k (1≤i≤N e,}} 1≤k≤n s ( the average of the average value, which straddle _{i)), SD (mR s} e) is a total segment _{_{s i, k (1≤i≤N e,}} 1≤k≤n s (i)) to straddle when calculating a normalized mean value by the standard deviation) of the mean value and the calculated normalized average value is negative standardized mean value in And a step of correcting the standardized average value by subtracting the minimum value of the standardized average value.

유효 비율 데이터를 추출하는 단계에서는 표준화 평균값과 각 세그먼트의 비율 데이터의 표준 편차를 근거로 각 세그먼트 내 각 기본 포지션의 데이터가 정규 분포를 갖도록 변환하여 유효 비율 데이터를 추출할 수 있다.In the step of extracting the effective ratio data, the effective ratio data can be extracted by converting the data of each basic position in each segment to have a normal distribution based on the standardized average value and the standard deviation of the ratio data of each segment.

상기한 목적을 달성하기 위한 본 발명의 실시예에 따른 유전자 단위 반복 변이 검출 방법은 정규화 처리 장치에 의해, 테스트 데이터 및 컨트롤 데이터 또는 데이트 데이터를 포함하는 입력 데이터를 정규화 처리하는 단계 및 CNV 추출 장치에 의해, 입력 데이터를 정규화 처리하는 단계에서 정규화 처리된 입력 데이터를 근거로 유전자 단위 반복 변이를 추출하는 단계를 포함하고, 입력 데이터를 정규화 처리하는 단계는 입력 데이터에 대해 각 타깃 영역에서의 염기 단위 리드 커버리지를 산출하는 단계, 염기 단위 리드 커버리지를 근거로 세그먼테이션하여 타깃 영역을 동일한 크기를 갖는 복수의 세그먼트로 분할하는 단계, 복수의 세그먼트들 각각의 리드 커버리지 데이터의 평균값을 표준화 평균값으로 조정하는 단계 및 표준화 평균값 및 각 세그먼트의 비율 데이터를 근거로 각 세그먼트의 커버리지 데이터를 정규화 분포를 갖도록 변환하는 단계를 포함할 수 있다.According to another aspect of the present invention, there is provided a method for detecting a repeated unit of gene units, comprising the steps of: normalizing input data including test data, control data, and date data by a normalization processing apparatus; Wherein the step of normalizing the input data comprises the step of extracting a gene unit repeat variation based on the normalized input data in the step of normalizing the input data, Dividing the target area into a plurality of segments having the same size by segmenting based on the base unit lead coverage, adjusting the average value of the read coverage data of each of the plurality of segments to a standardized average value, The average value and each segment Based on the data rate of the agent can comprise a step of converting the coverage data of each segment to have a normalized distribution.

본 발명에 의하면, 유전자 단위 반복 변이 검출 시스템 및 방법은 4단계의 정규화 변환 기법을 적용하여 엑솜 시퀀싱 데이터를 변환함으로써, 데이터의 잡음 특성을 최소화할 수 있는 효과가 있다.According to the present invention, the system and method for detecting a gene unit repeating mutation can minimize the noise characteristic of data by transforming exome sequencing data by applying a four-step normalization conversion technique.

또한, 유전자 단위 반복 변이 검출 시스템 및 방법은 스케일 스페이스 필터링을 통해 다양한 모양과 크기의 CNV 영역을 정확히 추출할 수 있는 효과가 있다.In addition, the system and method for detecting a gene unit repeat variation can extract CNV regions of various shapes and sizes through scale-space filtering.

도 1은 본 발명의 실시예에 따른 유전자 단위 반복 변이 검출 시스템을 설명하기 위한 도면.
도 2는 도 1의 정규화 처리 장치를 설명하기 위한 도면.
도 3 및 도 4는 도 1의 CNV 추출 장치를 설명하기 위한 도면.
도 5는 본 발명의 실시예에 따른 유전자 단위 반복 변이 검출 방법을 설명하기 위한 흐름도.
도 6은 도 5의 정규화 처리 단계를 설명하기 위한 흐름도.
도 7은 도 5의 CNV 검출 단계를 설명하기 위한 흐름도.BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a diagram for explaining a gene unit repeated variation detection system according to an embodiment of the present invention. FIG.
2 is a view for explaining the normalization processing apparatus of FIG.
FIG. 3 and FIG. 4 are views for explaining the CNV extracting apparatus of FIG. 1;
5 is a flow chart for explaining a method of detecting repeated gene unit variation according to an embodiment of the present invention.
6 is a flowchart for explaining the normalization processing step of FIG. 5;
FIG. 7 is a flowchart for explaining the CNV detection step of FIG. 5; FIG.

이하, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 용이하게 실시할 수 있을 정도로 상세히 설명하기 위하여, 본 발명의 가장 바람직한 실시예를 첨부 도면을 참조하여 설명하기로 한다. 우선 각 도면의 구성요소들에 참조부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 발명을 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings in order to facilitate a person skilled in the art to easily carry out the technical idea of the present invention. . In the drawings, the same reference numerals are used to designate the same or similar components throughout the drawings. In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear.

본 발명의 실시예에 따른 유전자 단위 반복 변이 검출 시스템 및 방법은 엑솜 시퀀싱(Exome Sequencing) 데이터를 정규화 변환하여 데이터의 잡음 특성을 최소화하고, 스케일 스페이스 필터링을 통해 다양한 모양과 크기의 유전자 단위 반복 변위(CNV; Copy Number Variation) 영역을 정확하게 추출한다.The system and method for detecting a gene unit repeat variation according to an embodiment of the present invention minimizes noise characteristics of data by normalizing and transforming exome sequencing data and performs a genetic unit repeat displacement of various shapes and sizes CNV (Copy Number Variation) area.

도 1을 참조하면, 본 발명의 실시예에 따른 유전자 단위 반복 변이 검출 시스템(100)은 정규화 처리 장치(120) 및 CNV 추출 장치(140)를 포함하여 구성된다.Referring to FIG. 1, a gene unit repeat mutation detection system 100 according to an embodiment of the present invention includes a normalization processing unit 120 and a CNV extraction unit 140.

정규화 처리 장치(120)는 GC content 바이어스, 라이브러리 사이즈 영향, 엑솜 경계 영역에서의 커버리지 바이어스 등으로 인한 다양한 잡음 영향을 최소화하기 위해서 입력 데이터를 정규화한다.The normalization processing unit 120 normalizes the input data to minimize various noise effects due to GC content bias, library size effect, coverage bias in the exosome boundary region, and the like.

이를 위해, 정규화 처리 장치(120)는 테스트(Test) 데이터 및 컨트롤(Control) 데이터 중 적어도 하나를 입력 데이터로 입력받는다.To this end, the normalization processing unit 120 receives at least one of test data and control data as input data.

이때, 정규화 처리 장치(120)는 테스트 데이터만 입력되면 테스트 데이터의 리드 커버리지 데이터만을 이용하여 정규화 처리를 수행한다. 정규화 처리 장치(120)는 테스트 데이터와 컨트롤 데이터가 입력되면 테스트 데이터 및 컨트롤 데이터의 커버리지 데이터로부터 얻은 비율 (ratio) 커버리지 데이터를 이용하여 정규화 처리를 수행한다. 이하에서는, 입력 데이터로서 테스트 데이터 및 컨트롤 데이터가 모두 사용하는 경우를 예로 들어 설명한다.At this time, if only the test data is input, the normalization processing unit 120 performs the normalization processing using only the read coverage data of the test data. When the test data and the control data are input, the normalization processing unit 120 performs normalization processing using ratio coverage data obtained from the coverage data of the test data and the control data. Hereinafter, the case where both test data and control data are used as input data will be described as an example.

정규화 처리 장치(120)는 입력 데이터에 대해 각 타깃 영역에서의 염기 단위(Base-Level) 리드 커버리지 산출하고, 산출한 연기 단위 리드 커버리지를 세그먼테이션하고, 각 세그먼트에서의 커버리지 데이터의 평균값 조정, 각 세그먼트의 커버리지 데이터를 정규화 분포를 갖도록 변환한다.The normalization processing unit 120 calculates a base-level lead coverage in each target area with respect to the input data, segments the calculated smoke-to-smoke lead coverage, adjusts the average value of the coverage data in each segment, Lt; RTI ID = 0.0 > normalization < / RTI > distribution.

도 2를 참조하면, 정규화 처리 장치(120)는 커버리지 산출모듈(122), 세그먼테이션 처리모듈(124), 평균값 조정모듈(126), 정규화 변환모듈(128)을 포함하여 구성된다.2, the normalization processing apparatus 120 includes a coverage calculation module 122, a segmentation processing module 124, an average value adjustment module 126, and a normalization conversion module 128.

커버리지 산출모듈(122)은 입력 데이터를 이용하여 타깃 단위로 커버리지를 산출한다. 즉, 커버리지 산출모듈(122)은 엑솜 시퀀싱 데이터인 테스트 데이터와 컨트롤 데이터의 정렬 결과를 이용하여 커버리지 데이터(R_T, R_C)를 산출한다.The coverage calculation module 122 calculates the coverage on a target basis using the input data. That is, the coverage calculation module 122 calculates coverage data (R _T , R _C ) using the result of alignment of test data and control data, which is exome sequencing data.

커버리지 산출모듈(122)은 기산출한 커버리지 데이터(R_T, R_C)로부터 각 타깃 영역(e_i(1≤i≤N_e))의 모든 포지션(j(1≤j≤N_b(i)))에서의 리드 커버리지 데이터(R_T ^e(i,j)와 R_C ^e(i,j)) 를 추출한다. 여기서, N_e는 전체 타깃 영역의 개수를 의미하고, N_b(i)는 i번째 타깃 영역의 길이를 의미한다.Coverage calculation module 122 may group all positions (j (1≤j≤N _b (i) from the coverage data (R _T, _C R) calculated for each target region (e _i (1≤i≤N _e)) (R _T ^e (i, j) and R _C ^e (i, j)) in the current position Here, N _e denotes the total number of target areas, and N _b (i) denotes the length of the i-th target area.

커버리지 산출모듈(122)은 추출한 리드 커버리지 데이터(

,

)의 비율(ratio) 데이터(R_T|C ^e(i,j))를 산출한다.The coverage calculation module 122 calculates the coverage coverage data

,

(R _{T | C} ^e (i, j)).

이때, 커버리지 산출모듈(122)은 하기 수학식 1을 이용하여 리드 커버리지 데이터(R_T ^e(i,j), R_C ^e(i,j))의 비율 데이터를 산출한다.At this time, the coverage calculation module 122 calculates ratio data of the lead coverage data (R _T ^e (i, j), R _C ^e (i, j)) using the following equation (1).

여기서, ε는 10^-3이다.Where epsilon is 10 < ^{-3 &} gt ;.

커버리지 산출모듈(122)은 두 리드 커버리지 데이터의 라이브러리 사이즈(library size) 보정을 위해 기산출한 비율 데이터를 보정한다. 즉, 하기 수학식 2와 같이, 커버리지 산출모듈(122)은 비율 데이터(R_T| _C ^e(i,j))에 보정계수(ω)를 곱하여 비율 데이터를 보정한다The coverage calculation module 122 corrects the ratio data calculated for the library size correction of the two lead coverage data. That is, as shown in the following Equation 2, the coverage calculation module 122 corrects the ratio data by multiplying the ratio data R _{T |} _C ^e (i, j) by the correction coefficient?

이때, 보정계수 ω=mR_C ^e/mR_T ^e이고, R_C ^e는 컨트롤 데이터에 존재하는 모든 타깃 영역의 커버리지 값을 모두 더한 값(

)이고, R_T ^e는 테스트 데이터에 존재하는 모든 타깃 영역의 커버리지 값을 모두 더한 값(

)이다.At this time, the correction coefficient ω = mR _C ^e / m R _T ^e , and R _C ^e is the sum of the coverage values of all the target areas existing in the control data (

), And R _T ^e is a value obtained by adding all the coverage values of all the target areas existing in the test data (

)to be.

세그먼테이션 처리모듈(124)은 타깃 영역에 대해 세그먼테이션을 수행한다. 즉, 실제 대부분의 엑솜 영역 경계 부분은 엑솜 영역 중앙 부분에 비해 낮은 커버리지를 갖는다. 이에, 세그먼테이션 처리모듈(124)은 각 타깃 엑솜 영역(e_i(1≤i≤N_e))에 대하여 세그먼테이션을 수행한다.The segmentation processing module 124 performs segmentation on the target area. That is, most of the actual exoskeleton boundaries have a lower coverage than the central portion of the exosomal region. Thus, a segmentation processing module 124 performs the segmentation for each target area eksom (e _i (1≤i≤N _e)).

세그먼테이션 처리모듈(124)은 각 타깃 영역에 대한 비율 데이터를 동일한 크기를 갖는 복수의 세그먼트로 분할한다. 즉, 세그먼테이션 처리모듈(124)은 각 타깃 영역(e_i(1≤i≤N_e))에 대한 비율 데이터(R_T| _C ^e(i,j), (b_f(i)≤j≤b_l(i))를 동일한 크기(b_s)를 갖는 Ns(i) = [N_b(i)/b_s]-1개의 세그먼트로 분할하여, 각 세그먼트가 동일한 크기를 갖는 비율 데이터(

(1≤i≤N_e, 1≤k≤n_s(i), 1≤j_k≤b_s)를 산출한다.The segmentation processing module 124 divides ratio data for each target area into a plurality of segments having the same size. That is, the segmentation-processing module 124, the ratio data for each target region (e _i (1≤i≤N _e)) (R _{T |} _C ^e (i, j), (b _f (i) ≤j≤b by dividing the _l (i)) with _{Ns (i) = [N b} (i) / b s] -1 segments having the same size (b _s), data rate, each segment having the same size (

(1? I? N _e , 1? _K ? N _s (i), 1? J _{k? B} _s ).

이때, 세그먼테이션 처리모듈(124)은 세그먼트 분할 시 b_f(i)=[(mod(n_b(i),b_s)+b_s)/2]+1에서 타깃 엑솜 영역의 첫 번째 세그먼트를 시작하고, 양끝의 남는 영역(R_T| _C ^e(i,j) (1≤j≤b_s(i)-1)와 R_T| _C ^e(i,j) (b_j(i)≤j≤N_b(i)))의 데이터는 제외한다.At this time, the segmentation processing module 124 starts the first segment of the target exosmotic region at b _f (i) = [(mod (n _b (i), b _s ) + b _s ) / 2] , and the remaining area of both ends _{_{^{(R T | C e (i}}} , j) (1≤j≤b s (i) -1) and _{_{^{R T | C e (i,}}} j) (b j (i) ≤j≤ N _b (i))) is excluded.

평균값 조정모듈(126)은 세그먼테이션 처리모듈(124)에서 분할된 각 세그먼트의 평균값을 T-Score를 이용하여 전체 타깃 엑솜에 대해 표준화된 평균값(이하, 표준화 평균값)으로 조정한다.The average value adjustment module 126 adjusts the average value of each segment segmented by the segmentation processing module 124 to a normalized average value (hereinafter referred to as a standardized average value) for the entire target exome using T-Score.

이때 평균값 조정모듈(126)은 하기의 수학식 3을 이용하여 각 세그먼트의 평균값을 산출한다.At this time, the average value adjustment module 126 calculates an average value of each segment using the following equation (3).

(1≤k≤n_s(i))(1? K? N _s (i))

평균값 조정모듈(126)은 하기의 수학식 4를 이용하여 각 세그먼트의 표준화 평균값을 산출한다.The average value adjustment module 126 calculates the standardized average value of each segment using the following equation (4).

(1≤k≤n_s(i))(1? K? N _s (i))

여기서, M(mR_s ^e)는 전체 세그먼트 S_i,k(1≤i≤N_e, 1≤k≤n_s(i))에 걸쳐져 있는 평균값의 평균이고, SD(mR_s ^e)는 전체 세그먼트 S_i,k(1≤i≤N_e, 1≤k≤n_s(i))에 걸쳐져 있는 평균값의 표준 편차를 의미한다.Here, M (mR _s ^e) is a full segment _{_{S i, k (1≤i≤N e,}} 1≤k≤n s (i)) and the average of the average value, which straddle, SD (mR _s ^e) is a full segment Means a standard deviation of an average value over S _{i, k} (1? I? N _e , 1? _K ? N _s (i)).

평균값 조정모듈(126)은 산출한 표준화 평균값이 음수인 경우 양수값을 갖도록 보정한다. 즉, 평균값 조정모듈(126)은 산출한 표준화 평균값이 음수이면, 표준화 평균값에서 표준화 평균값의 최소값을 차감하여 표준화 평균값을 보정한다.The average value adjustment module 126 corrects to have a positive value when the calculated standardized average value is negative. That is, if the calculated standardized average value is negative, the average value adjustment module 126 corrects the standardized average value by subtracting the minimum value of the standardized average value from the standardized average value.

이때, 표준화 평균값의 최소값은 min(tR_s ^e(i,j))(1≤i≤N_e, 1≤k≤n_s(i))을 의미한다.In this case, the minimum value of the normalized average value refers to the _{^{min (tR s e (i,}} j)) (1≤i≤N e, 1≤k≤n s (i)).

정규화 변환모듈(128)은 표준화 평균값과 각 세그먼트의 비율 데이터를 근거로 각 세그먼트의 커버리지 데이터를 정규화 분포를 갖도록 변환한다.The normalization conversion module 128 converts the coverage data of each segment to have a normalization distribution based on the standardized average value and ratio data of each segment.

즉, 정규화 변환모듈(128)은 표준화 평균값(tR_s ^e(i,k)과 각 세그먼트의 비율 데이터 ((

(1≤i≤N_e, 1≤k≤n_s(i), 1≤j_k≤b_s))의 표준 편차를 기반으로 각 세그먼트 S_i,k(1≤i≤N_e, 1≤k≤n_s(i)) 내의 각 기본 포지션 (j(b_f(i)+(k-1)*b_s≤j≤b_f(i)+k*b_s))의 데이터가 정규 분포를 갖도록 변환한다. That is, the normalized conversion module 128 is normalized average value (tR _s ^e (i, k) and the ratio of each data segment ((

Based on the standard deviation _{_{(1≤i≤N e, 1≤k≤n s (i}} ), 1≤j k ≤b s)) for each segment _{_{S i, k (1≤i≤N e,}} 1≤k data of ≤n _s (i) each base position in _{a) (j (b f (i} ) + (k-1) * b s ≤j≤b f (i) + k * b s)) to have a normal distribution Conversion.

이를 통해, 정규화 변환모듈(128)은 유효 비율 데이터(

)를 추출한다.Thereby, the normalization conversion module 128 converts the effective ratio data (

).

CNV 추출 장치(140)는 정규화 처리 장치(120)에서 추출한 유효 비율 데이터를 근거로 CNV 영역을 산출한다. 이때, CNV 추출 장치(140)는 스케일 스페이스 필터링 기법을 이용하여 CNV 영역을 산출한다. 여기서, 스케일 스페이스 필터링 기법은 입력 데이터를 다양한 해상도(multi-resolution) 데이터로 변환한 스케일 스페이스 이미지와 입력 데이터로부터 변곡점(inflection point) 정보를 추출하여 얻어지는 지문도(finger print map)를 활용하여 신호에 포함된 이벤트를 추출하는 신호 처리 방식이다.The CNV extracting unit 140 calculates the CNV area based on the effective ratio data extracted by the normalization processing unit 120. At this time, the CNV extracting apparatus 140 calculates a CNV region using a scale-space filtering technique. Here, the scale-space filtering technique uses a fingerprint map obtained by extracting inflection point information from input data and a scale-space image obtained by converting input data into various resolution (multi-resolution) data, And extracts embedded events.

CNV 추출 장치(140)는 스케일 스페이스 필터링 기법을 이용하여 커버리지 데이터에 포함된 CNV 모양을 이벤트로 추출한다. 즉, CNV 추출 장치(140)는 연속 커버리지 데이터의 값 변화를 추적하여, 커버리지 값이 평균보다 낮거나 높은값이 연속되는 모양을 갖는 영역을 테스트 데이터의 서열상에 결손(loss)이나 삽입(gain)이 있는 경우를 나타낸다고 해석하고, 이 영역들을 CNV 영역으로 산출한다. The CNV extracting apparatus 140 extracts the CNV shape included in the coverage data as an event using a scale space filtering technique. That is, the CNV extracting unit 140 tracks the value change of the continuous coverage data, and detects a region having a shape in which the coverage value is lower or higher than the average value in the sequence of the test data, ), And calculates these areas as CNV areas.

도 3은 스케일 스페이스 필터링 기법을 이용하여 이벤트 영역을 추출하는 일례로, 하단은 주어진 신호(σ=σ₀에서의 가장 아래 부분 신호)에 대한 스케일 스페이스 이미지를 나타내며, 상단은 스케일 스페이스 이미지의 지문도(Finger Print Map)를 나타낸다. 여기서,

는 해상도 조절을 위한 스케일 파라미터를 의미한다.3 shows an example of extracting an event area using a scale space filtering technique. The lower part shows a scale space image for a given signal (the lowest part signal at? =? ₀ ), and the upper part shows a fingerprint of a scale space image (Finger Print Map). here,

Means a scale parameter for adjusting the resolution.

또한, 수직 점선은 σ(=σ₀)에서의 스케일 스페이스 이미지의 변곡점을 이용한 지문도 작성 과정을 나타내며, 수평 화살표 구간은 σ(=σ_k)에서 이웃하는 두 변곡점으로 이루어지는 이벤트 정보의 추출 과정을 나타낸다.The vertical dotted line represents the fingerprinting process using the inflection point of the scale space image at σ (= σ ₀ ), and the horizontal arrow segment represents the process of extracting event information consisting of two neighboring inflection points at σ (= σ _k ) .

이를 위해, 도 4를 참조하면, CNV 추출 장치(140)는 계층 분리모듈(142), 핑거 프린트 맵 생성모듈(144), 기준선 조정모듈(146), CNV 추출모듈(148)을 포함하여 구성된다.4, the CNV extracting apparatus 140 includes a hierarchy separating module 142, a fingerprint map generating module 144, a reference line adjusting module 146, and a CNV extracting module 148 .

계층 분리모듈(142)은 정규화 처리 장치(120)에서 추출한 유효 비율 데이터(

)에 가우시안 컨볼루션(Gaussian Convolution)을 적용시켜 복수의 스케일 스페이스 이미지(scale-space image) 계층으로 분리한다. 이때, 가우시안 컨볼루션은 하기 수학식 5와 같다. 이후 표기법의 간략화를 위하여 유효 비율 데이터(

)를 c[i]로 표기하여 나타낸다.The hierarchy separating module 142 separates the validity ratio data extracted from the normalization processing device 120

) Into a plurality of scale-space image layers by applying a Gaussian convolution. At this time, the Gaussian convolution is expressed by Equation (5). In order to simplify the notation, effective ratio data (

) Is denoted by c [i].

여기서 c[i,k]는 스케일 스페이스 이미지이고, g[j,σk]는 가우시안 커널이고, k(0≤k≤L-1)는 총 L개로 분리되는 스케일 스페이스 이미지 계층에 대한 어느 하나의 계층의 인덱스를 의미한다. m은 가우시안 커널의 윈도우(window) 크기이고, σk 는 스케일 파라미터로서 실험적으로 m = 3σk, σk = 10³× (1.1)^k로 지정될 수 있다. σk = 10³×(1.1)^k 값은 검출하려는 CNV의 크기와 시간 복잡도(time complexity)를 고려하여 결정된 것 일수 있다.Where k [i, k] is a scale space image, g [j, k] is a Gaussian kernel, and k &Lt; / RTI > m is the window size of the Gaussian kernel, and σk can be experimentally specified as m = 3σk, σk = 10 ³ × (1.1) ^k as a scale parameter. σk = 10 ³ × (1.1) The ^k value may be determined in consideration of the size and time complexity of the CNV to be detected.

다만, 시간 영역에서의 컨볼루션은 주파수 영역에서의 곱과 같으므로, 산출의 복잡성을 줄이기 위하여 이산 푸리에 변환(Discrete Fourier Transform)을 적용함으로써 각각의 스케일 스페이스 이미지 계층 c[i, k]를 구할 수 있다.However, since the convolution in the time domain is the same as the product in the frequency domain, it is possible to obtain each scale-space image layer c [i, k] by applying Discrete Fourier Transform to reduce the complexity of the calculation have.

핑거 프린트 맵 생성모듈(144)은 계층 분리모듈(142)에서 분리된 복수의 계층별로 2차 도함수의 영교차점을 획득할 수 있다. k 계층의 스케일 스페이스 이미지 c[i, k]의 2차 도함수 c"[i, k]는 수학식 6과 같이 근사화시킬 수 있다. 또한, 2차 도함수의 영교차 신호 z[i, k]는 수학식 7과 같이 정의할 수 있다.The fingerprint map generation module 144 may obtain zero crossings of the second derivatives for a plurality of layers separated by the layer separation module 142. The second derivative c "[i, k] of the kth-order scale space image c [i, k] can be approximated as shown in Equation 6. The zero crossing signal z [i, k] of the second derivative is Can be defined as shown in Equation (7).

즉, z[i, k]는 영교차점 i를 전후 기점으로, 2차 도함수의 값이 음(minus)에서 양(plus)으로 변하는 경우 1의 값을 갖고, 반대로 2차 도함수의 값이 양에서 음으로 변하는 경우 -1의 값을 갖게 된다. 물론, i가 영교차점이 아니라면 z[i, k]는 0의 값을 갖는다.That is, z [i, k] has a value of 1 when the value of the second derivative changes from minus to plus, with the zero crossing point as its starting point, and conversely, If it changes to negative, it has a value of -1. Of course, if i is not a zero crossing, then z [i, k] has a value of zero.

핑거 프린트 맵 생성모듈(144)은 복수의 계층별로 획득된 각각의 영교차점을 표시함으로써 하나의 핑거 프린트 맵(finger print map)을 생성할 수 있다. 즉, z[i, k]가 +1 또는 -1의 값을 갖는 경우, 영교차점이 핑거 프린트 맵에 표시된다. 이때, z[i, k]가 0의 값을 갖는 경우에는 핑거 프린트 맵에 아무런 표시가 되지 않는다.The fingerprint map generation module 144 may generate a single fingerprint map by displaying each zero crossing obtained for a plurality of layers. That is, when z [i, k] has a value of +1 or -1, the zero crossing point is displayed in the fingerprint map. At this time, when z [i, k] has a value of 0, no indication is made in the fingerprint map.

기준선 조정모듈(146)은 복수의 계층별로 기설정된 범위 내에 포함된 스케일 스페이스 이미지의 값들을 이용하여 복수의 계층 각각의 기준선을 획득할 수 있고, 기설정된 범위는 스케일 스페이스 이미지의 평균값 및 표준 편차값을 이용하여 결정된 것일 수 있다. 여기서, 스케일 스페이스 이미지의 평균값을 m(k)라고 하고, 스케일 스페이스 이미지의 표준 편차값을 δ(k)라고 한다면, 기설정된 범위는 m(k)±wδ(k)로 놓을 수 있다. δ(k)의 계수로 포함된 w는 가중치(weight)로서 임의의 값으로 설정할 수 있고, m(k) 및 δ(k)의 값은 각각 수학식 8 및 수학식 9로부터 얻을 수 있다.The reference line adjustment module 146 may acquire a reference line of each of a plurality of layers using values of a scale space image included in a predetermined range for each of a plurality of layers. The predetermined range may be an average value and a standard deviation value As shown in FIG. Here, if the average value of the scale space image is m (k) and the standard deviation value of the scale space image is delta (k), the predetermined range can be set to m (k) ± wδ (k). w included in the coefficient of? (k) can be set to an arbitrary value as a weight, and the values of m (k) and? (k) can be obtained from equations (8) and (9), respectively.

본 발명의 일 실시예로서, 가중치 w는 2로 설정하겠다. 기준선 조정모듈(146)은 복수의 계층별로 m(k)±2δ(k)을 벗어난 값들을 제외하고, m(k)±2δ(k)에 포함된 값들에 대한 평균값과 표준 편찻값을 다시 구할 수 있다. 이때의 평균값과 표준 편차값은 앞서 언급한 평균값 m(k)과 표준 편차값 δ(k)과 구별하기 위하여 유효 평균값m*(k)과 유효 표준 편차값 δ*(k)이라 하겠다.As an embodiment of the present invention, the weight w will be set to 2. The baseline adjustment module 146 again obtains an average value and a standard deviation value for the values included in m (k) ± 2δ (k), except for values outside the range of m (k) ± 2δ . The mean value and the standard deviation value at this time are referred to as the effective mean value m * (k) and the effective standard deviation value δ * (k) in order to distinguish the above-mentioned mean value m (k) from the standard deviation value δ (k).

기준선 조정모듈(146)은 기설정된 범위 내에 포함된 스케일 스페이스 이미지의 값들에 대한 유효 평균값 및 유효 표준 편차값을 이용하여 복수의 계층 각각의 기준선 m*(k), m*(k) +dδ*(k), m*(k) -dδ*(k)를 획득할 수 있다.The reference line adjustment module 146 calculates the reference lines m * (k), m * (k) + dδ * of each of the plurality of layers using the effective average value and the effective standard deviation value for the values of the scale space image included in the predetermined range, (k), m * (k) - d? * (k).

여기서, δ*(k)의 계수로서 포함된 d는 가중치로서, 경우에 따라 다양한 값이 될 수 있다.Here, d included as a coefficient of? * (K) is a weight, and may be various values depending on the case.

나아가, 기준선 조정모듈(146)은 복수의 계층으로 분리된 스케일 스페이스 이미지 계층 모두에 대하여 기준선을 획득할 수도 있지만, 복수의 계층 중에서, 적어도 2개 이상의 0이 아닌 영교차 신호 z[i, k] 값을 갖는 계층에 대해서만 기준선을 획득할 수도 있다. 다시 말해, 경우에 따라서, 기준선 조정모듈(146)은 영교차점이 2개 이상 있는 계층에 대해서만 기준선을 획득할 수도 있다.Further, the reference line adjustment module 146 may obtain a reference line for all of the hierarchy of the plurality of hierarchy of scale space image layers, but at least two of the non-zero zero crossing signals z [i, k] The reference line may be obtained only for the layer having the value. In other words, in some cases, the baseline adjustment module 146 may obtain a baseline only for a layer having two or more zero crossings.

CNV 추출모듈(148)은 기준선 조정모듈(146)에서 획득된 각각의 기준선들로부터 CNV를 검출하기 위한 영역을 결정할 수 있다. CNV를 검출하기 위한 영역은 유효 평균값 및 유효 표준 편차값을 이용하여 결정될 수 있고, 본 발명의 일 실시예에 따르면, CNV 추출모듈(148)은 영교차점을 적어도 2개 이상 갖는 스케일 스페이스 이미지 계층에 한하여, d값에 3을 적용한 기준선 m*(k)±3δ*(k)을 이용하여 CNV를 검출하기 위한 영역을 결정할 수 있다. 이때의 CNV를 검출하기 위한 영역은 계층별로 복수개가 될 수도 있고, 하나도 없을 수도 있다. CNV를 검출하기 위한 k계층의 m번째 영역은 [lm,k, um,k]로 놓겠다. 여기서, lm,k 및 um,k는 핑거 프린트 맵에 표시된 영교차 신호 z[i, k]의 i의 범위로서, lm,k≤i≤um,k로 나타낼 수 있다. CNV를 검출하기 위한 영역을 결정하기 위하여 기준선의 d값에 3을 적용한 것은 본 발명의 일 실시예로서, d값은 3에 한정되지 않는다.The CNV extraction module 148 may determine an area for detecting CNV from the respective baselines obtained at the baseline adjustment module 146. [ The CNV extraction module 148 may determine the CNV using the effective mean value and the effective standard deviation value according to one embodiment of the present invention. The region for detecting CNV can be determined using the reference line m * (k) ± 3δ * (k) to which 3 is applied to the d value. There may be a plurality of regions for detecting the CNV at this time, or there may be no region. Let [m, k, um, k] denote the mth region of kth layer for detecting CNV. Where lm, k and um, k are the ranges of i of the zero crossing signal z [i, k] indicated in the fingerprint map and can be expressed as lm, k ≤ i ≤ um, k. It is an embodiment of the present invention that 3 is applied to the d value of the reference line to determine the area for detecting CNV, and the d value is not limited to 3.

단, CNV를 검출하기 위한 영역은 이하 3가지 조건에 기초하여 결정될 수 있다.However, the area for detecting CNV can be determined based on the following three conditions.

첫째로, k계층보다 높은, 즉, k+1 내지 L-1 계층에서 CNV를 검출하기 위한 영역으로 기지정된 영역에 대응하는 i는 [lm,k, um,k]에 포함되지 않아야 한다. First, i corresponding to a region pre-designated as a region for detecting CNV at k-th layer, that is, at k + 1 to L-1 layers, should not be included in [lm, k, um, k].

둘째로, z[lm,k, k] 값과 z[um,k, k] 값의 곱은 0보다 작아야 하고, [lm,k, um,k]에 포함된 i에 대응하는 z[i, k] 값은 모두 0이어야 한다. 즉, z[lm,k, k] 값과 z[um,k, k] 값 중 어느 하나는 +1이고, 나머지 하나는 -1 이고, 둘 사이의 값들은 모두 0이어야 한다.Second, the product of z [lm, k, k] and z [um, k, k] must be less than 0 and z [i, k ] Should all be zero. That is, one of z [lm, k, k] and z [um, k, k] is +1 and the other is -1, and both values must be zero.

같은 표현으로, 핑거 프린트 맵에서 i=lm,k 일 때의 값과 i= um,k일 때의 값 중 어느 하나는 'o'로 표시되고, 나머지 하나는 '+'로 표시되고, 둘 사이에는 아무런 표시도 되지 않아야 한다.In the same expression, a value when i = lm, k in the fingerprint map and a value when i = um, k are represented by "o" and the other is represented by "+" There should be no indication.

마지막으로, [lm,k, um,k]에 포함된 스케일 스페이스 이미지의 평균값이 위에서 언급한 주어진 기준선 m*(k)±3δ*(k)를 벗어나야 한다. 여기서 [lm,k, um,k]에 포함된 스케일 스페이스 이미지의 평균값은 수학식 10과 같이 표현될 수 있다.Finally, the mean value of the scale space image contained in [lm, k, um, k] should deviate from the given reference line m * (k) ± 3δ * (k). Here, the average value of the scale space images included in [lm, k, um, k] can be expressed as shown in Equation (10).

도 5를 참조하면, 본 발명의 실시예에 따른 유전자 단위 반복 변이 검출 방법은 정규화 처리 단계(S100) 및 CNV 검출 단계(S200)를 포함한다.Referring to FIG. 5, the method of detecting a gene unit repeat variation according to an embodiment of the present invention includes a normalization processing step S100 and a CNV detection step S200.

정규화 처리 단계(S100)에서는 테스트(Test) 데이터 및 컨트롤(Control) 데이터 중 적어도 하나를 입력 데이터로 입력받는다. 이때, 정규화 처리 장치(120)는 테스트 데이터만 입력되면 테스트 데이터의 리드 커버리지 데이터만을 이용하여 정규화 처리를 수행한다.In the normalization processing step S100, at least one of test data and control data is input as input data. At this time, if only the test data is input, the normalization processing unit 120 performs the normalization processing using only the read coverage data of the test data.

정규화 처리 단계(S100)는 테스트 데이터와 컨트롤 데이터가 입력되면 테스트 데이터 및 컨트롤 데이터의 커버리지 데이터로부터 얻은 비율 (ratio) 커버리지 데이터를 이용하여 정규화 처리를 수행한다.In the normalization processing step S100, when test data and control data are input, normalization processing is performed using ratio coverage data obtained from the coverage data of the test data and control data.

정규화 처리 단계(S100)는 입력 데이터에 대해 각 타깃 영역에서의 염기 단위(Base-Level) 리드 커버리지 계산하고, 계산한 연기 단위 리드 커버리지를 세그먼테이션하고, 각 세그먼트에서의 커버리지 데이터의 평균값 조정, 각 세그먼트의 커버리지 데이터를 정규화 분포를 갖도록 변환한다.The normalization processing step S100 calculates base-level lead coverage in each target area for the input data, segments the calculated delay unit lead coverage, adjusts the average value of the coverage data in each segment, Lt; RTI ID = 0.0 > normalization < / RTI > distribution.

도 6을 참조하면, 정규화 처리 단계(S100)는 리드 커버리지 산출 단계(S120), 세그먼테이션 단계(S140), 각 세그먼트에서의 커버리지 데이터 평균값 조정 단계(S160) 및 각 세그먼트의 커버리지 데이터를 정규화 분포를 갖도록 변환하는 단계(S180)를 포함한다.Referring to FIG. 6, the normalization process S100 includes a lead coverage calculation step S120, a segmentation step S140, a coverage data average value adjustment step S160, (S180).

리드 커버리지 산출 단계(S120)에서는 입력 데이터를 이용하여 타깃 단위로 커버리지를 계산한다. 즉, 리드 커버리지 산출 단계(S120)에서는 엑솜 시퀀싱 데이터인 테스트 데이터와 컨트롤 데이터의 정렬 결과를 이용하여 커버리지 데이터(R_T, R_C)를 산출한다.In the lead coverage calculation step (S120), coverage is calculated on a target basis using input data. That is, in the lead coverage calculation step (S120), the coverage data (R _T , R _C ) is calculated by using the result of alignment of test data and control data, which is exome sequencing data.

리드 커버리지 산출 단계(S120)에서는 산출한 커버리지 데이터로부터 각 타깃 영역의 모든 포지션에서의 리드 커버리지 데이터를 추출하고, 추출한 리드 커버리지 데이터의 비율 데이터를 산출한다. 여기서, 비율 데이터를 테스트 데이터 및 컨트롤 데이터의 비율을 의미한다.In the lead coverage calculation step S120, lead coverage data at all positions of each target area is extracted from the calculated coverage data, and ratio data of the extracted lead coverage data is calculated. Here, the ratio data means the ratio of test data and control data.

리드 커버리지 산출 단계(S120)에서는 리드 커버리지 데이터의 라이브러리 사이즈(library size) 보정을 위해 기산출한 비율 데이터를 보정한다. 이때, 리드 커버리지 산출 단계(S120)에서는 비율 데이터에 보정계수를 곱하여 비율 데이터를 보정한다In the read coverage calculation step (S120), the ratio data calculated for correcting the library size of the read coverage data is corrected. At this time, in the lead coverage calculation step (S120), the ratio data is multiplied by the correction coefficient to correct the ratio data

세그먼테이션 단계(S140)에서는 타깃 영역에 대해 세그먼테이션을 수행한다. 즉, 실제 대부분의 엑솜 영역 경계 부분은 엑솜 영역 중앙 부분에 비해 낮은 커버리지를 갖기 때문에, 세그먼테이션 단계(S140)에서는 각 타깃 엑솜 영역에 대하여 세그먼테이션을 수행한다.In the segmentation step S140, segmentation is performed on the target area. That is, since most of the exoskeleton boundaries have a lower coverage than the center of the exosomal region, segmentation is performed on each of the target exosomal regions in the segmentation step S140.

세그먼테이션 단계(S140)에서는 각 타깃 영역에 대한 비율 데이터를 동일한 크기를 갖는 복수의 세그먼트로 분할한다. 즉, 세그먼테이션 단계(S140)에서는 각 타깃 영역에 대한 비율 데이터를 동일한 크기를 갖는 복수개의 세그먼트로 분할하여, 각 세그먼트가 동일한 크기를 갖는 비율 데이터를 산출한다. 이때, 세그먼테이션 단계(S140)에서는 세그먼트 분할 시 타깃 엑솜 영역의 양끝에 남는 영역들의 데이터를 제외한다.In the segmentation step S140, the ratio data for each target area is divided into a plurality of segments having the same size. That is, in the segmentation step S140, the ratio data for each target area is divided into a plurality of segments having the same size, and the ratio data having the same size is calculated. At this time, in the segmentation step (S140), data of regions remaining at both ends of the target exosmotic region are excluded when the segment is segmented.

평균값 조정 단계(S160)에서는 세그먼테이션 단계(S140)에서 분할된 각 세그먼트의 평균값을 T-Score를 이용하여 전체 타깃 엑솜에 대해 표준화된 평균값(이하, 표준화 평균값)으로 조정한다. 이때, 평균값 조정 단계(S160)에서는 표준화 평균값이 음수인 경우 양수값을 갖도록 보정한다. 즉, 평균값 조정 단계(S160)에서는 표준화 평균값이 음수이면, 표준화 평균값에서 표준화 평균값의 최소값을 차감하여 표준화 평균값이 양수값을 갖도록 보정한다.In the average value adjustment step S160, the average value of each segment segmented in the segmentation step S140 is adjusted to a normalized average value (hereinafter referred to as a standardized average value) for the entire target exome using T-Score. At this time, in the average value adjusting step (S160), when the standardized average value is negative, it is corrected to have a positive value. That is, in the average value adjusting step S160, if the standardized average value is negative, the minimum value of the standardized average value is subtracted from the standardized average value, and the standardized average value is corrected to have a positive value.

변환하는 단계(S180)에서는 평균값 조정 단계(S160)에서 조정된 표준화 평균값과 각 세그먼트의 비율 데이터를 근거로 각 세그먼트의 커버리지 데이터를 정규화 분포를 갖도록 변환한다. 즉, 변환하는 단계(S180)에서는 표준화 평균값과 각 세그먼트의 비율 데이터의 표준 편차를 기반으로 각 세그먼트 내의 각 기본 포지션 데이터가 정규 분포를 갖도록 변환하여 유효 비율 데이터를 추출한다.In the converting step S180, the coverage data of each segment is converted to have a normalized distribution based on the normalized average value and the ratio data of each segment adjusted in the average value adjusting step S160. That is, in step S180, the basic position data in each segment is transformed to have a normal distribution based on the standardized average value and the standard deviation of the ratio data of each segment, and effective ratio data is extracted.

CNV 검출 단계(S200)는 정규화 처리 단계(S100)에서 추출한 유효 비율 데이터를 근거로 CNV 영역을 산출한다. 이때, CNV 검출 단계(S200)는 스케일 스페이스 필터링 기법을 이용하여 커버리지 데이터에 포함된 CNV 모양을 이벤트로 추출한다. 즉, CNV 검출 단계(S200)는 연속 커버리지 데이터의 값 변화를 추적하여, 커버리지 값이 평균보다 낮거나 높은값이 연속되는 모양을 갖는 영역을 테스트 데이터의 서열상에 결손(loss)이나 삽입(gain)이 있는 경우를 나타낸다고 해석하고, 이 영역들을 CNV 영역으로 산출한다. The CNV detection step S200 calculates a CNV area based on the effective ratio data extracted in the normalization processing step S100. At this time, the CNV detection step (S200) extracts the CNV shape included in the coverage data as an event using a scale space filtering technique. That is, the CNV detection step (S200) tracks the value change of the continuous coverage data, and detects an area having a shape in which the coverage value is lower than or higher than the average value in the sequence of the test data, ), And calculates these areas as CNV areas.

도 7을 참조하면, CNV 검출 단계(S200)는 계층 분리 단계(S220), 핑거 프린트 맵 생성 단계(S240), 기준선 조정 단계(S260) 및 CNV 추출 단계(S280)를 포함한다.Referring to FIG. 7, the CNV detection step S200 includes a layer separation step S220, a fingerprint map generation step S240, a reference line adjustment step S260, and a CNV extraction step S280.

계층 분리 단계(S220)에서는 정규화 처리 단계(S100)에서 추출한 유효 비율 데이터에 가우시안 컨볼루션(Gaussian Convolution)을 적용시켜 복수의 스케일 스페이스 이미지(scale-space image) 계층으로 분리한다.In the hierarchical separation step S220, Gaussian convolution is applied to the effective ratio data extracted in the normalization processing step S100 to separate into a plurality of scale-space image layers.

핑거 프린트 맵 생성 단계(S240)에서는 계층 분리 단계(S220)에서 분리된 복수의 계층별로 2차 도함수의 영교차점을 획득할 수 있다. 핑거 프린트 맵 생성 단계(S240)에서는 복수의 계층별로 획득된 각각의 영교차점을 표시함으로써 하나의 핑거 프린트 맵(finger print map)을 생성할 수 있다. 즉, 영교차 신호가 +1 또는 -1의 값을 갖는 경우, 영교차점이 핑거 프린트 맵에 표시된다. 이때, 영교차 신호가 0의 값을 갖는 경우에는 핑거 프린트 맵에 아무런 표시가 되지 않는다.In the fingerprint map generation step S240, the zero-crossing point of the second derivative may be acquired for each of the plurality of layers separated in the layer separation step S220. In the fingerprint map generation step S240, a single fingerprint map may be generated by displaying the respective zero crossings obtained for a plurality of layers. That is, when the zero crossing signal has a value of +1 or -1, the zero crossing point is displayed on the fingerprint map. At this time, when the zero-crossing signal has a value of 0, no indication is made on the fingerprint map.

기준선 조정 단계(S260)에서는 복수의 계층별로 기설정된 범위 내에 포함된 스케일 스페이스 이미지의 값들을 이용하여 복수의 계층 각각의 기준선을 획득할 수 있고, 기설정된 범위는 스케일 스페이스 이미지의 평균값 및 표준 편차값을 이용하여 결정된 것일 수 있다.In the baseline adjustment step S260, the baseline of each of the plurality of layers can be obtained by using the values of the scale space image included in the predetermined range for each of the plurality of layers, and the preset range includes the average value and the standard deviation value As shown in FIG.

기준선 조정 단계(S260)에서는 기설정된 범위 내에 포함된 스케일 스페이스 이미지의 값들에 대한 유효 평균값 및 유효 표준 편차값을 이용하여 복수의 계층 각각의 기준선을 획득한다.In the baseline adjustment step S260, the baseline of each of the plurality of layers is obtained using the effective average value and the effective standard deviation value of the values of the scale space image included in the predetermined range.

기준선 조정 단계(S260)에서는 복수의 계층으로 분리된 스케일 스페이스 이미지 계층 모두에 대하여 기준선을 획득할 수도 있지만, 복수의 계층 중에서, 적어도 2개 이상의 0이 아닌 영교차 신호값을 갖는 계층에 대해서만 기준선을 획득할 수도 있다. 다시 말해, 경우에 따라서, 기준선 조정 단계(S260)에서는 영교차점이 2개 이상 있는 계층에 대해서만 기준선을 획득할 수도 있다.In the baseline adjustment step S260, the baseline may be obtained for all of the plurality of hierarchy-separated scale space image layers. However, the baseline may be obtained for only a hierarchy having at least two zero- It can also be obtained. In other words, in some cases, the baseline may be obtained only for the layer having two or more zero crossings in the baseline adjustment step (S260).

CNV 추출 단계(S280)에서는 기준선 조정 단계(S260) 에서 획득된 각각의 기준선들로부터 CNV를 검출하기 위한 영역을 결정할 수 있다. CNV를 검출하기 위한 영역은 유효 평균값 및 유효 표준 편차값을 이용하여 결정될 수 있고, CNV 추출 단계(S280)에서는 영교차점을 적어도 2개 이상 갖는 스케일 스페이스 이미지 계층에 한하여 CNV를 검출하기 위한 영역을 결정할 수 있다. 이때의 CNV를 검출하기 위한 영역은 계층별로 복수개가 될 수도 있고, 하나도 없을 수도 있다.In the CNV extracting step (S280), an area for detecting CNV from each of the reference lines obtained in the reference line adjusting step (S260) can be determined. The CNV detection region can be determined using the effective average value and the effective standard deviation value. In the CNV extraction step S280, an area for detecting CNV is determined only for a scale space image layer having at least two zero crossings . There may be a plurality of regions for detecting the CNV at this time, or there may be no region.

이상에서 본 발명에 따른 바람직한 실시예에 대해 설명하였으나, 다양한 형태로 변형이 가능하며, 본 기술분야에서 통상의 지식을 가진자라면 본 발명의 특허청구범위를 벗어남이 없이 다양한 변형예 및 수정예를 실시할 수 있을 것으로 이해된다.While the present invention has been described in connection with what is presently considered to be practical exemplary embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but many variations and modifications may be made without departing from the scope of the present invention. It will be understood that the invention may be practiced.

100: 유전자 단위 반복 변이 검출 시스템
120: 정규화 처리 장치 122: 커버리지 산출모듈
124: 세그먼테이션 처리모듈 126: 평균값 조정모듈
128: 정규화 변환모듈 140: CNV 추출 장치
142: 계층 분리모듈 144: 핑거 프린트 맵 생성모듈
146: 기준선 조정모듈 148: CNV 추출모듈100: gene unit repeat mutation detection system
120: Normalization processing unit 122: Coverage calculation module
124: Segmentation processing module 126: Average value adjustment module
128: normalization conversion module 140: CNV extraction device
142: hierarchical separation module 144: fingerprint map generation module
146: Reference line adjustment module 148: CNV extraction module

Claims

A normalization processing apparatus for normalizing input data of a gene unit repeat variation detection system,
The coverage data is calculated based on two pieces of input data including the result of lead sorting of test data and control data, and lead coverage data of each exon target area is extracted based on each coverage data, and ratio data of the lead coverage data A coverage calculation module for calculating a coverage;
A segmentation process module for dividing the ratio data for the target area into a plurality of segments having the same size based on the ratio data calculated by the coverage calculation module or dividing the test data into a plurality of segments having the same size with respect to the target area, ;
An average value adjustment module for adjusting an average value of a plurality of segments divided by the segmentation processing module to a standardized average value; And
And a normalization conversion module that converts the coverage data of each segment to have a normalization distribution based on the normalized average value and ratio data of each segment adjusted by the average value adjustment module to extract effective ratio data,
Wherein the normalization conversion module comprises:
And converting the data of each position in each segment to have a normal distribution based on the standardized average value and the standard deviation of ratio data of each segment to extract effective ratio data.

delete

A method for normalizing input data using a normalization processing apparatus for detecting a gene unit repeat variation,
Calculating coverage of a target unit based on input data including test data and control data;
Extracting lead coverage data of each target area based on the coverage data;
Calculating ratio data of the read coverage data;
Dividing the ratio data for the target area into a plurality of segments having the same size based on the ratio data or dividing the test data into a plurality of segments having the same size;
Adjusting an average value of the segmented plurality of segments to a standardized average value; And
And converting the coverage data of each segment to have a normalization distribution based on the standardized average value and ratio data of each segment to extract effective ratio data,
In the step of extracting the effective ratio data,
Wherein the data of each base position in each segment is transformed to have a normal distribution based on the standardized mean value and the standard deviation of the ratio data of each segment to extract effective ratio data.

delete