KR20120044096A

KR20120044096A - Apparatus and system for haplotype phasing

Info

Publication number: KR20120044096A
Application number: KR1020100105500A
Authority: KR
Inventors: 신수용; 이용석; 홍태희; 김판규; 박민서; 권제근
Original assignee: 삼성에스디에스 주식회사
Priority date: 2010-10-27
Filing date: 2010-10-27
Publication date: 2012-05-07
Also published as: KR101173257B1

Abstract

PURPOSE: A haplotype paging method and a device thereof are provided to perform haplotype paging based on sequence analysis data, thereby increasing haplotype paging accuracy. CONSTITUTION: An SNP(Single Nucleotide Polymorphism) matrix generating unit generates heterozygous SNP matrix data from sequence data(S104). A haplotype data generating unit inputs the heterozygous SNP matrix data to a combination optimizing process using an evaluation function to generate haplotype data. A diploid generating unit receives the haplotype data to configure a diploid based on a consensus sequence and a gene type(S110).

Description

Haplotype paging method and apparatus {Apparatus and system for haplotype phasing}

본 발명은 하플로타입 페이징 방법 및 장치에 관한 것이다. 보다 자세하게는, 하플로타입 페이징 대상 생명체의 염기서열을 분석한 데이터를 폭넓게 이용하여 하플로타입 페이징의 정확도를 높이는 방법 및 장치에 관한 것이다.The present invention relates to a haplotype paging method and apparatus. More specifically, the present invention relates to a method and apparatus for increasing the accuracy of haplotype paging using a wide range of data obtained by analyzing the sequence of a haplotype paging target organism.

SNP(Single Nucleotide Polymorphism)은 새포핵 속의 염색체(chromosome)가 가지고 있는 염기 서열 중 각 개체의 편차를 나타내는 한 개의 염기를 의미한다. 인간의 경우 평균 200-300bp(base pair) 마다 하나씩 SNP가 존재하는 것으로 알려져 있다. 상기 SNP의 존재는 똑 같은 종이라도 조금씩 형질이 상이하게 되는 원인이다.Single Nucleotide Polymorphism (SNP) refers to one base that represents the variation of each individual among the nucleotide sequences possessed by the chromosome in the nucleus. In humans, SNPs are known to exist, one for every 200-300bp (base pair). The presence of the SNP is a cause of the trait being slightly different even with the same species.

하플로타입(haplotype)은 하나의 염색체에 포함된 상기 SNP들의 서열로 이해될 수 있다. 주로 인접해 있는 SNP들이 유사한 형상을 발현하는 것으로 알려져 있고, 그러한 SNP들의 집합이 하플로타입으로 이해될 수 있다.Haplotype can be understood as the sequence of the SNPs contained in one chromosome. It is known that mainly adjacent SNPs express similar shapes, and such a set of SNPs can be understood as haplotype.

인간 염색체의 경우 남성의 성염색체를 제외하고는 두 개의 상동 염색체로 구성되어 있는데, 대부분의 질병 및 유전적 특이성은 두 개의 상동 염색체 중 한 개에서 생기는 변이(variation)에 의해서 발현되는 경우가 일반적이다. 따라서 유전체 서열을 아는 것도 중요하지만, 염색체 각 가닥의 서열을 파악하는 것이 보다 중요하다고 할 수 있다. 특히 상기 하플로타입을 알게 되면 질병 원인을 보다 정확히 밝혀 낼 수 있으며, 약물 반응에도 응용이 가능하여 개인 맞춤형 의학을 현실화할 수 있다. 이 외에도 인간 유전적 변이의 공통된 패턴, 가족간의 진화 패턴 등도 유추할 수 있다. 이러한 목적으로 HapMap Project (http://www.hapmap.org)가 미국, 유럽, 중국, 일본 등 전세계 국가들이 참여한 대규모 프로젝트로 2002년부터 현재까지 진행되고 있다.Human chromosomes are composed of two homologous chromosomes, except for male sex chromosomes. Most diseases and genetic specificities are usually expressed by mutations in one of the two homologous chromosomes. . Therefore, it is important to know the genomic sequence, but it is more important to know the sequence of each strand of the chromosome. In particular, knowing the haplotype can more accurately determine the cause of the disease, and can be applied to the drug response to realize personalized medicine. In addition, common patterns of human genetic variation and family evolution patterns can be inferred. For this purpose, the HapMap Project (http://www.hapmap.org) is a large-scale project involving countries around the world, including the United States, Europe, China, and Japan.

하플로 타입을 구성하는데 있어서 같은 좌위에 서로 다른 대립유전자가 포함 되는데 이러한 것을 이형 접합(heterozygous)이라고 한다. 이러한 이형접합이 어떤 좌위에서 왔는지 정확히 파악 하기 위해 염기서열 시퀸서를 통해 분석된 염기서열 데이터를 이용하여 찾아 내는 과정이 필요하다. 이러한 과정을 소위 하플로타입 페이징이라 한다.In the formation of the haplo type, different alleles are included in the same locus, which is called heterozygous. In order to determine exactly where these heterozygotes come from, it is necessary to find the sequence using the sequence data analyzed through the sequence sequencer. This process is called haplotype paging.

하플로타입 페이징은 이론적으로 최적해를 찾을 수 없는 NP-완전(NP-complete) 문제로 증명되어 있다(G. Lancia, V. Bafna, S. Istrail, R. Lippert, & R. Schwartz, SNPs Problems, Complexity, and Algorithms, Lecture Notes in Computer Science, 2161: 182-193, 2001). 그러나, 현재까지 dynamic programming 기법 등이나 통계적 방법 등과 같이 NP 문제 해결에는 한계점이 있는 방법들이 많이 전통적으로 사용되어 왔다. 지금까지는 주로 SNP chip을 기반으로 한 정보를 이용하고 있어서, 그 정보량이 많지 않아 탐색 공간이 작아서 전통적인 방법을 사용하더라도 최적해와 유사한 해를 찾을 수 있었기 때문이다.Haplotype paging has proven to be a theoretical NP-complete problem that cannot be found optimally (G. Lancia, V. Bafna, S. Istrail, R. Lippert, & R. Schwartz, SNPs Problems, Complexity, and Algorithms, Lecture Notes in Computer Science , 2161: 182-193, 2001). However, until now, many methods that have limitations in solving NP problems, such as dynamic programming techniques and statistical methods, have been traditionally used. Until now, since information based on SNP chip is mainly used, since the amount of information is not large, the search space is small, and thus a solution similar to the optimal solution can be found even using the conventional method.

그러나, 최근 들어 염기서열 분석 비용이 하락함에 따라서 여러 분야에 응용되기 시작하고 있다. 이로 인해서 기존의 SNP칩을 통해 생성된 제한된 염기서열 데이터를 통해 분석하는 것이 아니라, 다양한 염기서열 분석 정보를 폭넓게 활용하여 하플로타입을 발견하는 방법이 필요해지고 있다. However, as the cost of sequencing has recently decreased, applications have begun to be applied in various fields. As a result, rather than analyzing the limited sequence data generated through the existing SNP chip, there is a need for a method of discovering a haplotype using a wide range of sequencing information.

본 발명이 해결하고자 하는 기술적 과제는 염기서열 분석 데이터를 폭넓게 활용하여 정확도를 높이는 하플로타입 페이징 방법 및 시스템을 제공하는 것이다.The technical problem to be solved by the present invention is to provide a haplo type paging method and system to increase the accuracy by utilizing a wide range of sequencing data.

본 발명이 해결하고자 하는 다른 기술적 과제는 전장 염기서열 데이터에 대응하는 전장 하플로타입 페이징 방법 및 시스템을 제공하는 것이다.Another technical problem to be solved by the present invention is to provide a full length haplotype paging method and system corresponding to full length sequence data.

본 발명이 해결하고자 하는 또 다른 기술적 과제는 하플로타입 페이징 대상 생명체의 가족 관계에 있는 생명체의 염기서열 데이터를 부가적으로 활용하여 하플로타입 페이징을 수행하는 방법 및 시스템을 제공하는 것이다.Another technical problem to be solved by the present invention is to provide a method and system for performing haplotype paging by additionally utilizing the sequence data of the organisms in the family relationship of the haplotype paging target organism.

본 발명이 해결하고자 하는 또 다른 기술적 과제는 염기서열 분석기를 통해 얻은 개체의 전체 염기서열(컨센서스 서열, consensus sequence)을 대상으로 하플로타입 페이징 분석을 통해 완전한 이배체(diploid)를 구성하는 방법 및 시스템을 제공하는 것이다.Another technical problem to be solved by the present invention is a method and system for constructing a complete diploid through the haplotype paging analysis of the entire nucleotide sequence (consensus sequence) of the individual obtained through the sequencing analyzer To provide.

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 당업자에게 명확하게 이해 될 수 있을 것이다.The technical problems of the present invention are not limited to the above-mentioned technical problems, and other technical problems not mentioned will be clearly understood by those skilled in the art from the following description.

상기 기술적 과제를 달성하기 위한 본 발명의 일 태양에 따른 하플로타입 페이징 방법은 염기서열 데이터로부터 각 프래그먼트에 포함된 이형접합 SNP 타입에 대한 데이터인 이형접합 SNP 매트릭스 데이터를 생성하는 단계 및 평가 함수의 함수 값인 적합도를 기준으로 상기 이형접합 SNP 매트릭스의 각 프래그먼트 집합으로부터 하플로타입 최적해를 산출하는 조합 최적화(Combinatorial optimization) 단계를 포함하되, 상기 SNP 매트릭스 데이터의 생성 및 상기 평가 함수 중 적어도 하나에 상기 염기서열 데이터에 포함된 리드 뎁스(read depth) 및 퀄리티 스코어(quality score) 값 중 적어도 하나를 반영하는 것이 바람직하다.According to one aspect of the present invention, there is provided a haplotype paging method for generating heterojunction SNP matrix data, which is data for a heterojunction SNP type included in each fragment from sequence data, and an evaluation function. A combinatorial optimization step of calculating a haplotype optimal solution from each fragment set of the heterozygous SNP matrix based on the goodness-of-fit of the function value, wherein the base is added to at least one of the generation of the SNP matrix data and the evaluation function. It is preferable to reflect at least one of a read depth and a quality score value included in the sequence data.

상기 하플로타입 페이징 방법은 상기 염기서열 데이터의 주체인 대상 생명체 종의 컨센서스 서열, 유전자형 및 상기 하플로타입 최적해를 결합하여 이배체를 생성하는 단계를 더 포함할 수 있다.The haplotype paging method may further include generating a diploid by combining the consensus sequence, genotype, and the haplotype optimal solution of the target organism species, which is the subject of the base sequence data.

상기 SNP 매트릭스 데이터 생성 시, 상기 SNP 매트릭스의 열(row)인 각각의 프래그먼트를 상기 리드 뎁스 및 상기 퀄리티 스코어로 평가하여 기준치 이하인 프래그먼트는 제외한 상기 SNP 매트릭스 데이터를 생성할 수 있다. 또한, 상기 평가 함수는 하플로타입 해_k의 각 SNP의 값이 상기 SNP 매트릭스의 각 프래그먼트를 구성하는 각 SNP의 값과 상이한 경우, 해당 프래그먼트를 구성하는 해당 SNP의 값의 상기 리드 뎁스와 상기 퀄리티 스코어 값을 반영한 패널티를 부가하는 것일 수 있다.When generating the SNP matrix data, each fragment that is a row of the SNP matrix may be evaluated based on the read depth and the quality score to generate the SNP matrix data excluding fragments that are less than or equal to a reference value. In addition, the evaluation function is that if the value of each SNP of the haplotype solution _k is different from the value of each SNP constituting each fragment of the SNP matrix, the read depth and the quality of the value of the corresponding SNP constituting the fragment. It may be to add a penalty reflecting the score value.

상기 하플로타입 페이징 방법은 대상 생명체의 전장 염기서열에 대한 데이터를 산출하는 DNA 시퀀서(sequencer)를 이용하여 상기 염기서열 데이터를 생성하는 단계를 더 포함할 수 있다. 이때, 상기 이형접합 SNP 매트릭스 데이터를 생성하는 단계는, 상기 염기서열 데이터를 상기 대상 생명체의 표준 염기서열 데이터와 비교하여 이형접합 SNP 매트릭스 데이터를 생성하는 단계를 포함할 수 있다. 또한, 대상 생명체의 전장 염기서열에 대한 데이터를 산출하는 DNA 시퀀서(sequencer)를 이용하여 상기 염기서열 데이터를 생성하는 단계는, 메이트-페어 라이브러리(mate-pair library) 또는 페어드-엔드 라이브러리(paired-end library)를 이용하여 상기 염기서열 데이터를 생성하는 단계를 포함하고, 상기 이형접합 SNP 매트릭스 데이터를 생성하는 단계는, 상기 염기서열 데이터의 각 리드(read) 중 중첩 사이트(site)를 연결한 프래그먼트(fragment)를 이용하여 상기 이형접합 SNP 매트릭스 데이터를 생성하는 단계를 포함할 수 있다.The haplotype paging method may further include generating the base sequence data using a DNA sequencer that calculates data about the full length sequence of the target organism. In this case, generating the heterozygous SNP matrix data may include generating heterozygous SNP matrix data by comparing the base sequence data with standard sequence data of the target organism. In addition, the step of generating the nucleotide sequence data using a DNA sequencer (sequencer) that calculates the data for the full length sequence of the target organism, a mate-pair library (pair-pair library) or a pair-end library (paired) generating the base sequence data by using an end-end library, and generating the heterozygous SNP matrix data by connecting overlapping sites of each read of the base sequence data. The method may include generating the heterojunction SNP matrix data by using a fragment.

한편, 상기 하플로타입 데이터를 생성하는 단계는 상기 평가 함수를 이용한 탐색점 분포 학습 프로세스에 상기 이형접합 SNP 매트릭스 데이터를 입력하여 하플로타입 데이터를 생성하는 단계를 포함할 수 있다.The generating of the haplotype data may include generating the haplotype data by inputting the heterojunction SNP matrix data into a search point distribution learning process using the evaluation function.

상기 평가 함수를 이용한 탐색점 분포 학습 프로세스에 상기 이형접합 SNP 매트릭스 데이터를 입력하여 하플로타입 데이터를 생성하는 단계는, 상기 대상 생명체의 가족의 염기서열 데이터를 이용하여 현재 세대 해를 생성하는 제1 단계, 상기 평가 함수를 이용하여 상기 현재 세대 해의 적합도를 생성하는 제2 단계, 상기 현재 세대 해 전체 집합 중 부분 집합을 선택하는 제3 단계, 상기 선택된 부분 집합의 각 해의 상기 적합도를 반영하여 각 이형접합 SNP들의 분포를 학습하는 제4 단계, 종료 조건 만족 여부를 평가하는 제5 단계 및 상기 종료 조건 불만족 시 상기 현재 세대 해를 다시 생성하여 상기 제2 단계를 재 수행하는 제6 단계를 포함할 수 있다.Generating the haplotype data by inputting the heterozygous SNP matrix data to the search point distribution learning process using the evaluation function may include: generating a current generation solution using base sequence data of a family of the target organism; A second step of generating a goodness of fit of the current generation solution by using the evaluation function, a third step of selecting a subset of the entire set of current generation solutions, and reflecting the goodness of fit of each solution of the selected subset A fourth step of learning the distribution of each heterozygous SNP, a fifth step of evaluating whether the termination condition is satisfied, and a sixth step of regenerating the current generation solution when the termination condition is not satisfied and performing the second step again. can do.

상기 종료 조건은, 상기 제2 단계 재 수행 횟수 및 상기 현재 세대 해의 적합도가 기존 세대 해의 적합도 보다 기 설정 치 이상으로 증가하지 않는 경우 중 적어도 하나일 수 있다.The termination condition may be at least one of a case where the number of repetition of the second step and the fitness of the current generation solution do not increase more than a preset value than the fitness of the existing generation solution.

상기 하플로타입 데이터를 생성하는 단계는, 상기 평가 함수를 이용한 진화 연산 프로세스에 상기 이형접합 SNP 매트릭스 데이터를 입력하여 하플로타입 데이터를 생성하는 단계를 포함할 수 있다.The generating the haplotype data may include generating the haplotype data by inputting the heterojunction SNP matrix data into an evolutionary calculation process using the evaluation function.

상기 평가 함수를 이용한 진화 연산 프로세스에 상기 이형접합 SNP 매트릭스 데이터를 입력하여 하플로타입 데이터를 생성하는 단계는, 상기 대상 생명체의 가족의 염기서열 데이터를 이용하여 현재 세대 해를 생성하는 제1 단계, 상기 평가 함수를 이용하여 상기 현재 세대 해의 적합도를 생성하는 제2 단계, 상기 현재 세대 해 전체 집합 중 선택 된 부분 집합에 대하여 크로스오버(crossover) 및 뮤테이션(mutation) 동작 중 적어도 하나를 수행하는 제3 단계 및 종료 조건 만족 여부를 평가하는 제4 단계 및 상기 종료 조건 불만족 시 상기 현재 세대 해를 다시 생성하여 상기 제2 단계를 재 수행하는 제5 단계를 포함할 수 있다.Generating the haplotype data by inputting the heterozygous SNP matrix data to the evolutionary operation process using the evaluation function, the first step of generating a current generation solution using the sequence data of the family of the target organism, Generating a goodness of fit of the current generation solution by using the evaluation function; and performing at least one of a crossover and a mutation operation on a selected subset of the entire current generation solution set The third step may include a fourth step of evaluating whether the end condition is satisfied and a fifth step of regenerating the current generation solution when the end condition is not satisfied and performing the second step again.

상기 평가 함수는, 평가 대상 해와 상기 SNP 매트릭스의 값이 일치하지 않는 경우 주어지는 제1 패널티 값에 상기 리드 뎁스 및 상기 퀄리티 스코어 값을 반영하는 것일 수 있다.The evaluation function may reflect the read depth and the quality score value to a first penalty value given when a solution to be evaluated does not coincide with a value of the SNP matrix.

상기 평가 함수는, 상기 평가 대상 해와 상기 대상 생명체의 가족의 염기서열 데이터와의 일치 여부에 따라 제2 패널티를 더 주는 것일 수 있다.The evaluation function may further give a second penalty according to whether the solution to be evaluated and the sequence data of the family of the target organism match.

본 발명의 다른 일 태양에 따른 하플로타입 페이징 장치는 염기서열 데이터로부터 이형접합 SNP 매트릭스 데이터를 생성하는 SNP 매트릭스 생성부 및 평가 함수를 이용한 조합 최적화 프로세스에 상기 이형접합 SNP 매트릭스 데이터를 입력하여 하플로타입 데이터를 생성하는 하플로타입 데이터 생성부를 포함할 수 있다. 이때, 상기 SNP 매트릭스 데이터의 생성 및 상기 평가 함수 중 적어도 하나에 상기 염기서열 데이터에 포함된 리드 뎁스(read depth) 및 퀄리티 스코어(quality score) 값을 반영할 수 있다.According to another aspect of the present invention, a haplotype paging apparatus inputs the heterozygous SNP matrix data to a combination optimization process using an SNP matrix generator and an evaluation function that generate heterozygous SNP matrix data from sequencing data, and then enters the haplotype. It may include a haflow type data generation unit for generating type data. In this case, a read depth and a quality score value included in the base sequence data may be reflected in at least one of the generation and the evaluation function of the SNP matrix data.

상기 하플로타입 페이징 장치는 상기 하플로타입 데이터 생성부에 의해 생성된 하플로타입 데이터와 컨센서스 염기서열 및 유전자형을 이용해 이배체를 구성하는 이배체 생성부를 더 포함할 수 있다.The haplotype paging apparatus may further include a diploid generator that forms a diploid using the haplotype data generated by the haplotype data generator and a consensus sequence and genotype.

본 발명의 또 다른 일 태양에 따른 컴퓨터로 읽을 수 있는 기록 매체는 염기서열 데이터로부터 이형접합 SNP 매트릭스 데이터를 생성하는 단계 및 평가 함수를 이용한 조합 최적화 프로세스에 상기 이형접합 SNP 매트릭스 데이터를 입력하여 하플로타입 데이터를 생성하는 단계를 수행하며, 상기 SNP 매트릭스 데이터의 생성 및 상기 평가 함수 중 적어도 하나에 상기 염기서열 데이터에 포함된 리드 뎁스(read depth) 및 퀄리티 스코어(quality score) 값을 반영하는 컴퓨터 프로그램이 기록된 것일 수 있다.According to still another aspect of the present invention, a computer-readable recording medium includes a step of generating heterojunction SNP matrix data from sequence data and inputting the heterojunction SNP matrix data to a combination optimization process using an evaluation function. Generating a type data, and reflecting a read depth and a quality score value included in the base sequence data in at least one of the generation of the SNP matrix data and the evaluation function; This may have been recorded.

상기 컴퓨터 프로그램은 상기 생성된 하플로타입 데이터와 컨센서스 염기서열 및 유전자형을 이용해 이배체를 구성하는 이배체 생성 단계를 더 수행할 수 있다.The computer program may further perform a diploid generation step constituting a diploid using the generated haplotype data, a consensus sequence, and genotype.

상기와 같은 본 발명에 따르면, 염기서열 분석 데이터를 폭넓게 활용하여 하플로타입 페이징을 수행하므로, 하플로타입 페이징의 대상에 주어지는 제약이 줄어들고, 정확도를 높일 수 있는 효과가 있다.According to the present invention as described above, since the haplotype paging is performed by widely utilizing the sequence analysis data, the constraint given to the target of the haplotype paging is reduced, and the accuracy can be improved.

또한, 대상 생명체의 가족의 염기서열 데이터를 더 활용하여 하플로타입 페이징의 정확도를 높일 수 있는 효과가 있다.In addition, by utilizing the sequence data of the family of the target organism further has the effect of increasing the accuracy of the haplotype paging.

도 1은 본 발명의 일 실시예에 따른 하플로타입 페이징 방법의 순서도이다.
도 2는 본 발명의 일 실시예에 따른 중첩 리드(read) 병합 방법의 개념도이다.
도 3는 본 발명의 일 실시예에 따른 하플로타입 페이징 방법에서의 탐색점 분포 학습 알고리즘 적용 방법의 순서도이다.
도 4는 본 발명의 일 실시예에 따른 하플로타입 페이징 방법에서의 진화 연산 적용 방법의 순서도이다.1 is a flowchart of a haplotype paging method according to an embodiment of the present invention.
2 is a conceptual diagram of a method of merging overlapping reads according to an embodiment of the present invention.
3 is a flowchart of a method for applying a search point distribution learning algorithm in a haflow type paging method according to an embodiment of the present invention.
4 is a flowchart illustrating a method of applying an evolutionary operation in a haplotype paging method according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시 예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 게시되는 실시 예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시 예들은 본 발명의 게시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Advantages and features of the present invention, and methods for achieving them will be apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but can be implemented in various forms. The embodiments of the present invention make the posting of the present invention complete and the general knowledge in the technical field to which the present invention belongs. It is provided to fully convey the scope of the invention to those skilled in the art, and the present invention is defined only by the scope of the claims. Like reference numerals refer to like elements throughout.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다.The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In the present specification, the singular form includes plural forms unless otherwise specified in the specification. As used herein, "comprises" and / or "comprising" does not exclude the presence or addition of one or more other components in addition to the mentioned components.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms (including technical and scientific terms) used in the present specification may be used in a sense that can be commonly understood by those skilled in the art. In addition, the terms defined in the commonly used dictionaries are not ideally or excessively interpreted unless they are specifically defined clearly.

이하, 도 1을 참조하여 본 발명의 일 실시예에 따른 하플로타입 페이징 방법에 대하여 설명하기로 한다.Hereinafter, a haflow type paging method according to an embodiment of the present invention will be described with reference to FIG. 1.

먼저, 하플로타입 대상 생명체의 염기서열 데이터가 생성 된다(S100). 하플로타입 대상 생명체는 인간일 수 있으나, 인간에 한정되지 않는다. 상기 염기서열 데이터는 DNA를 구성하는 4개의 염기(A, T, C, G)의 서열 자체에 대한 데이터 및 그에 부속되는 데이터를 의미한다. 상기 부속되는 데이터는 예를 들어, 퀄리티 스코어(Quality score), 리드 뎁스(read depth), 컨센서스 서열(consensus sequence) 및 유전자형 (genotype) 일 수 있다. 상기 컨센서스 서열은 대상 생명체 종의 표준 염기서열을 의미하는 것으로 해석될 수 있을 것이다.First, base sequence data of a haplotype target organism is generated (S100). The haplotype target organism may be a human, but is not limited to a human. The nucleotide sequence data means data about the sequence itself of the four bases (A, T, C, and G) constituting the DNA, and data attached thereto. The appended data may be, for example, a quality score, read depth, consensus sequence and genotype. The consensus sequence may be interpreted to mean the standard sequence of the subject organism species.

한편, 상기 염기서열 데이터는 DNA 시퀀서(sequencer)를 이용하여 생성된 전장 염기서열 데이터일 수 있다. 이경우, 본 실시예에 따른 하플로타입 페이징 방법은 대상 생명체의 전장 염기서열 데이터에 대응하는 하플로타입을 계산한 후에 이를 활용해 이배체를 구성하는 단계를 더 포함할 수 있다.Meanwhile, the nucleotide sequence data may be full length nucleotide sequence data generated using a DNA sequencer. In this case, the haplotype paging method according to the present exemplary embodiment may further include calculating a haplotype corresponding to the full-length sequence data of the target organism and using the same to construct a diploid.

다음으로, 중첩되는 리드(read)를 병합하여 길이를 늘린 프래그먼트(fragment)를 생성할 수 있다(S102). 상기 리드(read)는 DNA 시퀀싱 절차를 통하여 생성된 연결된 하나의 염기서열 조각을 의미한다.Next, fragments having an increased length may be generated by merging the overlapping reads (S102). The read refers to one linked nucleotide sequence generated through the DNA sequencing procedure.

다음으로, 이형접합 SNP 매트릭스를 생성한다(S104).Next, a heterojunction SNP matrix is generated (S104).

한편, DNA 시퀀싱과 관련하여 DNA의 분할 및 증식 단계가 수행되므로, 상기 DNA 시퀀싱 결과 산출된 각각의 리드에는 중첩된 부분이 존재할 수 있다. 특히, 메이트-페어 라이브러리(mate-pair library) 또는 페어드-엔드 라이브러리(paired-end library)를 이용하여 DNA 시퀀싱이 수행 된 경우에는, 도 2에 도시된 바와 같이 한 리드(read)안에 두 개 이상의 이형접합 SNP가 존재하거나(200), 쌍을 이루는 페어 리드(pair read)안에 두개 이상의 이형접합 SNP가 존재 한다면(202) 상기 각각의 이형접합 SNP의 위치에 해당하는 페어링-리드(pairing-read) 데이터를 이용하여 상기 프래그먼트(fragment)의 길이를 증가시킬 수 있다.Meanwhile, since DNA division and propagation steps are performed in relation to DNA sequencing, overlapping portions may exist in each read produced as a result of the DNA sequencing. In particular, when DNA sequencing is performed using a mate-pair library or a paired-end library, two in one read as shown in FIG. 2. If there are two or more heterojunction SNPs (200) or two or more heterojunctions SNPs are present in a paired pair read (202), the pairing-read corresponding to the position of each heterojunction SNP is present. ) The length of the fragment can be increased using data.

이형접합 SNP는 대립 유전자(allele)의 유전형이 상이한 경우를 의미한다. 동형접합 SNP는 이형접합 SNP와는 반대로 대립 유전자의 유전형이 동일한 경우이다. 하플로타입을 구성하는 좌위 중 동형접합 SNP로 구성되는 좌위는 하플로타입을 결정함에 있어서 어떠한 영향도 미치지 못하므로, 본 발명에서는 대상 생명체의 SNP 중 이형접합 SNP의 염기서열 데이터 만으로 구성된 이형접합 SNP 매트릭스를 생성하여 하플로타입 페이징을 수행한다. 본 발명에 따르면, 상기 전장 염기서열 데이터를 상기 대상 생명체의 표준 염기서열 데이터와 비교하여 이형접합 SNP를 용이하게 판단할 수 있는 효과가 있다.Heterozygous SNP refers to the case where alleles have different genotypes. Homozygous SNPs are cases where alleles have the same genotype as opposed to heterozygous SNPs. Since the loci consisting of homozygous SNPs among the loci of the haplotype do not have any influence in determining the haplotype, in the present invention, heterozygous SNPs composed of only base sequence data of heterozygous SNPs among the SNPs of the living organisms. Create a matrix to perform haplotype paging. According to the present invention, the heterozygous SNP can be easily determined by comparing the full length sequence data with the standard sequence data of the target organism.

표 1은 본 실시예에 따른 이형접합 SNP 매트릭스의 일 예이다.Table 1 is an example of a heterojunction SNP matrix according to the present embodiment.

표 1은 이형접합 SNP가 6개 있고, 프래그먼트가 7개가 있는 경우이다. 표 1은 본 실시예의 이해의 편의를 위한 것일 뿐, 실제로는 훨씬 많은 수의 이형접합 SNP 및 프래그먼트를 이용하여 하플로타입 페이징을 수행하여야 할 것이다. 이형접합 SNP 매트릭스(이하, 'S'라 한다)는 S(i,j)={0, 1, -}이다. 이 때, 0, 1은 각각의 대립 유전자(allele)를 의미하며(예를 들어, '0'이 대립 유전자 중 하나의 유전자를 의미한다면, '1'은 그에 대립되는 유전자를 의미함), '-'는 프래그먼트 i가 SNP j를 포함하고 있지 않은 상태를 의미한다.Table 1 shows six heterojunction SNPs and seven fragments. Table 1 is just for the sake of understanding of the present embodiment, and in practice, haplotype paging should be performed using a much larger number of heterojunction SNPs and fragments. The heterojunction SNP matrix (hereinafter referred to as 'S') is S (i, j) = {0, 1,-}. In this case, 0 and 1 mean each allele (for example, if '0' means one of the alleles, '1' means an allele thereof), -'Means that fragment i does not contain SNP j.

상기 SNP 매트릭스 데이터 생성 시, 상기 염기서열 데이터에 포함된 리드 뎁스(read depth) 및 퀄리티 스코어(quality score) 값 중 적어도 하나를 활용하여 신뢰할 수 없는 프래그먼트를 제외한 상기 SNP 매트릭스를 생성할 수 있다. 이는 특정 염기서열 데이터의 리드 뎁스가 높을수록 신뢰도가 높고, 특정 염기서열에 포함된 베이스들의 퀄리티 스코어 값이 클수록 신뢰도가 높다는 점을 반영한 것이다.When generating the SNP matrix data, the SNP matrix may be generated except for an unreliable fragment by using at least one of a read depth and a quality score value included in the base sequence data. This reflects that the higher the read depth of the specific base sequence data, the higher the reliability, and the higher the quality score value of the bases included in the specific base sequence, the higher the reliability.

예를 들어, 상기 SNP 매트릭스의 열(row)인 각각의 프래그먼트를 상기 리드 뎁스 및 상기 퀄리티 스코어로 평가하여 평가 점수가 기준치 이하인 프래그먼트는 제외한 상기 SNP 매트릭스 데이터를 생성할 수 있다. 이 경우, 오류가 있을 것으로 예상되는 프래그먼트로 인하여 최적해를 찾는 연산 시간이 늘어나고, 최적해 자체에도 오류가 발생할 수 있는 문제점을 해결할 수 있는 효과가 있다.For example, each fragment that is a row of the SNP matrix may be evaluated using the read depth and the quality score to generate the SNP matrix data excluding the fragment having an evaluation score less than or equal to a reference value. In this case, due to the fragment that is expected to have an error, the calculation time for finding the optimal solution is increased, and there is an effect of solving the problem that an error may occur in the optimal solution itself.

상기 대상 생명체에는 각각의 상동 염색체에 대응되는 두개의 하플로타입이 존재하므로, 상기 프래그먼트 0-6은 각각의 하플로타입을 의미하는 두 집합으로 완전히 구분되어야 한다. 그러나, DNA 시퀀싱 과정에서 오류가 발생할 수도 있으므로, 그런 점을 감안하여 두개의 하플로타입을 결정하여야 한다.Since there are two haplotypes corresponding to each homologous chromosome in the target organism, fragments 0-6 should be completely divided into two sets representing each haplotype. However, errors may occur during DNA sequencing, so two haplotypes should be determined.

본 발명에서는 조합 최적화 프로세스를 이용하여 하플로타입 페이징을 수행하는 방법을 소개한다. 즉, 상기 이형접합 SNP 매트릭스를 만족시키는 최적의 하플로타입을 구하는 문제로 하플로타입 페이징 문제가 변형 된다. 따라서, 상기 이형접합 SNP 매트릭스에 조합 최적화 프로세스를 적용(S106)하여 하플로타입 최적해를 제공할 수 있다(S108).In the present invention, a method of performing haflow type paging using a combination optimization process is introduced. In other words, the problem of finding the optimal haflow type that satisfies the heterojunction SNP matrix is modified. Therefore, by applying a combination optimization process to the heterojunction SNP matrix (S106), it is possible to provide a Haflow type optimal solution (S108).

또한, 상기 하플로타입 데이터 생성부에 의해 생성된 하플로타입 데이터와 컨센서스 염기서열 및 유전자형을 이용해 이배체를 구성하는 단계(S110)가 수행될 수 있다.In addition, a step (S110) of forming a diploid using the haplotype data generated by the haplotype data generation unit and a consensus sequence and genotype may be performed.

이배체 구성 단계(S110)는 컨센서스 염기서열, 유전자형, 그리고 하플로타입 최적해 정보를 결합하여 수행된다. 각 반수체(haploid)는 기본적으로 컨센서스 서열을 이용해 생성하고, 이형접합 SNP 위치들에서만 하플로타입 최적해를 활용해서 결정하는 것으로 수행된다. Diploid construction step (S110) is performed by combining consensus sequence, genotype, and haplotype optimal solution information. Each haploid is basically generated using consensus sequences and determined using the Haplotype optimal solution only at heterozygous SNP positions.

예를 들면, 컨센서스 염기서열의 특징 부위가 'ATGCATGC'이고 첫번째 T와 마지막 C가 이형접합 SNP인 경우, (각각 T/C, C/A SNP이라고 가정) 하플로타입이 10 (T/C, G/A에서 앞쪽이 1, 뒤쪽이 0이라고 가정)으로 찾아졌을때 'ATGCATGA'와 'ACGCATGC'가 대상 생명체의 각각의 반수체로 생성될 수 있다.For example, if the consensus sequence features 'ATGCATGC' and the first T and the last C are heterozygous SNPs (assuming T / C, C / A SNP, respectively), then the Haplotype is 10 (T / C, When G / A is found to be 1 on the front and 0 on the back), 'ATGCATGA' and 'ACGCATGC' can be generated as individual haploids of the target organism.

이하, 조합 최적화 프로세스를 적용(S106)하는 방법에 대하여 도 3 및 도 4를 참조하여 보다 자세히 설명하기로 한다. 상기 조합 최적화는 조합 최적화의 영역은 가능해가 이산 집합에 속하거나 이산적인 것으로 변환될 수 있고, 가장 좋은 해를 찾는 것이 목적인 최적화 문제를 의미하는 것으로, 응용수학과 전산학에서 널리 통용되는 최적화 문제의 일종인 알고리즘 내지 문제 해결 프로세스를 의미한다. 관련 기술로 'Alexander Schrijver; A Course in Combinatorial Optimization, February 1, 2006' 등의 문헌을 참조할 수 있다.Hereinafter, a method of applying the combination optimization process (S106) will be described in more detail with reference to FIGS. 3 and 4. The combination optimization means an optimization problem in which the scope of the combination optimization can be converted into a discrete set or discrete, and the objective is to find the best solution, which is a kind of optimization problem widely used in applied mathematics and computational science. Algorithm or problem solving process. As related technology 'Alexander Schrijver; A Course in Combinatorial Optimization, February 1, 2006 '.

먼저, 조합 최적화 프로세스에 공통적으로 적용되는 평가 함수에 대하여 설명하기로 한다. 상기 조합 최적화 프로세스는 하플로타입 후보 해를 평가 함수에 입력하여 그 함수 값인 적합도를 산출하고, 상기 적합도를 기준으로 상기 이형접합 SNP 매트릭스의 각 프래그먼트 집합으로부터 하플로타입 최적해를 산출하는 단계를 의미할 수 있다. First, an evaluation function commonly applied to the combination optimization process will be described. The combination optimization process may include inputting a haplotype candidate solution into an evaluation function to calculate a fitness, which is a function value, and calculating a haplotype optimal solution from each fragment set of the heterojunction SNP matrix based on the goodness of fit. Can be.

상기 평가 함수는 입력 데이터인 하플로타입 후보해와 상기 SNP 매트릭스의 값이 일치하지 않는 경우 주어지는 제1 패널티 값에 상기 리드 뎁스 및 상기 퀄리티 스코어 값을 반영하는 것일 수 있다. 아래 식 1은 본 실시예에 따른 평가 함수(f(k))의 일 예이다.The evaluation function may reflect the read depth and the quality score value to a first penalty value given when the Haplotype candidate solution, which is input data, and the value of the SNP matrix do not coincide. Equation 1 below is an example of an evaluation function f (k) according to the present embodiment.

식 1:

Equation 1:

식 1에서 Sij는 상기 이형접합 SNP 매트릭스의 프래그먼트 j, SNP i 값을 의미하며,

는 하플로타입 해k의 SNP j 값을 의미한다. 하플로타입 해k는 상기 하플로타입 후보해를 수식화하여 표현한 것이다.In Formula 1, Sij means fragment j and SNP i values of the heterojunction SNP matrix.

Is the SNP j value of the haplotype solution k. The haplotype solution k is a mathematical expression of the haplotype candidate solution.

또한, M은 상기 이형접합 SNP 매트릭스의 SNP 개수 이며, N은 상기 이형접합 SNP 매트릭스의 프래그먼트 개수이다.In addition, M is the number of SNPs in the heterojunction SNP matrix, and N is the number of fragments in the heterojunction SNP matrix.

또한, 일 수 있다. 즉, w(Indk(j),Sij) 함수는, 하플로타입 해k의 j번째 SNP 가 SNP 매트릭스의 프래그먼트 j에 포함된 SNP i 값과 동일하거나 SNP 매트릭스의 프래그먼트 j에 포함된 SNP i의 값이 '-'이면 0, 그렇지 않으면 1-quality_score(Sij)값을 반환한다.It can also be. That is, the function w (Indk (j), Sij) has a value of SNP i equal to or equal to the SNP i value included in fragment j of the SNP matrix, or the jth SNP of the haplotype solution. Is 0, otherwise 1-quality_score (Sij) is returned.

quality_score(S_ij) 함수는 S_ij에 대응하는 리드 뎁스 정보를 활용하여 각 리드의 퀄리티 스코어 값의 평균 값을 반환할 수 있다. 예를 들어, S_ij에 대응하는 베이스(base)들의 리드 뎁스가 30x인 경우, 30개의 S_ij에 대응하는 베이스(base)들의 퀄리티 스코어 값의 평균 값을 반환할 수 있다.The quality_score (S _ij ) function may return an average value of quality score values of each read by using read depth information corresponding to S _ij . For example, if the depth lead 30x of the base (base) corresponding to S _ij, it is possible to return the average value of the quality score value of the base (base) corresponding to the 30 S _ij.

즉, 본원 발명의 평가 함수는 종래의 하플로타입 페이징 방법과 다르게 대상 생명체의 염기서열 분석 결과를 폭넓게 사용할 수 있는 효과가 있다.That is, the evaluation function of the present invention has an effect that can widely use the results of sequencing of the target organism, unlike the conventional haplotype paging method.

또한, 대상 생명체 가족 개체의 염기서열 정보를 더 반영하는 평가 함수를 사용할 수도 있다. 이 경우, 평가 함수(f'(k))는 아래와 같다.It is also possible to use an evaluation function that further reflects sequence information of the target organism family entity. In this case, the evaluation function f '(k) is as follows.

이때,

는 하플로타입 후보 해_k(Ind_k)와 대상 생명체 가족 개체의 염기서열 정보가 일치하면 0, 틀리면 1을 반환하는 함수 일 수 있다.At this time,

May be a function that returns 0 if the Haplotype candidate solution _k (Ind _k ) and the sequence information of the target organism family entity match, and 1 if they are incorrect.

다음으로, 도 3을 참조하여 조합 최적화 프로세스 적용(S106)이 탐색점 분포 학습 알고리즘을 적용하는 경우를 설명하기로 한다.Next, a case in which the combination optimization process application S106 applies the search point distribution learning algorithm will be described with reference to FIG. 3.

탐색점 분포 학습 알고리즘은 주어진 데이터를 학습해 데이터의 분포를 표현할 수 있는 모델을 형성하고, 그 모델로 새로운 데이터를 생성한 후, 생성된 데이터 중에서 적합한 것을 선택하고, 선택된 데이터로 모델을 조금씩 수정해 나가면서 최적해를 찾아가는 최적화 알고리즘이다 (P. Larranaga & J. A. Lozano, Estimation of distribution algorithms: A new tool for evolutionary computation. Kluwer Academic Publishers, 2002). 기존의 진화 연산과 다른 점은 교차 연산 또는 돌연변이 연산 등을 사용하지 않고, 오로지 선택 연산만을 사용하면서 데이터의 분포를 파악하는 방법으로 최적해를 발견하는데, 많은 NP (Nondeterministic Polynomial time) 문제에서 실제로 사용할 수 있는 해 (near-optimal)를 찾아 주는 것으로 알려져 있다. 페이징 문제도 NP 문제로 증명되어 있기 때문에, 가장 적절한 선택이라고 볼 수 있다. The search point distribution learning algorithm learns the given data to form a model that represents the distribution of the data, generates new data from the model, selects the appropriate one from the generated data, and modifies the model little by little with the selected data. It is an optimization algorithm that finds the optimal solution as it goes out (P. Larranaga & JA Lozano, Estimation of distribution algorithms: A new tool for evolutionary computation . Kluwer Academic Publishers, 2002). The difference from the existing evolutionary operation is that the optimal solution is found by grasping the distribution of the data using only the selection operation without using the crossover or mutation operation, and can be used in many NP (Nondeterministic Polynomial time) problems. It is known to find near-optimal solutions. The paging problem is also proven to be the NP problem, making it the most appropriate choice.

먼저, 초기해를 생성한다(S1600). 초기해를 생성할 때 대상 생명체의 가족의 염기서열 정보를 활용할 수 있는데, 가족의 하플로타입을 조사하여 각 하플로타입 비율을 찾아, 해당 비율을 기반으로 초기해를 생성할 수 있다. 예를 들어, 염기서열의 특정 위치에서 가족 하플로타입을 조사해 본 결과 AG SNP의 빈도가 0.95라면, 초기해를 생성할 때 해당 위치의 하플로타입은 95%의 확률로 AG 타입을 가정해서 생성할 수 있다.First, an initial solution is generated (S1600). When generating the initial solution, the sequence information of the family of the target organism can be used. The haplotype of the family can be investigated to find the proportion of each haplotype, and the initial solution can be generated based on the ratio. For example, if a family haplotype is examined at a specific position in the sequence, and if the frequency of AG SNP is 0.95, then the haplotype at that position has a 95% probability of generating AG type, can do.

다음으로, 상기 평가 함수를 이용하여 상기 현재 세대 해의 적합도를 산출한다(S1602). 상기 가족의 염기서열 정보는 상기 평가 함수에서도 반영될 수 있으므로, 지속적으로 활용 될 수 있다.Next, the fitness of the current generation solution is calculated using the evaluation function (S1602). Since the sequence information of the family can be reflected in the evaluation function, it can be continuously used.

다음으로, 전체 해 집합 중 부분 집합을 선택한다(S1604). 또한 선택 된 해들의 적합도를 반영하여 각 이형접합 SNP들의 연관성을 학습한다(S1606). 이때 각 이형접합 SNP들이 서로 독립적(linkage equilibrium)이라고 간주될 수도 있고, 서로 의존적(linkage disequilibrium)이라고 간주될 수도 있다. 서로 의존적인 경우에는 인접한 2 개의 SNP들 만이 영향을 주고 있다고 간주할 수도 있고, 불특정 다수가 서로 의존적이라고 간주할 수도 있다. 각각의 가정에 따라 학습 모델을 세워 분포를 학습하게 된다. 학습 방법은 (P. Larranaga & J. A. Lozano, Estimation of distribution algorithms : A new tool for evolutionary computation. Kluwer Academic Publishers, 2002)의 방법론을 활용할 수 있다.Next, a subset of all solution sets is selected (S1604). In addition, it learns the association of each heterozygote SNP by reflecting the suitability of the selected solutions (S1606). In this case, each heterojunction SNP may be regarded as linkage equilibrium or may be regarded as linkage disequilibrium. In case of interdependence, only two adjacent SNPs may be considered to be influential, and an unspecified majority may be considered to be interdependent. You will learn the distribution by setting up a learning model for each assumption. Learn how (P. Larranaga & JA Lozano, Estimation of distribution algorithms : A new tool for evolutionary computation . Kluwer Academic Publishers, 2002).

학습이 종료되어 모델이 생성되면, 종료 조건이 완성되었는지 판단한다(S1608). 상기 종료 조건은, 세대 반복 횟수 및 상기 현재 세대 해의 적합도가 기존 세대 해의 적합도 보다 기 설정 치 이상으로 증가하지 않는 경우 중 적어도 하나일 수 있다.When the learning is finished and the model is generated, it is determined whether the termination condition is completed (S1608). The termination condition may be at least one of a case where the number of generation iterations and the fitness of the current generation solution do not increase more than a preset value than the fitness of the existing generation solution.

종료 조건이 완성되지 않은 경우, 상기 학습 결과를 이용하여 상기 현재 세대 해를 다시 생성하여(S1610), S1602 단계를 재 수행한다.If the termination condition is not completed, the current generation solution is regenerated using the learning result (S1610), and step S1602 is performed again.

도 4는 조합 최적화 프로세스 적용(S106)이 진화 연산을 적용하는 경우로, 진화 연산의 방법은 (T. Back, Evolutionary Algorithms in Theory and Practice, Oxford University Press, 1996) 등에 공지된 바에 따른다.4 is a case in which the combination optimization process application S106 applies an evolutionary operation, and the method of the evolutionary operation is known from (T. Back, Evolutionary Algorithms in Theory and Practice, Oxford University Press, 1996).

이상 첨부된 도면을 참조하여 본 발명의 실시예들을 설명하였지만, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다.Although embodiments of the present invention have been described above with reference to the accompanying drawings, those skilled in the art to which the present invention pertains may implement the present invention in other specific forms without changing the technical spirit or essential features thereof. I can understand that. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive.

S102 프래그먼트 생성
S104 이형접합 SNP 매트릭스 생성
S106 이형접합 SNP 매트릭스로부터 하플로타입 최적해 구하는 조합 최적화 프로세스 적용Create S102 fragment
S104 Heterojunction SNP Matrix Generation
Application of the combinatorial optimization process to find the Haflow type optimal solution from S106 heterojunction SNP matrix

Claims

Generating heterojunction SNP matrix data which is data on a heterojunction SNP type included in each fragment from the base sequence data; And
Comprising a combination optimization step of calculating the Haplotype optimal solution from each set of fragments of the heterojunction SNP matrix based on the goodness of fit function value of the evaluation function,
Haplotype phasing method reflecting at least one of a read depth and a quality score value included in the base sequence data in at least one of the generation of the SNP matrix data and the evaluation function .

The method according to claim 1,
Generating the heterojunction SNP matrix data,
Evaluating each fragment of the SNP matrix by the read depth and the quality score to generate the SNP matrix data excluding fragments that are below a reference value.

The method according to claim 1,
The evaluation function is
The Haflow type paging method reflects the read depth and the quality score value to a first penalty value when the Haplotype candidate solution, which is input data, does not coincide with a value of the SNP matrix.

The method of claim 3,
The evaluation function is
And a second penalty according to whether the evaluation target solution matches the sequence data of the family of the living organism.

The method according to claim 1,
Haplotype paging method further comprising the step of generating the nucleotide sequence data using a DNA sequencer (sequencer) for calculating the data for the full length sequence of the living organism.

The method of claim 5,
Generating the heterojunction SNP matrix data,
And comparing the base sequence data with standard sequence data of the living organism to generate heterozygous SNP matrix data.

The method of claim 5,
Generating the nucleotide sequence data using a DNA sequencer that calculates data about the full length nucleotide sequence of the target organism may include a mate-pair library or a paired-end library. generating the base sequence data using a library),
Generating the heterozygous SNP matrix data may include generating the heterozygous SNP matrix data by using a fragment connecting an overlapping site among the reads of the base sequence data. Haplotype paging method.

The method according to claim 1,
Generating the haflow type data,
Inputting the heterojunction SNP matrix data to the search point distribution learning process using the evaluation function to generate the haflow type data.

The method of claim 8,
Generating the haplotype data by inputting the heterojunction SNP matrix data into a search point distribution learning process using the evaluation function,
Generating a current generation solution using sequence data of the family of the target organism;
Generating a goodness of fit of the current generation solution using the evaluation function;
Selecting a subset of the entire set of current generation solutions;
A fourth step of learning a distribution of each heterojunction SNP by reflecting the goodness of fit of each solution of the selected subset;
A fifth step of evaluating whether the termination condition is satisfied; And
And a sixth step of regenerating the current generation solution and performing the second step again when the termination condition is not satisfied.

10. The method of claim 9,
The sixth step,
Regenerating the current generation solution when the termination condition is not satisfied, which is at least one of the number of re-execution of the second step and the fitness of the current generation solution does not increase by more than a predetermined value than the fitness of the existing generation solution, the second generation is generated again. A haplotype paging method comprising the step of performing the step again.

The method according to claim 1,
Generating the haflow type data,
Inputting the heterojunction SNP matrix data to an evolutionary operation process using the evaluation function to generate haflow type data.

The method of claim 11, wherein
Generating the haplotype data by inputting the heterojunction SNP matrix data to an evolution calculation process using the evaluation function,
Generating a current generation solution using sequence data of the family of the target organism;
Generating a goodness of fit of the current generation solution using the evaluation function;
A third step of performing at least one of a crossover and a mutation operation on a selected subset of the current generation solution whole set; And
A fourth step of evaluating whether the termination condition is satisfied; And
And a fifth step of regenerating the current generation solution and performing the second step again when the termination condition is not satisfied.

The method according to claim 1,
Haplotype paging method further comprising the step of constructing a diploid using the calculated Haplotype optimal solution and consensus sequence and genotype.

The method of claim 13,
Comprising the diploid,
Generating each haploid using the consensus sequences; And
Haplotype paging method comprising the step of determining the locus corresponding to the heterozygous SNP of the haploid using the haplotype optimal solution and the genotype.

An SNP matrix generating unit generating heterozygous SNP matrix data from nucleotide sequence data; And
Including a haplotype data generation unit for generating a haplotype data by inputting the heterojunction SNP matrix data to a combination optimization process using an evaluation function,
At least one of the generation of the SNP matrix data and the evaluation function reflects a read depth and a quality score value included in the base sequence data.

The method of claim 15,
Haplotype estimation apparatus further comprises a diploid generator for receiving a diplotype data generated by the haplotype data generation unit to form a diploid using a consensus sequence and genotype.

Generating heterozygous SNP matrix data from sequencing data; And
Generating the haplotype data by inputting the heterojunction SNP matrix data to a combination optimization process using an evaluation function;
A computer-readable recording medium having a computer program recorded therein that reflects read depth and quality score values included in the base sequence data in at least one of the generation of the SNP matrix data and the evaluation function. .

The method of claim 17,
The computer program,
And a computer program for performing a step of constructing a diploid using the generated haplotype data, a consensus sequence, and a genotype.