KR102452413B1

KR102452413B1 - Method for detecting chromosomal abnormality using distance information between nucleic acid fragments

Info

Publication number: KR102452413B1
Application number: KR1020200103240A
Authority: KR
Inventors: 기창석; 조은해; 이준남
Original assignee: 주식회사 지씨지놈
Priority date: 2019-08-19
Filing date: 2020-08-18
Publication date: 2022-10-11
Also published as: KR20210021923A

Abstract

본 발명은 핵산 단편간 거리 정보를 이용한 염색체의 이상을 검출하는 방법에 관한 것으로, 보다 구체적으로는 생체시료에서 핵산을 추출하여 서열정보를 획득한 다음 핵산 단편 기준값 사이의 거리를 계산하는 방법을 이용한 염색체의 이상 검출 방법에 관한 것이다. 본 발명에 따른 염색체 이상 판정 방법은, 기존의 리드 개수(read count) 기반으로 염색체 양을 결정하는 단계를 이용하는 방식과는 달리, 정렬된 핵산 단편 사이의 거리 개념을 이용하여 분석하는 방법으로, 기존 방법이 리드 개수가 감소하면 정확도가 떨어지나, 본 발명의 방법은 리드 개수가 감소하더라도, 검출의 정확도를 높일 수 있을 뿐만 아니라, 모든 염색체 구간이 아닌 일정 구간의 핵산 단편 사이의 거리를 분석하여도 검출 정확도가 높아 유용하다.The present invention relates to a method for detecting a chromosome abnormality using distance information between nucleic acid fragments, and more specifically, by extracting nucleic acids from a biological sample to obtain sequence information, and then calculating the distance between reference values of nucleic acid fragments. It relates to a method for detecting abnormalities in chromosomes. The chromosomal abnormality determination method according to the present invention is a method of analyzing using the concept of a distance between aligned nucleic acid fragments, unlike the conventional method using a step of determining the amount of chromosomes based on the number of reads. Although the accuracy of the method decreases when the number of reads is reduced, the method of the present invention can not only increase the accuracy of detection even when the number of reads is reduced, but also detect by analyzing the distance between the nucleic acid fragments of a certain section rather than all chromosome sections. It is useful because of its high accuracy.

Description

Method for detecting chromosomal abnormality using distance information between nucleic acid fragments

본 발명은 핵산 단편간 거리 정보를 이용한 염색체의 이상을 검출하는 방법에 관한 것으로, 보다 구체적으로는 생체시료에서 핵산을 추출하여 서열정보를 획득한 다음 핵산 단편 기준값 사이의 거리를 계산하는 방법을 이용한 염색체의 이상 검출 방법에 관한 것이다.The present invention relates to a method for detecting a chromosome abnormality using distance information between nucleic acid fragments, and more specifically, by extracting nucleic acids from a biological sample to obtain sequence information, and then calculating the distance between reference values of nucleic acid fragments. It relates to a method for detecting abnormalities in chromosomes.

염색체 이상(chromosomal abnormality)은 유전적 결함과 종양 질환과 관련이 있다. 염색체 이상은 염색체의 결실 또는 중복, 염색체 중 일부의 결실 또는 중복, 또는 염색체 내의 손상(break), 전위(translocation), 또는 역위(inversion)를 의미하는 것일 수 있다. 염색체 이상은 유전적 균형의 장애 중 하나로, 태아 사망 또는 육체 및 정신 상태의 심각한 결함 및 종양 질환을 유발한다. 예컨대, 다운증후군(Down's syndrome)은 21번 염색체가 3개 존재하여(trisomy 21) 유발되는 염색체 수 이상의 흔한 형태이다. 에드워드증후군(Edwards syndrome) (trisomy 18), 파타우 증후군(Patau syndrome) (trisomy 13), 터너증후군(Turner syndrome) (XO), 및 클라인펠터 증후군(Klinefelter syndrome) (XXY) 또한 염색체 수 이상에 해당한다. 또한 종양 환자에서도 염색체 이상이 발견 된다. 예컨대 간암 환자(Liver Adenomas and adenocarcinomas) 에서 4q, 11q, 22q 영역의 중복과 13q 영역의 결실이 확인되었고, 췌장암 환자에서는 2p, 2q, 6p, 11q 영역의 중복과 6q, 8p, 9p, 21 번 염색체 영역의 결실이 확인 되었다. 이러한 영역들은 종양과 관련된 Oncogene, Tumor suppressor gene 영역과 관련이 되어 있다.Chromosomal abnormalities are associated with genetic defects and tumor diseases. The chromosomal abnormality may mean a deletion or duplication of a chromosome, a deletion or duplication of a portion of a chromosome, or a break, translocation, or inversion in a chromosome. Chromosomal abnormalities are one of the disorders of genetic balance, leading to fetal death or serious defects in physical and mental condition and tumor diseases. For example, Down's syndrome is a common form of chromosome number abnormality caused by the presence of three chromosome 21 (trisomy 21). Edwards syndrome (trisomy 18), Patau syndrome (trisomy 13), Turner syndrome (XO), and Klinefelter syndrome (XXY) are also chromosome abnormalities do. Chromosomal abnormalities are also found in tumor patients. For example, duplication of regions 4q, 11q, and 22q and deletion of region 13q were confirmed in liver cancer patients (Liver Adenomas and adenocarcinomas), and duplication of regions 2p, 2q, 6p, 11q and 6q, 8p, 9p, and chromosome 21 in pancreatic cancer patients. Areas were confirmed. These regions are related to tumor-related oncogene and tumor suppressor gene regions.

염색체 이상은 핵형 검사(Karyotype), FISH(Fluorescent In Situ Hybridization)를 사용하여 검출 가능하다. 이러한 검출법은 시간, 노력 및 정확도 측면에서 불리하다. 또한, DNA 마이크로어레이를 염색체 이상 검출에 사용할 수 있다. 특히, 게놈 DNA 마이크로어레이 시스템의 경우, 프로브의 제작이 용이하고 염색체의 확장된 영역뿐 아니라 염색체의 인트론 영역에서의 염색체 이상을 검출할 수 있지만, 염색체 내의 위치 및 기능이 확인된 DNA 단편을 많은 수로 제작하기에 곤란하다.Chromosomal abnormalities can be detected using Karyotype and Fluorescent In Situ Hybridization (FISH). This detection method is disadvantageous in terms of time, effort and accuracy. In addition, DNA microarrays can be used to detect chromosomal abnormalities. In particular, in the case of a genomic DNA microarray system, it is easy to prepare a probe and detect chromosomal abnormalities in the intron region of the chromosome as well as the extended region of the chromosome, but the location and function of the DNA fragment in the chromosome can be confirmed in large numbers. difficult to manufacture

최근, 차세대 시퀀싱 기술이 염색체 수 이상 분석에 사용되고 있다(Park, H., Kim et al., Nat Genet 2010, 42, 400-405.; Kidd, J. M. et al., Nature 2008, 453, 56-64). 그러나 이 기술은 염색체 수 이상 분석을 위한 높은 coverage reading을 요구하며, CNV 측정은 독립적인 입증(validation)을 또한 필요로 한다. 따라서 비용이 매우 높고, 결과를 이해하기 어려우므로, 그 당시 일반적인 유전자 검색분석으로서 적절하지 못하였다. Recently, next-generation sequencing technology has been used to analyze chromosome number abnormalities (Park, H., Kim et al., Nat Genet 2010, 42, 400-405.; Kidd, J. M. et al., Nature 2008, 453, 56-64. ). However, this technique requires high coverage readings for the analysis of chromosome number abnormalities, and CNV measurements also require independent validation. Therefore, the cost is very high and the results are difficult to understand, so it was not suitable as a general gene search analysis at that time.

실시간 qPCR이 현재 정량적인 유전자 분석용 첨단 기술로서 사용되는데, 이는 넓은 동역학 범위(Weaver, S. et al, Methods 2010, 50, 271-276) 및 역치 주기(threshold cycle)와 초기 타겟 양 사이에 선형적인 상관관계가 재현성 있게 관찰되기 때문이다(Deepak, S. et al., Curr Genomics 2007,8, 234-251). 그러나 qPCR 분석의 민감도는 복제수 차이를 구별할 만큼 충분히 높지 않다.Real-time qPCR is currently used as a state-of-the-art technique for quantitative genetic analysis, which has a wide kinetic range (Weaver, S. et al, Methods 2010, 50, 271-276) and a linearity between threshold cycle and initial target amount. This is because a positive correlation is observed reproducibly (Deepak, S. et al., Curr Genomics 2007,8, 234-251). However, the sensitivity of the qPCR assay is not high enough to discriminate copy number differences.

한편, 태아 염색체 이상에 대한 기존 산전 검사 항목에는 초음파 검사, 혈중 표지자 검사, 양수검사, 융모막검사, 경피제대혈검사 등이 존재한다(Mujezinovic F, et al. Obstet Gynecol. 2007, 110(3):687-94.). 이 중 초음파 검사와 혈중 표지자 검사는 선별검사, 양수 염색체 검사는 확진 검사로 분류한다. 비침습적 방법인 초음파 검사와 혈중 표지자 검사는 태아에 대한 직접적인 시료 채취를 하지 않아 안전한 방법이지만 검사의 민감도가 80% 이하로 떨어진다(ACOG Committee on Practice Bulletins. 2007). 침습적 방법인 양수검사, 융모막검사, 경피제대혈 검사는 태아 염색체 이상을 확진할 수 있으나, 침습적 의료행위로 인한 태아의 소실 확률이 존재한다는 단점이 있다.On the other hand, existing prenatal tests for fetal chromosomal abnormalities include ultrasound test, blood marker test, amniotic fluid test, chorionic blood test, and transdermal umbilical cord blood test (Mujezinovic F, et al. Obstet Gynecol. 2007, 110(3):687). -94.). Among them, ultrasound and blood marker tests are classified as screening tests, and amniotic chromosome tests are classified as confirmatory tests. Ultrasound and blood marker tests, which are non-invasive methods, are safe methods because they do not collect samples directly from the fetus, but the sensitivity of the tests is less than 80% (ACOG Committee on Practice Bulletins. 2007). Invasive methods such as amniotic fluid test, chorionic blood test, and transdermal umbilical cord blood test can confirm fetal chromosomal abnormalities, but have a disadvantage in that there is a possibility of loss of the fetus due to invasive medical practices.

1997년 Lo 등이 모체 혈장 및 혈청에서 태아 유래 유전물질을 Y 염색체 염기서열분석에 성공하여 모체 내 태아 유전물질을 산전 검사에 이용하게 되었다(Lo YM, et al. Lancet. 1997, 350(9076):485-7). 모체 혈액 내의 태아 유전물질은 태반 재형성 과정 중 세포사멸과정을 겪은 영양막 세포의 일부분이 물질 교환 기전을 통해 모체 혈액으로 들어간 것으로 실제로는 태반으로부터 유래하고 이를 cff DNA(cell-free fetal DNA)라 정의한다. In 1997, Lo et al. succeeded in sequencing the Y chromosome of fetal genetic material from maternal plasma and serum, and used the fetal genetic material in the mother for prenatal testing (Lo YM, et al. Lancet. 1997, 350(9076)). :485-7). Fetal genetic material in maternal blood is a part of trophoblast cells that have undergone apoptosis during placental remodeling and entered into maternal blood through a substance exchange mechanism. do.

cff DNA는 빠르면 배아 이식 18일째부터, 37일째에는 대부분의 모체 혈액 내에서 발견된다. cff DNA는 300bp 이하의 짧은 가닥이며 모체 혈액 내 소량으로 존재하는 특징을 가지고 있기 때문에 이를 태아 염색체 이상 검출에 적용하기 위하여 차세대염기서열분석기법(NGS)을 이용한 대규모 병렬 염기분석 기술이 사용되고 있다. 대규모 병렬 염기분석 기술을 이용한 비침습적 태아 염색체 이상 검출 성능은 염색체에 따라 90-99% 이상의 검출 민감도를 나타내고 있으나, 위양성 및 위음성 결과가 1-10%에 해당하고 있어 이에 대한 교정 기술이 필요한 시점이다(Gil MM, et al. Ultrasound Obstet Gynecol. 2015, 45(3):249-66). cff DNA is found in most maternal blood as early as day 18 and day 37 of embryo transfer. Since cff DNA is a short strand of 300 bp or less and is present in a small amount in maternal blood, large-scale parallel sequencing technology using next-generation sequencing (NGS) is used to apply it to the detection of fetal chromosomal abnormalities. The non-invasive detection of fetal chromosomal abnormalities using large-scale parallel nucleotide analysis technology shows detection sensitivity of 90-99% or more depending on chromosomes, but false positive and false negative results are 1-10%, so it is time to correct this. (Gil MM, et al. Ultrasound Obstet Gynecol. 2015, 45(3):249-66).

이에, 본 발명자들은 상기 문제점들을 해결하고, 높은 민감도와 정확도의 염색체 이상 검출 방법을 개발하기 위해 예의 노력한 결과, 염색체 영역에 정렬되는 핵산 단편(fragments)을 그룹화한 다음, 핵산 단편 기준값 사이의 거리를 계산하여 정상인 그룹과 비교할 경우, 높은 민감도와 정확도로 염색체 이상을 검출할 수 있다는 것을 확인하고, 본 발명을 완성하였다.Accordingly, as a result of the present inventors' diligent efforts to solve the above problems and develop a method for detecting chromosomal abnormalities with high sensitivity and accuracy, grouping nucleic acid fragments aligned in a chromosomal region, and then determining the distance between the nucleic acid fragment reference values When compared with the normal group by calculation, it was confirmed that chromosomal abnormalities can be detected with high sensitivity and accuracy, and the present invention has been completed.

본 발명의 목적은 핵산 단편간 거리 정보를 이용한 염색체의 이상을 판정하는 방법을 제공하는 것이다.An object of the present invention is to provide a method for determining a chromosome abnormality using distance information between nucleic acid fragments.

본 발명의 다른 목적은 핵산 단편간 거리 정보를 이용한 염색체의 이상을 판정하는 장치를 제공하는 것이다.Another object of the present invention is to provide an apparatus for determining a chromosome abnormality using distance information between nucleic acid fragments.

본 발명의 또 다른 목적은 상기 방법으로 염색체의 이상을 판정하는 프로세서에 의해 실행되도록 구성되는 명령을 포함하는 컴퓨터 판독 가능한 저장 매체를 제공하는 것이다.Another object of the present invention is to provide a computer-readable storage medium comprising instructions configured to be executed by a processor for determining an abnormality of a chromosome by the above method.

상기 목적을 달성하기 위하여, 본 발명은 생체시료에서 추출한 핵산 단편(fragments) 기준값 사이의 거리를 계산하여 염색체 이상을 검출하는 방법을 제공한다.In order to achieve the above object, the present invention provides a method for detecting a chromosomal abnormality by calculating a distance between reference values of nucleic acid fragments extracted from a biological sample.

본 발명은 또한, 생체시료에서 핵산을 추출하여 서열정보를 해독하는 해독부; 해독된 서열을 표준 염색체 서열 데이터베이스에 정렬하는 정렬부; 및 선별된 핵산 단편(fragments)에 대하여 정렬된 핵산 단편의 기준값 사이의 거리를 측정하여, FD 값(Fragments Distance)을 계산하고, 계산한 FD 값을 기반으로 염색체 전체 영역 또는 특정 유전 영역 별로 FDI 값(Fragments Distance Index)를 계산하여, FDI 값이 기준값 범위 미만 또는 초과 일 경우, 염색체 이상이 있는 것으로 판정하는 염색체 이상 판정부를 포함하는 염색체 이상 검출 장치를 제공한다.The present invention also provides a decoding unit for extracting nucleic acids from a biological sample and deciphering sequence information; an alignment unit that aligns the translated sequence to a standard chromosomal sequence database; And by measuring the distance between the reference values of the aligned nucleic acid fragments with respect to the selected nucleic acid fragments (fragments), to calculate the FD value (Fragments Distance), based on the calculated FD value FDI value for the entire chromosome region or for each specific genetic region By calculating (Fragments Distance Index), when the FDI value is less than or exceeding the reference value range, there is provided a chromosomal abnormality detecting apparatus including a chromosomal abnormality determining unit that determines that there is a chromosomal abnormality.

본 발명은 또한, 컴퓨터 판독 가능한 저장 매체로서, 염색체 이상을 검출하는 프로세서에 의해 실행되도록 구성되는 명령을 포함하되, The present invention also provides a computer readable storage medium comprising instructions configured to be executed by a processor for detecting a chromosomal abnormality,

(A) 생체시료에서 핵산을 추출하여 핵산 단편을 획득하여 서열정보를 수득하는 단계; (A) extracting a nucleic acid from a biological sample to obtain a nucleic acid fragment to obtain sequence information;

(B) 수득한 서열정보(reads)에 기반하여 핵산 단편을 표준 염색체 서열 데이터베이스(reference genome database)에 정렬(alignment)하는 단계; (B) aligning the nucleic acid fragment to a standard chromosome sequence database (reference genome database) based on the obtained sequence information (reads);

(C) 선별된 핵산 단편(fragments)의 기준값 사이의 거리를 측정하여, FD 값(Fragments Distance)을 계산하는 단계; 및(C) measuring the distance between the reference values of the selected nucleic acid fragments (fragments), calculating the FD value (Fragments Distance); and

(D) 상기 (C) 단계에서 계산한 FD 값을 기반으로 염색체 전체 영역 또는 특정 유전 영역 별로 FDI 값(Fragments Distance Index)를 계산하여, FDI 값이 기준 값 또는 범위 미만 또는 초과 일 경우, 염색체 이상이 있는 것으로 판정하는 단계를 통하여, 염색체 이상을 검출하는 프로세서에 의해 실행되도록 구성되는 명령을 포함하는 컴퓨터 판독 가능한 저장 매체를 제공한다.(D) Calculate the FDI value (Fragments Distance Index) for the entire chromosome region or for each specific genetic region based on the FD value calculated in step (C). If the FDI value is less than or greater than the reference value or range, chromosomal abnormal There is provided a computer-readable storage medium comprising instructions configured to be executed by a processor that detects a chromosomal abnormality through the step of determining that there is a chromosomal abnormality.

본 발명은 또한, (A) 생체시료에서 핵산을 추출하여 서열정보를 획득하는 단계; (B) 획득한 서열정보(reads)를 표준 염색체 서열 데이터베이스(reference genome database)에 정렬(alignment)하는 단계; (C) 상기 정렬된 서열정보(reads)에 대하여 정렬된 리드 사이의 거리를 측정하여, RD 값(Read Distance)을 계산하는 단계; 및 (D) 상기 (C) 단계에서 계산한 RD 값을 기반으로 염색체 전체 영역 또는 특정 유전 영역 별로 RDI 값(Read Distance Index)를 계산하여, RDI 값이 기준 값 또는 범위 미만 혹은 초과일 경우, 염색체 이상이 있는 것으로 판정하는 단계를 포함하는 염색체 이상 검출 방법을 제공한다.The present invention also comprises the steps of (A) extracting nucleic acids from a biological sample to obtain sequence information; (B) aligning the obtained sequence information (reads) to a standard chromosome sequence database (reference genome database); (C) calculating the RD value (Read Distance) by measuring the distance between the aligned reads with respect to the aligned sequence information (reads); and (D) calculating the RDI value (Read Distance Index) for the entire chromosome region or for each specific genetic region based on the RD value calculated in step (C), and when the RDI value is less than or greater than the reference value or range, the chromosome It provides a chromosomal abnormality detection method comprising the step of determining that there is an abnormality.

본 발명은 또한, 생체시료에서 핵산을 추출하여 서열정보를 해독하는 해독부; 해독된 서열을 표준 염색체 서열 데이터베이스에 정렬하는 정렬부; 및 선별된 서열정보(reads)에 대하여 정렬된 리드 사이의 거리를 측정하여, RD 값(Read Distance)을 계산하고, 계산한 RD 값을 기반으로 염색체 전체 영역 또는 특정 유전 영역 별로 RDI 값(Read Distance Index)를 계산하여, RDI 값이 기준 값 범위 미만 또는 초과일 경우, 염색체 이상이 있는 것으로 판정하는 염색체 이상 판정부를 포함하는 염색체 이상 검출 장치를 제공한다.The present invention also provides a decoding unit for extracting nucleic acids from a biological sample and deciphering sequence information; an alignment unit that aligns the translated sequence to a standard chromosomal sequence database; and by measuring the distance between reads aligned with respect to the selected sequence information (reads), calculating the RD value (Read Distance), and based on the calculated RD value, the RDI value (Read Distance) for the entire chromosome region or for each specific genetic region Index) to provide a chromosomal abnormality detection apparatus including a chromosomal abnormality determining unit that determines that there is a chromosomal abnormality by calculating the RDI value is less than or exceeding the reference value range.

본 발명은 또한, 컴퓨터 판독 가능한 저장 매체로서, 염색체 이상을 검출하는 프로세서에 의해 실행되도록 구성되는 명령을 포함하되, (A) 생체시료에서 핵산을 추출하여 서열정보를 획득하는 단계; (B) 획득한 서열정보(reads)를 표준 염색체 서열 데이터베이스(reference genome database)에 정렬(alignment)하는 단계; (C) 선별된 선별된 서열정보(reads)에 대하여 정렬된 리드 사이의 거리를 측정하여, RD 값(Read Distance)을 계산하는 단계; 및 (D) 상기 (C) 단계에서 계산한 RD 값을 기반으로 염색체 전체 영역 또는 특정 유전 영역 별로 RDI 값(Read Distance Index)를 계산하여, RDI 값이 기준 값 또는 범위 미만 또는 초과 일 경우, 염색체 이상이 있는 것으로 판정하는 단계를 통하여, 염색체 이상을 검출하는 프로세서에 의해 실행되도록 구성되는 명령을 포함하는 컴퓨터 판독 가능한 저장 매체를 제공한다. The present invention also provides a computer-readable storage medium comprising instructions configured to be executed by a processor for detecting a chromosomal abnormality, comprising the steps of: (A) extracting nucleic acids from a biological sample to obtain sequence information; (B) aligning the obtained sequence information (reads) to a standard chromosome sequence database (reference genome database); (C) measuring the distance between the reads aligned with respect to the selected sequence information (reads), calculating the RD value (Read Distance); and (D) calculating the RDI value (Read Distance Index) for the entire chromosome region or for each specific genetic region based on the RD value calculated in step (C), and when the RDI value is less than or greater than the reference value or range, the chromosome There is provided a computer-readable storage medium comprising instructions configured to be executed by a processor that detects a chromosomal abnormality through the step of determining that there is an abnormality.

본 발명에 따른 염색체 이상 판정 방법은, 기존의 리드 개수(read count) 기반으로 염색체 양을 결정하는 단계를 이용하는 방식과는 달리, 정렬된 핵산 단편(fragments)을 그룹화한 다음, 핵산 단편 기준값 사이의 거리 개념을 이용한 방법으로, 기존 방법이 리드 개수가 감소하면 정확도가 떨어지나, 본 발명의 방법은 리드 개수가 감소하더라도, 검출의 정확도를 높일 수 있을 뿐만 아니라, 모든 염색체 구간이 아닌 일정 구간의 핵산 단편 사이의 거리를 분석하여도 검출 정확도가 높아 유용하다.The chromosomal abnormality determination method according to the present invention is different from the conventional method of using a step of determining the amount of chromosomes based on the number of reads, after grouping the aligned nucleic acid fragments, and then between the nucleic acid fragment reference value As a method using the concept of distance, the existing method loses accuracy when the number of reads is reduced, but the method of the present invention can increase the detection accuracy even when the number of reads is reduced, as well as nucleic acid fragments of a certain section rather than all chromosome sections. Even if the distance between them is analyzed, the detection accuracy is high, which is useful.

도 1은 본 발명의 일 실시예에 따른 FD 값에 기반한 염색체 이상을 판정하기 위한 전체 흐름도이다.
도 2는 싱글 엔드 시퀀싱(single-end sequencing) 방법으로 생산된 리드에서 본 발명의 FD 값을 계산하는 방법을 나타낸 개념도이다.
도 3은 페어드 엔드 시퀀싱(paired-end sequencing) 방법으로 생산된 리드에서 본 발명의 FD 값을 계산하는 방법을 나타낸 개념도이다.
도 4는 본 발명에서 리드 외 위치 정보를 활용하여 FD 값을 보정하는 방법에 대한 개념도이다.
도 5는 본 발명의 일 실시예에서 페어드 엔드 시퀀싱 방법으로 생산된 리드 데이터를 기반으로 리드 외 위치 정보를 활용하지 않을 경우와 활용한 경우 계산되는 FD 값의 차이를 측정한 그래프이다.
도 6은 본 발명의 일 실시예에 따른 RD 값에 기반한 염색체 이상을 판정하기 위한 전체 흐름도이다.
도 7은 본 발명에의 일 실시예에 따른 RD 값 기반의 방법에서 계산하는 Read Distance에 대한 개념을 도식화한 것이다. Reads Distance 계산에 사용되는 Reads의 경우, 정렬된 방향에 관계없이 사용할 수 있고(도 7. A), 정렬된 방향을 고려하여 사용할 수 있다. (양의 방향: 도 7. B, 음의 방향: 도 7. C)
도 8은 본 발명의 일 실시예에 따른 RD값 기반의 방법에서 X 염색체의 리드 개수(read count) 와 RepRD 의 분포를 도식화 한 것으로, 두 값의 관계가 선형이 아닌 비선형 관계임을 확인했다.
도 9는 본 발명의 일 실시예에 따른 RD값 기반의 방법에서 염색체별 리드 개수(read count)와 RepRD의 분포를 도식화한 것으로, (A)는 정상 염색체, (B)는 삼염색체성(trisomy) 21번 염색체, (C)는 삼염색체성(trisomy) 18번 염색체 및 (D)는 삼염색체성(trisomy) 13번 염색체를 나타낸 것이다.
도 10은 본 발명의 일 실시예에 따른 RD값 기반의 방법에서 계산한 RDI 값과 태아분획(도 10의 A), 임신 주수(도 10의 B) 및 G-score 값(대한민국 특허 제10-1686146호, 도 10의 C)의 관계를 각각 확인한 결과이다.
도 11은 본 발명의 일 실시예에 따른 RD값 기반의 방법에서 정상군과 각 염색체 이수성으로 확인된 샘플에 대한 ROC 분석의 결과이다.
도 12는 본 발명의 일 실시예에 따른 RD값 기반의 방법에서 리드 수에 따른 정확도를 확인한 결과로서, X 축은 리드의 수이며, Y축은 AUC를 의미한다.
도 13은 본 발명의 일 실시예에 따른 RD값 기반의 방법과 리드의 수 및 염색체 이상과의 관련도를 확인한 결과이다.
도 14는 본 발명의 일 실시예에 따른 RD값 기반의 방법과 microarray 분석 결과를 비교한 결과이다.
도 15는 본 발명의 일 실시예에 따른 RD값 기반의 방법에서 정상인과 21번 염색체 이수성 샘플의 RepRD를 중앙값의 역수로 설정한 RDI 값 분포를 확인한 결과이다.
도 16는 본 발명의 일 실시예에 따른 RD값 기반의 방법에서 정상인과 21번 염색체 이수성 샘플의 RepRD를 평균값으로 설정한 RDI 값 분포를 확인한 결과이다.
도 17는 본 발명의 일 실시예에 따른 RD값 기반의 방법에서 정상인과 21번 염색체 이수성 샘플의 RepRD를 평균값의 역수로 설정한 RDI 값 분포를 확인한 결과이다.1 is an overall flowchart for determining a chromosomal abnormality based on an FD value according to an embodiment of the present invention.
2 is a conceptual diagram illustrating a method of calculating an FD value of the present invention in reads produced by a single-end sequencing method.
3 is a conceptual diagram illustrating a method of calculating an FD value of the present invention in reads produced by a paired-end sequencing method.
4 is a conceptual diagram of a method of correcting an FD value by using location information other than a lead in the present invention.
5 is a graph measuring the difference in FD values calculated when non-read position information is not used and when it is used based on read data produced by the paired-end sequencing method in an embodiment of the present invention.
6 is an overall flowchart for determining a chromosomal abnormality based on an RD value according to an embodiment of the present invention.
7 is a schematic diagram of the concept of the Read Distance calculated in the method based on the RD value according to an embodiment of the present invention. In the case of Reads used for the Reads Distance calculation, it can be used regardless of the aligned direction (FIG. 7. A), and can be used in consideration of the aligned direction. (Positive direction: Fig. 7. B, Negative direction: Fig. 7. C)
8 is a diagram illustrating the distribution of the read count and RepRD of the X chromosome in the RD value-based method according to an embodiment of the present invention, and it was confirmed that the relationship between the two values is not a linear but a non-linear relationship.
9 is a schematic diagram of the distribution of read counts and RepRDs for each chromosome in the RD value-based method according to an embodiment of the present invention, (A) is a normal chromosome, (B) is a trisomy ) chromosome 21, (C) shows a trisomy 18, and (D) shows a trisomy 13 chromosome.
10 is an RDI value calculated by the RD value-based method according to an embodiment of the present invention, fetal fraction (FIG. 10A), gestational weeks (FIG. 10B), and G-score value (Korean Patent No. 10- 1686146, it is the result of confirming the relationship of FIG. 10C), respectively.
11 is a result of ROC analysis of a sample identified as a normal group and each chromosome aneuploidy in the method based on the RD value according to an embodiment of the present invention.
12 is a result of confirming the accuracy according to the number of reads in the method based on the RD value according to an embodiment of the present invention. The X axis is the number of reads, and the Y axis is the AUC.
13 is a result of confirming the relationship between the RD value-based method and the number of reads and chromosomal abnormalities according to an embodiment of the present invention.
14 is a result of comparing the RD value-based method and microarray analysis results according to an embodiment of the present invention.
15 is a result of confirming the distribution of RDI values in which the RepRD of a normal person and a sample of chromosome 21 aneuploidy is set as the reciprocal of the median in the RD value-based method according to an embodiment of the present invention.
16 is a result of confirming the distribution of RDI values in which RepRD of a normal person and a sample of chromosome 21 aneuploidy is set as an average value in the RD value-based method according to an embodiment of the present invention.
17 is a result of confirming the distribution of RDI values in which RepRD of a normal person and a sample of chromosome 21 aneuploidy is set as the reciprocal of the average value in the method based on the RD value according to an embodiment of the present invention.

다른 식으로 정의되지 않는 한, 본 명세서에서 사용된 모든 기술적 및 과학적 용어들은 본 발명이 속하는 기술 분야에서 숙련된 전문가에 의해서 통상적으로 이해되는 것과 동일한 의미를 갖는다. 일반적으로 본 명세서에서 사용된 명명법 및 이하에 기술하는 실험 방법은 본 기술 분야에서 잘 알려져 있고 통상적으로 사용되는 것이다.Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. In general, the nomenclature used herein and the experimental methods described below are well known and commonly used in the art.

본 발명에서는, 샘플에서 획득한 서열 정보(read) 데이터를 참조 유전체에 정렬한 다음, 정렬된 핵산 단편(fragments)를 그룹화한 다음, 핵산 단편 기준값 사이의 거리를 계산하여 정상인 집단과 실험 대상자의 분석하고자 하는 염색체에서의 대표값을 비교하여 염색체 이상을 검출할 경우, 높은 민감도와 정확도로 염색체 이상을 검출할 수 있다는 것을 확인하고자 하였다.In the present invention, sequence information (read) data obtained from a sample is aligned with a reference genome, then the aligned nucleic acid fragments are grouped, and then the distance between the nucleic acid fragment reference values is calculated to analyze the normal population and test subjects. In the case of detecting a chromosomal abnormality by comparing the representative values in the desired chromosome, it was tried to confirm that the chromosomal abnormality can be detected with high sensitivity and accuracy.

본 발명에 따른 염색체 이상 검출 방법은 이수성 등의 태아의 염색체 이상뿐 아니라, 종양의 검출, 즉 종양의 진단이나 예후의 예측에도 이용될 수 있다. The chromosomal abnormality detection method according to the present invention can be used not only for chromosomal abnormalities such as aneuploidy, but also for tumor detection, ie, tumor diagnosis or prediction of prognosis.

즉, 본 발명의 일 실시예에서는, 혈액에서 추출한 DNA를 시퀀싱 한 뒤, 참조 염색체에 정렬한 다음, 핵산 단편을 전체 그룹, 정방향 그룹 및 역방향 그룹으로 그룹화한 다음, 각 그룹별로 핵산 단편 기준값 사이의 거리(fragment distance, FD)를 계산하고, 각 유전 영역당 핵산 단편 기준값 사이 거리의 대표값(RepFD)을 도출한 다음, 정규화 요소를 이용하여 RepFD ratio를 계산하고, 정상인 참조 집단에서의 RepFD ratio와 비교하여 그룹별 FDI(Fragment Distance Index) 값을 도출하였으며, 모든 그룹별 FDI 값이 기준값 미만 또는 초과일 경우, 실험 대상자의 염색체 이상이 있다고 결정하는 방법을 개발하였다(도 1).That is, in one embodiment of the present invention, after sequencing the DNA extracted from blood, aligning it to a reference chromosome, grouping the nucleic acid fragments into a whole group, a forward group, and a reverse group, and then grouping the nucleic acid fragment reference value for each group After calculating the fragment distance (FD), deriving a representative value (RepFD) of the distance between the reference values of nucleic acid fragments for each genetic region, the RepFD ratio is calculated using a normalization factor, and the RepFD ratio and By comparison, FDI (Fragment Distance Index) values for each group were derived, and when the FDI values for all groups were less than or more than the reference value, a method was developed for determining that there is a chromosomal abnormality of the test subject (FIG. 1).

따라서, 본 발명은 일 관점에서, 생체시료에서 추출한 핵산 단편(fragments) 기준값 사이의 거리를 계산하여 염색체 이상 검출 방법에 관한 것이다.Accordingly, in one aspect, the present invention relates to a method for detecting a chromosomal abnormality by calculating a distance between reference values of nucleic acid fragments extracted from a biological sample.

본 발명에 있어서, 상기 핵산 단편은 생체시료에서 추출한 핵산의 조각이면 제한없이 이용할 수 있으나, 바람직하게는 세포 유리 핵산 또는 세포 내 핵산의 조각일 수 있으나, 이에 한정되는 것은 아니다.In the present invention, the nucleic acid fragment can be used without limitation as long as it is a fragment of a nucleic acid extracted from a biological sample, but preferably a fragment of a cell-free nucleic acid or an intracellular nucleic acid, but is not limited thereto.

본 발명에 있어서, 상기 핵산 단편은 직접 서열분석하거나, 차세대 염기서열 분석을 통해 서열분석하거나 또는 비특이적 전장 유전체 증폭(non-specific whole genome amplification)을 통해 서열분석하여 얻은 것임을 특징으로 할 수 있다.In the present invention, the nucleic acid fragment may be characterized in that it is obtained by direct sequencing, sequencing through next-generation sequencing, or sequencing through non-specific whole genome amplification.

본 발명에서 상기 핵산 단편을 직접 서열분석하는 방법은 기존의 공지된 기술을 모두 사용할 수 있다. In the present invention, for the method of directly sequencing the nucleic acid fragment, all known techniques may be used.

본 발명에서 비특이적 전장 유전체 증폭을 통해 서열분석하는 방법은 랜덤 프라이머를 이용하여 핵산을 증폭한 다음, 서열 분석을 수행하는 모든 방법을 의미한다.In the present invention, the method of sequencing through non-specific full-length genome amplification refers to any method of amplifying a nucleic acid using a random primer and then performing sequencing.

본 발명에 있어서, 차세대 염기서열 분석을 통한 서열분석을 이용하여 핵산 단편 기준값 사이의 거리를 계산하고 이로부터 염색체 이상 여부를 판정하는 방법은,In the present invention, the method of calculating the distance between the reference values of nucleic acid fragments using sequencing through next-generation sequencing and determining whether there is a chromosomal abnormality therefrom,

(B) 수득한 서열정보(reads)에 기반하여 표준 염색체 서열 데이터베이스(reference genome database)에서 핵산 단편의 위치를 확인 하는 단계; (B) confirming the position of the nucleic acid fragment in a standard chromosome sequence database (reference genome database) based on the obtained sequence information (reads);

(C) 상기 서열정보(reads)를 전체 서열, 정방향 서열 및 역방향 서열로 그룹화 하는 단계;(C) grouping the sequence information (reads) into full sequence, forward sequence and reverse sequence;

(D) 상기 그룹화된 서열정보를 이용하여, 각 핵산단편의 기준값을 정의하고, 기준값 사이의 거리를 측정하여, 각 그룹별 FD 값(Fragments Distance)을 계산하는 단계; 및(D) using the grouped sequence information, defining a reference value of each nucleic acid fragment, measuring a distance between the reference values, and calculating an FD value (Fragments Distance) for each group; and

(E) 상기 (D) 단계에서 계산한 각 그룹별 FD 값을 기반으로 염색체 전체 영역 또는 특정 영역 별로 각각의 FDI 값(Fragments Distance Index)을 계산하여, 각각의 FDI 값이 모두 기준값 범위 미만 혹은 초과일 경우, 염색체 이상이 있는 것으로 판정하는 단계; (E) Calculate each FDI value (Fragments Distance Index) for the entire chromosome region or for each specific region based on the FD value for each group calculated in step (D), so that each FDI value is less than or exceeding the reference value range , determining that there is a chromosomal abnormality;

를 포함하는 방법으로 수행하는 것을 특징으로 할 수 있지만, 이에 한정되는 것은 아니다.It may be characterized in that it is performed by a method comprising, but is not limited thereto.

본 발명에서 용어 “염색체 이상”은 염색체에서 발생하는 다양한 변이를 의미하는데, 크게 수 이상과 구조 이상, 미세결실, 염색체 불안정성 등으로 구분될 수 있다.In the present invention, the term “chromosomal abnormality” refers to various mutations occurring in chromosomes, and can be largely divided into number abnormalities, structural abnormalities, microdeletions, chromosomal instability, and the like.

염색체의 수 이상은 염색체의 개수에서 이상이 발생하는 경우로서 예를 들어 다운 증후군(Down Syndrome, 21번째 염색체가 1개 더 많은 전체 염색체의 수가 47개), 터너 증후군(Turner Syndrome, 단일 X를 가지고 있어 염색체의 수가 45개) 및 클라인펠터 증후군(Klinefelter Syndrome, XXYY, XXXY, XXXXY 등의 염색체 수를 가짐)과 같이 전체 염색체의 개수가 23쌍 46개에서 이상이 발생하는 경우를 모두 포함할 수 있다.An abnormality in the number of chromosomes is a case in which an abnormality occurs in the number of chromosomes, for example, Down Syndrome (the total number of chromosomes is 47 with one more 21st chromosome), Turner Syndrome (with a single X) There are 45 chromosomes) and Klinefelter Syndrome (having chromosome numbers such as XXYY, XXXY, XXXXY) in which the total number of chromosomes occurs in 23 pairs and 46 cases can include both. .

염색체의 구조 이상은 결실, 중복, 역위, 전좌, 융합(Fusion), 미세부수체 불안정성(MSI-H) 등 염색체의 개수에는 변화가 없으나 염색체의 구조에 변화가 발생하는 모든 경우를 의미한다. 예를 들어, 5번 염색체의 일부분 결실(고양이울음 증후군), 7번 염색체의 일부분 결실(필리암스 증후군), 12번 염색체의 일부분 중복(월프-허쉬호른 증후군) 등이 있을 수 있다. 종양 환자에서 발견되는 염색체 구조적 이상에는 9번과 22번 염색체 사이의 전좌(만성 골수 백혈병), 4q, 11q, 22q 영역의 중복과 13q 영역의 결실(간암), 2p, 2q, 6p, 11q 영역의 중복과 6q, 8p, 9p, 21번 염색체 영역의 결실 (췌장암), TMPRSS2-TRG 유전자 융합(전립선암), 염색체 전반에 걸친 미세부수체 불안정성(대장암) 등이 확인 되었다. 이러한 영역들은 종양과 관련된 종양 유전자(oncogene), 종양 억제 유전자(tumor suppressor gene) 영역과 관련이 되어 있으나 위에 기술된 내용에 한정되는 것은 아니다.Chromosomal structural abnormalities refer to all cases in which there is no change in the number of chromosomes, such as deletions, duplications, inversions, translocations, fusions, and microsatellite instability (MSI-H), but changes in the structure of chromosomes occur. For example, there may be partial deletion of chromosome 5 (cat crying syndrome), partial deletion of chromosome 7 (Philliams syndrome), partial duplication of chromosome 12 (Wolf-Hirschhorn syndrome), and the like. Chromosomal structural abnormalities found in tumor patients include translocation between chromosomes 9 and 22 (chronic myelogenous leukemia), duplication of regions 4q, 11q, and 22q and deletion of region 13q (liver cancer), 2p, 2q, 6p, and 11q regions. Duplications, deletions of the 6q, 8p, 9p, and chromosome 21 regions (pancreatic cancer), TMPRSS2-TRG gene fusion (prostate cancer), and chromosome-wide microsatellite instability (colon cancer) were confirmed. These regions are related to regions of oncogenes and tumor suppressor genes associated with tumors, but are not limited to those described above.

본 발명에 있어서, In the present invention,

상기 (A) 단계는 The step (A) is

(A-i) 혈액, 정액, 질 세포, 모발, 타액, 소변, 구강세포, 태반세포 또는 태아세포를 포함하는 양수, 조직세포 및 이의 혼합물에서 핵산을 수득하는 단계;(A-i) obtaining nucleic acids from blood, semen, vaginal cells, hair, saliva, urine, amniotic fluid including oral cells, placental cells or fetal cells, tissue cells, and mixtures thereof;

(A-ii) 채취된 핵산에서 솔팅-아웃 방법(salting-out method), 컬럼 크로마토그래피 방법(column chromatography method) 또는 비드 방법(beads method)을 사용하여 단백질, 지방, 및 기타 잔여물을 제거하고 정제된 핵산을 수득하는 단계; (A-ii) removing proteins, fats, and other residues from the collected nucleic acids using a salting-out method, a column chromatography method, or a beads method; obtaining purified nucleic acids;

(A-iii) 정제된 핵산 또는 효소적 절단, 분쇄, 수압 절단 방법(hydroshear method)으로 무작위 단편화(random fragmentation)된 핵산에 대하여, 싱글 엔드 시퀀싱(single-end sequencing) 또는 페어 엔드 시퀀싱(pair-end sequencing) 라이브러리(library)를 제작하는 단계; (A-iii) single-end sequencing or pair-end sequencing for purified nucleic acids or nucleic acids randomly fragmented by enzymatic digestion, pulverization, or hydroshear method end sequencing) preparing a library;

(A-iv) 제작된 라이브러리를 차세대 유전자서열검사기(next-generation sequencer)에 반응시키는 단계; 및(A-iv) reacting the prepared library with a next-generation sequencer; and

(A-v) 차세대 유전자서열검사기에서 핵산의 서열정보(reads)를 획득하는 단계를 포함하는 것을 특징으로 할 수 있다.(A-v) it may be characterized in that it comprises the step of acquiring sequence information (reads) of the nucleic acid in the next-generation gene sequencing machine.

본 발명에 있어서, 상기 차세대 유전자서열검사기(next-generation sequencer)는 당업계에 공지된 임의의 시퀀싱 방법으로 사용될 수 있다. 선택 방법에 의해 분리된 핵산의 시퀀싱은 전형적으로는 차세대 시퀀싱(NGS)을 사용하여 수행된다. 차세대 시퀀싱은 개개의 핵산 분자 또는 고도로 유사한 방식으로 개개의 핵산 분자에 대해 클론으로 확장된 프록시 중 하나의 뉴클레오타이드 서열을 결정하는 임의의 시퀀싱 방법을 포함한다(예를 들어, 10⁵개 이상의 분자가 동시에 시퀀싱된다). 일 실시형태에서, 라이브러리 내 핵산 종의 상대적 존재비는 시퀀싱 실험에 의해 만들어진 데이터에서 그것의 동족 서열의 상대적 발생 수를 계측함으로써 추정될 수 있다. 차세대 시퀀싱 방법은 당업계에 공지되어 있고, 예를 들어 본 명세서에 참조로서 포함된 문헌(Metzker, M. (2010) Nature Biotechnology Reviews 11:31-46)에 기재된다.In the present invention, the next-generation sequencer may be used by any sequencing method known in the art. Sequencing of nucleic acids isolated by selection methods is typically performed using next-generation sequencing (NGS). Next-generation sequencing includes any sequencing method that determines the nucleotide sequence of either an individual nucleic acid molecule or a clonal extended proxy for an individual nucleic acid molecule in a highly similar manner (e.g., 10 ⁵ or more molecules simultaneously sequenced). In one embodiment, the relative abundance of a nucleic acid species in a library can be estimated by counting the relative number of occurrences of its cognate sequence in data generated by sequencing experiments. Methods for next-generation sequencing are known in the art and are described, for example, in Metzker, M. (2010) Nature Biotechnology Reviews 11:31-46, which is incorporated herein by reference.

일 실시형태에서, 차세대 시퀀싱은 개개의 핵산 분자의 뉴클레오타이드 서열을 결정하기 위해 한다(예를 들어, 헬리코스 바이오사이언스(Helicos BioSciences)의 헬리스코프 유전자 시퀀싱 시스템(HeliScope Gene Sequencing system) 및 퍼시픽바이오사이언스의 팩바이오 알에스 시스템(PacBio RS system)). 다른 실시형태에서, 시퀀싱, 예를 들어, 더 적지만 더 긴 리드를 만들어내는 다른 시퀀싱 방법보다 시퀀싱 단위 당 서열의 더 많은 염기를 만들어내는 대량병렬의 짧은-리드 시퀀싱(예를 들어, 캘리포니아주 샌디에고에 소재한 일루미나 인코포레이티드(Illumina Inc.) 솔렉사 시퀀서(Solexa sequencer)) 방법은 개개의 핵산 분자에 대해 클론으로 확장된 프록시의 뉴클레오타이드 서열을 결정한다(예를 들어, 캘리포니아주 샌디에고에 소재한 일루미나 인코포레이티드(Illumina Inc.) 솔렉사 시퀀서(Solexa sequencer); 454 라이프 사이언스(Life Sciences)(코네티컷주 브랜포드에 소재) 및 아이온 토렌트(Ion Torrent)). 차세대 시퀀싱을 위한 다른 방법 또는 기계는, 이하에 제한되는 것은 아니지만, 454 라이프 사이언스(Life Sciences)(코네티컷주 브랜포드에 소재), 어플라이드 바이오시스템스(캘리포니아주 포스터 시티에 소재; SOLiD 시퀀서), 헬리코스 바이오사이언스 코포레이션(매사추세츠주 캠브릿지에 소재) 및 에멀젼 및 마이크로 유동 시퀀싱 기법 나노 점적(예를 들어, 지누바이오(GnuBio) 점적)에 의해 제공된다.In one embodiment, next-generation sequencing is performed to determine the nucleotide sequence of individual nucleic acid molecules (e.g., HeliScope Gene Sequencing system from Helicos BioSciences and Pacific Biosciences). PacBio RS system). In other embodiments, sequencing, e.g., mass-parallel short-read sequencing that yields more bases of sequence per sequencing unit (e.g., San Diego, CA) than other sequencing methods yielding fewer but longer reads. The Illumina Inc. Solexa sequencer method determines the nucleotide sequence of a cloned extended proxy for an individual nucleic acid molecule (e.g., Illumina, San Diego, CA). Illumina Inc. Solexa sequencer; 454 Life Sciences (Branford, Conn.) and Ion Torrent). Other methods or machines for next-generation sequencing include, but are not limited to, 454 Life Sciences (Branford, Conn.), Applied Biosystems (Foster City, CA; SOLiD Sequencer), Helicos. Bioscience Corporation (Cambridge, Massachusetts) and Emulsion and Micro Flow Sequencing Techniques Nano Droplets (eg, GnuBio Drops).

차세대 시퀀싱을 위한 플랫폼은, 이하에 제한되는 것은 아니지만, 로슈(Roche)/454의 게놈 시퀀서(Genome Sequencer: GS) FLX 시스템, 일루미나(Illumina)/솔렉사(Solexa) 게놈 분석기(Genome Analyzer: GA), 라이프(Life)/APG의 서포트 올리고(Support Oligonucleotide Ligation Detection: SOLiD) 시스템, 폴로네이터(Polonator)의 G.007 시스템, 헬리코스 바이오사이언스의 헬리스코프 유전자 시퀀싱 시스템(Helicos BioSciences' HeliScope Gene Sequencing system) 및 퍼시픽 바이오사이언스(Pacific Biosciences)의 팩바이오알에스(PacBio RS) 시스템, MGI 사의 DNBseq을 포함한다.Platforms for next-generation sequencing include, but are not limited to, Roche/454's Genome Sequencer (GS) FLX System, Illumina/Solexa Genome Analyzer (GA). , Life/APG's Support Oligonucleotide Ligation Detection (SOLiD) system, Polonator's G.007 system, Helicos BioSciences' HeliScope Gene Sequencing system (Helicos BioSciences' HeliScope Gene Sequencing system) and Pacific Biosciences' PacBio RS system, MGI's DNBseq.

NGS 테크놀로지스는, 예를 들어 주형 제조, 시퀀싱 및 이미징 및 데이터 분석 단계 중 하나 이상을 포함할 수 있다.NGS Technologies may include, for example, one or more of template preparation, sequencing and imaging and data analysis steps.

주형 제조 단계. 주형 제조를 위한 방법은 핵산(예를 들어, 게놈 DNA 또는 cDNA)을 작은 크기로 무작위로 파괴하는 단계 및 시퀀싱 주형(예를 들어, 단편 주형 또는 메이트-쌍 주형)을 만드는 단계와 같은 단계들을 포함할 수 있다. 공간적으로 분리된 주형은 고체 표면 또는 지지체에 부착되거나 또는 고정될 수 있는데, 이는 대량의 시퀀싱 반응이 동시에 수행되도록 한다. NGS 반응을 위해 사용될 수 있는 주형의 유형은, 예를 들어 단일 DNA 분자로부터 유래된 클론이 증폭된 주형 및 단일 DNA 분자 주형을 포함한다.Mold manufacturing steps. Methods for making templates include steps such as randomly disrupting nucleic acids (e.g., genomic DNA or cDNA) into small sizes and making sequencing templates (e.g., fragment templates or mate-pair templates). can do. Spatially separated templates can be attached or immobilized on a solid surface or support, which allows large-scale sequencing reactions to be performed simultaneously. Types of templates that can be used for NGS reactions include, for example, cloned amplified templates derived from single DNA molecules and single DNA molecule templates.

클론이 증폭된 주형의 제조방법은, 예를 들어 에멀젼 PCR(emulsion PCR: emPCR) 및 고체상 증폭을 포함한다.Methods for preparing the clone-amplified template include, for example, emulsion PCR (emPCR) and solid-phase amplification.

EmPCR은 NGS를 위한 주형을 제조하기 위해 사용될 수 있다. 전형적으로, 핵산 단편의 라이브러리가 만들어지며, 보편적 프라이밍 부위를 함유하는 어댑터는 단편의 말단에 결찰된다. 그 다음에 단편은 단일 가닥으로 변성되고, 비드에 의해 포획된다. 각 비드는 단일 핵산 분자를 포획한다. 증폭 및 emPCR 비드의 풍부화 후, 다량의 주형이 부착될 수 있고, 표준 현미경 슬라이드(예를 들어, 폴로네이터(Polonator)) 상에서 폴리아크릴아마이드 겔에 고정되며, 아미노-코팅된 유리 표면(예를 들어, Life/APG; 폴로네이터(Polonator))에 화학적으로 가교되거나, 또는 개개의 피코타이터플레이트(PicoTiterPlate: PTP) 웰(예를 들어, 로슈(Roche)/454) 상에 증착되는데, 이때 NGS 반응이 수행될 수 있다.EmPCR can be used to prepare templates for NGS. Typically, a library of nucleic acid fragments is made, and adapters containing universal priming sites are ligated to the ends of the fragments. The fragments are then denatured into single strands and captured by beads. Each bead captures a single nucleic acid molecule. After amplification and enrichment of emPCR beads, a large amount of template can be attached, immobilized on a polyacrylamide gel on a standard microscope slide (e.g., Polonator), and immobilized on an amino-coated glass surface (e.g., , Life/APG; Polonator), or deposited on individual PicoTiterPlate (PTP) wells (e.g., Roche/454) with NGS reaction This can be done.

고체상 증폭이 또한 사용되어 NGS를 위한 주형을 생성할 수 있다. 전형적으로, 전방 및 후방 프라이머는 고체지지체에 공유적으로 부착된다. 증폭된 단편의 표면 밀도는 지지체 상에서 프라이머 대 주형의 비로써 정의된다. 고체상 증폭은 수백만개의 공간적으로 분리된 주형 클러스터(예를 들어, 일루미나/솔렉사(Illumina/Solexa))를 생성할 수 있다. 주형 클러스터의 말단은 NGS 반응을 위한 보편적 프라이머에 혼성화될 수 있다.Solid-phase amplification can also be used to generate templates for NGS. Typically, the front and back primers are covalently attached to the solid support. The surface density of the amplified fragment is defined as the ratio of primer to template on the support. Solid-phase amplification can generate millions of spatially separated template clusters (eg, Illumina/Solexa). The ends of the template cluster can hybridize to universal primers for NGS reactions.

클론으로 증폭된 주형의 제조를 위한 다른 방법은, 예를 들어 다중 치환 증폭(Multiple Displacement Amplification: MDA)(Lasken R. S. Curr Opin Microbiol. 2007; 10(5):510-6)을 포함한다. MDA는 비-PCR 기반 DNA 증폭 기법이다. 반응은 주형에 대해 무작위 헥사머 프라이머를 어닐링하는 단계 및 일정한 온도에서 고충실도 효소, 전형적으로 Ф 중합효소에 의해 DNA를 합성하는 단계를 수반한다. MDA는 더 낮은 오류 빈도로 거대한 크기의 생성물을 만들 수 있다.Other methods for the preparation of cloned amplified templates include, for example, Multiple Displacement Amplification (MDA) (Lasken R. S. Curr Opin Microbiol. 2007; 10(5):510-6). MDA is a non-PCR based DNA amplification technique. The reaction involves annealing a random hexamer primer to a template and synthesizing DNA by a high-fidelity enzyme, typically Ф polymerase, at constant temperature. MDA can produce large-scale artifacts with a lower error frequency.

PCR과 같은 주형 증폭 방법은 표적에 NGS 플랫폼을 결합시킬 수 있거나 또는 게놈의 특이적 영역을 풍부화할 수 있다(예를 들어, 엑손). 대표적인 주형 풍부화 방법은, 예를 들어 마이크로점적 PCR 기법(Tewhey R. et al., Nature Biotech. 2009, 27:1025-1031), 맞춤-설계된 올리고뉴클레오타이드 마이크로어레이(예를 들어, 로슈(Roche)/님블젠(NimbleGen) 올리고뉴클레오타이드 마이크로어레이) 및 용액-기반 혼성화 방법(예를 들어, 분자역위 프로브(molecular inversion probe: MIP))(Porreca G. J. et al., Nature Methods, 2007, 4:931-936; Krishnakumar S. et al., Proc. Natl. Acad. Sci. USA, 2008, 105:9296-9310; Turner E. H. et al., Nature Methods, 2009, 6:315-316) 및 바이오틴화된 RNA 포획 서열(Gnirke A. et al., Nat. Biotechnol. 2009;27(2):182-9)을 포함한다.Template amplification methods such as PCR can bind the NGS platform to the target or enrich specific regions of the genome (eg, exons). Representative template enrichment methods include, for example, microdroplet PCR techniques (Tewhey R. et al., Nature Biotech. 2009, 27:1025-1031), custom-designed oligonucleotide microarrays (e.g., Roche/ NimbleGen oligonucleotide microarrays) and solution-based hybridization methods (eg, molecular inversion probes (MIPs)) (Porreca G. J. et al., Nature Methods, 2007, 4:931-936; Krishnakumar S. et al., Proc. Natl. Acad. Sci. USA, 2008, 105:9296-9310; Turner E. H. et al., Nature Methods, 2009, 6:315-316) and biotinylated RNA capture sequences ( Gnirke A. et al., Nat. Biotechnol. 2009;27(2):182-9).

단일-분자 주형은 NGS 반응을 위해 사용될 수 있는 주형의 다른 유형이다. 공간적으로 분리된 단일 분자 주형은 다양한 방법에 의해 고체 지지체 상에 고정될 수 있다. 한 접근에서, 개개의 프라이머 분자는 고체 지지체에 공유적으로 부착된다. 어댑터는 주형에 첨가되고, 주형은 그 다음에 고정된 프라이머에 혼성화된다. 다른 접근에서, 단일-분자 주형은 고정된 프라이머로부터 단일-가닥의 단일-분자 주형을 프라이밍하고 연장시킴으로써 고체 지지체에 공유적으로 부착된다. 그 다음에 보편적 프라이머는 주형에 혼성화된다. 또 다른 접근에서, 단일 폴리머라제 분자는 프라이밍된 주형이 결합된 고체 지지체에 부착된다.Single-molecule templates are another type of template that can be used for NGS reactions. Spatially separated single molecule templates can be immobilized on a solid support by a variety of methods. In one approach, individual primer molecules are covalently attached to a solid support. The adapter is added to the template, and the template is then hybridized to the immobilized primer. In another approach, a single-molecule template is covalently attached to a solid support by priming and extending a single-stranded single-molecule template from an immobilized primer. The universal primer is then hybridized to the template. In another approach, a single polymerase molecule is attached to a solid support to which a primed template is attached.

시퀀싱 및 이미징. NGS를 위한 대표적인 시퀀싱 및 이미징 방법은, 이하에 제한되는 것은 아니지만, 사이클릭 가역적 종결(cyclic reversible termination: CRT), 결찰에 의한 시퀀싱(sequencing by ligation: SBL), 단일-분자 첨가(파이로시퀀싱(pyrosequencing)) 및 실시간 시퀀싱을 포함한다.sequencing and imaging. Representative sequencing and imaging methods for NGS include, but are not limited to, cyclic reversible termination (CRT), sequencing by ligation (SBL), single-molecule addition (pyrosequencing) pyrosequencing) and real-time sequencing.

CRT는 뉴클레오타이드 포함, 형광 이미징 및 절단 단계를 최소로 포함하는 사이클릭 방법에서 가역 종결자를 사용한다. 전형적으로, DNA 폴리머라제는 프라이머에 주형 염기의 상보적 뉴클레오타이드에 대해 상보적인 단일의 형광으로 변형된 뉴클레오타이드를 포함시킨다. DNA 합성은 단일 뉴클레오타이드의 첨가 후 종결되고, 미포함된 뉴클레오타이드는 세척된다. 포함된 표지 뉴클레오타이드의 동일성을 결정하기 위해 이미징이 수행된다. 그 다음에, 절단 단계에서, 종결/억제기 및 형광 염료는 제거된다. CRT 방법을 사용하는 대표적인 NGS 플랫폼은, 이하에 제한되는 것은 아니지만, 전체 내부 반사 형광(total internal reflection fluorescence: TIRF)에 의해 검출된 4-색 CRT 방법과 결합된 클론으로 증폭된 주형 방법을 사용하는 일루미나(Illumina)/솔렉사(Solexa) 게놈 분석기(GA); 및 TIRF에 의해 검출된 1-색 CRT 방법과 결합된 단일-분자 주형 방법을 사용하는 헬리코스 바이오사이언스(Helicos BioSciences)/헬리스코프(HeliScope)를 포함한다.CRT uses a reversible terminator in a cyclic method with minimal nucleotide inclusion, fluorescence imaging and cleavage steps. Typically, DNA polymerases include a single fluorescently modified nucleotide complementary to the complementary nucleotide of the template base in the primer. DNA synthesis is terminated after addition of a single nucleotide, and the uncontained nucleotides are washed away. Imaging is performed to determine the identity of the included labeled nucleotides. Then, in a cleavage step, the terminator/inhibitor and the fluorescent dye are removed. Representative NGS platforms using the CRT method include, but are not limited to, using a cloned amplified template method combined with a 4-color CRT method detected by total internal reflection fluorescence (TIRF). Illumina/Solexa Genome Analyzer (GA); and Helicos BioSciences/HeliScope using a single-molecule template method combined with a one-color CRT method detected by TIRF.

SBL은 시퀀싱을 위해 DNA 리가제 및 1-염기-암호화된 프로브 또는 2-염기-암호화된 프로브 중 하나를 사용한다.SBL uses a DNA ligase and either a 1-base-encoded probe or a 2-base-encoded probe for sequencing.

전형적으로, 형광 표지된 프로브는 프라이밍된 주형에 인접한 상보적 서열에 혼성화된다. DNA 리가제는 프라이머에 염료-표지된 프로브를 결찰시키기 위해 사용된다. 비-결찰 프로브가 세척된 후 결찰된 프로브의 동일성을 결정하기 위하여 형광 이미징이 수행된다. 형광 염료는 후속의 결찰 주기를 위해 5'-PO4 기를 재생하는 절단가능한 프로브를 사용하여 제거될 수 있다. 대안적으로, 새로운 프라이머는 오래된 프라이머가 제거된 후 주형에 혼성화될 수 있다. 대표적인 SBL 플랫폼은, 이하에 제한되는 것은 아니지만, 라이프(Life)/APG/SOLiD(지지체 올리고뉴클레오타이드 결찰 검출)를 포함하는데, 이는 2-염기-암호화된 프로브를 사용한다.Typically, a fluorescently labeled probe hybridizes to a complementary sequence adjacent to the primed template. DNA ligases are used to ligate dye-labeled probes to primers. After the non-ligated probes are washed, fluorescence imaging is performed to determine the identity of the ligated probes. The fluorescent dye can be removed using a cleavable probe that regenerates the 5'-PO4 group for subsequent ligation cycles. Alternatively, the new primers can hybridize to the template after the old primers have been removed. Exemplary SBL platforms include, but are not limited to, Life/APG/SOLiD (Support Oligonucleotide Ligation Detection), which uses a two-base-encoded probe.

파이로시퀀싱 방법은 다른 화학발광 효소로 DNA 폴리머라제의 활성을 검출하는 단계를 기반으로 한다. 전형적으로, 해당 방법은 한 번에 하나의 염기쌍을 따라 상보적 가닥을 합성하고, 각 단계에서 실제로 첨가된 염기를 검출함으로써 DNA의 단일 가닥을 시퀀싱시킨다. 주형 DNA는 고정적이며, A, C, G 및 T 뉴클레오타이드의 용액은 순차적으로 첨가되고, 반응으로부터 제거된다. 빛은 단지 뉴클레오타이드 용액이 주형의 짝지어지지 않은 염기를 보충할 때에만 생성된다. 화학발광 신호를 생성하는 용액의 서열은 주형의 서열을 결정하게 한다. 대표적인 파이로시퀀싱 플랫폼은, 이하에 제한되는 것은 아니지만, PTP 웰에 증착된 백만 내지 2백만개의 비드에 의한 emPCR에 의해 제조된 DNA 주형을 사용하는 로슈(Roche)/454를 포함한다.The pyrosequencing method is based on detecting the activity of DNA polymerase with another chemiluminescent enzyme. Typically, the method sequences a single strand of DNA by synthesizing the complementary strand along one base pair at a time and detecting the base actually added at each step. The template DNA is immobilized, and solutions of A, C, G and T nucleotides are sequentially added and removed from the reaction. Light is only produced when the nucleotide solution replenishes the unpaired base of the template. The sequence of the solution generating the chemiluminescent signal allows to determine the sequence of the template. Representative pyrosequencing platforms include, but are not limited to, Roche/454 using DNA templates prepared by emPCR with 1 to 2 million beads deposited in PTP wells.

실시간 시퀀싱은 DNA 합성 동안 염료-표지된 뉴클레오타이드의 연속적 포함을 이미징하는 단계를 수반한다. 대표적인 실시간 시퀀싱 플랫폼은, 이하에 제한되는 것은 아니지만, 포스페이트 연결된 뉴클레오타이드가 성장되는 프라이머 가닥에 포함될 때 서열 정보를 얻기 위한 개개의 0-모드 웨이브가이드(zero-mode waveguide, ZMW) 검출기의 표면에 부착된 DNA 폴리머라제 분자를 사용하는 퍼시픽 바이오사이언스 플랫폼(Pacific Biosciences); 형광 공명 에너지 전달(fluorescence resonance energy transfer, FRET)에 의한 뉴클레오타이드 포함 후 향상된 신호를 만들기 위해 부착된 형광 염료와 함께 유전자 조작된 DNA 폴리머라제를 사용하는 라이프(Life)/비시겐(VisiGen) 플랫폼; 및 시퀀싱 반응에서 염료-퀀처 뉴클레오타이드를 사용하는 LI-COR 바이오사이언스(Biosciences) 플랫폼을 포함한다.Real-time sequencing involves imaging the continuous inclusion of dye-labeled nucleotides during DNA synthesis. Representative real-time sequencing platforms include, but are not limited to, individual zero-mode waveguide (ZMW) detectors attached to the surface to obtain sequence information when phosphate-linked nucleotides are included in the growing primer strand. Pacific Biosciences platform using DNA polymerase molecules; Life/VisiGen platform using genetically engineered DNA polymerase with attached fluorescent dye to create enhanced signal after nucleotide inclusion by fluorescence resonance energy transfer (FRET); and the LI-COR Biosciences platform using dye-quencher nucleotides in sequencing reactions.

NGS의 다른 시퀀싱 방법은, 이하에 제한되는 것은 아니지만, 나노포어 시퀀싱, 혼성화에 의한 시퀀싱, 나노-트랜지스터 어레이 기반 시퀀싱, 폴로니(polony) 시퀀싱, 주사형전자 터널링 현미경(scanning tunneling microscopy, STM) 기반 시퀀싱 및 나노와이어-분자 센서 기반 시퀀싱을 포함한다.Other sequencing methods of NGS include, but are not limited to, nanopore sequencing, sequencing by hybridization, nano-transistor array based sequencing, polony sequencing, scanning electron tunneling microscopy (STM) based sequencing and nanowire-molecular sensor-based sequencing.

나노포어 시퀀싱은 단일-핵산 폴리머에서 분석될 수 있는 고도로 밀폐된 공간을 제공하는 나노-규모 포어를 통해서 용액 중의 핵산 분자의 전기영동을 수반한다. 나노포어 시퀀싱의 대표적인 방법은, 예를 들어 문헌[Branton D. et al., Nat Biotechnol. 2008; 26(10):1146-53]에 기재된다.Nanopore sequencing involves electrophoresis of nucleic acid molecules in solution through nano-scale pores that provide a highly enclosed space that can be analyzed in single-nucleic acid polymers. Representative methods of nanopore sequencing are described, for example, in Branton D. et al., Nat Biotechnol. 2008; 26(10):1146-53].

혼성화에 의한 시퀀싱은 DNA 마이크로어레이를 사용하는 비-효소적 방법이다. 전형적으로, DNA의 단일 풀은 형광으로 표지되며, 공지된 서열을 함유하는 어레이에 혼성화된다. 어레이 상의 주어진 스팟으로부터 혼성화 신호는 DNA 서열을 확인할 수 있다. DNA 이중-가닥에서 DNA 중 한 가닥의 그것의 상보적 가닥에 결합은 혼성체 영역이 짧거나 또는 구체된 미스매치 검출 단백질이 존재할 때, 단일-염기 미스매치에 대해서 조차도 민감하다. 혼성화에 의한 시퀀싱의 대표적인 방법은, 예를 들어 문헌(Hanna G.J. et al., J. Clin. Microbiol. 2000; 38(7): 2715-21; 및 Edwards J.R. et al., Mut. Res. 2005; 573(1-2): 3-12)에 기재된다.Sequencing by hybridization is a non-enzymatic method using DNA microarrays. Typically, a single pool of DNA is fluorescently labeled and hybridized to an array containing a known sequence. The hybridization signal from a given spot on the array can identify the DNA sequence. Binding of one strand of DNA to its complementary strand in a DNA double-strand is sensitive even to single-base mismatches when the hybrid region is short or a specified mismatch detection protein is present. Representative methods of sequencing by hybridization are described, for example, in Hanna G.J. et al., J. Clin. Microbiol. 2000; 38(7): 2715-21; and Edwards J.R. et al., Mut. Res. 2005; 573(1-2): 3-12).

폴로니 시퀀싱은 폴로니 증폭 및 다중 단일-염기-연장(FISSEQ)을 통해 시퀀싱에 따르는 것을 기반으로 한다. 폴로니 증폭은 폴리아크릴아마이드 필름 상에서 인시츄로 DNA를 증폭시키는 방법이다. 대표적인 폴로니 시퀀싱 방법은, 예를 들어 미국특허 출원 공개 제2007/0087362호에 기재된다.Poloni sequencing is based on poloni amplification and followed by sequencing via multiple single-base-extension (FISSEQ). Poloni amplification is a method of amplifying DNA in situ on a polyacrylamide film. Representative poloni sequencing methods are described, for example, in US Patent Application Publication No. 2007/0087362.

탄소나노튜브 전계 효과 트랜지스터(Carbon NanoTube Field Effect Transistor: CNTFET)와 같은 나노-트랜지스터 어레이 기반 장치가 또한 NGS를 위해 사용될 수 있다. 예를 들어, DNA 분자는 신장되고, 마이크로-제작된 전극에 의해 나노튜브에 걸쳐 구동된다. DNA 분자는 탄소 나노튜브 표면과 순차적으로 접촉하게 되고, DNA 분자와 나노튜브 사이의 전하 전달에 기인하여 각 염기로부터의 전류 흐름의 차이가 만들어진다. DNA는 이들 차이를 기록함으로써 시퀀싱된다. 대표적인 나노-트랜지스터 어레이 기반 시퀀싱 방법은, 예를 들어 미국특허 공개 제2006/0246497호에 기재된다.Nano-transistor array-based devices such as Carbon NanoTube Field Effect Transistors (CNTFETs) can also be used for NGS. For example, DNA molecules are stretched and driven across nanotubes by micro-fabricated electrodes. DNA molecules come into sequential contact with the carbon nanotube surface, and a difference in current flow from each base is made due to charge transfer between the DNA molecule and the nanotube. DNA is sequenced by recording these differences. Representative nano-transistor array based sequencing methods are described, for example, in US Patent Publication No. 2006/0246497.

주사형전자 터널링 현미경(STM)은 또한 NGS를 위해 사용될 수 있다. STM은 표본의 래스터 주사(raster scan)를 수행하는 피에조-전자-제어 프로브를 사용하여 그것 표면의 이미지를 형성한다. STM은, 예를 들어 작동기-구동 가요성 갭과 주사형전자 터널링 현미경을 통합시킴으로써 일관된 전자 터널링 이미징 및 분광학을 만드는 단일 DNA 분자의 물리적 특성을 이미징하기 위해 사용될 수 있다. STM을 사용하는 대표적인 시퀀싱 방법은, 예를 들어 미국특허출원 공개 제2007/0194225호에 기재된다.Scanning electron tunneling microscopy (STM) can also be used for NGS. STM forms an image of its surface using a piezo-electron-controlled probe that performs a raster scan of the specimen. STM can be used to image the physical properties of single DNA molecules, for example, by integrating an actuator-driven flexible gap with a scanning electron tunneling microscope, resulting in coherent electron tunneling imaging and spectroscopy. Representative sequencing methods using STM are described, for example, in US Patent Application Publication No. 2007/0194225.

나노와이어-분자 센서로 구성된 분자-분석 장치가 또한 NGS를 위해 사용될 수 있다. 이러한 장치는 DNA와 같은 나노와이어 및 핵산 분자에 배치된 질소성 물질의 상호작용을 검출할 수 있다. 분자 가이드는 상호작용 및 후속하는 검출을 허용하기 위해 분자 센서 근처의 분자를 가이딩하기 위해 배치된다. 나노와이어-분자 센서를 사용하는 대표적인 시퀀싱 방법은 예를 들어 미국특허 출원 공개 제2006/0275779호에 기재된다.Molecular-analysis devices consisting of nanowire-molecular sensors can also be used for NGS. Such devices can detect the interaction of nitrogenous substances disposed on nucleic acid molecules and nanowires such as DNA. Molecular guides are positioned to guide molecules near the molecular sensor to allow interaction and subsequent detection. Representative sequencing methods using nanowire-molecular sensors are described, for example, in US Patent Application Publication No. 2006/0275779.

이중 말단의 시퀀싱 방법이 NGS를 위해 사용될 수 있다. 이중 말단 시퀀싱은 DNA의 센스와 안티센스 가닥 둘 다를 시퀀싱하기 위해 차단 및 미차단 프라이머를 사용한다. 전형적으로, 이들 방법은 핵산의 제1 가닥에 미차단 프라이머를 어닐링시키는 단계; 핵산의 제2 가닥에 제2의 차단 프라이머를 어닐링 시키는 단계; 폴리머라제로 제1 가닥을 따라 핵산을 연장시키는 단계; 제1 시퀀싱 프라이머를 종결시키는 단계; 제2 프라이머를 차단해제(deblocking)하는 단계; 및 제2 가닥을 따라 핵산을 연장시키는 단계를 포함한다. 대표적인 이중 가닥 시퀀싱 방법은, 예를 들어 미국특허 제7,244,567호에 기재된다.Double-ended sequencing methods can be used for NGS. Double-ended sequencing uses blocking and unblocking primers to sequence both the sense and antisense strands of DNA. Typically, these methods include annealing an unblocked primer to the first strand of the nucleic acid; annealing a second blocking primer to the second strand of the nucleic acid; extending the nucleic acid along the first strand with a polymerase; terminating the first sequencing primer; deblocking the second primer; and extending the nucleic acid along the second strand. Representative double-stranded sequencing methods are described, for example, in US Pat. No. 7,244,567.

데이터 분석 단계. NGS 리드가 만들어진 후, 그것들은 공지된 기준 서열에 대해 정렬되거나 데노보 조립된다.data analysis stage. After NGS reads are made, they are aligned or de novo assembled to a known reference sequence.

예를 들어, 샘플(예를 들어, 종양 샘플)에서 단일-뉴클레오타이드 다형성 및 구조적 변이체와 같은 유전적 변형을 확인하는 것은 기준 서열(예를 들어, 야생형 서열)에 대해 NGS 리드를 정렬함으로써 수행될 수 있다. NGS에 대한 서열 정렬방법은, 예를 들어 문헌(Trapnell C. and Salzberg S.L. Nature Biotech., 2009, 27:455-457]에 기재된다.For example, identification of genetic modifications such as single-nucleotide polymorphisms and structural variants in a sample (e.g., a tumor sample) can be performed by aligning NGS reads to a reference sequence (e.g., a wild-type sequence). have. Sequence alignment methods for NGS are described, for example, in Trapnell C. and Salzberg S.L. Nature Biotech., 2009, 27:455-457.

드노보 조립체의 예는, 예를 들어 문헌(Warren R. et al., Bioinformatics, 2007, 23:500-501; Butler J. et al., Genome Res., 2008, 18:810-820; 및 Zerbino D.R. 및 Birney E., Genome Res., 2008, 18:821-829)에 기재된다.Examples of de novo assemblies are described, for example, in Warren R. et al., Bioinformatics, 2007, 23:500-501; Butler J. et al., Genome Res., 2008, 18:810-820; and Zerbino. D.R. and Birney E., Genome Res., 2008, 18:821-829).

서열 정렬 또는 어셈블리는 하나 이상의 NGS 플랫폼으로부터의 리드 데이터를 사용하여, 예를 들어 로슈(Roche)/454 및 일루미나(Illumina)/솔렉사(Solexa) 리드 데이터를 혼합하여 수행될 수 있다. 본 발명에 있어서, 상기 정렬단계는 이에 제한되지는 않으나, BWA 알고리즘 및 hg19 서열을 이용하여 수행되는 것일 수 있다.Sequence alignment or assembly can be performed using read data from one or more NGS platforms, for example by mixing Roche/454 and Illumina/Solexa read data. In the present invention, the alignment step is not limited thereto, but may be performed using the BWA algorithm and the hg19 sequence.

본 발명에 있어서, 상기 (B) 단계의 핵산 단편의 위치를 확인 하는 단계는 바람직하게는 서열 정렬(Sequence alignment)를 통해 수행되는 것을 특징으로 할 수 있으며, 상기 서열 정렬은 컴퓨터 알고리즘으로서 게놈에서 리드 서열(예를 들어, 차세대 시퀀싱으로부터의, 예를 들어 짧은-리드 서열)이 대부분 리드 서열과 기준 서열 사이의 유사성을 평가함으로써 유래될 가능성이 있는 경우로부터 동일성에 대해 사용되는 컴퓨터적 방법 또는 접근을 포함한다. 서열 정렬 문제에 다양한 알고리즘이 적용될 수 있다. 일부 알고리즘은 상대적으로 느리지만, 상대적으로 높은 특이성을 허용한다. 이들은, 예를 들어 역동적 프로그래밍-기반 알고리즘을 포함한다. 역동적 프로그래밍은 그것들이 더 간단한 단계로 나누어짐으로써 복잡한 문제를 해결하는 방법이다. 다른 접근은 상대적으로 더 효율적이지만, 전형적으로 철저하지 않다. 이는, 예를 들어 대량 데이터베이스 검색을 위해 설계된 휴리스틱(heuristic) 알고리즘 및 확률적(probabilistic) 방법을 포함한다.In the present invention, the step of confirming the position of the nucleic acid fragment of step (B) may preferably be characterized in that it is performed through sequence alignment, and the sequence alignment is read from the genome as a computer algorithm. The computational method or approach used for identity from where a sequence (e.g., from next-generation sequencing, e.g., a short-read sequence) is most likely derived by evaluating the similarity between a read sequence and a reference sequence. include Various algorithms can be applied to the sequence alignment problem. Some algorithms are relatively slow, but allow relatively high specificity. These include, for example, dynamic programming-based algorithms. Dynamic programming is a way to solve complex problems by breaking them down into simpler steps. Other approaches are relatively more efficient, but are typically not exhaustive. These include, for example, heuristic algorithms and probabilistic methods designed for large database searches.

전형적으로, 정렬 과정에 두 단계가 있을 수 있다: 후보자 검사 및 서열 정렬. 후보자 검사는 가능한 정렬 위치의 더 짧은 열거에 대해 전체 게놈으로부터 서열 정렬을 위한 검색 공간을 감소시킨다. 용어가 시사하는 바와 같이 서열 정렬은 후보자 검사 단계에 제공된 서열을 갖는 서열을 정렬시키는 단계를 포함한다. 이는 광역 정렬(예를 들어, 니들만-분쉬(Needleman-Wunsch) 정렬) 또는 국소 정렬(예를 들어, 스미스-워터만 정렬)을 사용하여 수행될 수 있다. Typically, there can be two steps in the alignment process: candidate screening and sequence alignment. Candidate screening reduces the search space for sequence alignments from the whole genome for a shorter enumeration of possible alignment positions. As the term suggests, sequence alignment involves aligning sequences with sequences provided in the candidate screening step. This can be done using a global alignment (eg, a Needleman-Wunsch alignment) or a local alignment (eg, a Smith-Waterman alignment).

대부분의 속성 정렬 알고리즘은 색인 방법에 기반한 3가지 유형 중 하나를 특징으로 할 수 있다: 해쉬 테이블(예를 들어, BLAST, ELAND, SOAP), 접미사트리(예를 들어, Bowtie, BWA) 및 병합 정렬(예를 들어, 슬라이더(Slider))에 기반한 알고리즘. 짧은 리드 서열은 정렬을 위해 전형적으로 사용된다. 짧은-리드 서열에 대한 서열 정렬 알고리즘/프로그램의 예는, 이하에 제한되는 것은 아니지만, BFAST (Homer N. et al., PLoS One. 2009;4(11):e7767), BLASTN(월드 와이드 웹상의 blast.ncbi.nlm.nih.gov에서), BLAT(Kent W.J. Genome Res. 2002;12(4):656-64), 보타이(Bowtie)(Langmead B. et al., Genome Biol. 2009;10(3):R25), BWA(Li H. and Durbin R. Bioinformatics, 2009, 25:1754-60), BWA-SW(Li H. and Durbin R. Bioinformatics, 2010;26(5):589-95), 클라우드버스트(CloudBurst)(Schatz M.C. Bioinformatics. 2009;25(11):1363-9), 코로나 라이트(Corona Lite)(Applied Biosystems, Carlsbad, California, USA), CASHX(Fahlgren N. et al., RNA, 2009; 15, 992-1002), CUDA-EC (Shi H. et al., J Comput Biol. 2010;17(4):603-15), ELAND(월드 와이드 웹상의 bioit.dbi.udel.edu/howto/eland에서), GNUMAP(Clement N.L. et al., Bioinformatics. 2010;26(1):38-45), GMAP(Wu T.D. and Watanabe C.K. Bioinformatics. 2005;21(9):1859-75), GSNAP(Wu T.D. and Nacu S., Bioinformatics. 2010;26(7):873-81), 제니오스 어셈블러(Geneious Assembler)(뉴질랜드 오클랜드에 소재한 Biomatters Ltd.), LAST, MAQ(Li H. et al., Genome Res. 2008;18(11):1851-8), Mega-BLAST(월드 와이드 웹 상의 ncbi.nlm.nih.gov/blast/megablast.shtml에서), MOM(Eaves H.L. and Gao Y. Bioinformatics. 2009;25(7):969-70), MOSAIK(월드 와이드 웹 상의 bioinformatics.bc.edu/marthlab/Mosaik에서), 노보얼라인(Novoalign)(월드 와이드 웹 상의 novocraft.com/main/index.php에서), 팔맵퍼(PALMapper)(월드 와이드 웹 상의 fml.tuebingen.mpg.de/raetsch/suppl/palmapper에서), PASS(Campagna D. et al., Bioinformatics. 2009;25(7):967-8), PatMaN(Prufer K. et al., Bioinformatics. 2008; 24(13):1530-1), PerM(Chen Y. et al., Bioinformatics, 2009, 25 (19): 2514-2521), ProbeMatch(Kim Y.J. et al., Bioinformatics. 2009;25(11):1424-5), QPalma(de Bona F. et al., Bioinformatics, 2008, 24(16): i174), RazerS(Weese D. et al., Genome Research, 2009, 19:1646-1654), RMAP (Smith A.D. et al., Bioinformatics. 2009;25(21):2841-2), SeqMap(Jiang H. et al. Bioinformatics. 2008;24:2395-2396.), Shrec(Salmela L., Bioinformatics. 2010;26(10):1284-90), SHRiMP(Rumble S.M. et al., PLoS Comput. Biol., 2009, 5(5):e1000386), SLIDER(Malhis N. et al., Bioinformatics, 2009, 25 (1): 6-13), 슬림 서치(SLIM Search)(Muller T. et al., Bioinformatics. 2001;17 Suppl 1:S182-9), SOAP(Li R. et al., Bioinformatics. 2008;24(5):713-4), SOAP2(Li R. et al., Bioinformatics. 2009;25(15):1966-7), SOCS(Ondov B.D. et al., Bioinformatics, 2008; 24(23):2776-7), SSAHA(Ning Z. et al., Genome Res. 2001;11(10):1725-9), SSAHA2(Ning Z. et al., Genome Res. 2001;11(10):1725-9), 스탬피(Stampy)(Lunter G. and Goodson M. Genome Res. 2010, epub ahead of print), 타이판(Taipan)(월드 와이드 웹 상의 taipan.sourceforge.net에서), UGENE(월드 와이드 웹 상의 ugene.unipro.ru에서), XpressAlign(월드 와이드 웹 상의 bcgsc.ca/platform/bioinfo/software/XpressAlign에서), 및 ZOOM(캐나다 온타리오주 워터루에 소재한 바이오인포매틱스 솔루션 인코포레이티드(Bioinformatics Solutions Inc.))을 포함한다.Most attribute sorting algorithms can feature one of three types based on indexing methods: hash tables (e.g. BLAST, ELAND, SOAP), suffix trees (e.g. Bowtie, BWA), and merge sort. Algorithms based on (eg Slider). Short read sequences are typically used for alignment. Examples of sequence alignment algorithms/programs for short-read sequences include, but are not limited to, BFAST (Homer N. et al., PLoS One. 2009;4(11):e7767), BLASTN (on the World Wide Web). at blast.ncbi.nlm.nih.gov), BLAT (Kent W.J. Genome Res. 2002;12(4):656-64), Bowtie (Langmead B. et al., Genome Biol. 2009;10 (at blast.ncbi.nlm.nih.gov) 3):R25), BWA (Li H. and Durbin R. Bioinformatics, 2009, 25:1754-60), BWA-SW (Li H. and Durbin R. Bioinformatics, 2010;26(5):589-95) , CloudBurst (Schatz M.C. Bioinformatics. 2009;25(11):1363-9), Corona Lite (Applied Biosystems, Carlsbad, California, USA), CASHX (Fahlgren N. et al., RNA) , 2009; 15, 992-1002), CUDA-EC (Shi H. et al., J Comput Biol. 2010;17(4):603-15), ELAND (bioit.dbi.udel.edu on the World Wide Web) at /howto/eland), GNUMAP (Clement N.L. et al., Bioinformatics. 2010;26(1):38-45), GMAP (Wu T.D. and Watanabe C.K. Bioinformatics. 2005;21(9):1859-75), GSNAP (Wu T.D. and Nacu S., Bioinformatics. 2010;26(7):873-81), Geneious Assembler (Biomatters Ltd., Oakland, New Zealand), LAST, MAQ (Li H. et al. , Genome Res. 2008;18(11):1851-8), Mega -BLAST (at ncbi.nlm.nih.gov/blast/megablast.shtml on the World Wide Web), MOM (Eaves H.L. and Gao Y. Bioinformatics. 2009;25(7):969-70), MOSAIK (at bioinformatics.bc.edu/marthlab/Mosaik on the World Wide Web), Novoalign (on the World Wide Web at novocraft.com/main/index.php) in), PALMapper (at fml.tuebingen.mpg.de/raetsch/suppl/palmapper on the World Wide Web), PASS (Campagna D. et al., Bioinformatics. 2009;25(7):967-8 ), PatMaN (Prufer K. et al., Bioinformatics. 2008; 24(13):1530-1), PerM (Chen Y. et al., Bioinformatics, 2009, 25 (19): 2514-2521), ProbeMatch ( Kim Y.J. et al., Bioinformatics. 2009;25(11):1424-5), QPalma (de Bona F. et al., Bioinformatics, 2008, 24(16): i174), RazerS (Weese D. et al. , Genome Research, 2009, 19:1646-1654), RMAP (Smith A.D. et al., Bioinformatics. 2009;25(21):2841-2), SeqMap (Jiang H. et al. Bioinformatics. 2008;24:2395) -2396.), Shrec (Salmela L., Bioinformatics. 2010;26(10):1284-90), SHRiMP (Rumble S.M. et al., PLoS Comput. Biol., 2009, 5(5):e1000386), SLIDER (Malhis N. et al., Bioinformatics, 2009, 25 (1): 6-13), SLIM Search (Muller T. et al., Bioinformatics. 2001;17 Suppl 1:S182-9), SOAP (Li R. et al. , Bioinformatics. 2008;24(5):713-4), SOAP2 (Li R. et al., Bioinformatics. 2009;25(15):1966-7), SOCS (Ondov B.D. et al., Bioinformatics, 2008; 24(23) ):2776-7), SSAHA (Ning Z. et al., Genome Res. 2001;11(10):1725-9), SSAHA2 (Ning Z. et al., Genome Res. 2001;11(10): 1725-9), Stampy (Lunter G. and Goodson M. Genome Res. 2010, epub ahead of print), Taipan (at taipan.sourceforge.net on the World Wide Web), UGENE (World Wide On the web at ugene.unipro.ru), XpressAlign (on the World Wide Web at bcgsc.ca/platform/bioinfo/software/XpressAlign), and ZOOM (Bioinformatics Solutions, Inc., Waterloo, Ontario, Canada) Inc.)).

서열 정렬 알고리즘은, 예를 들어 시퀀싱 기법, 리드 길이, 리드 수, 입수가능한 컴퓨팅 자료 및 민감성/스코어링 필요조건을 포함하는 다수의 인자에 기반하여 선택될 수 있다. 상이한 서열 정렬 알고리즘은 상이한 속도 수준, 정렬 민감성 및 정렬 특이성을 달성할 수 있다. 정렬 특이성은 예측된 정렬과 비교하여 정확하게 정렬된 전형적으로 서브미션에서 발견되는 바와 같이 정렬된 표적 서열 잔기의 백분율을 지칭한다. 정렬 민감성은 또한 서브미션에서 정확하게 정렬된 보통 예측된 정렬에서 발견되는 바와 같이 정렬된 표적 서열 잔기의 백분율을 지칭한다.A sequence alignment algorithm may be selected based on a number of factors including, for example, sequencing technique, read length, number of reads, available computing resources, and sensitivity/scoring requirements. Different sequence alignment algorithms can achieve different speed levels, alignment sensitivity and alignment specificity. Alignment specificity refers to the percentage of target sequence residues aligned as found in the submission that are correctly aligned compared to the predicted alignment. Alignment sensitivity also refers to the percentage of target sequence residues aligned as found in normally predicted alignments that are correctly aligned in submission.

정렬 알고리즘, 예컨대 ELAND 또는 SOAP는 속도가 고려되는 제1 인자일 때 기준 게놈에 대해 짧은 리드(예를 들어, 일루미나(Illumina)/솔렉사(Solexa) 시퀀서제)을 정렬하는 목적으로 사용될 수 있다. BLAST 또는 Mega-BLAST와 같은 정렬 알고리즘은 특이성이 가장 중요한 인자일 때, 이들 방법이 상대적으로 더 느리지만, 짧은 판독(예를 들어, 로슈(Roche) FLX제)을 사용하여 유사성 조사의 목적을 위해 사용될 수 있다. MAQ 또는 노보얼라인(Novoalign)와 같은 정렬 알고리즘은 품질 스코어를 고려하며, 따라서 정확성이 본질을 가질 때 단일- 또는 짝지어진-말단 데이터에 대해 사용될 수 있다(예를 들어, 고속-대량 SNP 검색에서). 보타이(Bowtie) 또는 BWA와 같은 정렬 알고리즘은 버로우즈-휠러 변환(Burrows-Wheeler Transform: BWT)을 사용하며, 따라서 상대적으로 작은 메모리 풋프린트(memory footprint)를 필요로 한다. BFAST, PerM, SHRiMP, SOCS 또는 ZOOM과 같은 정렬 알고리즘은 색공간 리드를 맵핑하며, 따라서 ABI의 SOLiD 플랫폼과 함께 사용될 수 있다. 일부 적용에서, 2 이상의 정렬 알고리즘으로부터의 결과가 조합될 수 있다.Alignment algorithms such as ELAND or SOAP can be used for the purpose of aligning short reads (eg, from Illumina/Solexa sequencers) against a reference genome when speed is the first factor to be considered. Alignment algorithms such as BLAST or Mega-BLAST use shorter reads (e.g., from Roche FLX), although these methods are relatively slower when specificity is the most important factor, for the purpose of similarity investigations. can be used Alignment algorithms such as MAQ or Novoalign take the quality score into account and thus can be used for single- or paired-end data when accuracy is essential (e.g. in fast-mass SNP searches). ). Alignment algorithms such as Bowtie or BWA use the Burrows-Wheeler Transform (BWT) and thus require a relatively small memory footprint. Alignment algorithms such as BFAST, PerM, SHRiMP, SOCS or ZOOM map the color space reads and thus can be used with ABI's SOLiD platform. In some applications, results from two or more sorting algorithms may be combined.

본 발명에 있어서, 상기 (B) 단계의 서열정보(reads)의 길이는, 5 내지 5000 bp이고, 사용하는 서열정보의 수는 5천 내지 500만개가 될 수 있으나, 이에 한정되는 것은 아니다.In the present invention, the length of the sequence information (reads) in step (B) is 5 to 5000 bp, and the number of sequence information used may be 50 to 5 million, but is not limited thereto.

본 발명에 있어서, 상기 (C) 단계의 서열정보를 그룹화하는 단계는, 서열정보(reads)의 어댑터 서열을 바탕으로 수행할 수 있다. 정방향으로 정렬된 핵산 단편과 역방향으로 정렬된 핵산 단편으로 별도로 구분하여서 선별된 서열정보에 대해서 FD 값을 계산하거나, 전체 그룹에 대하여 FD값을 계산할 수 있다.In the present invention, the step of grouping the sequence information in step (C) may be performed based on the adapter sequence of the sequence information (reads). The FD value may be calculated for the selected sequence information by separately dividing the nucleic acid fragment aligned in the forward direction and the nucleic acid fragment aligned in the reverse direction, or the FD value may be calculated for the entire group.

본 발명에서, 상기 (C) 단계를 수행하기에 앞서 정렬된 핵산 단편의 정렬 일치도 점수(mapping quality score)를 만족하는 핵산 단편을 따로 분류하는 단계를 추가로 포함하는 것을 특징으로 할 수 있다.In the present invention, it may be characterized in that it further comprises the step of separately classifying the nucleic acid fragments satisfying the mapping quality score of the aligned nucleic acid fragments prior to performing the step (C).

본 발명에서 상기 정렬 일지도 점수(mapping quality score)는 원하는 기준에 따라 달라질 수 있으나, 바람직하게는 15-70점, 더욱 바람직하게는 50~70점 일 수 있고, 가장 바람직하게는 60점일 수 있다.In the present invention, the mapping quality score may vary depending on a desired criterion, but may be preferably 15-70 points, more preferably 50-70 points, and most preferably 60 points.

본 발명에 있어서, 상기 (D) 단계의 FD 값은 수득한 n개의 핵산 단편에 대하여, i 번째 핵산 단편과 i+1 내지 n 번째 핵산 단편에서 선택되는 어느 하나 이상의 핵산 단편의 기준값 사이의 거리로 정의되는 것을 특징으로 할 수 있다.In the present invention, the FD value of step (D) is the distance between the reference value of any one or more nucleic acid fragments selected from the i-th nucleic acid fragment and the i+1 to n-th nucleic acid fragment with respect to the obtained n nucleic acid fragments. It can be characterized as being defined.

본 발명에서, 상기 FD 값은 수득한 n개의 핵산 단편에 대하여, 제1핵산 단편과 제2내지 제n개 핵산 단편으로 구성된 군에서 선택되는 어느 하나 이상의 핵산 단편의 기준값과의 거리를 계산하여 이들의 합, 차, 곱, 평균, 곱의 로그, 합의 로그, 중앙값, 분위수, 최소값, 최대값, 분산, 표준편차, 중앙값 절대 편차 및 변동 계수로 구성된 군에서 선택된 하나 이상의 값 및/또는 하나 이상의 이들의 역수값과, 가중치가 포함된 계산 결과 및 이에 한정되지 않는 통계치를 FD 값으로 사용할 수 있으나 이에 한정되는 것은 아니다. In the present invention, the FD value is calculated by calculating the distance from the reference value of any one or more nucleic acid fragments selected from the group consisting of the first nucleic acid fragment and the second to n-th nucleic acid fragments with respect to the obtained n nucleic acid fragments. one or more values selected from the group consisting of sum, difference, product, mean, log of product, log of sum, median, quantile, minimum, maximum, variance, standard deviation, median absolute deviation and coefficient of variation and/or one or more of these The reciprocal value of , a calculation result including a weight, and a statistical value, but not limited thereto, may be used as the FD value, but the present invention is not limited thereto.

본 발명에서 “하나 이상의 값 및/또는 하나 이상의 이들의 역수값”이라는 기재는 앞서 기재된 수치값들 중에서 하나 또는 2 이상이 조합되어 사용될 수 있다는 의미로 해석된다. In the present invention, the expression “one or more values and/or one or more inverse values thereof” is interpreted to mean that one or two or more of the numerical values described above may be used in combination.

본 발명에서, 상기 “핵산 단편의 기준값”은 핵산 단편의 중앙값으로부터 임의의 값을 더하거나 뺀 값인 것인 것을 특징으로 할 수 있다.In the present invention, the “reference value of the nucleic acid fragment” may be a value obtained by adding or subtracting an arbitrary value from the median value of the nucleic acid fragment.

상기 FD 값은 수득한 n개의 핵산 단편에 대하여, 다음과 같이 정의 할 수 있다. The FD value can be defined as follows for the obtained n nucleic acid fragments.

FD = Dist(Ri~Rj) (1<i<j<n), FD = Dist(Ri~Rj) (1<i<j<n),

여기서 Dist 함수는 선별된 Ri와 Rj 두 핵산 단편 사이에 포함되는 모든 핵산 단편의 정렬 위치값 차이의 합, 차, 곱, 평균, 곱의 로그, 합의 로그, 중앙값, 분위수, 최소값, 최대값, 분산, 표준편차, 중앙값 절대 편차 및 변동 계수로 구성된 군에서 선택된 하나 이상의 값 및/또는 하나 이상의 이들의 역수값과, 가중치가 포함된 계산 결과 및 이에 한정되지 않는 통계치를 계산한다.Here, the Dist function is the sum, difference, product, mean, log of product, log of sum, median, quantile, minimum, maximum, variance , one or more values selected from the group consisting of standard deviation, median absolute deviation, and coefficient of variation, and/or one or more inverse values thereof, and calculation results including weights and statistics, but not limited thereto.

즉, 본 발명에서 FD 값(Fragment Distance Value)는 정렬된 핵산 단편 사이의 거리를 의미한다. 여기서 거리 계산을 위한 핵산 단편의 선별 경우의 수는 다음과 같이 정의 할 수 있다. 총 N개의 핵산 단편이 존재할 경우

개의 핵산 단편 간 거리 조합이 가능하다. 즉, i가 1일 경우, i+1은 2가 되어 2 내지 n 번째 핵산 단편에서 선택되는 어느 하나 이상의 핵산 단편과의 거리를 정의 할 수 있다.That is, in the present invention, the FD value (Fragment Distance Value) refers to the distance between aligned nucleic acid fragments. Here, the number of cases of selection of nucleic acid fragments for distance calculation can be defined as follows. When there are a total of N nucleic acid fragments

Any combination of distances between nucleic acid fragments is possible. That is, when i is 1, i+1 becomes 2 to define a distance from any one or more nucleic acid fragments selected from the 2nd to nth nucleic acid fragments.

본 발명에 있어서, 상기 FD 값은 상기 i 번째 핵산 단편 내부의 특정 위치와 i+1 내지 n 번째 중 어느 하나 이상의 핵산 단편 내부의 특정 위치 사이의 거리를 계산하는 것을 특징으로 할 수 있다. In the present invention, the FD value may be characterized by calculating a distance between a specific position inside the i-th nucleic acid fragment and a specific position inside any one or more of the i+1 to n-th nucleic acid fragments.

예를 들어 어떤 핵산 단편의 길이가 50bp 이며, 염색체 1번의 4,183 위치에 정렬 되었다고 하면, 이 핵산 단편의 거리 계산에 사용할 수 있는 유전적 위치값은 염색체 1번의 4,183 ~ 4,232 이다.For example, if a nucleic acid fragment has a length of 50 bp and is aligned at position 4,183 on chromosome 1, the genetic position value that can be used to calculate the distance of this nucleic acid fragment is 4,183 to 4,232 on chromosome 1.

상기 핵산 단편과 인접한 50bp 길이의 핵산 단편이 염색체 1번의 4,232번째 위치에 정렬되면, 이 핵산 단편의 거리 계산에 사용할 수 있는 유전적 위치값은 염색체 1번의 4,232 ~ 4,281이고, 두 핵산 단편 사이의 FD 값은 1에서 99가 될 수 있다. When the nucleic acid fragment having a length of 50 bp adjacent to the nucleic acid fragment is aligned at the 4,232th position of chromosome 1, the genetic position value that can be used for calculating the distance of this nucleic acid fragment is 4,232 to 4,281 of chromosome 1, and the FD between the two nucleic acid fragments The value can be from 1 to 99.

또 다른 인접한 50bp 길이의 핵산 단편이 염색체 1번의 4123번째 위치에 정렬되면, 이 핵산 단편의 거리 계산에 사용할 수 있는 유전적 위치값은 염색체 1번의 4,123~4,172이며, 두 핵산 단편 사이의 FD 값은 61에서 159이며, 첫 번째 예시 핵산 단편과의 FD 값은 12에서 110으로, 상기 두 FD 값 범위의 한 값의 합, 차, 곱, 평균, 곱의 로그, 합의 로그, 중앙값, 분위수, 최소값, 최대값, 분산, 표준편차, 중앙값 절대 편차 및 변동 계수로 구성된 군에서 선택된 하나 이상의 값 및/또는 하나 이상의 이들의 역수값, 및 가중치가 포함된 계산결과 및 이에 한정되지 않는 통계치를 FD 값으로 사용할 수 있으며, 바람직하게는 두 FD값 범위의 한 값의 역수값인 것을 특징으로 할 수 있으나 이에 한정되는 것은 아니다When another adjacent 50bp nucleic acid fragment is aligned at position 4123 of chromosome 1, the genetic position value that can be used to calculate the distance of this nucleic acid fragment is 4,123 to 4,172 of chromosome 1, and the FD value between the two nucleic acid fragments is 61 to 159, and the FD value with the first exemplary nucleic acid fragment is 12 to 110, the sum, difference, product, mean, logarithm of the product, logarithm of the sum, median, quantile, minimum, One or more values selected from the group consisting of maximum value, variance, standard deviation, median absolute deviation, and coefficient of variation and/or one or more inverse values thereof, and calculation results including but not limited to weights and statistics are used as FD values. may be, and preferably, it may be characterized as the reciprocal value of one value of the two FD value ranges, but is not limited thereto

바람직하게는 본 발명에 있어서, 상기 FD 값은 핵산 단편의 중앙값으로부터 임의의 값을 더하거나 뺀 값인 것을 특징으로 할 수 있다. Preferably, in the present invention, the FD value may be a value obtained by adding or subtracting an arbitrary value from the median value of the nucleic acid fragment.

본 발명에서, FD의 중앙값은 계산된 FD 값들을 크기의 순서대로 정렬했을 때 가장 중앙에 위치하는 값을 의미한다. 예를 들어 1, 2, 100 과 같이 세 개의 값이 있을 때, 2가 가장 중앙에 있기 때문에 2가 중앙값이 된다. 만약 짝수 개의 FD 값이 있을 경우 가운데 있는 두 값의 평균으로 중앙값을 결정 한다. 예를 들어 1, 10, 90, 200의 FD 값이 있을 경우 중앙값은 10과 90의 평균인 50이 된다. In the present invention, the median value of FD means a value located at the center when the calculated FD values are arranged in the order of magnitude. For example, when there are three values such as 1, 2, 100, 2 is the median because 2 is the most central. If there are an even number of FD values, the median is determined by the average of the two middle values. For example, if there are FD values of 1, 10, 90, and 200, the median value is 50, which is the average of 10 and 90.

본 발명에 있어서, 상기 임의의 값은 핵산 단편의 위치를 나타낼 수 있으면 제한없이 이용가능하나, 바람직하게는 0 내지 5 kbp 또는 핵산 단편 길이의 0 내지 300%, 0 내지 3 kbp 또는 핵산 단편 길이의 0 내지 200%, 0 내지 1 kbp 또는 핵산 단편 길이의 0 내지 100%, 더욱 바람직하게는 0 내지 500 bp 또는 핵산 단편 길이의 0 내지 50% 일 수 있으나, 이에 한정되는 것은 아니다.In the present invention, any of the above values can be used without limitation as long as they can indicate the position of the nucleic acid fragment, but preferably 0 to 5 kbp or 0 to 300% of the length of the nucleic acid fragment, 0 to 3 kbp or the length of the nucleic acid fragment. It may be 0 to 200%, 0 to 1 kbp, or 0 to 100% of the length of the nucleic acid fragment, more preferably 0 to 500 bp, or 0 to 50% of the length of the nucleic acid fragment, but is not limited thereto.

본 발명에 있어서, 상기 FD값은 페어드 엔드 시퀀싱(paired-end sequencing)일 경우, 정방향 및 역방향 서열정보(reads)의 위치값을 기반으로 도출하는 것을 특징으로 할 수 있다.In the present invention, in the case of paired-end sequencing, the FD value may be derived based on position values of forward and reverse sequence information (reads).

예를 들어, 50bp 길이의 페어드 엔드 리드 쌍에서, 정방향 리드는 염색체 1번의 4183번째 위치에 정렬되고, 역방향 리드는 4349번째 위치에 정렬되면, 이 핵산단편의 양 말단은 4183, 4349가 되고, 핵산 단편 거리에 사용할 수 있는 기준값은 4183~4349이다. 이 때 상기 핵산 단편과 인접한 다른 페어드 엔드 리드 쌍에서, 정방향 리드는 염색체 1번의 4349번째 위치에 정렬되고, 역방향 리드는 4515번째 정렬되면, 이 핵산 단편의 위치 값은 4349~4515 이다. 이 두 핵산 단편의 거리는 0~333이 될 수 있고, 가장 바람직하게 각 핵산 단편의 중앙값의 거리인 166이 될 수 있다.For example, in a pair of 50 bp long paired-end reads, if the forward read is aligned at position 4183 of chromosome 1 and the reverse read is aligned at position 4349, both ends of this nucleic acid fragment become 4183 and 4349, A reference value that can be used for the nucleic acid fragment distance is 4183 to 4349. At this time, in another paired-end read pair adjacent to the nucleic acid fragment, when the forward read is aligned at the 4349th position of chromosome 1 and the reverse read is aligned at the 4515th position, the position value of the nucleic acid fragment is 4349 to 4515. The distance between these two nucleic acid fragments may be 0-333, and most preferably, the distance of the median value of each nucleic acid fragment may be 166.

본 발명에 있어서, 상기 페어드 엔드 시퀀싱으로 서열정보를 수득할 경우, 서열정보(reads)의 정렬 점수가 기준값 미만인 핵산 단편의 경우, 계산과정에서 제외하는 단계를 추가로 포함하는 것을 특징으로 할 수 있다.In the present invention, when sequence information is obtained by the paired-end sequencing, in the case of a nucleic acid fragment having an alignment score of sequence information (reads) less than a reference value, the step of excluding from the calculation process may be further included. have.

본 발명에 있어서, 상기 FD 값은 싱글 엔드 시퀀싱(single-end sequencing)일 경우, 정방향 또는 역방향 서열정보(read)의 위치값 한 종류를 기반으로 도출하는 것을 특징으로 할 수 있다.In the present invention, in the case of single-end sequencing, the FD value may be derived based on one type of position value of forward or reverse sequence information (read).

본 발명에 있어서, 상기 싱글 엔드 시퀀싱의 경우, 정방향으로 정렬된 서열정보를 기반으로 위치값을 도출할 경우에는 임의의 값을 더해주며, 역방향으로 정렬된 서열정보를 기반으로 위치값을 도출할 경우에는 임의의 값을 빼주는 것을 특징으로 할 수 있으며, 상기 임의의 값은 FD 값이 핵산 단편의 위치를 명확하게 나타내도록 하는 값이면 제한없이 이용가능하나, 바람직하게는 0 내지 5kbp 또는 핵산 단편 길이의 0 내지 300%, 0 내지 3kbp 또는 핵산 단편 길이의 0 내지 200%, 0 내지 1kbp 또는 핵산 단편 길이의 0 내지 100%, 더욱 바람직하게는 0 내지 500bp 또는 핵산 단편 길이의 0 내지 50% 일 수 있으나, 이에 한정되는 것은 아니다. In the present invention, in the case of the single-ended sequencing, when deriving a position value based on sequence information aligned in the forward direction, an arbitrary value is added, and when deriving a position value based on sequence information aligned in the reverse direction may be characterized by subtracting an arbitrary value, and the arbitrary value can be used without limitation as long as the FD value clearly indicates the position of the nucleic acid fragment, but preferably 0 to 5 kbp or the length of the nucleic acid fragment. 0 to 300%, 0 to 3 kbp or 0 to 200% of the length of the nucleic acid fragment, 0 to 1 kbp or 0 to 100% of the length of the nucleic acid fragment, more preferably 0 to 500 bp or 0 to 50% of the length of the nucleic acid fragment , but is not limited thereto.

본 발명에서 분석하고자 하는 핵산은 시퀀싱 되어 리드(reads)라는 단위로 표현될 수 있다. 이 리드는 시퀀싱 방법에 따라 싱글 엔드 시퀀싱(single end sequencing read, SE) 및 페어드 엔드 시퀀싱(paired end sequencing read, PE)으로 나눌 수 있다. SE 방식의 리드는 핵산 분자의 5`과 3` 중 한 곳을 랜덤한 방향으로 일정한 길이만큼 시퀀싱 한 것을 의미하고, PE 방식의 리드는 5`과 3` 을 모두 일정 한 길이만큼 시퀀싱 하게 된다. 이러한 차이 때문에, SE 모드로 시퀀싱 할 경우 한 개의 핵산 단편으로부터 1개의 리드가 생기고, PE 모드에서는 1개의 핵산 단편으로부터 2개의 리드가 쌍으로 생성되는 것은 통상의 기술자에게 잘 알려진 사실이다.Nucleic acids to be analyzed in the present invention may be sequenced and expressed in units called reads. These reads can be divided into single end sequencing reads (SE) and paired end sequencing reads (PEs) according to a sequencing method. SE-type read means that one of 5' and 3' of a nucleic acid molecule is sequenced for a certain length in a random direction, and PE-type read sequenced both 5' and 3' by a certain length. Because of this difference, it is well known to those skilled in the art that when sequencing in SE mode, one read is generated from one nucleic acid fragment, and in PE mode, two reads are generated from one nucleic acid fragment in pairs.

핵산 단편 사이의 정확한 거리를 계산 하기 위한 가장 이상적인 방식은 핵산 분자를 처음부터 끝까지 시퀀싱하고, 그 리드를 정렬하고, 정렬된 값의 중앙값(센터)을 이용하는 것이다. 그러나 기술적으로 위 방식은 시퀀싱 기술의 한계 및 비용적인 측면 때문에 제약이 있는 실정이다. 따라서 SE, PE와 같은 방식으로 시퀀싱을 하게 되는데, PE 방식의 경우 핵산 분자의 시작과 끝 위치를 알 수 있기 때문에 이 값들의 조합을 통해 핵산 단편의 정확한 위치(중앙값)를 파악 할 수 있으나, SE 방식의 경우 핵산 단편의 한쪽 끝 정보만을 이용할 수 있기 때문에 정확한 위치(중앙값) 계산에 한계가 있다.The ideal way to calculate the exact distance between nucleic acid fragments is to sequence the nucleic acid molecules from beginning to end, align the grid, and use the median (center) of the aligned values. However, technically, the above method has limitations due to limitations and cost aspects of sequencing technology. Therefore, sequencing is performed in the same way as SE and PE. In the PE method, the start and end positions of nucleic acid molecules can be known, so the exact position (median value) of the nucleic acid fragment can be determined through the combination of these values, but SE In the case of the method, there is a limitation in calculating the exact location (median) because only information from one end of the nucleic acid fragment can be used.

또한 정방향, 역방향의 양 "?향으?* 시퀀싱 된(정렬된), 모든 리드의 말단 정보를 이용해 핵산 분자의 거리 계산시, 시퀀싱 방향 이라는 요소 때문에 정확하지 않은 값이 계산 될 수 있다.In addition, when calculating the distance of a nucleic acid molecule using the end information of all reads that have been sequenced (aligned) in both forward and reverse directions, inaccurate values may be calculated due to the factor of sequencing direction.

따라서, 시퀀싱 방식의 기술적 이유로 정방향 리드의 5`말단은, 핵산 분자의 중심 위치 보다 작은 위치 값을 갖고, 역방향 리드의 3`말단은 큰 값을 갖게 된다. 이러한 특징을 이용해, 정방향 리드의 경우 임의의 값(Extended bp)을 더해주고, 역방향 리드는 빼주게 되면 핵산 분자의 중심 위치에 가까운 값을 추정 할 수 있다. Therefore, for technical reasons of the sequencing method, the 5' end of the forward read has a smaller position than the central position of the nucleic acid molecule, and the 3' end of the reverse read has a larger value. Using this feature, if an arbitrary value (Extended bp) is added to the forward read and subtracted from the reverse read, a value close to the central position of the nucleic acid molecule can be estimated.

즉, 임의의 값(Extended bp)은 사용하는 시료에 따라 달라질 수 있으며, 세포유리 핵산의 경우 그 핵산의 평균 길이가 166bp 정도로 알려져 있기 때문에 약 80bp 정도로 설정을 할 수 있다. 만일 단편화 장비(ex; sonication)를 통해 실험이 진행 된 경우 단편화 과정에서 설정한 타겟 길이의 절반 정도를 extended bp으로 설정 할 수 있다.That is, the arbitrary value (Extended bp) may vary depending on the sample used, and in the case of cell-free nucleic acids, since the average length of the nucleic acid is known to be about 166 bp, it can be set to about 80 bp. If the experiment is performed through fragmentation equipment (ex; sonication), about half of the target length set in the fragmentation process can be set as extended bp.

본 발명에 있어서, 상기 (E) 단계의 염색체의 이상을 판정하는 단계는 In the present invention, the step of determining the chromosome abnormality in step (E) is

(E-i) 염색체 전체 영역 또는 특정 영역별로 FD 값의 대표값(RepFD)을 결정하는 단계;(E-i) determining a representative value (RepFD) of the FD value for the entire chromosome region or for each specific region;

(E-ii) 분석하고자 하는 염색체 전체 영역 또는 특정 영역 외의 샘플 내 특정 영역의 RepFD 값의 합, 차, 곱, 평균, 곱의 로그, 합의 로그, 중앙값, 분위수, 최소값, 최대값, 분산, 표준편차, 중앙값 절대 편차 및 변동 계수로 구성된 군에서 선택된 하나 이상의 값 및/또는 하나 이상의 이들의 역수값을 계산하여, 정규화 요소(Normalized Factor)를 도출하는 단계;(E-ii) Sum, difference, product, mean, log of product, log of sum, median, quantile, minimum, maximum, variance, standard deriving a normalized factor by calculating one or more values selected from the group consisting of deviation, median absolute deviation, and coefficient of variation and/or one or more inverse values thereof;

(E-iii) 하기 식 1을 바탕으로 대표값 비율(RepFD ratio)를 계산하는 단계;(E-iii) calculating a representative value ratio (RepFD ratio) based on Equation 1 below;

식 1: RepFD ratio = RepFD Target genomic region / Normalized FactorEquation 1: RepFD ratio = RepFD Target genomic region / Normalized Factor

(E-iv) 정상인 참조집단과 샘플의 RepFD ratio 값을 비교하여, FDI (Fragments Distance Index)를 계산하는 단계:(E-iv) Comparing the RepFD ratio value of the normal reference group and the sample, calculating FDI (Fragments Distance Index):

를 포함하여 수행되는 것을 특징으로 할 수 있다.It may be characterized in that it is carried out including

본 발명에 있어서, 상기 (E-i) 단계의 대표값(RepFD)은 FD 값의 합, 차, 곱, 평균, 중앙값, 분위수, 최소값, 최대값, 분산, 표준편차, 중앙값 절대 편차 및 변동 계수로 구성된 군에서 선택된 하나 이상의 값 및/또는 하나 이상의 이들의 역수값인 것을 특징으로 할 수 있으며, 바람직하게는 FD 값들의 중앙값, 평균값 또는 이의 역수값인 것을 특징으로 할 수 있으나, 이에 한정되는 것은 아니다.In the present invention, the representative value (RepFD) of step (E-i) is the sum, difference, product, mean, median, quantile, minimum, maximum, variance, standard deviation, median absolute deviation and coefficient of variation of the FD values. It may be characterized as one or more values selected from the group and/or one or more reciprocal values thereof, and preferably it may be characterized as a median value, an average value, or a reciprocal value of the FD values, but is not limited thereto.

본 발명에서 상기 염색체 전체 영역 또는 특정 유전 영역은 인간 핵산 서열의 집합이면 제한없이 이용가능하나, 바람직하게는 염색체 단위 또는 일부 염색체의 특정 영역일 수 있으며, 예를 들어, 수적 이상 여부 검출을 위한 특정 영역에는 정배수체로 생각되는 상염색체가 될 수 있고, 구조적 이상 여부 검출을 위한 특정 영역에는 고유성이 떨어지는 영역(centromere, telomere)을 제외한 모든 유전적 영역이 될 수 있으나, 이에 한정되는 것은 아니다.In the present invention, the entire chromosome region or a specific genetic region can be used without limitation as long as it is a set of human nucleic acid sequences, but may preferably be a chromosomal unit or a specific region of some chromosomes, for example, a specific region for detecting numerical abnormalities. The region may be an autosomal that is considered to be euploid, and may be any genetic region except for regions with low uniqueness (centromere, telomere) in a specific region for detecting structural abnormalities, but is not limited thereto.

상기 (E-ii) 단계의 분석하고자 하는 염색체 전체 영역 또는 특정 유전 영역 외의 샘플 내 특정 영역은 A specific region in the sample other than the entire chromosome region or specific genetic region to be analyzed in step (E-ii) is

a) 무작위로 분석하고자 하는 염색체 전체 영역 또는 특정 유전 영역 외의 영역을 무작위로 선별하는 단계;a) randomly selecting an entire region of a chromosome to be analyzed or a region other than a specific genetic region;

b) 상기 a)단계에서 선별한 유전 영역의 RepFD 값의 대표값을 사전 정규화 요소(Pre Normalized Factor, PNF)로 결정하는 단계;b) determining a representative value of the RepFD value of the genetic region selected in step a) as a pre-normalized factor (PNF);

c) 하기 식 2를 바탕으로 대표값 비율(RepFD ratio)를 계산하는 단계:c) calculating a representative value ratio (RepFD ratio) based on Equation 2:

식 2: RepFD ratio = RepFD Target genomic region / PNFEquation 2: RepFD ratio = RepFD Target genomic region / PNF

d) 정상인 참조 집단의 RepFD ratio 값의 변동 계수(Coefficient of Variance: SD / Mean)를 계산하는 단계; 및d) calculating a coefficient of variation (Coefficient of Variance: SD / Mean) of the RepFD ratio value of a normal reference group; and

e) 상기 a) 내지 d) 단계를 반복 시행하여 수득한 변동 계수 중 가장 작은 값을 갖는 유전 영역을 염색체 전체 영역 또는 특정 유전 영역 외의 샘플내 특정 영역으로 결정하는 단계를 포함하는 방법으로 선별하는 것을 특징으로 할 수 있다.e) selecting by a method comprising the step of determining the genetic region having the smallest value among the coefficients of variation obtained by repeatedly performing steps a) to d) as the entire chromosome region or a specific region in the sample other than the specific genetic region can be characterized.

본 발명에서 상기 e) 단계의 반복 시행은 100회 이상, 바람직하게는 1만에서 100만회 사이, 가장 바람직하게는 10만회인 것을 특징으로 할 수 있으나, 이에 한정되는 것은 아니다.In the present invention, the repeated execution of step e) may be characterized as being 100 or more times, preferably between 10,000 and 1 million times, and most preferably 100,000 times, but is not limited thereto.

본 발명에 있어서, 상기 (E-iv) 단계는 정상인 참조집단의 RepFD ratio 값을 샘플의 RepFD ratio 값과 비교하는 것을 특징으로 할 수 있다.In the present invention, the step (E-iv) may be characterized in that the RepFD ratio value of the normal reference group is compared with the RepFD ratio value of the sample.

본 발명에 있어서, 상기 정상인 참조집단의 RepFD ratio 값과 샘플의 RepFD ratio를 비교하는 방식은 두 값이 통계적으로 유의미하게 차이가 나는 것을 확인할 수 있는 방법이면 제한없이 사용가능하나, 바람직하게는 평균 및 표준편차 기반의 Z-score나 중앙값 기반의 Log 비, 그 외 분류 알고리즘을 통해 산출 된 우도비(Likelihood) 등이 선택되는 방법일 수 있으며, 가장 바람직하게는 평균 및 표준편차 기반의 Z점수 계산방식일 수 있으나, 이에 한정되는 것은 아니다.In the present invention, the method of comparing the RepFD ratio value of the normal reference group and the RepFD ratio of the sample can be used without limitation as long as it is a method that can confirm that the two values are statistically significantly different, but preferably average and A standard deviation-based Z-score, a median-based log ratio, or a likelihood ratio calculated through other classification algorithms may be selected, and most preferably, the Z-score calculation method based on the mean and standard deviation may be, but is not limited thereto.

본 발명에 있어서, 상기 Fragments Distance Index 는 정상참조 집단과 분석하고자 하는 샘플의 Rep FD ratio 값의 비교를 통해 계산되는데, 비교를 하는 방법에는 Z점수와 같은 표준 점수 방식을 사용 할 수 있고, 임계치는 무한대의 양수 음수등의 정수 또는 범위가 가능하고, 바람직하는 3이 될 수 있으나, 이에 한정되는 것은 아니다.In the present invention, the Fragments Distance Index is calculated by comparing the Rep FD ratio value of the normal reference group and the sample to be analyzed. A standard score method such as the Z score can be used for the comparison method, and the threshold is An integer or range such as infinity of positive and negative numbers is possible, and may be preferably 3, but is not limited thereto.

본 발명은 다른 관점에서, 생체시료에서 핵산을 추출하여 서열정보를 해독하는 해독부; In another aspect, the present invention provides a decoding unit for decoding sequence information by extracting nucleic acids from a biological sample;

해독된 서열을 표준 염색체 서열 데이터베이스에 정렬하는 정렬부; 및 an alignment unit that aligns the translated sequence to a standard chromosomal sequence database; and

선별된 핵산 단편(fragments)에 대하여 정렬된 핵산 단편 사이의 거리를 측정하여, FD 값(Fragments Distance)을 계산하고, 계산한 FD 값을 기반으로 염색체 전체 영역 또는 특정 유전 영역 별로 FDI 값(Fragments Distance Index)를 계산하여, FDI 값이 기준값 또는 구간 미만 또는 초과 일 경우, 염색체 이상이 있는 것으로 판정하는 염색체 이상 판정부를 포함하는 염색체 이상 검출 장치에 관한 것이다.By measuring the distance between the aligned nucleic acid fragments with respect to the selected nucleic acid fragments, the FD value (Fragments Distance) is calculated, and the FDI value (Fragments Distance) for the entire chromosome region or for each specific genetic region based on the calculated FD value Index) and, when the FDI value is less than or greater than a reference value or section, relates to a chromosome abnormality detection apparatus comprising a chromosome abnormality determining unit that determines that there is a chromosome abnormality.

본 발명은 또 다른 관점에서, 컴퓨터 판독 가능한 저장 매체로서, 염색체 이상을 검출하는 프로세서에 의해 실행되도록 구성되는 명령을 포함하되, In another aspect, the present invention provides a computer-readable storage medium comprising instructions configured to be executed by a processor for detecting a chromosomal abnormality,

(C) 선별된 핵산 단편(fragments) 사이의 거리를 측정하여, FD 값(Fragments Distance)을 계산하는 단계; 및(C) measuring the distance between the selected nucleic acid fragments (fragments), calculating the FD value (Fragments Distance); and

(D) 상기 (C) 단계에서 계산한 FD 값을 기반으로 염색체 전체 영역 또는 특정 유전 영역 별로 FDI 값(Fragments Distance Index)를 계산하여, FDI 값이 기준 값 범위 미만 또는 초과일 경우, 염색체 이상이 있는 것으로 판정하는 단계;(D) The FDI value (Fragments Distance Index) is calculated for the entire chromosome region or for each specific genetic region based on the FD value calculated in step (C). determining that there is;

를 통하여, 염색체 이상을 검출하는 프로세서에 의해 실행되도록 구성되는 명령을 포함하는 컴퓨터 판독 가능한 저장 매체에 관한 것이다.It relates to a computer-readable storage medium comprising instructions configured to be executed by a processor that detects a chromosomal abnormality through a chromosomal aberration.

구체적으로는 본 발명에 따른 컴퓨터 판독 가능한 저장 매체는, 염색체 이상을 검출하는 프로세서에 의해 실행되도록 구성되는 명령을 포함하되,Specifically, the computer-readable storage medium according to the present invention includes instructions configured to be executed by a processor for detecting a chromosomal abnormality,

(C) 상기 서열정보(reads)에 기반하여 정렬된 핵산 단편(fragments)을 전체 서열, 정방향 서열 및 역방향 서열로 그룹화 하는 단계;(C) grouping the aligned nucleic acid fragments into full sequence, forward sequence and reverse sequence based on the sequence information (reads);

(D) 상기 그룹화된 각각의 핵산 단편에 대하여 정렬된 핵산 단편 기준값 사이의 거리를 측정하여, 각 그룹별 FD 값(Fragments Distance)을 계산하는 단계; 및(D) calculating the FD value (Fragments Distance) for each group by measuring the distance between the reference values of the aligned nucleic acid fragments for each of the grouped nucleic acid fragments; and

(E) 상기 (D) 단계에서 계산한 각 그룹별 FD 값을 기반으로 염색체 전체 영역 또는 특정 영역 별로 각각의 FDI 값(Fragments Distance Index)을 계산하여, 각각의 FDI 값이 모두 기준값 범위 미만 혹은 초과일 경우, 염색체 이상이 있는 것으로 판정하는 단계;(E) Calculate each FDI value (Fragments Distance Index) for the entire chromosome region or for each specific region based on the FD value for each group calculated in step (D), so that each FDI value is less than or exceeding the reference value range , determining that there is a chromosomal abnormality;

를 통하여, 염색체 이상을 검출하는 프로세서에 의해 실행되도록 구성되는 명령을 포함하는 것을 특징으로 할 수 있지만 이에 한정되는 것은 아니다. It may be characterized by including, but not limited to, instructions configured to be executed by a processor for detecting a chromosomal abnormality.

본 발명의 다른 실시예에서는 상기 정렬된 핵산 단편의 중앙값에서 분석대상 핵산 단편 평균 길이의 50%를 더하거나 빼서 리드의 양 말단의 위치값을 계산하여, 리드 사이의 거리를 계산할 수 있다는 것을 확인하였다(도 6). In another embodiment of the present invention, it was confirmed that the distance between the reads can be calculated by adding or subtracting 50% of the average length of the analyte nucleic acid fragment from the median value of the aligned nucleic acid fragments to calculate the position values of both ends of the read ( 6).

따라서, 본 발명은 또 다른 관점에서, Accordingly, the present invention in another aspect,

(A) 생체시료에서 핵산을 추출하여 서열정보를 획득하는 단계; (A) extracting nucleic acids from a biological sample to obtain sequence information;

(B) 획득한 서열정보(reads)를 표준 염색체 서열 데이터베이스(reference genome database)에 정렬(alignment)하는 단계; (B) aligning the obtained sequence information (reads) to a standard chromosome sequence database (reference genome database);

(C) 상기 정렬된 서열정보(reads)에 대하여 정렬된 리드 사이의 거리를 측정하여, RD 값(Read Distance)을 계산하는 단계; 및(C) calculating the RD value (Read Distance) by measuring the distance between the aligned reads with respect to the aligned sequence information (reads); and

(D) 상기 (C) 단계에서 계산한 RD 값을 기반으로 염색체 전체 영역 또는 특정 영역 별로 RDI 값(Read Distance Index)를 계산하여, RDI 값이 기준 값 범위 미만 혹은 초과일 경우, 염색체 이상이 있는 것으로 판정하는 단계를 포함하는 염색체 이상 검출 방법에 관한 것이다. (D) If the RDI value (Read Distance Index) is calculated for the entire chromosome region or for each specific region based on the RD value calculated in step (C), and the RDI value is less than or exceeding the reference value range, there is a chromosome abnormality It relates to a chromosomal abnormality detection method comprising the step of determining that it is.

본 발명에 있어서, In the present invention,

상기 A) 단계는 Step A) is

(A-v) 차세대 유전자서열검사기에서 핵산의 서열정보(reads)를 획득하는 단계포함하는 방법으로 수행되는 것을 특징으로 할 수 있다.(A-v) it may be characterized in that it is carried out in a method comprising the step of obtaining sequence information (reads) of nucleic acids in a next-generation gene sequencing machine.

본 발명에 있어서, 상기 (B) 단계의 서열정보(reads)의 길이는, 5 내지 5000 bp이고, 사용하는 서열정보의 수는 5천 내지 500만개가 될 수 있으나, 이에 한정되는 것은 아니다. In the present invention, the length of the sequence information (reads) in step (B) is 5 to 5000 bp, and the number of sequence information used may be 50 to 5 million, but is not limited thereto.

본 발명에 있어서, 상기 (C) 단계는 정렬된 리드를 정렬된 방향에 따라 그룹화하는 단계를 추가로 사용할 수 있는 것을 특징으로 할 수 있다.In the present invention, the step (C) may be characterized in that the step of grouping the aligned leads according to the aligned direction can be further used.

본 발명에서는 리드를 그룹화하는 단계를 추가로 사용할 수 있으며, 이때 그룹화 기준은, 정렬된 리드의 어댑터 서열을 바탕으로 수행할 수 있다. 정방향으로 정렬된 리드와 역방향으로 정렬된 리드로 별도로 구분하여서 선별된 서열정보에 대해서 RD 값을 계산할 수 있다.In the present invention, the step of grouping the reads may be additionally used, and in this case, the grouping criterion may be performed based on the adapter sequence of the aligned reads. The RD value can be calculated for the sequence information selected by separately dividing the reads aligned in the forward direction and the reads aligned in the reverse direction.

본 발명에서, 상기 (C) 단계를 수행하기에 앞서 정렬된 리드의 정렬 일치도 점수(mapping quality score)를 만족하는 리드를 따로 분류하는 단계를 추가로 포함하는 것을 특징으로 할 수 있다.In the present invention, it may be characterized in that it further comprises the step of separately classifying the reads satisfying the alignment matching score (mapping quality score) of the aligned reads prior to performing the step (C).

본 발명에 있어서, 상기 (C) 단계의 RD 값은 수득한 n개의 리드에 대하여, i 번째 리드와 i+1 내지 n 번째 리드에서 선택되는 어느 하나 이상의 리드의 양 말단 값 중 하나의 값에 핵산 평균 길이의 50%를 더하거나 뺀 값 사이의 거리를 통해 산출하는 것을 특징으로 할 수 있다.In the present invention, the RD value of step (C) is a nucleic acid at one of the values of both ends of the i-th read and any one or more reads selected from i+1 to n-th read with respect to the n reads obtained. It may be characterized in that it is calculated through the distance between values added or subtracted by 50% of the average length.

본 발명에서, 상기 RD 값은 수득한 n개의 리드에 대하여, 제1리드와 제2 내지 제n개 리드로 구성된 군에서 선택되는 어느 하나 이상의 리드와의 거리를 계산하여 이들의 합, 차, 곱, 평균, 곱의 로그, 합의 로그, 중앙값, 분위수, 최소값, 최대값, 분산, 표준편차, 중앙값 절대 편차, 변동 계수, 이들의 역수값 및 조합, 가중치가 포함된 계산 결과 및 이에 한정되지 않는 통계치를 RD 값으로 사용할 수 있으나 이에 한정되는 것은 아니다. In the present invention, the RD value is calculated by calculating the distance between the first read and any one or more reads selected from the group consisting of the second to n-th reads with respect to the obtained n reads, and the sum, difference, and product thereof , mean, log of product, log of sum, median, quantile, minimum, maximum, variance, standard deviation, median absolute deviation, coefficient of variation, reciprocal values and combinations thereof, calculation results including but not limited to weights, and statistics may be used as the RD value, but is not limited thereto.

본 발명에서, RD의 중앙값은 계산된 RD 값들을 크기의 순서대로 정렬했을 때 가장 중앙에 위치하는 값을 의미한다. 예를 들어 1, 2, 100 과 같이 홀수 개의 값이 있을 때, 2가 가장 중앙에 있기 때문에 2가 중앙값이 된다. 만약 짝수 개의 RD 값이 있을 경우 가운데 있는 두 값의 평균으로 중앙값을 결정 한다. 예를 들어 1, 10, 90, 200의 RD 값이 있을 경우 중앙값은 10과 90의 평균인 50이 된다. In the present invention, the median value of RD means a value located at the center when the calculated RD values are arranged in the order of magnitude. For example, when there are an odd number of values such as 1, 2, 100, 2 is the median because 2 is the most central. If there are an even number of RD values, the median is determined by the average of the two middle values. For example, if there are RD values of 1, 10, 90, and 200, the median is 50, which is the average of 10 and 90.

본 발명에 있어서, 상기 RD 값은 상기 i 번째 리드 내부의 5' 또는 3' 말단과 i+1 내지 n 번째 중 어느 하나 이상의 리드의 5' 또는 3' 말단 사이의 거리를 계산하는 것을 특징으로 할 수 있다.In the present invention, the RD value is characterized by calculating the distance between the 5' or 3' end inside the i-th read and the 5' or 3' end of any one or more of the i+1 to n-th reads. can

예를 들어, 50bp 길이의 페어드 엔드 리드 쌍에서, 정방향 리드는 염색체 1번의 4183번째 위치에 정렬되고, 역방향 리드는 4349번째 위치에 정렬되면, 이 핵산단편의 양 말단은 4183, 4349가 되고, 핵산 단편 거리에 사용할 수 있는 기준값은 4183~4349이다. 이 때 상기 핵산 단편과 인접한 다른 페어드 엔드 리드 쌍에서, 정방향 리드는 염색체 1번의 4349번째 위치에 정렬되고, 역방향 리드는 4515번째 정렬되면, 이 핵산 단편의 위치 값은 4349~4515 이다. 이 두 핵산 단편의 거리는 0~333이 될 수 있고, 가장 바람직하게 각 핵산 단편의 중앙값의 거리인 166이 될 수 있다. 상기 예시에서 핵산 단편의 평균 길이가 166 일 경우, 핵산 단편 평균 길이의 50% 값을 중앙값(4266)에서 빼는 경우, 첫번째 핵산단편의 위치값은 4183 이 되고, 두번째 핵산 단편의 위치값은 4349 이며, 이때 리드 사이의 거리는 166이 된다(4349 - 4183). 반면 50% 값을 중앙값에서 더하는 경우, 첫번째 핵산단편의 위치값은 4349, 두번째 핵산단편의 위치값은 4515 이며 이때 리드 사이의 거리는 166이 된다(4515 - 4349).For example, in a pair of 50 bp long paired-end reads, if the forward read is aligned at position 4183 of chromosome 1 and the reverse read is aligned at position 4349, both ends of this nucleic acid fragment become 4183 and 4349, A reference value that can be used for the nucleic acid fragment distance is 4183 to 4349. At this time, in another paired-end read pair adjacent to the nucleic acid fragment, when the forward read is aligned at the 4349th position of chromosome 1 and the reverse read is aligned at the 4515th position, the position value of the nucleic acid fragment is 4349 to 4515. The distance between these two nucleic acid fragments may be 0-333, and most preferably, the distance of the median value of each nucleic acid fragment may be 166. In the above example, when the average length of the nucleic acid fragment is 166, when 50% of the average length of the nucleic acid fragment is subtracted from the median value (4266), the position value of the first nucleic acid fragment is 4183, the position value of the second nucleic acid fragment is 4349, , at this time, the distance between the leads becomes 166 (4349 - 4183). On the other hand, when 50% is added from the median value, the position value of the first nucleic acid fragment is 4349 and the position value of the second nucleic acid fragment is 4515, and the distance between the reads is 166 (4515 - 4349).

본 발명에 있어서, 상기 (D) 단계의 염색체의 이상을 판정하는 단계는 In the present invention, the step of determining the chromosome abnormality in step (D) is

(D-i) 각 염색체 전체 영역 또는 특정 유전 영역별로 RD 값의 대표값(RepRD)을 결정하는 단계;(D-i) determining a representative value of the RD value (RepRD) for each entire chromosome region or a specific genetic region;

(D-ii) 분석하고자 하는 염색체 전체 영역 또는 특정 유전 영역 외의 샘플 내 영역의 RepRD 값의 합, 차, 곱, 평균, 곱의 로그, 합의 로그, 중앙값, 분위수, 최소값, 최대값, 분산, 표준편차, 중앙값 절대 편차 및 변동 계수로 구성된 군에서 선택된 하나 이상의 값 및/또는 하나 이상의 이들의 역수값을 계산하여, 정규화 요소(Normalized Factor)를 도출하는 단계;(D-ii) Sum, difference, product, mean, log of product, log of sum, median, quantile, minimum, maximum, variance, standard deriving a normalized factor by calculating one or more values selected from the group consisting of deviation, median absolute deviation, and coefficient of variation and/or one or more inverse values thereof;

(D-iii) 하기 식 10을 바탕으로 대표값 비율(RepRD ratio)를 계산하는 단계;(D-iii) calculating a representative value ratio (RepRD ratio) based on the following Equation 10;

식 10: RepRD ratio = RepRD Target genomic region / Normalized FactorEquation 10: RepRD ratio = RepRD Target genomic region / Normalized Factor

(D-iv) 정상인 참조집단과 샘플의 RepRD ratio 값을 비교하여, RDI (Read Distance Index)를 계산하는 단계:(D-iv) Comparing the RepRD ratio value of the normal reference group and the sample, calculating RDI (Read Distance Index):

본 발명에 있어서, 상기 (D-i) 단계의 대표값(RepRD)은 RD 값의 합, 차, 곱, 평균, 곱의 로그, 합의 로그, 중앙값, 분위수, 최소값, 최대값, 분산, 표준편차, 중앙값 절대 편차 및 변동 계수로 구성된 군에서 선택된 하나 이상의 값 및/또는 하나 이상의 이들의 역수값, 및 이에 한정하지 않는 통계치로 구성된 군에서 선택되는 하나 이상인 것을 특징으로 할 수 있으며, 바람직하게는 RD 값들의 중앙값, 평균값 또는 이의 역수값인 것을 특징으로 할 수 있으나, 이에 한정되는 것은 아니다.In the present invention, the representative value (RepRD) of step (D-i) is the sum, difference, product, mean, log of product, log of sum, median, quantile, minimum, maximum, variance, standard deviation, median of the RD values. One or more values selected from the group consisting of absolute deviations and coefficients of variation and/or one or more inverse values thereof, and one or more selected from the group consisting of, but not limited to, statistics, preferably the RD values It may be characterized as a median value, an average value, or a reciprocal value thereof, but is not limited thereto.

본 발명에서, RepRD 값의 중앙값은 계산된 RepRD 값들을 크기의 순서대로 정렬했을 때 가장 중앙에 위치하는 값을 의미한다. 예를 들어 1, 2, 100 과 같이 홀수 개의 값이 있을 때, 2가 가장 중앙에 있기 때문에 2가 중앙값이 된다. 만약 짝수 개의 RepRD 값이 있을 경우 가운데 있는 두 값의 평균으로 중앙값을 결정한다. 예를 들어 1, 10, 90, 200의 RepRD 값이 있을 경우 중앙값은 10과 90의 평균인 50이 된다.In the present invention, the median value of RepRD means a value located at the most center when the calculated RepRD values are arranged in order of size. For example, when there are an odd number of values such as 1, 2, 100, 2 is the median because 2 is the most central. If there are an even number of RepRD values, the median is determined by the average of the two middle values. For example, if there are RepRD values of 1, 10, 90, and 200, the median value is 50, which is the average of 10 and 90.

본 발명에서 상기 염색체 전체 영역 또는 특정 유전 영역(specific genomic region)은 인간 핵산 서열의 집합이면 제한없이 이용가능하나, 바람직하게는 염색체 단위 또는 일부 염색체의 특정 영역일 수 있으며, 예를 들어, 수적 이상 여부 검출을 위한 특정 영역에는 정배수체로 생각되는 상염색체가 될 수 있고, 구조적 이상 여부 검출을 위한 특정 영역에는 고유성이 떨어지는 영역(centromere, telomere)을 제외한 모든 유전적 영역이 될 수 있으나, 이에 한정되는 것은 아니다.In the present invention, the entire chromosome region or a specific genomic region can be used without limitation as long as it is a set of human nucleic acid sequences, but may preferably be a chromosomal unit or a specific region of some chromosomes, for example, numerical abnormalities A specific region for detecting whether or not there is an autosomal that is considered to be euploid, and a specific region for detecting a structural abnormality may be any genetic region except for a region with low uniqueness (centromere, telomere), but limited to this it's not going to be

상기 (D-ii) 단계의 분석하고자 하는 염색체 전체 영역 또는 특정 유전 영역 외의 샘플 내 특정 영역은 A specific region in the sample other than the entire chromosome region or a specific genetic region to be analyzed in step (D-ii) is

a) 무작위로 분석하고자 하는 염색체 전체 영역 또는 특정 유전 영역 외의 영역을 선별하는 단계;a) randomly selecting an entire region of a chromosome to be analyzed or a region other than a specific genetic region;

b) 상기 a)단계에서 선별한 유전 영역의 RepRD 값의 대표값을 사전 정규화 요소(Pre Normalized Factor, PNF)로 결정하는 단계;b) determining a representative value of the RepRD value of the genetic region selected in step a) as a pre-normalized factor (PNF);

c) 하기 식 11을 바탕으로 대표값 비율(RepRD ratio)를 계산하는 단계:c) calculating a representative value ratio (RepRD ratio) based on Equation 11 below:

식 11: RepRD ratio = RepRD Target genomic region / PNFEquation 11: RepRD ratio = RepRD Target genomic region / PNF

d) 정상인 참조 집단의 RepRD ratio 값의 변동 계수(Coefficient of Variance: SD / Mean)를 계산하는 단계; 및d) calculating a coefficient of variation (Coefficient of Variance: SD / Mean) of the RepRD ratio value of a normal reference group; and

e) 상기 a) 내지 d) 단계를 반복 시행하여 수득한 변동 계수 중 가장 작은 값을 갖는 유전 영역을 염색체 전체 영역 또는 특정 유전 영역 외 영역으로 결정하는 단계를 포함하는 방법으로 선별하는 것을 특징으로 할 수 있다.e) a method comprising the step of determining the genetic region having the smallest value among the coefficients of variation obtained by repeating steps a) to d) as the entire chromosome region or a region outside of a specific genetic region can

본 발명에 있어서, 상기 (iv) 단계는 정상인 참조집단의 RepRD ratio 값을 샘플의 RepRD ratio 값과 비교하는 것을 특징으로 할 수 있다.In the present invention, the step (iv) may be characterized in that the RepRD ratio value of the normal reference group is compared with the RepRD ratio value of the sample.

본 발명에 있어서, 상기 정상인 참조집단의 RepRD ratio 값과 샘플의 RepRD ratio를 비교하는 방식은 두 값이 통계적으로 유의미하게 차이가 나는 것을 확인할 수 있는 방법이면 제한없이 사용가능하나, 바람직하게는 평균 및 표준편차 기반의 Z-score나 중앙값 기반의 Log 비, 그 외 분류 알고리즘을 통해 산출 된 우도비(Likelihood) 등이 선택되는 방법일 수 있으며, 가장 바람직하게는 평균 및 표준편차 기반의 Z점수 계산방식일 수 있으나, 이에 한정되는 것은 아니다.In the present invention, the method of comparing the RepRD ratio value of the normal reference group and the RepRD ratio of the sample can be used without limitation as long as it is a method that can confirm that the two values are statistically significantly different, but preferably the average and A standard deviation-based Z-score, a median-based log ratio, or a likelihood ratio calculated through other classification algorithms may be selected, and most preferably, the Z-score calculation method based on the mean and standard deviation may be, but is not limited thereto.

본 발명에 있어서, 상기 Reads Distance Index 는 정상참조 집단과 분석하고자 하는 샘플의 Rep RD ratio 값의 비교를 통해 계산되는데, 비교를 하는 방법에는 Z점수와 같은 표준 점수 방식을 사용 할 수 있고, 임계치는 무한대의 양수 음수 등의 정수 또는 범위가 가능하고, 바람직하는 -3 또는 3 이 될 수 있으나, 이에 한정되는 것은 아니다. In the present invention, the Reads Distance Index is calculated by comparing the Rep RD ratio value of the normal reference group and the sample to be analyzed. A standard score method such as the Z score can be used for the comparison method, and the threshold is An integer or range such as infinity of positive or negative numbers is possible, and may preferably be -3 or 3, but is not limited thereto.

본 발명은 또 다른 관점에서, 생체시료에서 핵산을 추출하여 서열정보를 해독하는 해독부; In another aspect, the present invention provides a decoding unit for extracting nucleic acids from a biological sample and deciphering sequence information;

선별된 서열정보(reads)에 대하여 정렬된 리드 사이의 거리를 측정하여, RD 값(Read Distance)을 계산하고, 계산한 RD 값을 기반으로 유전 영역 별로 RDI 값(Read Distance Index)를 계산하여, RDI 값이 기준값 또는 구간 미만 또는 초과 일 경우, 염색체 이상이 있는 것으로 판정하는 염색체 이상 판정부를 포함하는 염색체 이상 검출 장치에 관한 것이다.By measuring the distance between reads aligned with respect to the selected sequence information (reads), calculating the RD value (Read Distance), and calculating the RDI value (Read Distance Index) for each genetic region based on the calculated RD value, To a chromosomal abnormality detecting apparatus comprising a chromosomal abnormality determining unit that determines that there is a chromosomal abnormality when the RDI value is less than or greater than a reference value or section.

(C) 선별된 서열정보(reads)에 대하여 정렬된 리드 사이의 거리를 측정하여, RD 값(Read Distance)을 계산하는 단계; 및(C) calculating the RD value (Read Distance) by measuring the distance between the reads aligned with respect to the selected sequence information (reads); and

(D) 상기 (C) 단계에서 계산한 RD 값을 기반으로 유전 영역 별로 RDI 값(Read Distance Index)를 계산하여, RDI 값이 기준 값 범위 미만 또는 초과 일 경우, 염색체에 이상이 있는 것으로 판정하는 단계를 통하여, 염색체의 이상을 검출하는 프로세서에 의해 실행되도록 구성되는 명령을 포함하는 컴퓨터 판독 가능한 저장 매체에 관한 것이다.(D) Calculating the RDI value (Read Distance Index) for each genetic region based on the RD value calculated in step (C), and when the RDI value is less than or exceeding the reference value range, determining that there is an abnormality in the chromosome To a computer-readable storage medium comprising instructions configured to be executed by a processor for detecting an abnormality in a chromosome, through the steps.

실시예Example

이하, 실시예를 통하여 본 발명을 더욱 상세히 설명하고자 한다. 이들 실시예는 오로지 본 발명을 예시하기 위한 것으로서, 본 발명의 범위가 이들 실시예에 의해 제한되는 것으로 해석되지는 않는 것은 당업계에서 통상의 지식을 가진 자에게 있어서 자명할 것이다.Hereinafter, the present invention will be described in more detail through examples. These examples are only for illustrating the present invention, and it will be apparent to those of ordinary skill in the art that the scope of the present invention is not to be construed as being limited by these examples.

실시예 1. 혈액에서 DNA를 추출하여, 차세대 염기서열 분석 수행Example 1. Extracting DNA from blood, performing next-generation sequencing

샘플의 혈액을 10mL씩 채취하여 EDTA Tube에 보관하였으며, 채취 후 2시간 이내에 1200g, 4℃ 15분의 조건으로 혈장 부분만 1차 원심분리한 다음, 1차 원심분리된 혈장을 16000g, 4℃분의 조건으로 2차 원심분리하여 침전물을 제외한 혈장 상층액을 분리하였다. 분리된 혈장에 대해 Tiangenmicro DNA kit (Tiangen)을 사용하여 cell-free DNA를 추출하였다. Paired-end(PE) 데이터의 생산은, MGIEasy Cell-free DNA Library Prep set kit (MGI)를 사용해 Library preparation 과정을 수행 후, DNBseq G400장비 (MGI) 를 이용했고 (50cycle * 2), Single-end(SE) 데이터는 Truseq Nano DNA HT library prep kit (Illumina) 를 사용해 Library preparation 과정 후, Nextseq500 (Illumina) 장비를 이용해 생산했다.The blood sample was collected by 10mL and stored in an EDTA tube. Within 2 hours after collection, only the plasma part was first centrifuged under the conditions of 1200g, 4℃ for 15 minutes, and then the first centrifuged plasma was collected at 16000g, 4℃ for 15 minutes. Plasma supernatant except for the precipitate was separated by second centrifugation under the conditions of Cell-free DNA was extracted from the isolated plasma using the Tiangenmicro DNA kit (Tiangen). For paired-end (PE) data generation, the library preparation process was performed using the MGIEasy Cell-free DNA Library Prep set kit (MGI), and the DNBseq G400 equipment (MGI) was used (50cycle * 2), single-end (SE) Data were produced using Nextseq500 (Illumina) equipment after library preparation using Truseq Nano DNA HT library prep kit (Illumina).

PE 데이터는 약 천만개의 핵산단편에 대한 서열 정보를 얻을 수 있었고, SE 데이터는 약 백삼십만개의 핵산 단편에 대한 서열 정보를 얻을 수 있었다.For PE data, sequence information for about 10 million nucleic acid fragments could be obtained, and for SE data, sequence information for about 1.3 million nucleic acid fragments could be obtained.

실시예 2. 서열정보 데이터의 품질관리 및 FD값 계산Example 2. Quality Control of Sequence Information Data and FD Value Calculation

염기서열 정보를 전처리하고, FD 값을 계산하기 전에 다음 일련의 과정을 진행하였다. 차세대염기서열분석기(NGS) 장비에서 생성된 fastq 형식의 파일을 BWA-mem 알고리즘을 사용하여 참조 염색체 Hg19 서열을 기준으로 라이브러리 서열을 정렬하였다. 라이브러리 서열의 정렬 시 오류가 발생할 확률이 있어 오류를 교정하는 두 가지 과정을 수행하였다. 우선, 중복된 라이브러리 서열에 대하여 제거 작업을 실시한 다음, BWA-mem 알고리즘에 의해 정렬된 라이브러리 서열 중 Mapping Quality Score가 60에 도달하지 못하는 서열을 제거하였다. The sequence information was pre-processed and the following series of procedures were performed before calculating the FD value. The library sequence was aligned based on the Hg19 sequence of the reference chromosome using the BWA-mem algorithm in the fastq format file generated by the next-generation sequencing (NGS) equipment. Since there is a possibility that an error may occur when aligning the library sequence, two processes were performed to correct the error. First, the duplicated library sequences were removed, and then sequences that did not reach 60 Mapping Quality Score among the library sequences aligned by the BWA-mem algorithm were removed.

선별한 리드를 정렬된 방향에 따라 정방향 리드와 역방향 리드로 그룹화 한 다음, 가장 인접한 리드와의 거리를 FD 값으로 하기 식 3을 이용하여 계산하였고, 이의 개념은 도 2 및 도 3에 나타내었다. 하기 식 3의 D 함수는 유전체 위치의 차이값을 계산하는 함수이다. 하기 식 3의 a와 b 는 핵산단편의 위치값이며, PE 시퀀싱의 경우 2개의 서열 정보의 정렬된 위치값의 최소값부터 최대값 중 어느 한 값이 될 수 있고, SE 시퀀싱의 경우 서열 정보의 정렬된 위치값이거나 위치값에 특정 값을 extension 한 값 일 수 있다.The selected reads were grouped into forward and reverse reads according to the aligned directions, and the distance to the nearest read was calculated using Equation 3 below as an FD value, and the concept thereof is shown in FIGS. 2 and 3 . The D function of Equation 3 below is a function for calculating the difference value of the dielectric position. In Equation 3, a and b are the position values of the nucleic acid fragment, and in the case of PE sequencing, it may be any one of the minimum to the maximum value of the aligned position values of the two sequence information, and in the case of SE sequencing, the alignment of the sequence information It may be a position value or a value obtained by extending a specific value to a position value.

식 3: Fragment Distance (FD) = D(a,b) | a ∈Fi , b ∈Fi Equation 3: Fragment Distance (FD) = D(a,b) | a ∈Fi , b ∈Fi

실시예 3. Extension에 따른 FD값의 차이 확인Example 3. Confirmation of difference in FD value according to extension

PE로 생산된 데이터는 핵산 단편의 시작과 말단의 위치정보를 알 수 있고, 중간 위치를 기준으로 각 핵산 단편간 거리를 계산 할 수 있다. PE로 생산된 데이터에서 무작위 적으로 For, Rev 리드로 그룹화 하고 For 로 분류된 리드는 For 리드의 5` 위치를 기준으로, Rev 로 분류된 리드는 Rev 리드의 3`위치를 기준으로 FD를 계산한 다음, For 리드에는 80bp를 더하고, Rev 리드에는 80bp를 빼주는 extension을 수행하였다. Data produced by PE can know the position information of the start and end of the nucleic acid fragment, and the distance between each nucleic acid fragment can be calculated based on the intermediate position. In the data produced by PE, randomly grouped into For and Rev reads. For leads classified as For, the FD is calculated based on the 5' position of the For lead, and the Rev classified read is based on the 3' position of the Rev read. Then, extension was performed by adding 80 bp to the For read and subtracting 80 bp to the Rev read.

상기 과정의 FD 값과 extension을 수행한 과정의 FD 값의 차이를 비교한 결과, 도 5에 기재된 바와 같이 extension 한뒤 계산된 FD 값이 PE의 centered FD 값과 유사한 것이 확인을 확인하였으며, extension을 수행하지 않은 FD 값은 +166, -166 의 FD 값의 차이가 나는 것을 확인하였다.As a result of comparing the difference between the FD value of the above process and the FD value of the process of performing extension, it was confirmed that the FD value calculated after extension was similar to the centered FD value of PE as shown in FIG. 5, and extension was performed It was confirmed that there is a difference between the FD values of +166 and -166 for the FD values that are not.

실시예 4. FDI 값 계산Example 4. FDI Value Calculation

4-1. 염색체 수적 이상 검출을 위한 FDI 값 계산4-1. FDI Value Calculation for Chromosomal Numerical Abnormality Detection

SE 시퀀싱 데이터를 이용해 FDI 값을 계산했으며, extension 값은 80bp를 설정 했다. 이수성 여부를 확인 하고자 하는 염색체를, 각각 선별된 염색체의 집합의 RepFD 값의 중앙값 비를 Normalized Factor로 정의하고, 하기 식 4로 계산 하였다. The FDI value was calculated using the SE sequencing data, and the extension value was set to 80bp. For chromosomes to be checked for aneuploidy, the median ratio of the RepFD values of each selected chromosome set was defined as the Normalized Factor, and calculated by Equation 4 below.

이수성 확인을 위한 염색체Chromosomes to identify aneuploidies 염색체 집합chromosome set 1313 1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,19,20,221,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,19,20,22 1818 1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,19,20,221,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,19,20,22 2121 1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,19,20,221,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,19,20,22

식 4: Normalized Factor = Median of RepFD _{selected chromosome set} Equation 4: Normalized Factor = Median of RepFD _{selected chromosome set}

(selected chromosome set: 상기 표에서 염색체 집합에 해당하는 부분임)(selected chromosome set: the part corresponding to the chromosome set in the table above)

식 3과 식 4에서 계산한 FD와 Normalized Factor 를 이용해서 RepFD ratio를 식 5로 계산하였다.Using the FD and Normalized Factor calculated in Equations 3 and 4, the RepFD ratio was calculated in Equation 5.

식 5: RepFD ratio = RepFD _{Target chromosome} / Normalized Factor Equation 5: RepFD ratio = RepFD _{Target chromosome} / Normalized Factor

2000명의 정상인 참조 집단에서 RepFD ratio 의 평균과 표준편차 값을 계산하고, 분석 하고자 하는 샘플의 Fragments Distance Index (FDI) 값을 식 6으로 계산하였다.The mean and standard deviation values of the RepFD ratio were calculated in a reference group of 2000 normal persons, and the Fragments Distance Index (FDI) value of the sample to be analyzed was calculated using Equation 6.

식 6: FDI = (MEAN(RepFD Ratio _reference - RepFD Ratio _sample -) / SD(RepFD Ratio _reference ) Equation 6: FDI = (MEAN(RepFD Ratio _reference - RepFD Ratio _sample -) / SD(RepFD Ratio _reference ))

상기 식 6의 과정을 모든 핵산단편을 사용하는 경우(식 7), 정방향으로 정렬된 핵산단편을 사용하는 경우(식 8), 역방향으로 정렬된 핵산단편을 사용하는 경우(식 9) 각각 수행하였다.The procedure of Equation 6 was performed when all nucleic acid fragments were used (Equation 7), when using nucleic acid fragments aligned in the forward direction (Equation 8), and when using nucleic acid fragments aligned in the reverse direction (Equation 9). .

식 7: FDI^all = mean(RepFD^all Ratio_reference) - RepFD^all Ratio_sample / SD(RepFD^all Ratio_reference)Equation 7: FDI ^all = mean(RepFD ^all Ratio _reference ) - RepFD ^all Ratio _sample / SD(RepFD ^all Ratio _reference )

식 8: FDI^For = mean(RepFD^For Ratio_reference) - RepFD^For Ratio_sample / SD(RepFD^For Ratio_reference)Equation 8: FDI ^For = mean(RepFD ^For Ratio _reference ) - RepFD ^For Ratio _sample / SD(RepFD ^For Ratio _reference )

식 9: FDI^Rev = mean(RepFD^Rev Ratio_reference) - RepFD^Rev Ratio_sample / SD(RepFD^Rev Ratio_reference)Equation 9: FDI ^Rev = mean(RepFD ^Rev Ratio _reference ) - RepFD ^Rev Ratio _sample / SD(RepFD ^Rev Ratio _reference )

4-2. 염색체 수적 이상 검출을 위한 FDI 값의 성능 확인4-2. Confirmation of the performance of FDI values for detecting chromosomal numerical abnormalities

정상 표준 집단 2000명의 샘플과 Trisomy 6샘플을 포함한 임상 88샘플의 분석 결과, 100% sensitivity와 100% specificity를 확인하였다.As a result of analysis of 88 clinical samples including 2,000 samples from the normal standard group and 6 samples from Trisomy, 100% sensitivity and 100% specificity were confirmed.

양성판별을 위한 임계치 값은 FDI21^all , FDI21^For , FDI21^Rev 모두 3을 사용 하였다.3 was used as the threshold value for positive discrimination for FDI21 ^all , FDI21 ^For , and FDI21 ^Rev.

각각 샘플은 양성 판별을 위해 3개의 FDI 값을 계산하며, 모두 3 이상일 경우 최종 양성으로 판별하였다. 분석된 88 샘플 중 3개 샘플(G19NIPT261-3, G19NIPT261-10, G19NIPT261-13은 1개의 FDI 값에서 양성으로 판별 되었지만, 나머지 2개의 FDI 값은 음성을 판별 되어, 최종적으론 음성으로 판정하였다.For each sample, three FDI values are calculated for positive discrimination, and when all of them are 3 or more, it is determined as final positive. Of the 88 samples analyzed, 3 samples (G19NIPT261-3, G19NIPT261-10, G19NIPT261-13) were determined to be positive in one FDI value, but the remaining two FDI values were determined to be negative, and ultimately negative.

samplesample ResultResult FDI21.AllFDI21.All RDI21.ForRDI21.For FDI21.revFDI21.rev G19NIPT264-22G19NIPT264-22 T21T21 13.5189 13.5189 18.9706 18.9706 22.7449 22.7449 G19NIPT261-27G19NIPT261-27 T21T21 11.3797 11.3797 15.7714 15.7714 16.3134 16.3134 G19NIPT262-11G19NIPT262-11 T21T21 9.8365 9.8365 13.4254 13.4254 12.6415 12.6415 G19NIPT263-21G19NIPT263-21 T21T21 9.6024 9.6024 13.3521 13.3521 13.8184 13.8184 G19NIPT264-29G19NIPT264-29 T21T21 9.3665 9.3665 12.3367 12.3367 14.4694 14.4694 G19NIPT261-3G19NIPT261-3 T21T21 6.2652 6.2652 7.9757 7.9757 8.6841 8.6841 G19NIPT263-12G19NIPT263-12 NN 2.8936 2.8936 3.5516 3.5516 1.5855 1.5855 G19NIPT263-10G19NIPT263-10 NN 1.9594 1.9594 2.6681 2.6681 3.1888 3.1888 G19NIPT263-13G19NIPT263-13 NN 1.7782 1.7782 2.0879 2.0879 3.8235 3.8235

실시예 5. RD값 기반의 분석을 위한 혈액에서 DNA를 추출하여, 차세대 염기서열 분석 수행Example 5. Extracting DNA from blood for RD value-based analysis, performing next-generation sequencing

정상인 400명, Trisomy 21 175명, Trisomy 18 67명 및 Trisomy 13 26명의 혈액을 10mL씩 채취하여 EDTA Tube에 보관하였으며, 채취 후 2시간 이내에 1200g, 4℃15분의 조건으로 혈장 부분만 1차 원심분리한 다음, 1차 원심분리된 혈장을 16000g, 4℃10분의 조건으로 2차 원심분리하여 침전물을 제외한 혈장 상층액을 분리하였다. 분리된 혈장에 대해 Tiangenmicro DNA kit (Tiangen)을 사용하여 cell-free DNA를 추출하고, Truseq Nano DNA HT library prep kit (Illumina)를 사용해 Library preparation 과정을 수행한 다음, Nextseq500 장비 (Illumina) 를 75 Single-end 모드로 sequencing을 진행하였다. Blood from 400 normal subjects, 175 Trisomy 21, 67 Trisomy 18, and 26 Trisomy 13 was collected at 10 mL each and stored in an EDTA tube. After separation, the first centrifuged plasma was subjected to a second centrifugation under the conditions of 16000 g and 4° C. for 10 minutes to separate the plasma supernatant except for the precipitate. Cell-free DNA was extracted from the separated plasma using the Tiangenmicro DNA kit (Tiangen), and library preparation was performed using the Truseq Nano DNA HT library prep kit (Illumina), and then the Nextseq500 equipment (Illumina) was used for 75 Single Sequencing was performed in -end mode.

그 결과, 샘플 당 약 13 million 개의 reads가 생산되는 것을 확인 하였다.As a result, it was confirmed that about 13 million reads were produced per sample.

실시예 6. 서열정보 데이터의 품질관리 및 RD값 계산Example 6. Quality control of sequence information data and calculation of RD value

염기서열 정보를 전처리하고, RD 값을 계산하기 전에 다음 일련의 과정을 진행하였다. 차세대염기서열분석기(NGS) 장비에서 생성된 Bcl 파일(염기서열정보 포함)을 fastq 형식으로 변환한 다음, fastq 파일을 BWA-mem 알고리즘을 사용하여 참조 염색체 Hg19 서열을 기준으로 라이브러리 서열을 정렬하였다. 라이브러리 서열의 정렬 시 오류가 발생할 확률이 있어 오류를 교정하는 두 가지 과정을 수행하였다. 우선, 중복된 라이브러리 서열에 대하여 제거 작업을 실시한 다음, BWA-mem 알고리즘에 의해 정렬된 라이브러리 서열 중 Mapping Quality Score가 60에 도달하지 못하는 서열을 제거하였다. The sequence information was pre-processed and the following series of procedures were performed before calculating the RD value. The Bcl file (including sequencing information) generated by the next-generation sequencing (NGS) equipment was converted into fastq format, and the library sequence was aligned based on the reference chromosome Hg19 sequence using the fastq file using the BWA-mem algorithm. Since there is a possibility that an error may occur when aligning the library sequence, two processes were performed to correct the error. First, the overlapping library sequences were removed, and then sequences that did not reach 60 Mapping Quality Score among the library sequences aligned by the BWA-mem algorithm were removed.

선별한 리드를 정렬된 방향에 따라 정방향 리드와 역방향 리드로 그룹화 한 다음, 가장 인접한 리드와의 거리를 RD 값으로 하기 식 12를 이용하여 계산하였고, 이의 개념은 도 7에 나타내었다. 하기 12의 D 함수는 유전체 위치의 차이값을 계산하는 함수이다. After the selected reads were grouped into forward and reverse reads according to the aligned directions, the distance to the nearest read was calculated using Equation 12 below as the RD value, and the concept thereof is shown in FIG. 7 . The D function of 12 below is a function for calculating the difference value of the dielectric position.

식 12: Read Distance (RD) = D(a,b) | a ∈Ri , b ∈Fi Equation 12: Read Distance (RD) = D(a,b) | a ∈Ri , b ∈Fi

실시예 7. RDI 값 계산Example 7. RDI Value Calculation

7-1. 염색체 수적 이상 검출을 위한 RDI 값 계산7-1. RDI Value Calculation for Chromosomal Numerical Abnormality Detection

각 염색체 별로 RD 값의 중앙값을 RepRD로 정의하였다. 이수성 여부를 확인 하고자 하는 염색체를 각각 선별된 염색체의 집합의 RepRD 값의 중앙값 비를 Normalized Factor로 정의하고, 하기 식 13으로 계산하였다. The median RD value for each chromosome was defined as RepRD. The median ratio of the RepRD values of each selected chromosome set for the chromosome to be checked for aneuploidy was defined as the Normalized Factor, and calculated by Equation 13 below.

이수성 확인을 위한 염색체Chromosomes to identify aneuploidies 염색체 집합chromosome set 1313 4,64,6 1818 5,85,8 2121 2,4,14,202,4,14,20

식 13: Normalized Factor = Median of RepRD _{selected chromosome set} Equation 13: Normalized Factor = Median of RepRD _{selected chromosome set}

식 12와 식 13에서 계산한 RD와 Normalized Factor 를 이용해서 RepRD ratio를 식 14로 계산하였다.Using the RD and Normalized Factor calculated in Equations 12 and 13, the RepRD ratio was calculated in Equation 14.

식 14: RepRD ratio = RepRD _{Target chromosome} / Normalized Factor Equation 14: RepRD ratio = RepRD _{Target chromosome} / Normalized Factor

400명의 정상인 참조 집단에서 RepRD ratio 의 평균과 표준편차 값을 계산하고, 분석 하고자 하는 샘플의 Reads Distance Index (RDI) 값을 식 15로 계산하였다.The mean and standard deviation values of the RepRD ratio were calculated in a reference group of 400 normal people, and the Reads Distance Index (RDI) value of the sample to be analyzed was calculated using Equation 15.

식 15: RDI = RepRD Ratio _sample - MEAN(RepRD Ratio _reference ) / SD(RepRD Ratio _reference ) Equation 15: RDI = RepRD Ratio _sample - MEAN(RepRD Ratio _reference ) / SD(RepRD Ratio _reference )

7-2. 염색체 구조적 이상을 위한 RDI 값 계산7-2. RDI Value Calculation for Chromosomal Structural Abnormalities

염색체를 50kbase 로 일정하게 나눈 뒤, 각 영역별로 RD 값의 중앙값을 RepRD로 정의 하였다. 또한 Normalized Factor 는 상염색체 RD 값의 중앙값을 사용하였다. 참조 집단은 정상 여성 437명 데이터를 이용했고, 염색체 각 영역별로 RepRD Ratio의 평균과 표준편차를 계산했다. RDI 값은 식 15을 인용해 계산하였다.After uniformly dividing the chromosome into 50 kbase, the median RD value for each region was defined as RepRD. Also, as the Normalized Factor, the median value of autosomal RD values was used. The reference group used data from 437 normal women, and the mean and standard deviation of RepRD Ratio were calculated for each region of the chromosome. The RDI value was calculated by quoting Equation 15.

7-3. RD 대표값(RepRD) 계산 방식에 따른 성능 확인 (중앙값의 역수 이용)7-3. Check the performance according to the RD representative value (RepRD) calculation method (using the reciprocal of the median value)

각 유전 영역(염색체 별)에 정렬된 서열정보들의 RD 값을 계산한 뒤, 이 값들의 중앙값의 역수를 RD 대표값(RepRD)으로 정의 하였다. 여기서 중앙값이란, 계산된 RD 값들을 크기의 순서대로 정렬 했을 때 가장 중앙에 위치하는 값을 의미한다. 예를 들어 1,2,100 과 같이 세 개의 값이 있을 때, 2가 가장 중앙에 있기 때문에 2가 중앙값이 된다. After calculating the RD values of sequence information aligned in each genetic region (per chromosome), the reciprocal of the median of these values was defined as the RD representative value (RepRD). Here, the median value means a value located at the center when the calculated RD values are arranged in the order of their sizes. For example, when there are three values such as 1,2,100, 2 is the median because 2 is the most central.

만약 짝수개의 RD 값이 있을 경우, 가운데 있는 두 값의 평균으로 중앙값을 결정 한다.. 예를 들어 1,10,90,200의 RD 값이 있을 경우 중앙에 위치한 10과 90의 평균인 50이 중앙값이 된다.. 분석 샘플은 Trisomy 21 로 확인된 49 샘플과 정상으로 확인된 3,448 샘플을 이용하였고, RepRD 값은 RD 값들의 중앙값의 역수를 사용하였다. 분석 방법은 정상인 3,448 샘플의 RepRD 값의 평균과 표준편차를 이용한 Z-score 방식으로 RDI 값을 계산 하였다. 분석 결과, 약 0.999 의 정확도로 샘플의 염색체 수 이상 여부를 검출 할 수 있었다(표 4, 도 15).If there is an even number of RD values, the median is determined by the average of the two values in the middle. For example, if there are RD values of 1,10,90,200, 50, the average of 10 and 90 located in the center, becomes the median .. 49 samples confirmed as Trisomy 21 and 3,448 samples confirmed as normal were used for the analysis samples, and the RepRD value was the reciprocal of the median of the RD values. As for the analysis method, the RDI value was calculated using the Z-score method using the average and standard deviation of the RepRD values of 3,448 normal samples. As a result of the analysis, it was possible to detect whether the sample had an abnormality in the number of chromosomes with an accuracy of about 0.999 (Table 4, FIG. 15).

ChromosomeChromosome Accuracy (95%CI)Accuracy (95% CI) SensitivitySensitivity SpecificitySpecificity AUCAUC T21T21 0.9994(0.9979, 0.9999)0.9994 (0.9979, 0.9999) 0.97950.9795 0.99970.9997 1.00001.0000

7-4. RD 대표값 (RepRD) 계산 방식에 따른 성능 확인 (평균 이용)7-4. Performance check according to RD representative value (RepRD) calculation method (using average)

각 유전 영역(염색체 별)에 정렬된 서열정보들의 RD 값을 계산한 뒤, 이 값들의 평균값을 RD 대표값(RepRD)으로 정의 하였다. 여기서 평균 값이란, 계산된 RD 값들의 산술평균값으로 만약 10,50,90 의 RD 값이 있다면, (10+50+90)/3 인 50이 RD 대표값이 된다. 정상인 1,999 과 T21 163 샘플을 이용해, 정상인 집단의 RepRD 평균과 표준편차를 이용한 Z-score 방식으로 RDI 값을 계산 하였다. Normalized Factor로 사용한 염색체는 2,7,9,12,14 였다. 분석 결과, 약 0.9995 의 정확도로 샘플의 염색체 수 이상 여부를 검출 할 수 있었고, 임계값을 4.0으로 설정시 민감도는 0.999 특이도는 1.000인 것을 확인할 수 있었다(표 5, 도 16).After calculating the RD values of the sequence information aligned in each genetic region (for each chromosome), the average value of these values was defined as the RD representative value (RepRD). Here, the average value is the arithmetic average of the calculated RD values. If there are RD values of 10, 50, 90, 50 (10+50+90)/3 becomes the RD representative value. RDI values were calculated using the Z-score method using the RepRD mean and standard deviation of the normal population using 1999 and T21 163 samples. The chromosomes used as the Normalized Factor were 2,7,9,12,14. As a result of the analysis, it was possible to detect an abnormality in the number of chromosomes in the sample with an accuracy of about 0.9995, and when the threshold was set to 4.0, it was confirmed that the sensitivity was 0.999 and the specificity was 1.000 (Table 5, Fig. 16).

ChromosomeChromosome AccuracyAccuracy SensitivitySensitivity SpecificitySpecificity AUCAUC T21T21 0.9995(0.9974,1.0000)0.9995 (0.9974,1.0000) 0.99990.9999 1.00001.0000 1.00001.0000

7-5. RD 대표값 (RepRD) 계산 방식에 따른 성능 확인(평균의 역수값 이용)7-5. Check the performance according to the RD representative value (RepRD) calculation method (using the reciprocal value of the average)

각 유전 영역(염색체 별)에 정렬된 서열정보들의 RD 값을 계산한 뒤, 이 값들의 평균값의 역수값을 RD 대표값(RepRD)으로 정의 하였다. 여기서 평균 값이란, 계산된 RD 값들의 산술평균값으로 만약 10,50,90 의 RD 값이 있다면, (10+50+90)/3 인 50 이 평균값이 되고, 이 값의 역수인 1/50 = 0.02 를 RD 대표값으로 이용했다. 정상인 1,999 과 T21 163 샘플을 이용해, 정상인 집단의 RepRD 평균과 표준편차를 이용한 Z-score 방식으로 RDI 값을 계산 하였다. Normalized Factor로 사용한 염색체는 2,7,8,9,12,14 였다. 분석 결과, 약 0.9995 의 정확도로 샘플의 염색체 수 이상 여부를 검출 할 수 있었고, 임계값을 4.3으로 설정시 민감도는 0.993 특이도는 1.000인 것을 확인할 수 있었다(표 6, 도 17).After calculating the RD values of sequence information aligned in each genetic region (for each chromosome), the reciprocal value of the average value of these values was defined as the RD representative value (RepRD). Here, the average value is the arithmetic average of the calculated RD values. If there are RD values of 10,50,90, 50 which is (10+50+90)/3 becomes the average value, and the inverse of this value 1/50 = 0.02 was used as the RD representative value. RDI values were calculated using the Z-score method using the RepRD mean and standard deviation of the normal population using 1999 and T21 163 samples. The chromosomes used as the Normalized Factor were 2,7,8,9,12,14. As a result of the analysis, it was possible to detect an abnormality in the number of chromosomes in the sample with an accuracy of about 0.9995, and when the threshold was set to 4.3, it was confirmed that the sensitivity was 0.993 and the specificity was 1.000 (Table 6, FIG. 17).

ChromosomeChromosome AccuracyAccuracy SensitivitySensitivity SpecificitySpecificity AUCAUC T21T21 0.9995(0.9974,1.0000)0.9995 (0.9974,1.0000) 0.99390.9939 1.0001.000 1.0001.000

실시예 8. 염색체 수적 이상 검출을 위한 RDI 값의 성능 확인Example 8. Confirmation of the performance of RDI values for detecting chromosomal numerical abnormalities

8-1. Read count 와 Read Distance의 분포8-1. Distribution of Read Count and Read Distance

정렬된 Reads의 거리 개념을 이용한 분석에서, 생산된 reads의 수가 많을수록, 각 Reads사이의 거리는 짧게 유지가 될 것이다. 이를 확인하기 위해, 각 염색체별로 Reads 수와 RepRD값의 분포를 분석하였다. In the analysis using the distance concept of sorted reads, the greater the number of reads produced, the shorter the distance between each reads will be. To confirm this, the distribution of the number of Reads and RepRD values for each chromosome was analyzed.

그 결과, 전체적으로 Reads 수가 많을수록 RepRD이 감소되는 것을 확인했다. 특히, Reads 수와 RepRD 값의 관계가 선형 관계가 아닌 비 선형 관계 임을 확인 했고, 이는 Reads 거리 개념이 단순 Reads의 수뿐만 아니라 정렬된 위치를 모두 반영하는 결과이다(도 8).As a result, it was confirmed that the RepRD decreased as the overall number of Reads increased. In particular, it was confirmed that the relationship between the number of Reads and the RepRD value was not a linear relationship but a non-linear relationship, which is a result that the Reads distance concept reflects not only the number of simple Reads but also the sorted positions (FIG. 8).

정상 샘플과 비교하면 Trisomy 13,18 그리고 21번 샘플에서 각각 이수성으로 확인된 염색체의 RepRD 값이 낮게 분포하는 것을 확인할 수 있었다(도 9).Compared with the normal sample, it was confirmed that the RepRD values of the chromosomes identified as aneuploidy were low in Trisomy 13, 18 and 21 samples, respectively (FIG. 9).

8-2. Reads Distance Index 의 성능 및 태아분획, 임상정보, 기존 G-score 와의 관계8-2. Performance of Reads Distance Index, fetal fraction, clinical information, and relationship with existing G-score

산모 혈액을 이용한 태아 이수성 검사에서, 태아분획, 임신 주수는 검사의정확도에 많은 영향을 미친다. 임신주수가 높을수록 태아 분획은 높아지는 경향성이 있고, 태아분획이 높을수록 검사의 정확도는 높아진다. Trisomy 21 샘플의 RDI_chr21값과 산모의 임신주수, 태아분획과의 분포를 분석한 결과, 태아분획이 높아지면서 RDI_chr21값이 떨어지는 것을 확인하였다. 또 임신주수와 RDI_chr21값의 관계에서는, 15주 이상의 샘플에서 값이 떨어지는 경향성을 확인하였으며, 기존 Reads count 에 기반한 값인, G-score(대한민국 특허 제10-1686146호)와의 관계를 살펴본 결과, 유사한 경향성을 확인할 수 있었다(도 10).In a fetal aneuploidy test using maternal blood, the fetal fraction and gestational age greatly affect the accuracy of the test. The higher the gestational age, the higher the fetal fraction tends to be, and the higher the fetal fraction, the higher the accuracy of the test. As a result of analyzing the distribution of the RDI _chr21 value of the Trisomy 21 sample, the maternal gestational age, and the fetal fraction, it was confirmed that the RDI _chr21 value decreased as the fetal fraction increased. In addition, in the relationship between the number of gestational weeks and the RDI _chr21 value, a tendency to decrease in samples over 15 weeks was confirmed. The tendency was confirmed (FIG. 10).

8-3. 양성 판별 임상 검체 분석 결과8-3. Positive discriminant clinical sample analysis result

RDI 값을 이용해, 정상군과 각 염색체 이수성으로 확인된 샘플을 분석 성능을 검증하였다. RDI값을 일정기준 cutoff (-3) 으로 설정 후 정상, 이수성 샘플간의 분석성능을 비교해본 결과, Trisomy 13 은 0.991, Trisomy 18 은 0.989, Trisomy 21 은 0.998의 정확도가 확인 되었다(표 7). 또한 AUC값은 Trisomy 13,18,21 에서 각각 0.999, 0.984, 1.000인 것을 확인할 수 있었다(도 11).Using the RDI value, the analysis performance of the normal group and the samples identified as each chromosomal aneuploidy was verified. As a result of comparing the analytical performance between normal and aneuploid samples after setting the RDI value as a cutoff (-3) of a certain standard, the accuracy of Trisomy 13 was 0.991, Trisomy 18 was 0.989, and Trisomy 21 was 0.998 (Table 7). In addition, it was confirmed that the AUC values were 0.999, 0.984, and 1.000 in Trisomy 13, 18, and 21, respectively (FIG. 11).

ChromosomeChromosome Accuracy (95%CI)Accuracy (95% CI) SensitivitySensitivity SpecificitySpecificity AUCAUC T13T13 0.991(0.976-0.997)0.991 (0.976-0.997) 0.8460.846 1.0001.000 0.9990.999 T18T18 0.989(0.975-0.997)0.989 (0.975-0.997) 0.9250.925 1.0001.000 0.9840.984 T21T21 0.998(0.99-1)0.998 (0.99-1) 1.0001.000 0.9980.998 1.0001.000

8-4. RDI 계산방식에 따른 성능 확인8-4. Performance check according to RDI calculation method

정상 참조 집단 RDI ratio 의 평균과 표준편차를 이용하는 Z-score 방식과 다른, 중앙값을 이용한 Log ratio 분석 결과를 확인 하였다. Log ratio의 분석방법은 식 16을 이용하였다. The log ratio analysis result using the median, which is different from the Z-score method using the mean and standard deviation of the RDI ratio of the normal reference group, was confirmed. Equation 16 was used for the log ratio analysis method.

식 16: RDI = log ₁₀ (RepRD Ratio _sample / Median(RepRD Ratio _reference )) Equation 16: RDI = log ₁₀ (RepRD Ratio _sample / Median(RepRD Ratio _reference ))

실시예 8-3 에서 이용한 것과 같은 샘플을 이용하였고, RDI값을 일정기준 cutoff (-0.0045) 으로 설정 후 분석성능을 비교해 봤다. 성능은 양성 종류에 따라 조금씩 차이가 있었으며, 정확도는 Trisomy 21 은 0.976, Trisomy 18 은 0.994, 그리고 Trisomy 13 은 0.991로 확인 되었다(표 8).The same sample as used in Example 8-3 was used, and the analysis performance was compared after setting the RDI value to a predetermined cutoff (-0.0045). The performance was slightly different depending on the benign type, and the accuracy was 0.976 for Trisomy 21, 0.994 for Trisomy 18, and 0.991 for Trisomy 13 (Table 8).

ChromosomeChromosome Accuracy (95%CI)Accuracy (95% CI) SensitivitySensitivity SpecificitySpecificity AUCAUC T13T13 0.991(0.976-0.997)0.991 (0.976-0.997) 0.8460.846 1.0001.000 0.9990.999 T18T18 0.994(0.981-0.999)0.994 (0.981-0.999) 0.9550.955 1.0001.000 0.9840.984 T21T21 0.976(0.959-0.987)0.976 (0.959-0.987) 1.0001.000 0.9650.965 1.0001.000

8-5. Down sampling 성능확인8-5. Down sampling performance check

차세대 염기서열 분석 기술을 이용해, 비 침습적 방법의 태아 이수성 여부를 확인하는 검사에서 생산되는 데이터의 양(리드 수)은 정확도의 중요한 요소로 알려져 있다. 본 실시예에서, 리드 수에 따른 RDI 방식의 분석성능을 계산해 보았다. 분석 성능의 기준은 ROC 분석의 AUC 값을 이용했고, 리드의 수는 in-silico 방식의 무작위 리드 선별 방법을 이용했다. 무작위적으로 리드를 백만 개에서 천만 개를 선별했다. 21번 이수성 샘플을 이용한 분석 결과 리드 수가 줄어듦에 따라 분석 성능이 낮아지는 것을 확인할 수 있었다(도 12).Using next-generation sequencing technology, the amount of data (number of reads) produced in a non-invasive method for determining fetal aneuploidy is known to be an important factor in accuracy. In this embodiment, the analysis performance of the RDI method according to the number of reads was calculated. As a criterion for analysis performance, the AUC value of the ROC analysis was used, and the in-silico random read selection method was used for the number of reads. One to ten million leads were randomly selected. As a result of the analysis using the aneuploidy sample No. 21, it was confirmed that the analysis performance decreased as the number of reads decreased ( FIG. 12 ).

실시예 9. 염색체 구조적 이상 검출을 위한 RDI 값의 성능 확인Example 9. Confirmation of performance of RDI values for detecting chromosomal structural abnormalities

9-1. Read count 와 Read Distance의 분포9-1. Distribution of Read Count and Read Distance

RDI를 이용한 염색체 구조적 이상 여부를 보기 위해, 염색체를 적당한 크기로 나누는 작업이 필요하며, 본 실시예에선 50k base 의 크기로 염색체의 구간을 나눴다. 리드의 거리는 리드의 수가 많을수록 작고, 리드의 수가 많을수록 길게 분포한다. 나눠진 구간에 해당하는 리드의 수와 거리의 관계를 살펴본 결과, 염색체의 구조적 이상인 결실이 확인된 영역의 리드 거리가 구조적 이상이 없는 영역보다 리드 거리가 길게 분포하는 것을 확인할 수 있었다(도 13). In order to see whether there is a chromosomal structural abnormality using RDI, it is necessary to divide the chromosome into an appropriate size, and in this embodiment, the chromosome section is divided into a size of 50k base. The distance between the leads is smaller as the number of leads increases, and the distance between the leads increases as the number of leads increases. As a result of examining the relationship between the number of reads and the distance corresponding to the divided section, it was confirmed that the read distance of the region in which the deletion of the structural abnormality of the chromosome was confirmed was longer than that of the region without the structural abnormality (FIG. 13).

9-2. Microarray 의 결과와 비교9-2. Compare with Microarray's results

염색체 구조적 이상 여부를 검출하는 Microarray 검사와 RDI 의 분석 결과를 비교 하였다. 분석 샘플은 1번 염색체의 끝단에 3,897,640 bp 길이의 결실이 확인 되었던 샘플이고, RDI를 이용한 분석 결과 비슷한 지역의 3,700,000 bp 사이즈에서 구조적 이상(결실)이 검출 되는 것을 확인할 수 있었다(도 14).The analysis results of Microarray and RDI, which detect chromosomal structural abnormalities, were compared. The analysis sample was a sample in which a deletion of 3,897,640 bp in length was confirmed at the end of chromosome 1, and as a result of analysis using RDI, it was confirmed that a structural abnormality (deletion) was detected in a size of 3,700,000 bp in a similar region (FIG. 14).

이상으로 본 발명 내용의 특정한 부분을 상세히 기술하였는 바, 당업계의 통상의 지식을 가진 자에게 있어서 이러한 구체적 기술은 단지 바람직한 실시 양태일 뿐이며, 이에 의해 본 발명의 범위가 제한되는 것이 아닌 점은 명백할 것이다. 따라서, 본 발명의 실질적인 범위는 첨부된 청구항들과 그것들의 등가물에 의하여 정의된다고 할 것이다.As the specific parts of the present invention have been described in detail above, for those of ordinary skill in the art, it is clear that these specific descriptions are only preferred embodiments, and the scope of the present invention is not limited thereby. will be. Accordingly, it is intended that the substantial scope of the present invention be defined by the appended claims and their equivalents.

Claims

delete

(A) extracting a nucleic acid from a biological sample to obtain a nucleic acid fragment to obtain sequence information;
(B) confirming the position of the nucleic acid fragment in a standard chromosome sequence database (reference genome database) based on the obtained sequence information (reads);
(C) grouping the sequence information (reads) into full sequence, forward sequence and reverse sequence;
(D) using the grouped sequence information, defining a reference value of each nucleic acid fragment, measuring a distance between the reference values, and calculating an FD value (Fragments Distance) for each group; and
(E) Calculate each FDI value (Fragments Distance Index) for the entire chromosome region or for each specific region based on the FD value for each group calculated in step (D), so that each FDI value is less than or exceeding the reference value range In one case, a method of detecting a chromosomal abnormality comprising the step of determining that there is a chromosomal abnormality.

5. The method of claim 4, wherein step (A) is performed by a method comprising the following steps:
(Ai) obtaining nucleic acids from blood, semen, vaginal cells, hair, saliva, urine, amniotic fluid including oral cells, placental cells or fetal cells, tissue cells, and mixtures thereof;
(A-ii) removing proteins, fats, and other residues from the collected nucleic acids using a salting-out method, a column chromatography method, or a beads method; obtaining purified nucleic acids;
(A-iii) single-end sequencing or pair-end sequencing for purified nucleic acids or nucleic acids randomly fragmented by enzymatic digestion, pulverization, or hydroshear method end sequencing) preparing a library;
(A-iv) reacting the prepared library with a next-generation sequencer; and
(Av) acquiring sequence information (reads) of nucleic acids in a next-generation gene sequencing machine.

5. The method of claim 4, wherein the FD value of step (D) is the distance between the reference value of any one or more nucleic acid fragments selected from the i-th nucleic acid fragment and the i+1 to the n-th nucleic acid fragment with respect to the obtained n nucleic acid fragments. A method of detecting a chromosomal abnormality, characterized in that calculated through.

The method according to claim 6, wherein the reference value of the nucleic acid fragment is a value added or subtracted from a median value of the nucleic acid fragment.

The method according to claim 7, wherein the reference value of the nucleic acid fragment is derived based on the position values of forward and reverse sequence information (reads) in the case of paired-end sequencing. Way.

The method according to claim 8, further comprising excluding from the calculation process a nucleic acid fragment having an alignment score of sequence information (reads) less than a reference value.

The chromosomal abnormality detection according to claim 6, wherein the reference value of the nucleic acid fragment is derived based on one type of position value of forward or reverse sequence information (read) in case of single-end sequencing. How to.

The method of claim 10, wherein a random value is added when a position value is derived based on sequence information aligned in the forward direction, and a random value is subtracted when a position value is derived based on sequence information aligned in the reverse direction. A method for detecting chromosomal abnormalities, characterized in that.

The method of claim 7, wherein the arbitrary value is 30 to 70% of the average length of the nucleic acid to be analyzed.

The method of claim 7, wherein the arbitrary value is 0 to 5 kbp or 0 to 300% of the length of the nucleic acid fragment.

The method according to claim 4, wherein step (E) is performed by a method comprising the following steps:
(Ei) determining a representative value (RepFD) of FD values for the entire chromosome region or for each specific region;
(E-ii) Sum, difference, product, mean, log of product, log of sum, median, quantile, minimum, maximum, variance, standard deriving a normalized factor by calculating one or more values selected from the group consisting of deviation, absolute deviation, coefficient of variation, reciprocal values thereof, and combinations thereof;
(E-iii) calculating a representative value ratio (RepFD ratio) based on Equation 1 below;
Equation 1: RepFD ratio = RepFD Target genomic region / Normalized Factor
(E-iv) Comparing the RepFD ratio value of the normal reference group and the sample, calculating FDI (Fragments Distance Index).

15. The method of claim 14, wherein the representative value (RepFD) of step (Ei) is the sum, difference, product, mean, log of product, log of sum, median, quantile, minimum, maximum, variance, standard deviation, A method for detecting a chromosomal abnormality, characterized in that it is one or more values selected from the group consisting of median absolute deviation and coefficient of variation and/or one or more inverse values thereof.

The method of claim 15 , wherein the representative value (RepFD) of the step (Ei) is a median value, an average value, or a reciprocal value of the FD values.

The method according to claim 14, wherein the specific region in the sample other than the entire chromosome region to be analyzed or the specific genetic region to be analyzed in step (E-ii) is selected by a method comprising the following steps. :
a) randomly selecting an entire region of a chromosome to be analyzed or a region other than a specific genetic region;
b) determining the representative value (RepFD) of the genetic region selected in step a) as a pre-normalized factor (PNF);
c) calculating a representative value ratio (RepFD ratio) based on Equation 2:
Equation 2: RepFD ratio = RepFD Target genomic region / PNF
d) calculating a coefficient of variation (Coefficient of Variance: SD / Mean) of the RepFD ratio value of a normal reference group; and
e) determining the genetic region having the smallest value among the coefficients of variation obtained by repeating steps a) to d) as the entire chromosome region or a specific region in the sample other than the specific genetic region.

The method of claim 14, wherein in step (E-iv), the RepFD ratio value of a normal reference group is compared with the RepFD ratio value of the sample.

a decoding unit that extracts nucleic acids from a biological sample and deciphers sequence information;
an alignment unit that aligns the translated sequence to a standard chromosomal sequence database; and
By measuring the distance between the aligned nucleic acid fragments with respect to the selected nucleic acid fragments, the FD value (Fragments Distance) is calculated, and the FDI value (Fragments Distance) for the entire chromosome region or for each specific genetic region based on the calculated FD value Index) and, when the FDI value is less than or greater than the reference value or section, a chromosome abnormality detection device comprising a chromosome abnormality determining unit that determines that there is a chromosome abnormality.

A computer-readable storage medium comprising instructions configured to be executed by a processor for detecting a chromosomal abnormality,
(A) extracting a nucleic acid from a biological sample to obtain a nucleic acid fragment to obtain sequence information;
(B) aligning the nucleic acid fragment to a standard chromosome sequence database (reference genome database) based on the obtained sequence information (reads);
(C) measuring the distance between the selected nucleic acid fragments (fragments), calculating the FD value (Fragments Distance); and
(D) Based on the FD value calculated in step (C), the FDI value (Fragments Distance Index) is calculated for the entire chromosome region or for each specific genetic region. A computer readable storage medium comprising instructions configured to be executed by a processor that detects a chromosomal abnormality through the step of determining that there is a chromosomal abnormality.

The method according to claim 7, wherein when the arbitrary value is 50% of the average length of the nucleic acid to be analyzed, the calculated FD value is an RD value (Read Distance).

(A) extracting nucleic acids from a biological sample to obtain sequence information;
(B) aligning the obtained sequence information (reads) to a standard chromosome sequence database (reference genome database);
(C) calculating the RD value (Read Distance) by measuring the distance between the aligned reads with respect to the aligned sequence information (reads); and
(D) The RDI value (Read Distance Index) is calculated for the entire chromosome region or for each specific region based on the RD value calculated in step (C). A chromosomal abnormality detection method comprising the step of determining.

The method of claim 22, wherein step (A) is performed by a method comprising the following steps:
(Ai) obtaining nucleic acids from blood, semen, vaginal cells, hair, saliva, urine, amniotic fluid including oral cells, placental cells or fetal cells, tissue cells, and mixtures thereof;
(A-ii) removing proteins, fats, and other residues from the collected nucleic acids using a salting-out method, a column chromatography method, or a beads method; obtaining purified nucleic acids;
(A-iii) single-end sequencing or pair-end sequencing for purified nucleic acids or nucleic acids randomly fragmented by enzymatic digestion, pulverization, or hydroshear method end sequencing) preparing a library;
(A-iv) reacting the prepared library with a next-generation sequencer; and
(Av) acquiring sequence information (reads) of nucleic acids in a next-generation gene sequencing machine.

The method according to claim 22, wherein the step of grouping the reads aligned before the step (C) according to the aligned direction can be further used.

23. The method of claim 22, wherein the RD value of step (C) is, with respect to the n reads obtained, one of the values of both ends of the i-th read and any one or more reads selected from i+1 to n-th read. Chromosomal abnormality detection method, characterized in that calculated through the distance between the value plus or minus 50% of the average length of the nucleic acid.

26. The method of claim 25, wherein the RD value calculates the distance between the 5' or 3' end inside the i-th read and the 5' or 3' end of any one or more of the i+1 to n-th reads. Methods for detecting chromosomal abnormalities.

The method of claim 22, wherein step (D) is performed by a method comprising the following steps:
(Di) determining a representative value (RepRD) of the RD value for the entire chromosome region or for each specific region;
(D-ii) Sum, difference, product, mean, log of product, log of sum, median, quantile, minimum, maximum, variance, standard deriving a normalized factor by calculating one or more values selected from the group consisting of deviation, median absolute deviation, and coefficient of variation and/or one or more inverse values thereof;
(D-iii) calculating a representative value ratio (RepRD ratio) based on the following Equation 10;
Equation 10: RepRD ratio = RepRD Target genomic region / Normalized Factor
(D-iv) Comparing the RepRD ratio value of the normal reference group and the sample, calculating RDI (Read Distance Index):

28. The method of claim 27, wherein the representative value (RepRD) in step (Di) is the sum, difference, product, mean, log of product, log of sum, median, quantile, minimum, maximum, variance, standard deviation, A chromosomal abnormality detection method, characterized in that it is one or more values selected from the group consisting of a median absolute deviation and a coefficient of variation and/or one or more inverse values thereof.

The method according to claim 28, wherein the representative value (RepRD) in step (Di) is a median value, an average value, or a reciprocal value of the RD values.

The method according to claim 27, wherein the specific region in the sample other than the entire chromosome region to be analyzed or the specific genetic region to be analyzed in step (D-ii) is selected by a method comprising the following steps:
a) randomly selecting an entire region of a chromosome to be analyzed or a region other than a specific genetic region;
b) determining a representative value of the RepRD value of the genetic region selected in step a) as a pre-normalized factor (PNF);
c) calculating a representative value ratio (RepRD ratio) based on Equation 11 below:
Equation 11: RepRD ratio = RepRD Target genomic region / PNF
d) calculating a coefficient of variation (Coefficient of Variance: SD / Mean) of the RepRD ratio value of a normal reference group; and
e) determining the genetic region having the smallest value among the coefficients of variation obtained by repeating steps a) to d) as the entire chromosome region or a specific region in the sample other than the specific genetic region.

The method of claim 27, wherein in step (D-iv), the RepRD ratio value of a normal reference group is compared with the RepRD ratio value of the sample.

a decoding unit that extracts nucleic acids from a biological sample and deciphers sequence information;
an alignment unit that aligns the translated sequence to a standard chromosomal sequence database; and
By measuring the distance between reads aligned with respect to the selected sequence information (reads), the RD value (Read Distance) is calculated, and the RDI value (Read Distance Index) for the entire chromosome region or for each specific genetic region based on the calculated RD value ), and when the RDI value is less than or greater than the reference value or section, a chromosome abnormality detection device comprising a chromosomal abnormality determining unit for determining that there is a chromosomal abnormality.

A computer-readable storage medium comprising instructions configured to be executed by a processor for detecting a chromosomal abnormality,
(A) extracting nucleic acids from a biological sample to obtain sequence information;
(B) aligning the obtained sequence information (reads) to a standard chromosome sequence database (reference genome database);
(C) calculating the RD value (Read Distance) by measuring the distance between the reads aligned with respect to the selected sequence information (reads); and
(D) Calculate the RDI value (Read Distance Index) for the entire chromosome region or for each specific genetic region based on the RD value calculated in step (C). A computer readable storage medium comprising instructions configured to be executed by a processor that detects a chromosomal abnormality through determining that there is a chromosomal abnormality.