KR20200137875A

KR20200137875A - Non-invasive prenatal testing method and devices based on double Z-score

Info

Publication number: KR20200137875A
Application number: KR1020190064938A
Authority: KR
Inventors: 이민섭; 신상철; 이성훈; 윤선영
Original assignee: 이원다이애그노믹스(주)
Priority date: 2019-05-31
Filing date: 2019-05-31
Publication date: 2020-12-09
Also published as: KR20210130680A; KR20220032540A; KR102519739B1

Abstract

The present invention relates to a non-invasive prenatal test method based on a two-step Z-score, and an apparatus thereof. More particularly, the present invention relates to a method and apparatus for improving the accuracy of non-invasive prenatal testing by significantly amplifying the Z-score by applying one of the statistical methods, Z-score, twice. Accordingly, the present invention uses a two-stage Z-score amplified by two-stage data processing to separate the number of reads of a specific chromosome from the number of reads of a normal sample group as much as possible to reduce the possibility of false-positive and false-negative determination of chromosomal abnormalities, and may be used to increase the accuracy of a non-invasive prenatal test.

Description

Non-invasive prenatal testing method and devices based on double Z-score}

본 발명은 2단계 Z-score에 기반한 비침습적 산전 검사 방법 및 장치에 관한 것으로, 보다 상세하게는 통계 방법 중 하나인 Z-score를 두 번 적용함으로써 Z-score를 크게 증폭시켜 비침습적 산전 검사의 정확도를 향상시키기 위한 방법 및 장치에 관한 것이다.The present invention relates to a non-invasive prenatal test method and apparatus based on a two-step Z-score. More specifically, Z-score, which is one of the statistical methods, is applied twice to greatly amplify the Z-score and thus the non-invasive prenatal test It relates to a method and apparatus for improving accuracy.

일반적으로 '산전 진단'이란 태아가 태어나기 전 태아의 질병 유무를 판단 및 진단하는 과정을 말한다. In general,'prenatal diagnosis' refers to the process of determining and diagnosing the presence of a fetus's disease before the fetus is born.

최근의 한 국내 통계자료에 따르면, 선천성 기형아가 전체 신생아의 약 3%에 이르며, 선천성 기형아 중 약 20%는 염색체 이상에 의한 것으로 보고되었다. 특히 널리 알려져 있는 다운증후군에 해당하는 기형아는 선천성 기형아의 약 26%에 이른다. 이러한 기형아 출산율의 증가와 여러 산전 진단 장비들의 개발로 인하여 산전 진단에 대한 관심은 날로 증가하고 있다. 특히, 만 35세 이상의 고령의 임산부, 염색체 이상이 있는 아이의 분만 경력이 있는 임산부, 부모 중 한 명에게서 염색체의 구조적 이상이 있는 경우, 유전질환의 가족력이 있는 경우, 신경관결손의 위험이 있는 경우, 모체혈청 선별검사와 초음파검사에서 태아기형이 의심되는 경우 등에는 산전 진단을 받을 필요가 있다.According to a recent domestic statistical data, about 3% of newborns with congenital malformations are reported, and about 20% of congenital malformations are reported to be due to chromosomal abnormalities. In particular, the deformity of Down syndrome, which is widely known, accounts for about 26% of congenital malformations. Interest in prenatal diagnosis is increasing day by day due to the increase in the birth rate of malformed babies and the development of various prenatal diagnostic equipment. In particular, pregnant women over 35 years of age, pregnant women with a history of delivery of children with chromosomal abnormalities, structural abnormalities in chromosomes from one of the parents, family history of genetic diseases, and risk of neural tube defects In case of suspected fetal malformation in maternal serum screening and ultrasound, it is necessary to undergo prenatal diagnosis.

산전 진단 방법은 크게 침습적 진단 방법과 비침습적 진단 방법으로 나누어 볼 수 있다. 침습적 진단 방법의 예로는, 임신 10 ~ 12주 사이에 시행하는 융모막검사(chorionic villi sampling, CVS), 임신 15 ~ 20주 사이에 면역분석법을 이용하여 양수 내 AFP의 농도를 측정함으로써 태아의 염색체를 분석하는 양수천자(amniocentesis), 임신 18 ~ 20주 사이에 초음파 유도 하에 탯줄로부터 직접 태아 혈액을 추출하는 방법으로 시행하는 탯줄천자(cordocentesis) 방법 등이 있다. 그러나 위와 같은 침습적 진단 방법들은 검사 과정에서 태아에게 충격을 가하여 유산이나, 질병 또는 기형 등을 유발할 수 있다는 문제점이 있다. Prenatal diagnosis methods can be largely divided into invasive diagnosis methods and non-invasive diagnosis methods. Examples of invasive diagnostic methods include chorionic villi sampling (CVS) performed between 10 and 12 weeks of pregnancy, and immunoassay between 15 and 20 weeks of pregnancy to measure the concentration of AFP in the amniotic fluid. There are amniocentesis to be analyzed, cordocentesis method, which is performed by extracting fetal blood directly from the umbilical cord under ultrasound guidance between 18 and 20 weeks of pregnancy. However, such invasive diagnostic methods have a problem in that they may cause miscarriage, disease, or deformity by impacting the fetus during the examination process.

따라서, 이러한 문제점들을 극복하기 위하여 비침습적 진단 방법들이 개발되고 있다. 예를 들어, 배아 착상 전 유전진단 방법은 체외수정에서 사용되는 분자유전학적 또는 세포유전학적 기술을 이용하여 자궁 내 착상 전 유전적 결함이 없는 배아를 선택하는 기술이다. 또한, 염색체 이수성(aneuploidy)을 신속히 진단하기 위한 QF-PCR (quantitative-fluorescent PCR) 형광 정량법은 염색체마다 특이적으로 존재하는 DNA의 짧은 염기서열 반복 표지자(short tandem repeats, STR)에 형광을 붙여 멀티플렉스(multiplex) PCR 법으로 증폭한 후 DNA 자동염기서열 분석기로 형광이 붙은 증폭된 DNA의 양을 측정하여 분석하는 신속 선별 검사방법이다. 또한, 복제수 변이(copy number change)를 찾아내기 위하여 유리 슬라이드 위에 맵핑한 DNA 서열(mapped DNA sequence)을 집적하여 검사하는 염색체 마이크로어레이 (chromosomal microarray, CMA) 방법 등이 알려져 있다.Therefore, non-invasive diagnostic methods are being developed to overcome these problems. For example, the genetic diagnosis method before embryo implantation is a technique that selects embryos without genetic defects before implantation in the uterus using molecular genetic or cytogenetic techniques used in in vitro fertilization. In addition, QF-PCR (quantitative-fluorescent PCR) fluorescence quantification method for rapidly diagnosing chromosomal aneuploidy is performed by fluorescing short tandem repeats (STR) of DNA specifically present for each chromosome. This is a rapid screening test method in which the amount of amplified DNA with fluorescence is measured and analyzed by a DNA auto-base sequence analyzer after amplification by the multiplex PCR method. In addition, a chromosomal microarray (CMA) method in which a mapped DNA sequence on a glass slide is integrated and examined to detect a copy number change is known.

한편, 시퀀싱 기술의 발달로 대규모의 유전체 정보를 해독하는 것이 가능해짐에 따라, 이러한 차세대 시퀀싱(Next-Generation Sequencing, NGS) 기술을 기반으로 한 유전체 분석 방법들이 산전 진단 영역에도 활용되고 있다. 특히, 모체의 혈액에는 태아의 유전체가 전체 유전체의 약 10% 수준으로 함유되어 있다는 사실이 알려져 있으며, 이를 이용하여 태아의 세포를 모체의 혈액에서 분리하여 그 염색체를 분석하려는 산전 진단 방법들이 알려져 있다. 이와 관련하여 대한민국특허출원 제2010-7003969호는 대규모 병렬 게놈 시퀀싱(massively parallel genomic sequencing)을 이용한 태아 염색체 이수성의 진단 방법에 관하여 개시하고 있다. 또한, 미국등록특허 제8195415호 역시 산모 혈액으로부터 수득한 DNA의 서열분석 결과를 각 염색체별로 특정 길이에 대해 맵핑(mapping)하여 정량분석하는 방법을 개시하고 있다.On the other hand, as it becomes possible to decode large-scale genome information with the development of sequencing technology, genome analysis methods based on this next-generation sequencing (NGS) technology are also used in the prenatal diagnosis area. In particular, it is known that the maternal blood contains about 10% of the whole genome of the fetus, and prenatal diagnostic methods are known to analyze the chromosome by separating the fetal cells from the maternal blood using this. . In this regard, Korean Patent Application No. 2010-7003969 discloses a method for diagnosing fetal chromosomal aneuploidy using massively parallel genomic sequencing. In addition, U.S. Patent No. 8195415 also discloses a method for quantitative analysis by mapping the sequence analysis results of DNA obtained from maternal blood to a specific length for each chromosome.

하지만, 종래 보고된 방법들은 통계학적 한계로 인하여 태아 기형의 판별이정확하지 않다는 문제가 있다. 태아 기형의 진단 오류(위양성; false positive; FP 및 위음성; false negative; FN)는 심각한 결과를 초래할 수 있기 때문에 비침습적 산전 검사 방법에서 보다 민감하고, 정확한 분석 알고리즘을 개발하는 것은 매우 중요하다. However, the conventionally reported methods have a problem that the determination of fetal malformations is not accurate due to statistical limitations. Since diagnostic errors (false positive; FP and false negative; FN) of fetal malformations can lead to serious consequences, it is very important to develop more sensitive and accurate analysis algorithms in non-invasive prenatal testing methods.

이에 본 발명의 발명자들은 종래기술의 한계점을 극복하고 비침습적 산전 검사의 정확도를 향상시키기 위해 예의 연구를 거듭한 결과, 각 염색체 당 두 차례의 Z-score를 산출하는 방법을 통하여 염색체 이상을 매우 정확하게 판별할 수 있음을 발견하고 본 발명을 완성하게 되었다. Accordingly, the inventors of the present invention have conducted extensive research in order to overcome the limitations of the prior art and improve the accuracy of non-invasive prenatal testing. As a result, the chromosomal abnormality is very accurately determined through the method of calculating two Z-scores for each chromosome. It was found that it could be discriminated and the invention was completed.

따라서 본 발명의 목적은 (a) 산모의 혈액에서 무세포 DNA(cfDNA)를 추출하여 차세대 시퀀싱(Next Generataion Sequencing)을 수행하는 단계; Accordingly, an object of the present invention is to (a) extract cell-free DNA (cfDNA) from the mother's blood to perform next-generation sequencing (Next Generataion Sequencing);

(b) 상기 시퀀싱된 리드(read) 서열들을 참조 유전체(reference genome) 서열에 맵핑(mapping)하는 단계; (b) mapping the sequenced read sequences to a reference genome sequence;

(c) 상기 참조 유전체의 각 염색체를 미리 설정된 구획(bin)으로 나누어 상기 맵핑된 리드를 기반으로 구획별 맵핑된 리드 수를 산출하는 단계; (c) dividing each chromosome of the reference genome into preset bins and calculating the number of mapped reads for each partition based on the mapped reads;

(d) 각 염색체 별로 상기 구획별 리드 수의 중간값(median)을 산출하는 단계; (d) calculating a median of the number of reads for each segment for each chromosome;

(e) 하기 수학식 1에 따라서 각 염색체의 1단계 Z-score를 산출하는 단계;(e) calculating the first stage Z-score of each chromosome according to Equation 1 below;

[수학식 1][Equation 1]

(f) 하기 수학식 2에 따라서 각 염색체의 2단계 Z-score를 산출하는 단계; 및(f) calculating a two-step Z-score of each chromosome according to Equation 2 below; And

[수학식 2][Equation 2]

(상기 식에서, 대표 염색체 군은 1 내지 12번 염색체 및 14 내지 16번 염색체로 이루어진 군에서 선택된 2 이상의 염색체이다.)(In the above formula, the representative chromosome group is two or more chromosomes selected from the group consisting of chromosomes 1 to 12 and chromosomes 14 to 16.)

(g) 각 염색체별로 산출된 상기 2단계 Z-score값을 확인하여 염색체 이상여부를 판단하는 단계를 포함하는 비침습적 산전 검사 방법을 제공하는 것이다. (g) To provide a non-invasive prenatal test method comprising the step of determining whether or not a chromosome abnormality is detected by checking the Z-score value in the second step calculated for each chromosome.

본 발명의 다른 목적은 (a) 산모의 혈액에서 무세포 DNA(cfDNA)를 추출하여 차세대 시퀀싱(Next Generataion Sequencing)을 수행하는 시퀀싱부;Another object of the present invention is (a) a sequencing unit for extracting cell-free DNA (cfDNA) from the mother's blood and performing Next Generataion Sequencing;

(b) 상기 시퀀싱된 리드(read) 서열들을 참조 유전체(reference genome) 서열에 맵핑(mapping)하는 맵핑부;(b) a mapping unit for mapping the sequenced read sequences to a reference genome sequence;

(c) 상기 참조 유전체의 각 염색체를 미리 설정된 구획(bin)으로 나누어 상기 맵핑된 리드를 기반으로 구획별 맵핑된 리드 수를 산출하는 리드 수 산출부;(c) a read number calculation unit that divides each chromosome of the reference genome into preset bins and calculates the number of mapped reads for each partition based on the mapped reads;

(d) 각 염색체 별로 상기 구획별 리드 수의 중간값(median)을 산출하는 중간값 산출부;(d) a median value calculator for calculating a median of the number of reads for each chromosome;

(e) 하기 수학식 1에 따라서 각 염색체의 1단계 Z-score를 산출하는 1단계 Z-score 산출부;(e) a first-stage Z-score calculator for calculating a first-stage Z-score of each chromosome according to Equation 1 below;

[수학식 1][Equation 1]

(f) 하기 수학식 2에 따라서 각 염색체의 2단계 Z-score를 산출하는 2단계 Z-score 산출부; 및(f) a two-step Z-score calculating unit for calculating a two-step Z-score of each chromosome according to Equation 2 below; And

[수학식 2][Equation 2]

(g) 각 염색체별로 산출된 상기 2단계 Z-score값을 확인하여 염색체 이상여부를 판단하는 판정부를 포함하는 비침습적 산전 검사 장치를 제공하는 것이다.(g) It is to provide a non-invasive prenatal test apparatus including a determination unit for determining whether or not a chromosome abnormality is determined by checking the two-step Z-score value calculated for each chromosome.

상기와 같은 목적을 달성하기 위하여, 본 발명은 (a) 산모의 혈액에서 무세포 DNA(cfDNA)를 추출하여 차세대 시퀀싱(Next Generataion Sequencing)을 수행하는 단계; In order to achieve the above object, the present invention comprises the steps of: (a) extracting cell-free DNA (cfDNA) from the mother's blood and performing Next Generataion Sequencing;

[수학식 1][Equation 1]

[수학식 2][Equation 2]

(g) 각 염색체별로 산출된 상기 2단계 Z-score값을 확인하여 염색체 이상여부를 판단하는 단계를 포함하는 비침습적 산전 검사 방법을 제공한다.(g) It provides a non-invasive prenatal test method including the step of determining whether or not a chromosome abnormality is detected by checking the Z-score value calculated for each chromosome.

본 발명의 다른 목적을 달성하기 위하여, 본 발명은 (a) 산모의 혈액에서 무세포 DNA(cfDNA)를 추출하여 차세대 시퀀싱(Next Generataion Sequencing)을 수행하는 시퀀싱부;In order to achieve another object of the present invention, the present invention includes: (a) a sequencing unit for extracting cell-free DNA (cfDNA) from the mother's blood and performing Next Generataion Sequencing;

[수학식 1][Equation 1]

[수학식 2][Equation 2]

(g) 각 염색체별로 산출된 상기 2단계 Z-score값을 확인하여 염색체 이상여부를 판단하는 판정부를 포함하는 비침습적 산전 검사 장치를 제공한다. (g) It provides a non-invasive prenatal test apparatus including a determination unit that determines whether or not a chromosome is abnormal by checking the Z-score value in the second step calculated for each chromosome.

이하 본 발명을 상세히 설명한다. Hereinafter, the present invention will be described in detail.

본 발명은 (a) 산모의 혈액에서 무세포 DNA(cfDNA)를 추출하여 차세대 시퀀싱(Next Generataion Sequencing)을 수행하는 단계; The present invention comprises the steps of: (a) extracting cell-free DNA (cfDNA) from the mother's blood and performing Next Generataion Sequencing;

[수학식 1][Equation 1]

[수학식 2][Equation 2]

(a) 산모의 혈액에서 무세포 DNA(cfDNA)를 추출하여 차세대 시퀀싱(Next Generataion Sequencing)을 수행하는 단계;(a) extracting cell-free DNA (cfDNA) from the mother's blood and performing Next Generataion Sequencing;

본 발명에서 상기 산모는 태아 임신의 약 1주 내지 약 45주(예를 들면, 태아 임신의 1주 내지 4주, 4주 내지 8주, 8주 내지 12주, 12주 내지 16주, 16주 내지 20주, 20주 내지 24주, 24주 내지 28주, 28주 내지 32주, 32주 내지 36주, 36주 내지 40주, 또는 40주 내지 44주), 종종 태아 임신의 약 5주 내지 약 28주(예를 들면, 태아 임신의 6주, 7주, 8주, 9주, 10주, 11주, 12주, 13주, 14주, 15주, 16주, 17주, 18주, 19주, 20주, 21주, 22주, 23주, 24주, 25주, 26주 또는 27주)의 여성을 의미하는 것일 수 있다. In the present invention, the mother is about 1 week to about 45 weeks of fetal pregnancy (e.g., 1 week to 4 weeks, 4 weeks to 8 weeks, 8 weeks to 12 weeks, 12 weeks to 16 weeks, 16 weeks of fetal pregnancy) To 20 weeks, 20 to 24 weeks, 24 to 28 weeks, 28 to 32 weeks, 32 to 36 weeks, 36 to 40 weeks, or 40 to 44 weeks), often from about 5 weeks of fetal pregnancy About 28 weeks (e.g. 6 weeks, 7 weeks, 8 weeks, 9 weeks, 10 weeks, 11 weeks, 12 weeks, 13 weeks, 14 weeks, 15 weeks, 16 weeks, 17 weeks, 18 weeks of fetal pregnancy, 19 weeks, 20 weeks, 21 weeks, 22 weeks, 23 weeks, 24 weeks, 25 weeks, 26 weeks, or 27 weeks) may mean a female.

모체 혈액에서 발견된 태아 DNA의 분석은 예를 들면, 전혈, 혈청 또는 혈장을 사용함으로써 수행될 수 있다. 모체 혈액으로부터 혈청 또는 혈장을 준비하는 방법은 공지되어 있다. 예를 들면, 임신 여성의 혈액은 혈액 응고를 방지하기 위해 EDTA를 함유하는 튜브 또는 특수 상업용 제품, 예컨대, 바큐테이너(Vacutainer) SST(벡톤 딕킨슨(Becton Dickinson), 미국 뉴저지주 프랭클린 레이크스 소재) 내에 놓여질 수 있고, 그 후 혈장이 원심분리를 통해 전혈로부터 수득될 수 있다. 혈청은 혈액 응고 후 원심분리를 이용하거나 이용하지 않고 수득될 수 있다. 원심분리가 이용되는 경우, 원심분리는 (배타적으로는 아니지만) 전형적으로 적절한 속도, 예를 들면, 1,500xg 내지 3,000xg에서 수행된다. 혈장 또는 혈청은 DNA 추출을 위해 새로운 튜브로 옮겨지기 전에 추가 원심분리 단계로 처리될 수 있다.Analysis of fetal DNA found in maternal blood can be carried out, for example, by using whole blood, serum or plasma. Methods for preparing serum or plasma from maternal blood are known. For example, the blood of a pregnant woman is contained in tubes or special commercial products containing EDTA to prevent blood clotting, such as Vacutainer SST (Becton Dickinson, Franklin Lakes, NJ). Can be placed, and then plasma can be obtained from whole blood via centrifugation. Serum can be obtained with or without centrifugation after blood coagulation. When centrifugation is used, the centrifugation is typically (but not exclusively) carried out at an appropriate speed, for example 1,500xg to 3,000xg. Plasma or serum can be subjected to an additional centrifugation step before being transferred to a new tube for DNA extraction.

혈액으로부터 DNA를 추출하는 다수의 공지된 방법들이 존재한다. (예를 들면, 문헌(Sambrook and Russell, Molecular Cloning: A Laboratory Manual 3d ed., 2001)에 기재된) 일반적인 DNA 준비 방법을 따를 수 있고; 다양한 상업적으로 입수가능한 시약들 또는 키트들, 예컨대, 퀴아젠(Qiagen)의 QIAamp 순환 핵산 키트, QiaAmp DNA 미니 키트 또는 QiaAmp DNA 혈액 미니 키트(퀴아젠, 독일 힐덴 소재), 게노믹프렙(GenomicPrep)?? 혈액 DNA 단리 키트(프로메가(Promega), 미국 위스콘신주 매디슨 소재), 및 GFX?? 게놈 혈액 DNA 정제 키트(아머샴(Amersham), 미국 뉴저지주 피스카타웨이 소재)도 임신 여성으로부터의 혈액 샘플로부터 DNA를 수득하는 데에 사용될 수 있다. 이 방법들 중 1종 초과의 방법들이 병용될 수도 있다.There are a number of known methods for extracting DNA from blood. The general DNA preparation method (described, for example, Sambrook and Russell, Molecular Cloning: A Laboratory Manual 3d ed., 2001) can be followed; Various commercially available reagents or kits, such as Qiagen's QIAamp Cyclic Nucleic Acid Kit, QiaAmp DNA Mini Kit or QiaAmp DNA Blood Mini Kit (Qiagen, Hilden, Germany), GenomicPrep? ? Blood DNA Isolation Kit (Promega, Madison, Wis.), and GFX?? A genomic blood DNA purification kit (Amersham, Piscataway, NJ) can also be used to obtain DNA from blood samples from pregnant women. More than one of these methods may be used in combination.

특히, 상기 혈청 무세포 DNA(cell free DNA, cfDNA)는 매우 미량으로 혈청에 섞여 있는 소량의 산모 세포만으로도 민감도에 영향을 줄 수 있으므로 혈청에 있는 산모의 세포를 완전히 제거하는 것이 본 검사에 중요하다.In particular, the serum cell free DNA (cfDNA) is very small, and even a small amount of maternal cells mixed in the serum can affect the sensitivity, so it is important to completely remove the maternal cells in the serum. .

본원에서 사용된 용어 "무세포 DNA"는 세포를 실질적으로 갖지 않는 공급원으로부터 단리된 핵산을 지칭할 수 있고, "세포 유리" 핵산, "세포 유리 순환 핵산"(예를 들면, CCF 단편) 및/또는 "세포 유리 순환" 핵산으로서도 지칭된다. 세포외 핵산은 혈액에 존재할 수 있고 이 혈액으로부터 수득될 수 있다.As used herein, the term “cell-free DNA” may refer to a nucleic acid isolated from a source that is substantially free of cells, and may refer to “cell free” nucleic acids, “cell free circulating nucleic acids” (eg, CCF fragments) and/ Or as a “cell free circulating” nucleic acid. Extracellular nucleic acids can be present in the blood and can be obtained from this blood.

상기 산모의 혈액으로부터 추출된 무세포 DNA는 차세대 시퀀싱(Next Generation Sequencing)을 통해 그 서열이 시퀀싱 될 수 있다. 상기 용어 차세대 시퀀싱(next-generation sequencing: NGS)"은 대규모 병렬 시퀀싱(massive parallel sequencing)" 또는 2세대 시퀀싱(second-generation sequencing)과 상호 교환적으로 사용될 수 있다. 차세대 시퀀싱은 수백만개의 단편의 핵산을 동시다발적으로 시퀀싱하는 기법을 말한다. 차세대 시퀀싱은 예를 들어, 454 플랫폼(Roche), GS FLX 티타늄, Illumina MiSeq, Illumina HiSeq, Illumina Genome Analyzer, Solexa platform, SOLiD System(Applied Biosystems), Ion Proton(Life Technologies), Complete Genomics, Helicos Biosciences Heliscope, Pacific Biosciences의 단일 분자 실시간(SMRT??) 기술, 또는 이들의 조합에 의해 병렬 방식으로 수행될 수 있다.The cell-free DNA extracted from the maternal blood may be sequenced through Next Generation Sequencing. The term "next-generation sequencing (NGS)" may be used interchangeably with "massive parallel sequencing" or second-generation sequencing. Next-generation sequencing refers to a technique for simultaneously sequencing millions of fragments of nucleic acid. Next-generation sequencing is, for example, 454 platform (Roche), GS FLX titanium, Illumina MiSeq, Illumina HiSeq, Illumina Genome Analyzer, Solexa platform, SOLiD System (Applied Biosystems), Ion Proton (Life Technologies), Complete Genomics, Helicos Biosciences Heliscope. , Pacific Biosciences' single molecule real-time (SMRT??) technology, or a combination thereof.

일부 실시양태에서, 핵산은 본 발명에 기재된 방법 전, 동안 또는 후에 단편화되거나 절단될 수 있다. 단편화된 또는 절단된 핵산은 약 5개 내지 약 10,000개 염기쌍, 약 100개 내지 약 1,000개 염기쌍, 약 100개 내지 약 500개 염기쌍, 또는 약 10개, 15개, 20개, 25개, 30개, 35개, 40개, 45개, 50개, 55개, 60개, 65개, 70개, 75개, 80개, 85개, 90개, 95개, 100개, 200개, 300개, 400개, 500개, 600개, 700개, 800개, 900개, 1000개, 2000개, 3000개, 4000개, 5000개, 6000개, 7000개, 8000개 또는 9000개 염기쌍의 공칭, 평균치(average) 또는 평균(mean) 길이를 가질 수 있다. 단편은 당분야에서 공지된 적합한 방법에 의해 생성될 수 있고, 핵산 단편의 평균치, 평균 또는 공칭 길이는 적절한 단편 생성 절차를 선택함으로써 조절될 수 있다. 핵산 단편은 중첩 뉴클레오티드 서열을 함유할 수 있고, 이러한 중첩 서열은 단편화되지 않은 대응물 핵산 또는 이의 분절의 뉴클레오티드 서열의 구축을 용이하게 할 수 있다.In some embodiments, nucleic acids can be fragmented or cleaved before, during or after the methods described herein. Fragmented or truncated nucleic acids may be about 5 to about 10,000 base pairs, about 100 to about 1,000 base pairs, about 100 to about 500 base pairs, or about 10, 15, 20, 25, 30 , 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 200, 300, 400 Nominal, average of, 500, 600, 700, 800, 900, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000 or 9000 base pairs ) Or mean length. Fragments can be produced by suitable methods known in the art, and the average, average or nominal length of nucleic acid fragments can be adjusted by selecting an appropriate fragment generation procedure. Nucleic acid fragments may contain overlapping nucleotide sequences, and such overlapping sequences may facilitate construction of the nucleotide sequence of the non-fragmented counterpart nucleic acid or segment thereof.

일부 실시양태에서, 핵산은 적합한 방법에 의해 단편화되거나 절단되고, 이러한 방법의 비한정적 예는 물리적 방법(예를 들면, 전단, 예를 들면, 초음파처리, 프렌치 프레스, 가열, UV 방사선조사 등), 효소적 방법(예를 들면, 효소적 절단제(예를 들면, 적합한 뉴클레아제(nuclease), 적합한 제한 효소, 적합한 메틸화 민감성 제한 효소)), 화학적 방법(예를 들면, 알킬화, DMS, 피페리딘, 산 가수분해, 염기 가수분해, 가열 등 또는 이들의 조합) 등 또는 이들의 조합을 포함한다.In some embodiments, nucleic acids are fragmented or cleaved by suitable methods, and non-limiting examples of such methods include physical methods (e.g., shearing, e.g., sonication, French press, heating, UV irradiation, etc.), Enzymatic methods (e.g., enzymatic cleavage agents (e.g., suitable nucleases, suitable restriction enzymes, suitable methylation sensitive restriction enzymes)), chemical methods (e.g., alkylation, DMS, pipery Din, acid hydrolysis, base hydrolysis, heating, etc., or a combination thereof) and the like, or combinations thereof.

본 발명의 상기 (a) 단계에서는 차세대 시퀀싱을 수행하기 위해 핵산 라이브러리를 제조하는 단계를 더 포함할 수 있다. 상기 핵산 라이브러리는 차세대 시퀀싱의 방식에 따라 제조될 수 있다. 즉, 차세대 시퀀싱을 제공하는 제조자의 지시에 따라 핵산 라이브러리를 제작할 수 있다.The step (a) of the present invention may further include preparing a nucleic acid library to perform next-generation sequencing. The nucleic acid library can be prepared according to the method of next-generation sequencing. That is, a nucleic acid library can be prepared according to the manufacturer's instructions for providing next-generation sequencing.

상기 수득된 핵산 단편의 서열정보는 리드(reads)로 불릴 수 있다.The sequence information of the obtained nucleic acid fragment may be referred to as reads.

(b) 상기 시퀀싱된 리드(read) 서열들을 참조 유전체(reference genome) 서열에 맵핑(mapping)하는 단계;(b) mapping the sequenced read sequences to a reference genome sequence;

본 발명의 상기 (b) 단계에서는 시퀀싱된 서열 리드가 맵핑될 수 있고, 특정된 핵산 영역(예를 들면, 염색체, 또는 이의 부분 또는 분절)에 맵핑되는 리드의 수는 카운트로서 지칭될 수도 있다. 임의의 적합한 맵핑 방법(예를 들면, 프로세스, 알고리즘, 프로그램, 소프트웨어, 모듈 등 또는 이들의 조합)이 본 발명의 상기 (b) 단계에서 이용될 수 있다. 맵핑 프로세스의 일부 양태는 이하에 기재되어 있다.In the step (b) of the present invention, the sequenced sequence reads may be mapped, and the number of reads mapped to a specified nucleic acid region (eg, a chromosome, or a portion or segment thereof) may be referred to as a count. Any suitable mapping method (eg, process, algorithm, program, software, module, etc., or a combination thereof) may be used in step (b) of the present invention. Some aspects of the mapping process are described below.

서열 리드(즉, 물리적 게놈 위치가 공지되어 있지 않은 단편으로부터의 서열 정보)의 맵핑은 다수의 방식으로 수행될 수 있고, 종종 수득된 서열 리드를 참조 유전체 내의 일치 서열과 정렬(align)하는 단계를 포함한다. 이러한 정렬에서, 서열 리드는 일반적으로 참조 유전체 서열과 정렬되고, 정렬되는 서열 리드는 "맵핑된" 것, "맵핑된 서열 리드" 또는 "맵핑된 리드"로서 표기될 수 있다. 일부 실시양태에서, 맵핑된 서열 리드는 "히트(hit)" 또는 "카운트"로서 지칭될 수 있다. The mapping of sequence reads (i.e., sequence information from fragments for which the physical genomic location is unknown) can be performed in a number of ways, often with the step of aligning the obtained sequence reads with matching sequences in the reference genome. Include. In such an alignment, the sequence reads are generally aligned with a reference genomic sequence, and the sequence reads to be aligned may be designated as “mapped”, “mapped sequence reads” or “mapped reads”. In some embodiments, mapped sequence reads may be referred to as “hits” or “counts”.

본원에서 사용된 바와 같이, 용어 "정렬된", "정렬" 또는 "정렬하는"은 일치(예를 들면, 100% 동일성) 또는 부분적 일치로서 확인될 수 있는 2개 이상의 핵산 서열을 지칭한다. 정렬은 수동으로 수행될 수 있거나 컴퓨터(예를 들면, 소프트웨어, 프로그램, 모듈 또는 알고리즘)에 의해 수행될 수 있고, 이러한 컴퓨터 프로그램의 비제한적인 예는 일루미나 게놈 분석 파이프라인의 부분으로서 배포된 뉴클레오티드 데이터의 효율적 국소 정렬(ELAND) 컴퓨터 프로그램을 포함하다. 서열 리드의 정렬은 100% 서열 일치일 수 있다. 일부 경우, 정렬은 100% 미만의 서열 일치(즉, 완벽하지 않은 일치, 부분적 일치, 부분적 정렬)이다. 일부 실시양태에서, 정렬은 약 99%, 98%, 97%, 96%, 95%, 94%, 93%, 92%, 91%, 90%, 89%, 88%, 87%, 86%, 85%, 84%, 83%, 82%, 81%, 80%, 79%, 78%, 77%, 76% 또는 75% 일치일 수 있다. 일부 실시양태에서, 정렬은 불일치를 포함할 수 있다As used herein, the terms “aligned”, “aligned” or “aligned” refer to two or more nucleic acid sequences that can be identified as a match (eg, 100% identity) or partial match. Alignment may be performed manually or may be performed by a computer (e.g., software, program, module or algorithm), and non-limiting examples of such computer programs include nucleotide data distributed as part of the Illumina genomic analysis pipeline. Includes the efficient local alignment (ELAND) computer program. Alignment of sequence reads can be 100% sequence identity. In some cases, the alignment is less than 100% sequence match (ie, non-perfect match, partial match, partial alignment). In some embodiments, the alignment is about 99%, 98%, 97%, 96%, 95%, 94%, 93%, 92%, 91%, 90%, 89%, 88%, 87%, 86%, It may be 85%, 84%, 83%, 82%, 81%, 80%, 79%, 78%, 77%, 76% or 75% consistent. In some embodiments, the alignment can include a mismatch

컴퓨터를 이용한 다양한 방법들이 각각의 서열 리드를 부분들에 맵핑하는 데에 이용될 수 있다. 서열을 정렬하는 데에 사용될 수 있는 컴퓨터 알고리즘의 비한정적 예는 BWA, BLAST, BLITZ, FASTA, BOWTIE 1, BOWTIE 2, ELAND, MAQ, PROBEMATCH, SOAP 또는 SEQMAP, 또는 이들의 변경물 또는 이들의 조합물을 포함하나 이들로 한정되지 않으며, 가장 바람직하게는 BWA일 수 있다.Various computer-aided methods can be used to map each sequence read to portions. Non-limiting examples of computer algorithms that can be used to align sequences include BWA, BLAST, BLITZ, FASTA, BOWTIE 1, BOWTIE 2, ELAND, MAQ, PROBEMATCH, SOAP or SEQMAP, or a modification or combination thereof. Including, but not limited to these, may be most preferably BWA.

일부 실시양태에서, 서열 리드는 참조 유전체 내의 서열과 정렬될 수 있다. 일부 실시양태에서, 서열 리드는 예를 들면, UCSC 데이터베이스(hg19, hg38), 진뱅크(GenBank), dbEST, dbSTS, EMBL(유럽 분자생물학 실험실(European Molecular Biology Laboratory)) 및 DDBJ(일본의 DNA 데이터뱅크(DNA Databank of Japan))를 포함하는, 당분야에서 공지된 핵산데이터베이스에서 발견될 수 있고/있거나 이러한 데이터베이스 내의 서열과 정렬될 수 있으며, 가장 바람직하게는 UCSC 데이터베이스(hg19, h38) 내의 서열과 정렬될 수 있다. In some embodiments, sequence reads may be aligned with sequences within a reference genome. In some embodiments, sequence reads are, for example, UCSC database (hg19, hg38), GenBank, dbEST, dbSTS, EMBL (European Molecular Biology Laboratory) and DDBJ (Japanese DNA Databank). (DNA Databank of Japan)), can be found in nucleic acid databases known in the art and/or can be aligned with sequences in such databases, most preferably aligned with sequences in UCSC databases (hg19, h38). Can be.

(c) 상기 참조 유전체의 각 염색체를 미리 설정된 구획(bin)으로 나누어 상기 맵핑된 리드를 기반으로 구획별 맵핑된 리드 수를 산출하는 단계;(c) dividing each chromosome of the reference genome into preset bins and calculating the number of mapped reads for each partition based on the mapped reads;

일부 실시양태에서, 상기 맵핑된 리드는 다양한 파라미터에 따라 함께 분류되고 특정 부분(예를들면, 참조 유전체의 부분)에 할당된다. 종종, 개별 맵핑된 리드가 샘플에 존재하는 구획(예를 들면, 부분의 존재, 부재 또는 양)을 확인하는 데에 사용될 수 있다. 일부 실시양태에서, 구획의 양은 샘플 중의 보다 큰 서열(예를 들면, 염색체)의 양을 표시한다. In some embodiments, the mapped reads are grouped together according to various parameters and assigned to specific portions (eg, portions of a reference genome). Often, individual mapped reads can be used to identify the compartments (eg, presence, absence or amount of parts) present in the sample. In some embodiments, the amount of a partition indicates the amount of a larger sequence (eg, chromosome) in the sample.

상기 용어 "구획"은 본원에서 "게놈 구획", "빈(bin)", "영역", "부분", "기준 게놈의 부분", "염색체의 부분" 또는 "게놈 부분"으로서도 지칭될 수 있다. 일부 실시양태에서, 상기 구획은 전체 염색체, 염색체의 분절, 참조 유전체의 분절, 다수의 염색체에 걸쳐 있는 분절, 다수의 염색체 분절 및/또는 이들의 조합물일 수 있다. 일부 실시양태에서, 상기 구획은 특정 파라미터(예를 들면, 표시자)를 기초로 미리 정해질 수 있다. 일부 실시양태에서, 구획은 유전체의 분할을 기초로 임의로 또는 비임의로 정해질 수 있다(예를 들면, 크기, GC 함량, 연속 영역, 임의로 정해진 크기의 연속 영역 등에 의해 분할된다). 일부 실시양태에서, 상기 구획은 이산 게놈 빈(discrete genomic bins), 소정의 길이의 연속 서열을 가진 게놈 빈, 가변-크기 빈, 평활화된 커버리지 맵의 포인트 베이스 뷰(point-based views) 및/또는 이들의 조합물로부터 선택될 수 있다. The term “compartment” may also be referred to herein as “genomic compartment”, “bin”, “region”, “portion”, “part of a reference genome”, “part of a chromosome” or “genomic part” . In some embodiments, the compartment may be an entire chromosome, a segment of a chromosome, a segment of a reference genome, a segment spanning multiple chromosomes, multiple chromosomal segments, and/or combinations thereof. In some embodiments, the partitions may be predefined based on certain parameters (eg, indicators). In some embodiments, the partitions may be randomly or non-randomly defined based on the division of the dielectric (eg, divided by size, GC content, contiguous region, randomly defined contiguous region, etc.). In some embodiments, the partitions are discrete genomic bins, genomic bins with contiguous sequences of a predetermined length, variable-size bins, point-based views of smoothed coverage maps, and/or It can be selected from combinations of these.

일부 실시양태에서, 상기 구획은 예를 들면, 서열의 길이 또는 특정 특징 또는 특징들을 포함하는 하나 이상의 파라미터를 기초로 기술될 수 있다. 상기 구획은 당분야에서 공지되어 있거나 본원에 기재되어 있는 임의의 적합한 기준을 사용함으로써 고려사항으로부터 선택될 수 있고/있거나, 여과될 수 있고/있거나 제거될 수 있다. 일부 실시양태에서, 상기 구획은 게놈 서열의 특정 길이에 기초한다. 일부 실시양태에서, 상기 구획은 거의 동일한 길이를 가질 수 있거나, 상이한 길이를 가질 수 있다. 일부 실시양태에서, 부분은 동등한 길이를 가진다. 일부 실시양태에서, 상이한 길이의 부분들은 조절되거나 가중될 수 있다. In some embodiments, the partitions may be described based on one or more parameters including, for example, the length of the sequence or specific features or features. Such compartments may be selected from consideration and/or may be filtered and/or removed by using any suitable criteria known in the art or described herein. In some embodiments, the partitions are based on a specific length of the genomic sequence. In some embodiments, the compartments may have approximately the same length, or may have different lengths. In some embodiments, portions are of equal length. In some embodiments, portions of different lengths can be adjusted or weighted.

바람직하게는, 본 발명에서 상기 구획은 약 10 킬로염기(kb) 내지 약 500kb, 약 10 kb 내지 약 300 kb, 약 20 kb 내지 약 200 kb, 약 20 kb 내지 약 100 kb일 수 있으며, 바람직하게는 약 30 kb 내지 약 80 kb일 수 있으며, 가장 바람직하게는 40kb 내지 60kb일 수 있다. Preferably, in the present invention, the compartment may be about 10 kilobases (kb) to about 500 kb, about 10 kb to about 300 kb, about 20 kb to about 200 kb, about 20 kb to about 100 kb, and preferably May be from about 30 kb to about 80 kb, and most preferably from 40 kb to 60 kb.

본 발명에서 상기 구획은 서열의 연속물로 한정되지 않는다. 따라서, 상기 구획은 연속 및/또는 비-연속 서열로 구성될 수 있다. 또한, 상기 구획은 단일 염색체로 한정되지 않는다. 일부 실시양태에서, 상기 구획은 한 염색체의 전부 또는 일부, 또는 2개 이상의 염색체의 전부 또는 일부를 포함할 수 있다. 일부 실시양태에서, 상기 구획은 1개 또는 2개 이상의 전체 염색체에 걸쳐 있을 수 있다. 추가로, 상기 구획은 다수의 염색체들의 연결된 또는 연결되지 않은 영역들에 걸쳐 있을 수 있다.In the present invention, the partition is not limited to a sequence of sequences. Thus, the partitions may consist of contiguous and/or non-contiguous sequences. In addition, the compartment is not limited to a single chromosome. In some embodiments, the compartment may comprise all or part of one chromosome, or all or part of two or more chromosomes. In some embodiments, the compartments may span one or two or more entire chromosomes. Additionally, the compartment may span linked or unlinked regions of multiple chromosomes.

또한, 본 발명에서 상기 구획은 종종 본원에 기재되어 있거나 당분야에서 공지되어 있는 하나 이상의 특징, 파라미터, 기준 및/또는 방법에 따라 프로세싱될 수 있다(예를 들면, 표준화, 여과, 선택 등 또는 이들의 조합에 의해 프로세싱될 수 있다). 상기 구획은 임의의 적합한 방법에 의해 임의의 적합한 파라미터에 따라 프로세싱될 수 있다. 상기 구획을 여과하고/하거나 선택하는 데에 사용될 수 있는 특징 및/또는 파라미터의 비한정적 예는 카운트, 커버리지, 맵핑 가능성, 가변성, 불확실성 수준, 구아닌-사이토신(GC) 함량, CCF 단편 길이 및/또는 리드 길이(예를 들면, 단편 길이 비(FLR), 태아비 통계(FRS)), DNaseI 민감성, 메틸화 상태, 아세틸화, 히스톤 분포, 염색질 구조 등 또는 이들의 조합을 포함할 수 있다. In addition, in the present invention, the compartments can often be processed according to one or more features, parameters, criteria and/or methods described herein or known in the art (e.g., standardization, filtration, selection, etc. or Can be processed by a combination of). The compartment can be processed according to any suitable parameters by any suitable method. Non-limiting examples of features and/or parameters that can be used to filter and/or select the partitions include count, coverage, mappability, variability, level of uncertainty, guanine-cytosine (GC) content, CCF fragment length, and/or Or read length (eg, fragment length ratio (FLR), fetal ratio statistics (FRS)), DNaseI sensitivity, methylation status, acetylation, histone distribution, chromatin structure, and the like, or a combination thereof.

상기 참조 유전체의 각 염색체를 전술한 구획으로 나누어 상기 맵핑된 리드를 기반으로 구획별 맵핑된 리드 수를 산출한다. Each chromosome of the reference genome is divided into the above-described partitions, and the number of mapped reads per partition is calculated based on the mapped reads.

일부 실시양태에서, 하나 이상의 구획들(예를 들면, 기준 게놈의 구획)에 맵핑된 리드의 수를 측정하기 위해 선택된 특징 또는 변수를 기초로 맵핑되거나 분할된 서열 리드를 정량할 수 있다. 일부 실시양태에서, 구획들에 맵핑된 리드의 수는 카운트(예를 들면, 카운트 값)로서 지칭될 수 있다. 일부 실시양태에서, 2개 이상의 구획(예를 들면, 구획의 세트)에 대한 카운트는 수학적으로 조작될 수 있다(예를 들면, 평균 산출, 덧셈, 표준화 등 또는 이들의 조합). 일부 실시양태에서, 카운트는 구획들에 맵핑된(즉, 구획과 관련된) 리드의 일부 또는 전부로부터 측정될 수 있다. 일부 실시양태에서, 맵핑된 리드의 수는 미리 정해진 맵핑된 리드의 서브세트로부터 측정될 수 있다. 미리 정해진 맵핑된 서열 리드의 서브세트는 임의의 적합한 특징 또는 변수를 사용함으로써 정해질 수 있거나 선택될 수 있다. 일부 실시양태에서, 미리 정해진 맵핑된 리드의 서브세트는 1개 내지 n개의 리드를 포함할 수 있고, 이때 n은 검사 대상체 또는 기준 대상체 시료로부터 생성된 모든 리드들의 합계와 동등한 수를 나타낸다.In some embodiments, mapped or segmented sequence reads can be quantified based on a selected feature or variable to determine the number of reads mapped to one or more partitions (eg, a partition of a reference genome). In some embodiments, the number of reads mapped to partitions may be referred to as a count (eg, a count value). In some embodiments, counts for two or more partitions (eg, a set of partitions) can be manipulated mathematically (eg, calculating averaging, adding, normalizing, etc., or a combination thereof). In some embodiments, counts can be measured from some or all of the reads mapped to (ie, associated with) the partitions. In some embodiments, the number of mapped reads can be measured from a predetermined subset of mapped reads. A predetermined subset of mapped sequence reads can be defined or selected by using any suitable feature or variable. In some embodiments, the predetermined subset of mapped reads may include 1 to n reads, where n represents a number equal to the sum of all reads generated from a test subject or reference subject sample.

특히, 본 발명에서는 상기 구획별 맵핑된 리드 수는 조작될 수 있거나 변환될 수 있다(예를 들면, 표준화될 수 있거나, 조합될 수 있거나, 더해질 수 있거나, 여과될 수 있거나, 선택될 수 있거나, 평균으로서 산출될 수 있거나, 평균으로서 유도될 수 있거나, 이들의 조합에 의해 프로세싱될 수 있다). 바람직하게는, 본 발명에서 상기 구획별 맵핑된 리드 수는 표준화될 수 있다. In particular, in the present invention, the number of reads mapped per division can be manipulated or converted (e.g., can be standardized, can be combined, can be added, can be filtered, can be selected, or It can be calculated as an average, derived as an average, or processed by a combination thereof). Preferably, in the present invention, the number of mapped reads for each partition may be standardized.

상기 표준화는 본 발명에 기재되어 있거나 당분야에서 공지되어 있는 적합한 방법에 의해 수행될 수 있다. 일부 실시양태에서, 표준화는 상이한 스케일로 측정된 값을 개념적으로 공통된 스케일로 조정하는 단계를 포함한다. 일부 실시양태에서, 표준화는 조정된 값의 확률 분포를 정렬하는 정교한 수학적 조정을 포함한다. 일부 실시양태에서, 표준화는 분포를 표준 분포로 정렬하는 단계를 포함한다. 일부 실시양태에서, 표준화는 일부 중대한 결과(예를 들면, 오차 및 예외)의 영향을 제거하는 방식으로 상이한 데이터 세트에 대한 상응하는 표준화된 값의 비교를 가능하게 하는 수학적 조절을 포함한다. 일부 실시양태에서, 표준화는 크기 조정을 포함한다. 표준화는 종종 하나 이상의 데이터 세트를 소정의 변수 또는 식으로 나누는 단계를 포함한다. The standardization can be carried out by any suitable method described in the present invention or known in the art. In some embodiments, normalizing comprises adjusting values measured on different scales to a conceptually common scale. In some embodiments, normalization involves sophisticated mathematical adjustments that align the probability distribution of the adjusted values. In some embodiments, normalizing comprises aligning the distribution to a standard distribution. In some embodiments, normalization includes mathematical adjustments that allow comparison of corresponding normalized values for different data sets in a manner that eliminates the influence of some significant outcomes (eg, errors and exceptions). In some embodiments, normalizing includes scaling. Standardization often involves dividing one or more sets of data into predetermined variables or expressions.

본 발명에서 상기 표준화 방법의 구획별 표준화(bin-wise normalization), GC 함량에 의한 표준화, 선형 최소 자승 회귀, 비선형 최소 자승 회귀, LOESS, GC LOESS, LOWESS, PERUN, 반복부 마스킹(repeat masking; RM), GC-표준화 및 반복부 마스킹(GC-normalizatin and repeat masking; GCRM), 조건부 분위 표준화(conditional quantile normalization; cQn) 및 이들의 조합으로 이루어진 군에서 선택될 수 있으며, 바람직하게는 GC 함량에 의한 표준화, LOESS 방법 및 이의 조합일 수 있다. In the present invention, bin-wise normalization of the standardization method, standardization by GC content, linear least squares regression, nonlinear least squares regression, LOESS, GC LOESS, LOWESS, PERUN, repeat masking (RM) ), GC-normalizatin and repeat masking (GCRM), conditional quantile normalization (cQn), and a combination thereof, preferably by GC content It can be standardization, LOESS method, and combinations thereof.

가장 바람직하게는, 참조 유전체의 각 구간별 GC 함량과 상기 산출된 구획별 맵핑된 리드 수의 관계를 LOESS 회귀(regression) 방법을 통하여 GC 함량에 의한 편향을 계산하여 GC 함량에 의한 편향을 보정하여 표준화할 수 있다. 이 단계를 상세히 설명하면, 차세대 시퀀싱을 이용한 염색체수 이상 진단과정에 가장 큰 영향을 주는 것은 차세대 시퀀싱 과정에서 발생하는 GC 함량에 의한 시퀀스 리드의 편향이다(Benjamini, Y., & Speed, T. P. (2012). Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Research, 40(10)).Most preferably, the relationship between the GC content for each section of the reference genome and the number of mapped reads for each section is calculated by calculating the bias by the GC content through the LOESS regression method, and correcting the bias by the GC content. Can be standardized. Explaining this step in detail, the most influencing on the diagnosis of chromosome number abnormalities using next-generation sequencing is the bias of sequence reads due to the GC content generated in the next-generation sequencing process (Benjamini, Y., & Speed, TP (2012). ).Summarizing and correcting the GC content bias in high-throughput sequencing.Nucleic Acids Research, 40(10)).

즉, 이러한 편향은 대규모 병렬 게놈시퀀싱 과정에 존재하는 PCR 증폭과정에서 발생하는 것으로 알려져 있으며, 유전체 중 GC 함량이 높은 부분이 시퀀싱이 잘되거나 반대로 GC 함량이 낮음 부분이 시퀀싱이 잘 되는 결과를 말한다.In other words, this bias is known to occur in the PCR amplification process that exists in the large-scale parallel genome sequencing process, and it is a result of good sequencing in the part of the genome with a high GC content or, conversely, the part with a low GC content in the genome.

이러한 GC 함량에 의한 편향에 의해서 실제와는 다르게 염색체 이상의 결과가 검출될 수 있는데, 예를 들어 특별히 GC 함량이 높은 부분이 시퀀싱이 잘 된 결과를 이용하여 진단을 하려고 한다면 GC 함량이 높은 염색체의 비율이 실제보다 더 높은 것으로 나타나게 될 것이다. 이는 실제와는 다른 관찰결과를 도출하게 되는 것이다. 그러므로 차세대 시퀀싱을 이용하여 염색체 이수성을 진단하고자 할 때 이와 같은 GC 함량에 의한 편차를 강하게 억제하는 것이 바람직하다. The result of chromosomal abnormalities may be detected differently from the actual due to such a bias by the GC content.For example, if a part with a particularly high GC content is to be diagnosed using the result of well sequencing, the ratio of the chromosome with a high GC content It will appear to be higher than this reality. This leads to observation results different from the actual one. Therefore, when diagnosing chromosomal aneuploidy using next-generation sequencing, it is desirable to strongly suppress such variations due to GC content.

본 발명의 일실시예에서는 GC 함량 편차 보정 방법으로 차세대 시퀀싱 결과로 얻어진 염색체 각 구획별 리드의 수와 그 구획의 GC 함량과의 관계를 LOESS 회귀(regression) 방법을 통하여 GC 함량에 의한 편차 정도를 계산한 뒤 이를 이용하여 역으로 편차를 제거하는 방식을 사용하였다. In one embodiment of the present invention, the GC content deviation correction method is used to determine the degree of deviation due to GC content through the LOESS regression method to determine the relationship between the number of reads for each chromosome segment and the GC content of the segment obtained as a result of next-generation sequencing. After calculation, the deviation was reversely removed using this method.

한편, 상기 LOESS 방법은 k-최근접 이웃 기초 메타-모델에서 다중 회귀 모델을 병용하는, 당분야에서 공지된 회귀 모델링 방법이다. LOESS는 종종 국소적으로 가중된 다항 회귀로서 지칭된다. 일부 실시양태에서, GC LOESS는 LOESS 모델을 기준 게놈의 부분에 대한 단편 카운트(예를 들면, 서열 리드, 카운트)와 GC 조성 사이의 관계에 적용한다. LOESS를 이용하여 데이터 점 세트를 통해 평활 곡선을 작도하는 것은 특히 각각의 평활화된 값이 y-축 산점도 기준 변수의 값의 범위에 걸쳐 가중된 이차 최소 자승 회귀에 의해 제공될 때 종종 LOESS 곡선으로서 지칭된다. 데이터 세트에서 각각의 점에 대해, LOESS 방법은 반응이 평가되는 점 근처의 설명 변수 값으로 저차(low-degree) 다항식을 데이터의 서브세트에 피팅한다. 상기 다항식은 반응이 평가되는 점 근처의 점에게 더 많은 가중을 주고 더 멀리 떨어져 있는 점에게 더 적은 가중을 주는 가중된 최소 자승을 이용함으로써 피팅된다. 그 다음, 점에 대한 회귀 함수의 값이 그 데이터 점에 대한 설명 변수 값을 사용하여 국소 다항식을 평가함으로써 수득된다. LOESS 피트는 종종 회귀 함수 값이 각각의 데이터 점에 대해 계산된 후 완결된 것으로서 간주된다. 이 방법의 세부사항들 중 대부분, 예컨대, 다항 모델의 차수 및 가중은 변화 가능하다.Meanwhile, the LOESS method is a regression modeling method known in the art that uses multiple regression models in combination with a k-nearest neighbor base meta-model. LOESS is often referred to as a locally weighted polynomial regression. In some embodiments, GC LOESS applies the LOESS model to the relationship between fragment counts (eg, sequence reads, counts) and GC composition for a portion of a reference genome. Using LOESS to construct a smooth curve through a set of data points is often referred to as a LOESS curve, especially when each smoothed value is provided by a weighted quadratic least squares regression over a range of values of the y-axis scatterplot reference variable. do. For each point in the data set, the LOESS method fits a low-degree polynomial to a subset of the data with the value of the explanatory variable near the point at which the response is evaluated. The polynomial is fitted by using a weighted least squares which gives more weight to points near the point on which the response is being evaluated and less weight to points further away. Then, the value of the regression function for a point is obtained by evaluating the local polynomial using the explanatory variable value for that data point. The LOESS fit is often considered complete after the regression function values have been calculated for each data point. Most of the details of this method, such as the order and weight of the polynomial model, are variable.

(d) 각 염색체 별로 상기 구획별 리드 수의 중간값(median)을 산출하는 단계;(d) calculating a median of the number of reads for each segment for each chromosome;

본 발명의 상기 (d) 단계에서는 상기 (c) 단계에서 산출된 각 구획별 맵핑된 리드 수, 바람직하게는 표준화된 리드 수의 중간값을 각 염색체별로 산출하는 단계이다. In the step (d) of the present invention, the number of mapped reads for each partition, preferably the median value of the number of standardized reads calculated in step (c), is calculated for each chromosome.

본 발멸에서 상기 중간값(median)은 통상적인 정의에 따른다. In this extinction, the median is according to the conventional definition.

(e) 하기 수학식 1에 따라서 각 염색체의 1단계 Z-score를 산출하는 단계;(e) calculating the first stage Z-score of each chromosome according to Equation 1 below;

[수학식 1][Equation 1]

(f) 하기 수학식 2에 따라서 각 염색체의 2단계 Z-score를 산출하는 단계; 및(f) calculating a two-step Z-score of each chromosome according to Equation 2 below; And

[수학식 2][Equation 2]

본 발명은 상기 (e) 단계 및 (f) 단계에서 각 염색체별 산출된 리드 수를 2 단계에 걸쳐 프로세싱하여 1단계 Z-score 및 2단계 Z-score를 산출하는 것이 특징일 수 있다. The present invention may be characterized in that the number of reads calculated for each chromosome in steps (e) and (f) is processed over two steps to calculate the first step Z-score and the second step Z-score.

본 발명에서 상기 i번 염색체란 염색체 이상 여부를 검출하고자 하는 염색체를 의미하는 것이며, 상기 1번 내지 22번 염색체는 인간의 상염색체를 의미하는 것이다. In the present invention, chromosome i means a chromosome for which chromosome abnormality is to be detected, and chromosomes 1 to 22 mean human autosomal.

본 발명의 상기 (f) 단계에서는 2단계 Z-score를 산출하기 위하여 사용되는 대표 염색체 군은 1 내지 12번 염색체 및 14 내지 16번 염색체로 이루어진 군에서 선택된 2 이상의 염색체 집단을 의미하는 것이며, 바람직하게는 1 내지 12번 염색체 및 14 내지 16번 염색체로 이루어진 군에서 선택된 5 이상의 염색체 집단을 의미할 수 있으며, 더 바람직하게는 1 내지 12번 염색체 및 14 내지 16번 염색체로 이루어진 군에서 선택된 9 이상의 염색체 집단을 의미할 수 있으며, 가장 바람직하게는 1 내지 12번 염색체 및 14 내지 16번 염색체로 이루어진 집단일 수 있다. In the step (f) of the present invention, the representative chromosome group used to calculate the step 2 Z-score refers to a group of two or more chromosomes selected from the group consisting of chromosomes 1 to 12 and chromosomes 14 to 16, preferably Specifically, it may mean a group of 5 or more chromosomes selected from the group consisting of chromosomes 1 to 12 and chromosomes 14 to 16, more preferably 9 or more selected from the group consisting of chromosomes 1 to 12 and chromosomes 14 to 16. It may mean a chromosome group, and most preferably, it may be a group consisting of chromosomes 1 to 12 and chromosomes 14 to 16.

이와 같이, 상기 2단계 Z-score 산출단계에서는 삼염색체의 위험이 있어 결과에 영향을 줄 수 있는 13번, 18번 및 21번 염색체가 제외되고, GC 함량이 특히 높거나 낮은 19번, 20번 및 22번 염색체가 데이터 프로세싱에서 제외됨으로써 이들로 인해 생길 수 있는 편향(bias)를 줄여 더욱 정확한 결과를 얻을 수 있다. As described above, in the second Z-score calculation step, chromosomes 13, 18 and 21, which may affect the result due to the risk of trisomy, are excluded, and the GC content is particularly high or low 19 and 20. And because chromosome 22 is excluded from data processing, a bias that may occur due to them is reduced, so that more accurate results can be obtained.

본 발명의 일실시예에 따르면, 상기 1단계 Z-score를 통해서는 판별이 불가능했던 21번 염색체의 삼염색체 위험성이 상기 2단계 Z-score를 통해서는 판별이 가능한 것으로 확인되었으며, 이러한 결과는 곧 상기 2단계 Z-score를 통해 태아 산전 검사방법에서 위음성 결과를 현격하게 줄이고 검사의 정확성을 높일 수 있다는 것을 의미한다. According to an embodiment of the present invention, it was confirmed that the trisomy risk of chromosome 21, which was not possible to be discriminated through the first step Z-score, can be discriminated through the second step Z-score, and this result is soon This means that the second step Z-score can significantly reduce false negative results in the fetal prenatal test method and increase the accuracy of the test.

본 발명의 방법은 또한 상기 (g) 단계 이후에 각 염색체별 미리 설정된 컷-오프(cut-off) 값을 이용하여 태아의 염색체 이상 위험도를 판정하는 단계를 추가로 포함할 수 있다. 즉, 특정 산모로부터 수득한 무세포 DNA 본 발명의 방법에 따라 분석한 결과, 특정 염색체의 2단계 Z-score 값이 컷-오프 초과 또는 미만인 경우에는, 해당 산모의 태아가 상기 염색체의 이상을 나타낼 위험성이 높은 것으로 판정할 수 있다. The method of the present invention may further include the step of determining the risk of a chromosomal abnormality of the fetus using a preset cut-off value for each chromosome after step (g). That is, as a result of analysis according to the method of the present invention, a cell-free DNA obtained from a specific mother, when the second stage Z-score value of a specific chromosome is greater than or less than the cut-off, the mother's fetus indicates an abnormality of the chromosome. It can be determined that the risk is high.

특히, 상기 특정 염색체의 상기 컷-오프(cut-off) 값은 정상 태아를 가진 산모집단에서 나타내는 2단계 Z-score의 최대값보다 크고 상기 염색체에 이상을 보이는 태아를 가진 산모집단에서 나타내는 2단계 Z-score의 최소값보다 작을 수 있으며, 바람직하게는 상기 특정 염색체의 컷-오프(cut-off) 값은 정상 태아를 가진 산모집단에서 나타내는 2단계 Z-score의 최대값과, 상기 염색체에 이상을 보이는 태아를 가진 산모집단에서 나타내는 2단계 Z-score의 최소값의 평균으로 설정이 될 수 있으나, 이러한 컷-오프 값은 통상의 기술자가 주어진 실험 환경 내에서 각 염색체별로 용이하게 설정할 수 있음이 자명하게 이해될 수 있다. In particular, the cut-off value of the specific chromosome is greater than the maximum value of the second stage Z-score indicated in the mother group with the normal fetus, and the second stage indicated in the mother group with the fetus showing abnormalities in the chromosome. It may be less than the minimum value of the Z-score, and preferably the cut-off value of the specific chromosome is the maximum value of the second-stage Z-score expressed in the mother group with a normal fetus, and the abnormality in the chromosome. It can be set as the average of the minimum value of the two-stage Z-score represented by the maternal group with visible fetuses, but it is obvious that such a cut-off value can be easily set for each chromosome within a given experimental environment. Can be understood.

한편, 본 발명의 상기 방법을 통해서 판정이 가능한 염색체의 이상은 태아 염색체의 이수성(aneuploidy)일 수 있다. Meanwhile, an abnormality of a chromosome that can be determined through the method of the present invention may be an aneuploidy of a fetal chromosome.

본 발명에서 용어 "이수성"은 염색체의 구체적인 수를 지칭하는 것이 아니라, 유기체의 주어진 세포 또는 세포들 내의 염색체 함량이 비정상적인 상황을 지칭한다. 일부 실시양태에서, 본원에서 용어 "이수성"은 전체 염색체또는 염색체의 일부의 상실 또는 획득에 의해 야기된 유전 물질의 불균형을 지칭한다. "이수성"은 염색체의 분절의 하나 이상의 결실 및/또는 삽입을 지칭할 수 있다. 일부 실시양태에서, 용어 "정배수체"는 염색체의 정상 상보체를 지칭한다.In the present invention, the term "aneuploidy" does not refer to a specific number of chromosomes, but refers to a situation in which a given cell of an organism or chromosomal content in cells is abnormal. In some embodiments, the term “aneuploidy” herein refers to an imbalance in genetic material caused by the loss or gain of an entire chromosome or a portion of a chromosome. “Aneuploidy” may refer to the deletion and/or insertion of one or more segments of a chromosome. In some embodiments, the term "euploid" refers to the normal complement of a chromosome.

본 발명에서 상기 이수성의 일예시로서 "일염색체성"은 정상 상보체의 1개 염색체의 결여를 지칭한다. 부분적 일염색체성은 불균형 전위 또는 결실에서 발생할 수 있고, 이때 염색체의 분절만이 단일 카피로 존재한다. As an example of the aneuploidy in the present invention, "monochromosomal" refers to the lack of one chromosome of the normal complement. Partial monosomy may arise from an imbalanced translocation or deletion, with only a segment of the chromosome present as a single copy.

본 발명에서 상기 이수성의 또 다른 예시로서 "삼염색체성"은 특정 염색체의 2개 카피 대신에 3개 카피의 존재를 지칭한다. 인간 다운증후군에서 발견되는 추가 21번 염색체의 존재가 대표적인 예시이다. 삼염색체성 18 및 삼염색체성 13은 2종의 또 다른 인간 상염색체 삼염색체성이다.As another example of the aneuploidy in the present invention, "trisomy" refers to the presence of 3 copies instead of 2 copies of a specific chromosome. The presence of an additional chromosome 21 found in human Down syndrome is a prime example. Trisomy 18 and trisomy 13 are two other human autosomal trisomy.

상기 염색체 이상은 다양한 기작들에 의해 야기될 수 있다. 기작은 (i) 약화된 유사분열 체크포인트의 결과로서 발생하는 비분리, (ii) 다수의 염색체들에서 비분리를 야기하는 불활성 유사분열 체크포인트, (iii) 1개의 동원체가 양쪽 유사분열 방추극에 부착될 때 발생하는 메로텔릭(merotelic) 부착, (iv) 2개 초과의 방추극이 형성될 때 형성되는 다중극 방추, (v) 1개의 방추극만이 형성될 때 형성되는 단일극 방추, 및 (vi) 단일극 방추 기작의 최종 결과로서 발생하는 사배수체 중간체를 포함하나 이들로 한정되지 않는다.The chromosomal abnormality can be caused by a variety of mechanisms. Mechanisms are (i) non-separation resulting from weakened mitotic checkpoints, (ii) inactive mitotic checkpoints resulting in non-segregation in multiple chromosomes, and (iii) one centroid is bilateral mitotic spindle. Merotelic attachment that occurs when attached to, (iv) a multipole spindle formed when more than two spindle poles are formed, and (v) a single pole spindle formed when only one spindle pole is formed. , And (vi) tetraploid intermediates arising as the end result of a monopole spindle mechanism.

염색체 이수성의 또 다른 예시로서 "부분적 일염색체성" 및 "부분적 삼염색체성"은 염색체의 일부의 상실 또는 획득에 의해 야기된 유전 물질의 불균형을 지칭한다. 부분적 일염색체성 또는 부분적 삼염색체성은 불균형 전위로부터 발생할 수 있고, 이때 개체는 2개의 상이한 염색체들의 절단 및 융합으로부터 형성된 유도체 염색체를 가진다. 이 상황에서, 개체는 한 염색체의 일부의 3개 카피(2개의 정상 카피, 및 유도체 염색체 상에 존재하는 분절), 및 유도체 염색체에 포함된 다른 염색체의 일부의 단지 1개 카피를 가질 것이다.As another example of chromosomal aneuploidy, “partial monosomy” and “partial trisomy” refer to an imbalance of genetic material caused by the loss or acquisition of a portion of a chromosome. Partial monosomy or partial trisomy can arise from an imbalanced translocation, wherein the individual has a derivative chromosome formed from the cleavage and fusion of two different chromosomes. In this situation, an individual will have 3 copies of a portion of one chromosome (2 normal copies, and a segment present on the derivative chromosome), and only 1 copy of a portion of the other chromosome included in the derivative chromosome.

본 발명의 방법에 의해서 검출이 될 수 있는 염색체 이수성은 그 종류가 특별히 제한되지 않으며, 이의 비제한적인 예시 및 관련 질환을 이하에 나타내었다. The type of chromosomal aneuploidy that can be detected by the method of the present invention is not particularly limited, and non-limiting examples and related diseases thereof are shown below.

가장 바람직하게는, 본 발명에서 상기 염색체의 이수성은 13번 염색체의 삼염색체성, 18번 염색체의 삼염색체성, 21번 염색체의 삼염색체성 및 이들의 조합으로 이루어진 군에서 선택될 수 있다. Most preferably, in the present invention, the aneuploidy of the chromosome may be selected from the group consisting of a trisomy of chromosome 13, a trisomy of chromosome 18, a trisomy of chromosome 21, and combinations thereof.

본 발명은 또한 (a) 산모의 혈액에서 무세포 DNA(cfDNA)를 추출하여 차세대 시퀀싱(Next Generataion Sequencing)을 수행하는 시퀀싱부;The present invention also includes (a) a sequencing unit for extracting cell-free DNA (cfDNA) from the mother's blood and performing Next Generataion Sequencing;

[수학식 1][Equation 1]

[수학식 2][Equation 2]

따라서, 본 발명은 2 단계의 데이터 프로세싱에 의한 증폭된 2단계 Z-score를 이용함으로써, 특정 염색체의 리드 수를 정상 샘플군의 리드 수와 최대한 분리시켜 염색체 이상 판정의 위양성 및 위음성 가능성을 감소시키고 비침습적 산전 검사의 정확도를 높이는데 활용될 수 있다.Accordingly, the present invention reduces the possibility of false-positive and false-negative determination of chromosome abnormalities by separating the number of reads of a specific chromosome from the number of reads of the normal sample group as much as possible by using a two-stage Z-score amplified by data processing in two stages It can be used to increase the accuracy of non-invasive prenatal testing.

도 1은 맵핑된 전체 리드(reads)를 50kb bin으로 나누어 GC 함량에 따라 표준화해주는 툴인 HMMcopy를 사용한 결과값을 나타낸 것이다.
도 2는 데이터를 표준화하기 이전 리드 수(Readcount)와 GC함량에 따라 표준화된 리드 수(Readcount)를 비교한 결과를 나타내는 그래프이다(상부 도면: 표준화 전, 하부 도면: 표준화 후)
도 3은 1단계 Z-score)을 산출하기 위해 사용되는 각 염색체의 중간값(median)을 나타낸 것이다.
도 4는 각 염색체의 중간값(median)으로부터 1단계 Z-score를 산정한 후 2단계 Z-socre를 계산하는 과정에서의 값들을 나타낸 것이다. 21번 염색체의 삼염색체증일 경우 21번 염색체의 값이 데이터 프로세싱에 따라서 점차 크게 증가하는 것을 확인할 수 있다.
도 5는 도4와 마찬가지로 각 염색체의 중간값(median)으로부터 1단계 Z-score를 산정한 후 2단계 Z-socre를 계산하는 과정에서의 값들을 나타낸 것이다. 정상 샘플일 경우 데이터 프로세싱 과정에서 1단계 Z-score와 2단계 Z-score의 차이가 크지 않을 것을 확인할 수 있다.
도 6은 다운증후군 질병을 가진 샘플(21번 염색체 삼염색체성)에서 1단계 Z-score와 2단계 Z-score 모두에서 염색체 이수성이 검출된 결과를 나타낸 것이다. 1단계 Z-score에 비해 2단계 Z-score값이 현저히 증가한 것을 확인할 수 있다.
도 7은 다운증후군 질병을 가진 샘플(21번 염색체 삼염색체성)에서 1단계 Z-score 로는 염색체 이상이 검출되지 않았으나, 2단계 Z-score값에서는 염색체 이상이 검출된 결과를 나타낸 것이다. 1단계 Z-score만 사용했을 때 놓칠 수 있는 위음성의 가능성이 있는 샘플로서, 2단계 Z-score값을 사용하면 위음성의 가능성이 줄어드는 것을 확인할 수 있다.
도 8은 다운증후군 질병을 가진 샘플(21번 염색체 삼염색체성)에서 1단계 Z-score 로는 염색체 이상이 검출되지 않았으나, 2단계 Z-score값에서는 염색체 이상이 검출된 결과를 나타낸 것이다.
도 9는 본 발명에서 2단계 Z-score를 산출하기 위한 수학식 2에서 대표 염색체군의 선택에 따른 염색체 이상여부 검출 차이를 확인한 결과이다(a: 대표 염색체군으로 1번~12번 및 14번~16번 염색체를 선택, b: 대표 염색체군으로 1번~22번 염색체를 선택).FIG. 1 shows the results of using HMMcopy, a tool that divides all mapped reads into 50 kb bins and standardizes them according to GC content.
2 is a graph showing the result of comparing the number of reads before standardization of data (Readcount) and the number of reads standardized according to the GC content (upper drawing: before standardization, lower drawing: after standardization)
3 shows the median of each chromosome used to calculate the first stage Z-score.
4 shows values in the process of calculating the second step Z-socre after calculating the first step Z-score from the median of each chromosome. In the case of trisomy 21, it can be seen that the value of chromosome 21 increases gradually as data processing.
FIG. 5 shows values in the process of calculating the first step Z-score from the median of each chromosome and then calculating the second step Z-socre as in FIG. 4. In the case of a normal sample, it can be seen that the difference between the first step Z-score and the second step Z-score is not large in the data processing process.
6 shows the results of detecting chromosomal aneuploidy in both the first stage Z-score and the second stage Z-score in a sample with Down syndrome disease (chromosome 21 trisomy). It can be seen that the Z-score value of the second step is significantly increased compared to the Z-score of the first step.
7 shows the result of detecting a chromosomal abnormality in the first stage Z-score in the sample with Down syndrome disease (chromosome 21 trisomy), but the second stage Z-score value. It can be seen that the possibility of false negative is reduced by using the second Z-score value as a sample with the possibility of false negative that can be missed when using only the first Z-score.
8 shows the result of detecting a chromosomal abnormality in the Z-score of the first stage in the sample with Down syndrome disease (chromosome trisomy 21), but the chromosomal abnormality was detected in the Z-score of the second stage.
9 is a result of confirming the difference in detection of chromosome abnormality according to selection of a representative chromosome group in Equation 2 for calculating the second step Z-score in the present invention (a: representative chromosome groups 1 to 12 and 14 Select chromosome ~16, b: Select chromosome 1~22 as the representative chromosome group).

이하, 본 발명의 이해를 돕기 위하여 바람직한 실시예를 제시한다. 그러나 하기의 실시예는 본 발명을 보다 쉽게 이해하기 위하여 제공되는 것일 뿐, 이에 의해 본 발명의 내용이 한정되는 것은 아니다.Hereinafter, a preferred embodiment is presented to aid the understanding of the present invention. However, the following examples are provided for easier understanding of the present invention, and the content of the present invention is not limited thereby.

<실시예 1> 차세대 시퀀싱(NGS) 데이터를 전처리 하는 단계<Example 1> Step of preprocessing next-generation sequencing (NGS) data

산모로부터 혈액을 채취하여 원심분리기를 이용하여 혈장을 분리한 이후 분리된 혈장에서 30ng이상의 cfDNA(cell-free DNA)를 추출하여 라이브러리를 제작하였다. Illumina Adapter를 사용하여 bead size selection을 수행하고 pooling한 다음, 차세대 시퀀싱(NGS)를 이용하여 염기서열을 해독하였다. Blood was collected from the mother, plasma was separated using a centrifuge, and then 30 ng or more of cfDNA (cell-free DNA) was extracted from the separated plasma to prepare a library. After performing bead size selection and pooling using the Illumina Adapter, the nucleotide sequence was decoded using next-generation sequencing (NGS).

차세대 시퀀싱(NGS)를 이용하여 생산된 서열들(reads)을 인간 참조 유전체 서열(reference genome)에 맵핑(mapping)하고, PCR duplication을 제거함으로써 전처리 과정을 수행하였다. Pre-treatment was performed by mapping the sequences (reads) produced using next-generation sequencing (NGS) to the human reference genome sequence and removing PCR duplication.

<실시예 2> 전처리한 데이터를 이용하여 GC함량에 따라 표준화 하는 단계<Example 2> Step of standardizing according to GC content using pre-processed data

전처리 과정을 수행한 서열들(reads)을 50kb bin으로 쪼개어 GC 함량에 따라 표준화하였다. LOESS 모델을 사용하여 GC 함량 편향에 따라 기존 데이터를 수정하여 정규화시키는 툴인 HMMcopy를 이용하였다. Whole genome 샘플을 50kb bin으로 쪼갠 뒤, 각 50kb bin의 GC 함량 편향에 따라 리드(reads) 수를 표준화시켰다. Sequences (reads) subjected to the pretreatment process were split into 50 kb bins and normalized according to the GC content. HMMcopy, a tool that normalizes and corrects existing data according to the GC content bias, was used using the LOESS model. Whole genome samples were divided into 50 kb bins, and the number of reads was normalized according to the GC content bias of each 50 kb bin.

도 1에서 한 샘플을 50kb bin으로 나누었을 때 리드 수인 reads 컬럼과 각 bin에서 GC 함량인 gc컬럼과 GC 함량 편향에 따라 표준화된 서열들(reads)값인 cor.gc 컬럼을 확인할 수 있다. In FIG. 1, when one sample is divided into 50 kb bins, the reads column, which is the number of reads, the gc column, which is the GC content in each bin, and the cor.gc column, which is the standardized reads value according to the GC content bias.

도 2에서 GC 함량 편향에 따른 표준화를 수행하기 이전과 표준화를 수행한 이후를 비교하였고, 표준화 이전 전처리된 리드(reads) 수는 각 구획(bin)에서 변동의 폭이 크며 log값 1과 1.5 사이에 위치하지만 GC 함량 편향에 따라 표준화가 된 이후 리드 수는 각 구획별로 변동이 폭이 적어지며 log값이 0.7과 1.3 사이에 위치하는 것을 확인할 수 있다. In FIG. 2, before and after standardization according to the GC content bias was compared, the number of preprocessed reads before standardization has a large variation in each bin, and is between log values 1 and 1.5. However, after standardization according to the GC content bias, the number of reads fluctuates less for each section, and the log value is located between 0.7 and 1.3.

<실시예 3> 50kb로 구획화한 서열들의 중간값(median)을 계산<Example 3> Calculation of the median of sequences partitioned with 50 kb

도 1에서 각 50kb bin에서 GC 함량에 따라 표준화된 리드 수(cor.gc 컬럼)를 이용하여 각 염색체별로 중간값(median)을 계산하였다. In FIG. 1, the median value (median) for each chromosome was calculated using the number of reads (cor.gc column) standardized according to the GC content in each 50 kb bin.

도 3에서 검정색 dot는 각 50kb bin에서 GC 함량에 따라 표준화된 리드 수(cor.gc 컬럼)를 의미하며, 각 염색체에서 cor.gc 컬럼의 중간값(median)을 계산한 것을 노란색 실선으로 나타내었다. In FIG. 3, the black dot means the number of reads (cor.gc column) standardized according to the GC content in each 50kb bin, and the median value (median) of the cor.gc column in each chromosome is calculated as a solid yellow line. .

<실시예 4> 1단계 Z-score 및 2단계 Z-score의 계산 <Example 4> Calculation of the first step Z-score and the second step Z-score

일반적으로 Z-score를 계산할 때 평균값과 표준 편차값(Standard Deviation)을 이용하는데, 본 발명에서는 1단계 Z-score를 계산할 때 각 염색체의 표준화된 리드 수의 중간값들의 평균값과 표준편차값을 이용하여 수학식 1에 대입하여 계산하였다. In general, when calculating the Z-score, an average value and a standard deviation value are used.In the present invention, the average value and standard deviation value of the median values of the number of standardized reads of each chromosome are used when calculating the first stage Z-score. Then, it was calculated by substituting it into Equation 1.

[수학식 1][Equation 1]

이후, 각 염색체의 1단계 Z-score를 이용하여 다시 한번 평균값과 표준편차값을 구하고 최종적으로 수학식 2에 대입하여 2단계 Z-score를 계산하였다.Thereafter, the average value and the standard deviation value were calculated once again using the first stage Z-score of each chromosome, and finally substituted into Equation 2 to calculate the second stage Z-score.

[수학식 2][Equation 2]

(상기 식에서, 대표 염색체 군은 1 내지 12번 염색체 및 14 내지 16번로 이루어짐)(In the above formula, the representative chromosome group consists of chromosomes 1 to 12 and 14 to 16)

<실시예 5> 2단계 Z-score를 통한 염색체 이상 판단<Example 5> Determination of chromosomal abnormalities through the second step Z-score

2단계 Z-score를 이용하여 각 염색체 이상 판단 컷-오프(cut-off)값에 따라 정상군인지 비정상군인지 판단하였다. Each chromosomal abnormality determination cut-off value was used to determine whether it was a normal group or an abnormal group according to the second step Z-score.

상기 컷-오프(cut-off)값은 각 염색체마다 정상군(정상태아를 임신한 임산부)과 비정상군(염색체 이상인 태아를 임신한 임산부)의 2단계 Z-score값을 모아 두 그룹을 분리해줄 수 있는 값을 컷-오프 값으로 선택하였다. 컷-오프 값은 비정상군에서 2단계 Z-score값의 최소값보다 작고, 정상군에서 2단계 Z-score값의 최대값보다 큰 값으로 설정하였다.The cut-off value is for each chromosome to separate the two groups by collecting the two-stage Z-score values of the normal group (pregnant women pregnant with a steady-state baby) and abnormal group (pregnant women pregnant with a chromosomal abnormality). A possible value was selected as the cut-off value. The cut-off value was set to be smaller than the minimum value of the second stage Z-score value in the abnormal group and greater than the maximum value of the second stage Z-score value in the normal group.

도 6에서 보는 바와, 해당 산모로부터 수득한 샘플에서는 2단계 Z-score 뿐만 아니라 1단계 Z-score에서도 21번 염색체의 이수성(삼염색체성)이 판정되었다.As shown in FIG. 6, in the sample obtained from the mother, aneuploidy (trisomy) of chromosome 21 was determined not only in the second stage Z-score but also in the first stage Z-score.

그런데, 도 7에서 보는 바와 같이 또 다른 산모로부터 수득한 샘플에서는 2단계 Z-score로는 염색체 이상이 판정되었지만 1단계 Z-score로는 염색체 이상이 판정되지 못했고, 이러한 결과는 2단계 Z-socre를 통해서 태아 산전 검사의 염색체 이상 위음성 판정을 방지할 수 있음을 의미한다.However, as shown in FIG. 7, in the sample obtained from another mother, a chromosomal abnormality was determined with the second stage Z-score, but the chromosomal abnormality was not determined with the first stage Z-score. It means that it can prevent false negative determination of chromosomal abnormalities in fetal prenatal examination.

본 발명에서 상기 2단계 Z-score의 유용성은 도 8의 결과를 통해서도 확인될 수 있다. The usefulness of the second-stage Z-score in the present invention can also be confirmed through the results of FIG. 8.

도 8a는 정상군과 비정상군(21번 염색체의 삼염색체성인 태아를 임신한 임산부)의 1단계 Z-score에 따른 분포를 구분한 것인데 1단계 Z-score로는 정상군과 비정상군을 정확하게 구분할 수 있는 컷-오프값의 설정이 불가능함을 확인할 수 있다. 즉, 정상군에서 최대값의 1단계 Z-score를 나타내는 임산부와 비정상군에서 최소값의 1단계 Z-score를 나타내는 임산부의 구분이 불가능하여 정확한 염색체 이상 진단이 어려울 것으로 판단할 수 있다. Figure 8a shows the distribution of the normal group and the abnormal group (pregnant women who are pregnant with a trisomy fetus of chromosome 21) according to the first stage Z-score, and the first stage Z-score can accurately distinguish the normal group and the abnormal group. It can be seen that setting of the existing cut-off value is impossible. That is, since it is impossible to distinguish between a pregnant woman having a first-stage Z-score of the maximum value in the normal group and a pregnant woman having a first-stage Z-score of the minimum value in the abnormal group, it may be determined that accurate chromosomal abnormality diagnosis is difficult.

도 8b는 상기 도 8a와 동일한 정상군 및 비정상군의 2단계 Z-score에 따른 분포를 구분한 것인데 정상군에서 최대값의 1단계 Z-score를 나타내는 임산부와 비정상군에서 최소값의 1단계 Z-score를 나타내는 임산부의 명확한 구분이 가능하여 컷-오프값 설정을 통해 정확한 염색체 이상 진단이 가능한 것으로 판단할 수 있다. Figure 8b is a distribution of the same Z-score of the normal group and the abnormal group according to the second-stage Z-score in the normal group. Since it is possible to clearly distinguish a pregnant woman representing a score, it can be determined that accurate chromosomal abnormality diagnosis is possible by setting a cut-off value.

<실시예 6> 2단계 Z-score 산출을 위한 수학식 2의 유용성 검증<Example 6> Verification of the usefulness of Equation 2 for calculating the second-stage Z-score

상기 실시예 5에서 21번 염색체의 삼염색체성 이상 태아를 임신한 임산부의 염색체 분석 결과를 동일하게 활용하되, 2단계 Z-score를 산출하기 위한 수학식 2에서 대표 염색체군을 1번~12번 및 14번~16번 염색체로 선택하여 계산한 것과, 1번 내지 22번 염색체 모두를 선택하여 계산한 것의 차이를 비교해 보았다. In Example 5, the result of the chromosome analysis of a pregnant woman with a trisomy abnormality fetus of chromosome 21 was used in the same manner, but representative chromosome groups 1 to 12 in Equation 2 for calculating the second stage Z-score And, the difference between the calculation by selecting chromosomes 14 to 16 and those calculated by selecting all chromosomes 1 to 22 was compared.

그 결과, 도 9에 나타낸 바와 같이, 2단계 Z-score를 산출하기 위한 수학식 2에서 대표 염색체군을 1번~12번 및 14번~16번 염색체로 선택했을 때에는 21번 염색체의 이상이 명확하게 확인이 가능했으나(도 9a), 1번 내지 22번 염색체 모두를 선택했을 때에는 21번 염색체의 이상이 확인되지 않는 것을 알 수 있었다(도 9b). As a result, as shown in FIG. 9, when the representative chromosome group is selected as chromosomes 1 to 12 and 14 to 16 in Equation 2 for calculating the second step Z-score, abnormality of chromosome 21 is clear. Although it was possible to confirm (FIG. 9A), it was found that no abnormality of chromosome 21 was confirmed when all of chromosomes 1 to 22 were selected (FIG. 9B).

본 발명은 2 단계의 데이터 프로세싱에 의한 증폭된 2단계 Z-score를 이용함으로써, 특정 염색체의 리드 수를 정상 샘플군의 리드 수와 최대한 분리시켜 염색체 이상 판정의 위양성 및 위음성 가능성을 감소시키고 비침습적 산전 검사의 정확도를 높이는데 활용될 수 있어 산업상 이용가능성이 우수하다. The present invention uses a two-stage Z-score amplified by two-stage data processing, thereby dividing the number of reads of a specific chromosome from the number of reads of the normal sample group as much as possible to reduce the possibility of false positives and false negatives of chromosome abnormality determination It can be used to increase the accuracy of prenatal testing, so it has excellent industrial applicability.

Claims

(a) extracting cell-free DNA (cfDNA) from the mother's blood and performing Next Generataion Sequencing;
(b) mapping the sequenced read sequences to a reference genome sequence;
(c) dividing each chromosome of the reference genome into preset bins and calculating the number of mapped reads for each partition based on the mapped reads;
(d) calculating a median of the number of reads for each segment for each chromosome;
(e) calculating the first stage Z-score of each chromosome according to Equation 1 below;
[Equation 1]

(f) calculating a two-step Z-score of each chromosome according to Equation 2 below; And
[Equation 2]

(In the above formula, the representative chromosome group is two or more chromosomes selected from the group consisting of chromosomes 1 to 12 and chromosomes 14 to 16.)
(g) a non-invasive prenatal test method comprising the step of determining whether or not a chromosome is abnormal by checking the Z-score value calculated for each chromosome.

The method of claim 1, wherein the bin in step (c) is 10 to 500 kb.

The method of claim 1, wherein the number of reads calculated in step (c) is a standardized number of reads.

4. The method of claim 3, wherein the number of normalized reads has a reduced guanine-cytosine (GC) bias compared to the number of raw reads.

The method of claim 3, wherein the standardization is bin-wise normalization, standardization by GC content, linear least squares regression, nonlinear least squares regression, LOESS, GC LOESS, LOWESS, PERUN, repeat masking ; RM), GC-normalizatin and repeat masking (GCRM), conditional quantile normalization (cQn), and a combination thereof, characterized in that it is performed by a method selected from the group consisting of Way.

The method of claim 1, wherein the representative chromosome group of Equation 2 is composed of chromosomes 1 to 12 and 14 to 16.

The method of claim 1, further comprising the step of determining a risk of a chromosomal abnormality of the fetus using a preset cut-off value for each chromosome after the step (g).

The method of claim 7, wherein the cut-off value of a specific chromosome is greater than the maximum value of the second-stage Z-score represented in a mother group with a normal fetus, and in a mother group with a fetus exhibiting abnormalities in the chromosome. A method, characterized in that less than the minimum value of the indicated two-stage Z-score.

9. The method of any one of claims 1 to 8, wherein the chromosomal abnormality is an aneuploidy of fetal chromosomes.

The method according to claim 9, wherein the aneuploidy of the chromosome is selected from the group consisting of a trisomy of chromosome 13, a trisomy of chromosome 18, a trisomy of chromosome 21, and combinations thereof.

(a) a sequencing unit that extracts cell-free DNA (cfDNA) from the mother's blood and performs Next Generataion Sequencing;
(b) a mapping unit for mapping the sequenced read sequences to a reference genome sequence;
(c) a read number calculation unit that divides each chromosome of the reference genome into preset bins and calculates the number of mapped reads for each partition based on the mapped reads;
(d) a median value calculator for calculating a median of the number of reads for each chromosome;
(e) a first-stage Z-score calculator for calculating a first-stage Z-score of each chromosome according to Equation 1 below;
[Equation 1]

(f) a two-step Z-score calculating unit for calculating a two-step Z-score of each chromosome according to Equation 2 below; And
[Equation 2]

(In the above formula, the representative chromosome group is two or more chromosomes selected from the group consisting of chromosomes 1 to 12 and chromosomes 14 to 16.)
(g) Non-invasive prenatal testing apparatus comprising a determination unit for determining whether or not a chromosome abnormality is determined by checking the two-step Z-score value calculated for each chromosome.