JP2014530629A5

JP2014530629A5 -

Info

Publication number: JP2014530629A5
Application number: JP2014537440A
Authority: JP
Filing date: 2011-10-28
Publication date: 2016-04-21

Description

しかしながら、この種の病気は、染色体レベルの微小変異なので、染色体核型解析方法などの通常の臨床方法（解像度は10M以上）で検出することができない（Malcolm S. Microdeletion and microduplication syndromes. Prenat Diagn. 1996 Dec; 16(13): 1213 - 9）。現在、微細欠失/微細重複症候群に対する診断方法には、主として、高解像度染色体核型解析、FISH（蛍光in situ ハイブリダイゼーション）、Array CGH（比較ゲノムハイブリダイゼーション）、MLPA（Multiplex Ligation-dependent Probe Amplification）及びPCR方法などの方法があり、これらの方法を利用して、染色体の微細欠失/微細重複を検出することができる。 However, since this type of disease is a chromosomal micro-mutation, it cannot be detected by conventional clinical methods such as chromosome karyotype analysis (resolution is 10M or more) (Malcolm S. Microdeletion and microduplication syndromes. Prenat Diagn. 1996 Dec; 16 (13): 1213-9). Currently, diagnostic methods for microdeletion / duplication syndrome include mainly high-resolution chromosome karyotype analysis, FISH (fluorescence in situ hybridization), Array CGH (comparative genomic hybridization), MLPA (Multiplex Ligation-dependent Probe Amplification). ) and there are methods such as PCR method, it is possible to use these methods to detect the fine deletion / minute duplication of a chromosome.

本発明は細胞染色体DNA断片コピー数変異（Copy number variation、CNV）を検出する方法に関し、この方法は以下のステップを含む。
a）被検測サンプル及び正常サンプルから得られたゲノムDNA分子をそれぞれ無作為に切断してDNA断片を得て、前記DNA断片の配列決定を行い配列決定のリード（read）を獲得する；
b）ステップaで測定したDNA配列とサンプルの種のゲノム参考配列を対比して測定したDNA配列を参考配列上に定位し、参考配列上に唯一の位置を有するリードのみを選出して解析を行う；
c）参考配列上において、以下の条件に満足するサイトを探す。即ち、正常サンプルの対比結果と比べ、サイト両側にコピー数変異比率は差異があるサイトである。具体の手順は以下のようになる。
i）参考配列上の各々のサイトbに対して、強引にその左右両側部分の窓口にw条正常リードを包含させ、即ちN(x_L,b)=N(b,x_R)=wを満たし、式中に、N(x_L,x_R)は窓口(x_L,x_R)中に入った正常サンプルの対比本数である；
ii）これらの位置において、

に符合するサイトを選別し、D_i(x_L,x_R)=0、b-w<i<b+wに符合するサイトを除去し、検定統計量D(x_L,x_R)に対して正規分布の両側有意性検定を行うことで、各々サイトのp(|D(x_L,x_R)|)は得られ、式中に、D(x_L,x_R)=log(R(x_L,x))-log(R(x,x_R))、

、また、正常サンプルリードと被検測サンプルリード中唯一に参考配列上に対比したリード本数はそれぞれa_N及びa_Tであり、窓口(x_L,x_R)中に入った参考配列の唯一対比したリード本数はそれぞれN(x_L,x_R)及びT(x_L,x_R)である；
iii）p_bkpを設定し、p(|D(x_L,x_R)|)>p_bkpに符合するすべてのサイトを得るまで上記のステップを繰り返し、得られた候補サイト集合BcはBc={b₁,b₂,...,b_N}に満たす；
そのうち、p_bkpを設定してよく、例えば対照サンプルデータにより最初の候補サイトが10、100、1000又は10000である時最小のp(|D(x_L,x_R)|)をp_bkpと設定する。以下の方式でp_bkpを選択してもよい。正常サンプルを被検測サンプルとして、前記ステップa）〜c）のii）を執行し、すべてのp(|D(x_L,x_R)|)について偽発見率制御（False discovery rate control、FDR control）で濾過し、濾過したサイト中の最後にFDR閾値を突破するp(|D(x_L,x_R)|)をp_bkpとする。偽発見率制御を行うステップは以下のようになる。
被検定データ集を有意性（P値）の低い方から順に排列し、これらのランク（r）を得る。
上から下に

に満たす最後のサイトkまで検定し、式中に、P_kは第k個位置のP値であり、r_kは第k個位置のランクであり、Nは総サイト個数であり、αは有意性レベル、例えば0.01である。k及びその前のすべてのサイトを保留し、その後の偽陽性サイトを除去する。
d）ステップcで得られた参考配列上の候補サイト集合Bc、Bc={b₁,b₂,...,b_N}にある各サイトkの両側に窓口(b_k-1,b_k-1)及び(b_k,b_k+1)が存在する。両側窓口の間のコピー数変異比率の差異が比較的に小さいサイトを除去し、即ち毎回

最大のサイトkを削除し、また区間(b_k-1,b_k+1)のp値を更新・合併し、ｈを設定することで、すべてのサイトは

に満たすまで当該ステップを繰り返して、残りのサイトはCNVを探すに必要な要求を満たすサイトであり、即ち染色体コピー数変異が発生するサイトを得る。
そのうち、p_mergeは、設定してよく、例えば残りのサイトの規模はもとの1/2、1/10、1/100又は1/1000である時の最大のp(|D(x_L,x_R)|)をp_mergeとして設定する。以下の形態でp_mergeを選択してよい。合併した候補サイトの数量を最初サイトの数量の1/2、1/10、1/100又は1/1000にするように、正常サンプルを被検測サンプルとして、上記のステップa）〜d）を執行し、そのうち、最大のp(|D(x_L,x_R)|)はp_mergeとして選ばれる。 The present invention relates to a method for detecting cell chromosomal DNA fragment copy number variation (CNV), which comprises the following steps.
a) genomic DNA molecules obtained from a test sample and a normal sample are randomly cleaved to obtain DNA fragments, and the DNA fragments are sequenced to obtain a sequencing read;
b) Localize the DNA sequence measured by comparing the DNA sequence measured in step a and the genomic reference sequence of the sample seed on the reference sequence, and select and analyze only the reads that have a unique position on the reference sequence. Do;
c) Search for sites that satisfy the following conditions on the reference sequence. That is, compared with the comparison result of the normal sample, the copy number variation ratio is different on both sides of the site. The specific procedure is as follows.
i) for each site b on the reference sequence, forcibly to encompass w conditions normally leads to the window opening of the left and right side portions, i.e. _{N (x L, b) =} N (b, x R) = w Where N (x _L , x _R ) is the contrast number of normal samples that entered the window (x _L , x _R );
ii) In these positions:

, Select sites that match D _i (x _L , x _R ) = 0, bw <i <b + w, and normalize for the test statistic D (x _L , x _R ) By performing the two-sided significance test of the distribution, p (| D (x _L , x _R ) |) of each site is obtained, and D (x _L , x _R ) = log (R (x _L , x))-log (R (x, x _R )),

In addition, the number of leads compared to the reference sequence in the normal sample lead and the test sample lead is a _N and a _T , respectively, and the reference sequence in the window (x _L , x _R ) is the only comparison. The number of leads obtained is N (x _L , x _R ) and T (x _L , x _R ), respectively;
iii) p _bkp is set, and the above steps are repeated until all sites matching p (| D (x _L , x _R ) |)> p _bkp are obtained, and the obtained candidate site set Bc is Bc = { satisfy b ₁ , b ₂ , ..., b _N };
Among them, p _bkp may be set, for example, when the first candidate site is 10, 100, 1000 or 10000 by the control sample data, the minimum p (| D (x _L , x _R ) |) is set as p _bkp To do. P _bkp may be selected in the following manner. Using a normal sample as a test sample, execute steps ii) through a) to c) above, and perform false discovery rate control (FDR) for all p (| D (x _L , x _R ) |) control), and p (| D (x _L , x _R ) |) that breaks the FDR threshold at the end of the filtered site is defined as p _bkp . The steps for performing false discovery rate control are as follows.
The test data collection is arranged in descending order of significance (P value) to obtain these ranks (r).
From top to bottom

To the last site k satisfying, where P _k is the P value at the kth position, r _k is the rank at the kth position, N is the total number of sites, and α is significant Sex level, for example 0.01. Hold k and all previous sites and remove subsequent false positive sites.
d) A window (b _k-1 , b _{k on} both sides of each site k in the candidate site set Bc, Bc = {b ₁ , b ₂ , ..., b _N } on the reference sequence obtained in step c -1) and (b _k , b _{k + 1} ) exist. Copy number mutation ratio of the difference between the two sides teller removes the relatively small site, i.e. each time

By deleting the largest site k, updating and merging the p-values in the interval (b _k-1 , b _{k + 1} ) and setting h, all sites

This step is repeated until the above conditions are satisfied, and the remaining sites satisfy the requirements necessary for searching for CNVs, that is, sites where chromosome copy number variation occurs.
Among them, p _merge may be set, for example, the maximum p (| D (x _L , when the scale of the remaining site is 1/2, 1/10, 1/100 or 1/1000 Set x _R ) |) as p _merge . You may choose p _{merge in the} following form: Steps a) to d) above with the normal sample as the test sample so that the merged candidate site quantity is 1/2, 1/10, 1/100 or 1/1000 of the original site quantity. The largest p (| D (x _L , x _R ) |) is chosen as p _merge .

本発明の効果
目前の染色体微細欠失/微細重複を検出する常用方法（如高解像度染色体核型解析、FISH、Array CGH及びPCRの方法）と比べ、本発明の優越性は主に以下の点である。
１）高解像度。本発明は、染色体CNVを解析する精度が100kbに達し、染色体微細欠失/微細重複を効果的に検出することができる。
２）より広いデータ解析に適用し、メモリー設備の利用率を高める。算法を新たに編訳し、データ処理の方法を改善し、元のSegSeqソフトウェアは1〜4×低深度配列決定データ解析のみに適したが、改良したSegSeqは1〜30×異なる配列決定深度のデータ解析に適用することができる。
３）全ゲノムを覆う。第二世代の配列決定技術に基づき、本発明は全ゲノム範囲に対して染色体CNV解析を行い、既知のプローブを依頼すること及びプローブを設計することなく、新しい染色体異常を発見することができる。
４）ハイスループット。ハイスループット配列決定技術に基づき、本発明はハイスループットで染色体CNV解析を行い、サンプル一個あたりに異なるラベル配列を加えることで、多量のサンプルに対して一括に解析することができる。
５）低コスト。配列決定技術の不断の発展及び配列決定コストの継続的な降下に従い、本発明の染色体CNV解析のコストもますます低下してくる。 Advantages of the present invention The superiority of the present invention is as follows, compared with the conventional method (high resolution chromosome karyotype analysis, FISH, Array CGH and PCR methods) for detecting the current chromosomal microdeletion / duplication. It is.
1) High resolution. In the present invention, the accuracy of analyzing chromosome CNV reaches 100 kb, and it is possible to effectively detect chromosomal microdeletions / microduplications.
2) Apply to wider data analysis and increase the utilization rate of memory equipment. New translation of algorithm, improved data processing method, original SegSeq software is only suitable for 1-4x low depth sequencing data analysis, but improved SegSeq is 1-30x different sequencing depth data It can be applied to analysis.
3) Cover the whole genome. Based on second generation sequencing technology, the present invention can perform chromosomal CNV analysis over the entire genome range and discover new chromosomal abnormalities without requesting known probes and designing probes.
4) High throughput. Based on the high-throughput sequencing technique, the present invention can perform chromosomal CNV analysis at high throughput and add different label sequences to each sample, thereby analyzing a large number of samples at once.
5) Low cost. According continuous drop in constant development and sequencing cost of sequencing technology, the cost of chromosomes CNV analysis of the present invention also come more and more reduced.

本発明において、被検測サンプルに対してCNV解析の断点を探すとは、改良されたSegseqソフトウェア算法を利用して、正常サンプルを陰性対照として、参考ゲノム配列において、被検測サンプルと正常サンプルちの両側コピー数変異比率の差異は一定の要求に満たす候補サイト、即ち断点を探すこと指す。前記断点を探すことには二つのステップを含む。即ち、（1）初期化。その目的は、候補点の選出にある。（2）隣接の断片の合併を繰り返す。その目的は、偽陽性率を低下させることにある。 In the present invention, searching for a breakpoint in CNV analysis for a test sample is performed using the improved Segseq software algorithm, using a normal sample as a negative control, and a reference genome sequence as a normal sample. The difference in the copy number variation ratio on both sides of the sample refers to searching for candidate sites that meet certain requirements, that is, break points. Finding the break point involves two steps. (1) Initialization. The purpose is to select candidate points. (2) Repeat the merger of adjacent fragments. Its purpose is to make reduce the false positive rate.

具体的な原理及数学模型は、配列決定で得られたリードはゲノムDNA中の随机断片から由来するものである前提下、対比後一つの区域に入るリード数量はポアソン分布に従うべき。全ゲノム中の対比可能な区域長さをA（A=2.2×10⁹）とし、正常サンプル及び被検測サンプルの参考配列に対比可能なリード本数をそれぞれa_N及びa_Tとし、窓口(x_L,x_R)中に入ったリード本数をそれぞれN(x_L,x_R)及びT(x_L,x_R)とし、窓口大きさL=x_R-x_L+1、そしてN及びTはそれぞれパラメーターは

及び

であるポアソン分布に従い、かつλ_T=r×a×λ_N、a=a_T/a_Nがある。コピー数変異比率は

と定義され、サンプリングが大きいである条件下、R(x_L,x_R)は対数正規分布に近いである。D(x_L,x_R)=log(R(x_L,x))-log(R(x,x_R))、x_L<x<x_R、と定義する。そして、R(x_L,x_R)は対数正規分布に近いから、D(x_L,x_R)は正規分布に従うことにより、両側P-value（p(|D(x_L,x_R)|>d）を用いてあるサイト両側のコピー数変異比率の差異は有意かどうかことを検定することができる。 The specific principle and mathematical model are based on the assumption that the reads obtained by sequencing are derived from random fragments in the genomic DNA, and the number of reads that enter one zone after the comparison should follow the Poisson distribution. The length of comparable area in the whole genome is A (A = 2.2 × 10 ⁹ ), the number of reads that can be compared with the reference sequence of normal sample and test sample is a _N and a _T , respectively. _L, x _R) a number of leads entering into the respective N (x _L, x _R) and T (x _L, and x _R), window size _L = x _R -x L +1, and N and T Each parameter is

as well as

According to the Poisson distribution, there are λ _T = r × a × λ _N and a = a _T / a _N. Copy number variation ratio is

R (x _L , x _R ) is close to a lognormal distribution under conditions where sampling is large. _{_{D (x L, x R)}} = log (R (x L, x)) - log (R (x, x R)), x L <x <x R, and defined. Since R (x _L , x _R ) is close to a lognormal distribution, D (x _L , x _R ) follows a normal distribution, so that the two-sided P-value (p (| D (x _L , x _R ) | > d) can be used to test whether the difference in copy number variation ratios on both sides of a site is significant.

断点を探すステップ（1）中の初期化とは、候補点を予選する流れを指す。具体的には、参考配列上の位置ｂに対し、強引にその左右両側部分の窓口にｗ条正常リードを包含させ、即ちN(x_L,b)=N(b,x_R)=wを満たす。これらの位置において、

を満たすものを候補配列に加入し、D_i(x_L,x_R)=0、b-w<i<b+wを満たすものを除去し、候補点に列入しない。適宜なp_bkpを設定することで、p(|D(x_L,x_R)|)>p_bkpに符合するすべてのサイトを得るまで上記のステップを繰り返し、適宜な数量の候補点を得る。 The initialization in the step (1) for searching for break points refers to the flow of qualifying candidate points. Specifically, with respect to the position b on the reference sequence, forcibly to encompass w conditions normally leads to the window opening of the left and right side portions, i.e. _{N (x L, b) =} N (b, x R) = w Meet. In these positions,

Those satisfying the condition are added to the candidate sequence, those satisfying D _i (x _L , x _R ) = 0, bw <i <b + w are removed, and the candidate points are not entered. By setting an appropriate p _bkp , the above steps are repeated until all sites matching p (| D (x _L , x _R ) |)> p _bkp are obtained to obtain an appropriate number of candidate points.

断点を探すステップ（2）に隣接の断片の合併を繰り返すとは、最尤処理により、その間のコピー数変異比率の差異が比較的に小さい隣接の断片を合併させることで、偽陽性率を低下させる。具体的に、ステップ（1）で得られた参考配列上の候補点集合をBc、Bc={b₁,b₂,...,b_N}とし、候補点kの左右両側窓口をそれぞれ(b_k-1,b_k-1)及び(b_k,b_k+1)とし、両側窓口の間のコピー数変異比率の差異が比較的に小さいサイトを除去する。即ち、毎度、

最大のサイトkを削除し、合併区間(b_k-1, b_k+1)のｐ値を更新し、pmergeを設定し、すべてのサイトは

に満たすまで当該ステップを繰り返すと、残りのサイトはCNVを探すに必要な要求を満たすサイトである。 Repeated and the merger of the fragment next to the step (2) to find the cross-sectional point, the maximum likelihood process, in between the copy number mutation ratio differences that have to merge the relatively small contiguous pieces, the false positive rate cause deterioration. Specifically, the set of candidate points on the reference sequence obtained in step (1) is Bc, Bc = {b ₁ , b ₂ , ..., b _N }, and the left and right side windows of the candidate point k are respectively ( _{_{b k-1, b k -1}} ) and (b _k, and b _{k + 1),} the copy number mutation ratio of difference between the two sides window removes a relatively small site. That is, every time,

Delete the largest site k, update the p-value of the merged section (b _k-1 , b _{k + 1} ), set pmerge, all sites

If this step is repeated until the condition is satisfied, the remaining sites satisfy the requirements necessary for searching for CNVs.

本発明において、既存のCNVと病気データベースとは、既存のコピー数変異と病気関連情報のデータベースを指す。本発明の一つの形態において、使用されるデータベース値DECIPHER (https://decipher.sanger.ac.uk/syndromes)、該データベースに挙げられた58種の微細欠失/微細重複症候群はいずれも欠失重複断片と病気との関係が明確である内容である。 In the present invention, the existing CNV and disease database refer to an existing database of copy number variation and disease related information. In one form of the invention, the database value DECIPHER (https://decipher.sanger.ac.uk/syndromes) used, none of the 58 microdeletion / microduplication syndromes listed in the database is missing. the relationship between the disease and the loss overlapping fragments is the content is clear.

本発明の一つの形態において、絨毛組織に対して染色体CNV解析を行う具体的な方法は、以下のステップを含む。
1、DNA抽出及び配列決定：磁珠法ゲノムDNA抽出キット（例えばTiangen DP329）の操作ハンドブックに従って絨毛組織DNAを抽出した後、Illumina/Solexa標準ライブラリ構築工程に従ってライブラリを構築する。この過程中、絨毛組織DNAは超音法で500bp程度に集中したDNA分子に無作為に切断され、両端に配列決定用ジョイントを加え、サンプル一個あたりに異なるラベル配列（index）を加えることで、一回配列決定で得られたデータ中に多数のサンプルのデータを区別することができる。
2、対比及び統計：第二世代の配列決定方法Illumina/Solexaを利用して配列決定（他の配列決定方法、例えばABI/SOLiDを用いても相同又は相似の効果を得る）を行い、サンプル一個あたりに一定大きさの断片のDNA配列、即ちリードを得る。それとNCBIデータベース中の標準ヒトゲノム参考配列とをSOAP対比し、測定されるDNA配列がゲノム相応位置に定位する情報を得る。重複配列のCNV解析への妨害を避けるために、ヒトゲノム参考配列と唯一対比したリード（Unique reads）のみを選択し、後続のCNV解析の有效データとし、その数目a_Tを統計する。
3、データ解析：既知正常サンプルを陰性サンプルとし、SegSeq算法によるCNV解析で、CNV解析に必要な断点を探し、及び被検測サンプルの正常サンプルに対するコピー数変異比率を計算し、一定の検出閾値を設定することで、被検測サンプルの染色体断片微細欠失/微細重複状態を判断し、かつ染色体数字核型図を製作し、及び対応の遺伝子のアノテーションを行う。具体的な過程は以下のようになる。
1）初期化。同一の染色体上に対し、一つの位置bに対し、その左右両側部分の窓口に300条正常リードを含み、即ちN(x _L ,b)=N(b,x _R )=w=300のように、パラメーターwを設定する。被検測サンプルのリード位置に、

を満たす物を候補配列に加入し、D_i(x_L,x_R)=0、b-w<i<b+wを満たすものを除去する。p_bkp関連のパラメーターを1000とし、当該初期化流れに1000個候補点を輸出させる。すべてのp(|D(x_L,x_R)|)>p_bkpまで、上記の除去及び加入候補配列のステップを繰り返し、染色体c上の候補点集合Bc、Bc={b₁,b₂,...,b_N}を輸出する。
2）隣接の断片の合併を繰り返す。初期化して候補点集合をえて、候補点kの左右両側窓口をそれぞれ(b_k-1,b_k-1)及び(b_k,b_k+1)とし、p_merge関連のパラメーターを10とし、当該反復分割流れにせいぜい10個偽陽性断片結果を輸出させる。すべての

まで、その間のコピー数変異比率の差異が比較的に小さい隣接の断片の合併を繰り返す、最終の解析CNVに必要な有效候補点、即ち断点を得る。
3）CNV解析。上記の最終断点を統計し、ある二つの断点の間の窓口を(x_L,x_R)とし、被検測サンプルの正常サンプルに対するCNV比率

を計算する。前記CNV比率≦0.75及び≧1.25をそれぞれ染色体断片欠失及び重複の検出閾値とし、解析して微細欠失/微細重複結果を得てから染色体数字核型図を製作し、かつ遺伝子のアノテーションを行う。 In one form of the invention, a specific method for performing chromosomal CNV analysis on villus tissue includes the following steps.
1. DNA extraction and sequencing: After extracting chorionic tissue DNA according to the operation handbook of the magnetic bead method genomic DNA extraction kit (eg Tiangen DP329) , the library is constructed according to the Illumina / Solexa standard library construction process. During this process, villous tissue DNA is randomly cleaved into DNA molecules concentrated at about 500 bp by the ultrasonic method, adding sequencing joints at both ends, and adding a different label sequence (index) per sample, Multiple samples of data can be distinguished from the data obtained by single sequencing.
2, contrast and Statistics: second-generation sequencing methods sequenced using Illumina / Solexa performed (other sequencing methods, for example, obtain the effect of homologous or similar be used ABI / SOLiD), one sample A DNA sequence of a fragment of a certain size, that is, a lead is obtained. By comparing it with the standard human genome reference sequence in the NCBI database, information on the localization of the measured DNA sequence at the position corresponding to the genome is obtained. To avoid interference with the CNV analysis of overlapping sequences, the human genome reference sequence only contrasted with lead alone (Unique reads The) is selected, the chromatic效data subsequent CNV analysis, statistically the number eyes a _T.
3. Data analysis: Using known normal samples as negative samples, CNV analysis using the SegSeq algorithm to find the breakpoints required for CNV analysis, and calculating the copy number variation ratio of the test sample to the normal sample to detect a certain amount By setting the threshold value, the chromosome fragment microdeletion / microduplication state of the test sample is judged, a chromosome number karyotype is created, and the corresponding gene is annotated. The specific process is as follows.
1) Initialization. For the same above chromosome, for one position b, viewed contains 300 Article normally leads to the window opening of the left and right side portions, i.e. _{N (x L, b) =} N (b, x R) = w = 300 as in, to set the parameter w. At the lead position of the sample to be measured,

Those satisfying the condition are added to the candidate sequence, and those satisfying D _i (x _L , x _R ) = 0 and bw <i <b + w are removed. Let p _bkp related parameters be 1000 and export 1000 candidate points to the initialization flow. The above removal and joining candidate sequence steps are repeated until all p (| D (x _L , x _R ) |)> p _bkp , and the candidate point set Bc on chromosome c, Bc = {b ₁ , b ₂ , ..., b _N } is exported.
2) Repeat the merger of adjacent fragments. Initializing to obtain a candidate point set, the left and right sides of the candidate point k are (b _k−1 , b _k −1) and (b _k , b _{k + 1} ) respectively, the p _merge related parameter is set to 10, Export at most 10 false positive fragment results to the repetitive split flow. All

Until, repeated merger of fragments of adjoining difference between copy number mutation ratio is relatively small, to obtain the final Effective candidate points required for analysis of CNV, i.e. the cross-sectional point.
3) CNV analysis. Statistics of the above final breakpoints, and the contact between two breakpoints is (x _L , x _R ), and the CNV ratio of the test sample to the normal sample

Calculate The CNV ratios ≤ 0.75 and ≥ 1.25 are used as detection thresholds for chromosome fragment deletion and duplication, respectively, and after analysis, a fine deletion / fine duplication result is obtained, a chromosome number karyotype is produced, and gene annotation is performed. .

３．データ解析
１）初期化。SegSeq算法を運行し、一本染色体上の位置bに対して、位置b左右両側部分の窓口に300条正常リードを包含させるように、パラメーターw=300を設定し、即ちN(x_L,b)=N(b,x_R)=w=300。被検測サンプルのリード位置に、

を満たすものを候補配列に加入し、D_i(x_L,x_R)=0、b-w<i<b+wを満たすものを除去する。p_bkp関連のパラメーターを1000とし、当該初期化流れに1000個候補点を出させる。すべてのp(|D(x_L,x_R)|)>p_bkpまで、上記の除去及び加入候補配列のステップを繰り返し、染色体

上の候補点集合Bc、Bc={b₁,b₂,...,b_N}を輸出する。
２）隣接の断片の合併を繰り返す。初期化して候補点集合を得て、候補点kの左右両側窓口をそれぞれ(b_k-1,b_k-1)及び(b_k,b_k+1)とし、p_merge関連のパラメーターを10とし、当該反復合併流れにせいぜい10個偽陽性断片結果を輸出させる。すべての

まで、両側窓口の間のコピー数変異比率の差異が比較的に小さいサイトを除去し、最終の解析CNVに必要な有效断点を得る。
３） CNV解析。上記の最終断点を統計し、ある二つの断点の間の窓口を(x_L,x_R)とし、被検測サンプルの正常サンプルに対するCNV比率

を計算する。前記CNV比率≦0.75及び≧1.25をそれぞれ染色体断片欠失及び重複の検出閾値とし、解析して微細欠失/微細重複結果を得てから、染色体数字核型図を製作し、array CGH（The Fetal DNA Chip, http://www.fetalmedicine.hk/en/Fetal_DNA_Chip.asp ）と比較する。DECIPHERデータベースにより病気分類を行って遺伝子のアノテーションを行う。
４） CNV解析結果を出して数字核型図を製作する。
陰性対照結果コピー数はいずれも正常であり、3例サンプルのCNV結果及び検出結果検証並び主要遺伝子それぞれは下表2と3に示す。 3. Data analysis 1) Initialization. Runs the SegSeq algorithm, with respect to the position b on one chromosome, so as to encompass 300 Article normally leads to the window opening position b the right and left side portions, and set the parameters w = 300, i.e. N (x _L, b) = N (b, x _R ) = w = 300. At the lead position of the sample to be measured,

Those satisfying the condition are added to the candidate sequence, and those satisfying D _i (x _L , x _R ) = 0 and bw <i <b + w are removed. Let p _bkp- related parameters be 1000, and let 1000 candidate points appear in the initialization flow. _Repeat the above removal and joining candidate sequence steps until all p (| D (x _L , x _R ) |)> p _bkp

Export the above candidate point set Bc, Bc = {b ₁ , b ₂ , ..., b _N }.
2) Repeat the merger of adjacent fragments. Initialization is performed to obtain a set of candidate points, and the left and right sides of the candidate point k are (b _k-1 , b _k -1) and (b _k , b _{k + 1} ), respectively, and the p _merge related parameter is set to 10. Export at most 10 false positive fragment results in the recurrent merge process. All

Until the difference in the copy number mutation ratio between the sides teller removes the relatively small sites, obtaining a chromatic效断points required in the final analysis CNV.
3) CNV analysis. Statistics of the above final breakpoints, and the contact between two breakpoints is (x _L , x _R ), and the CNV ratio of the test sample to the normal sample

Calculate The CNV ratios ≦ 0.75 and ≧ 1.25 are set as detection thresholds for chromosomal fragment deletion and duplication, respectively, and after analyzing and obtaining a fine deletion / fine duplication result, a chromosome number karyotype is produced and array CGH (The Fetal Compare with DNA Chip, http://www.fetalmedicine.hk/en/Fetal_DNA_Chip.asp). Use DECIPHER database to classify diseases and annotate genes.
4) Generate CNV analysis results and create a numerical karyotype.
The negative control result copy numbers are all normal, and the CNV result and detection result verification of 3 samples and the major genes are shown in Tables 2 and 3 below.

本実施例のデータ解析過程中、実施例一と同様に、既知の正常サンプルの炎黄ゲノムDNAサンプルを陰性サンプル対照として選択し、被検測サンプルに近いデータ量を取り、標準化してからその有效リード数目a_Nを統計し、a_N=68750810。上記のサンプル4、サンプル5及びサンプル6の有效リード数目a_Tを統計し、それぞれは44797212、44086450及び45374254である。他のデータ解析の流れ及び関連パラメーターの設定は、いずれも実施例一と同じであり、最後、解析して微細欠失/微細重複結果を得てから、染色体数字核型図を製作して遺伝子のアノテーションを行う。 During data analysis process of the present embodiment, similarly to Embodiment one, the flame yellow genomic DNA sample of known normal samples were selected as negative samples control, it takes the data amount close to Hikenhaka sample, its chromatic效after standardizing to statistics the number of leads first a _{_N,} a _N = 68750810. The number of effective leads a _T of sample 4, sample 5 and sample 6 is statistically calculated as 44797212, 44086450 and 45374254, respectively. The other data analysis flow and related parameter settings are the same as in Example 1. Finally, after analyzing and obtaining the microdeletion / microduplication result, the chromosome number karyotype is created and the gene is Annotate.

本発明の具体的な実施形態はすでに詳細に説明されたものの、当業者は公開されたすべての示唆により、その細節を修正及び変更できると理解することができる。これらの変更はいずれも本発明の保護範囲内のものである。本発明の全部範囲は権利要求及びその任何等同物に与えられる。 While specific embodiments of the present invention have been described in detail , those skilled in the art will appreciate that the subsections may be modified and changed in accordance with all published suggestions. Any of these modifications are within the protection scope of the present invention. The full scope of the invention is given to the rights requirement and any such equivalents.

Claims

a) Sequential genomic DNA molecules obtained from test samples and normal samples derived from cells, blood, or tissues to obtain DNA fragments, which are then sequenced to read the DNA fragments. Step to win,
b) Localize the DNA sequence measured by comparing the DNA sequence measured in step a and the genomic reference sequence of the sample seed on the reference sequence, and select and analyze only the reads that have a unique position on the reference sequence. The steps to do,
c) A step where the reference sequence satisfies the following conditions in the reference sequence, that is, the comparison result of the sample to be tested and the comparison result of the normal sample are compared, and a site having a different copy number variation ratio is found on both sides of the site. As follows:
i) For each site b on the reference sequence, rammed inclusion of window w pieces correctly read the left and right side portions, satisfies i.e. _{N (x L, b) =} N (b, a x _R) = w Where N (x _L , x _R ) is the only number of reads of the reference sequence in the normal sample window (x _L , x _R ), w is an integer greater than 1,
ii) In these positions:

, Select sites that match D _i (x _L , x _R ) = 0, bw <i <b + w, and normalize for the test statistic D (x _L , x _R ) By performing the two-sided significance test of the distribution, p (| D (x _L , x _R ) |) of each site is obtained, and D (x _L , x _R ) = log (R (x _L , x))-log (R (x, x _R )), where x _L <x <x _R ,

In addition, the number of leads compared to the reference sequence in the normal sample lead and the test sample lead is a _N and a _T , respectively, and the reference sequence in the window (x _L , x _R ) is the only comparison. The number of leads is N (x _L , x _R ) and T (x _L , x _R ), respectively.
iii) p _bkp is set, and the above steps are repeated until all the sites matching p (| D (x _L , x _R ) |)> p _bkp are obtained, and the obtained candidate site sets are B _c , B _c = {b ₁ , b ₂ , ..., b _N },
d) A window (b _k−1 , b) on each side of each site k in the candidate site set B _c , B _c = {b ₁ , b ₂ , ..., b _N } on the reference sequence obtained in step c. b _k -1) and _{_{(b k, b k + 1}} ) is present, between the side window, the every time

By deleting the largest site k from the candidate site set Bc , updating and merging the p values of the interval (b _k−1 , b _{k + 1} ), and setting p _merge , all sites are

Repeating the steps until the point is satisfied to obtain a breakpoint at which chromosomal copy number variation occurs,
The p _bkp is the minimum p (| D (x _L , x _R ) |) when the compensation site is 10, 100, 1000 or 10000, or is selected as follows: Yes: Using normal samples as test samples, execute from step a) to c) ii) above, and filter all p (| D (x _L , x _R ) |) with false discovery rate control (FDR) Then, p (| D (x _L , x _R ) |) having the smallest value in the filtered site is defined as p _bkp, and the step of controlling the false discovery rate is significant (P value) Arrange in order from the lowest of , get these ranks (r), from top to bottom

(P _k is the P value of the k-number position, r _k is the rank of the k-number position, N is the total site number, alpha is significance level, for example a is 0.01) after the site to meet the assayed to k, k and suspends all sites prior method for detecting chromosomal copy number mutant is a step of removing the subsequent site.

The method of claim 1, wherein w is an integer from 100 to 1000.

p _merge is the maximum p (| D (x _L , x _R ) |) when the number of remaining sites is 1/2, 1/10, 1/100 or 1/1000 of the number of compensation sites Or selected as follows: normal sample as test sample, merged candidate site number is 1/2, 1/10, 1/100 of initial site number or The above steps a) to d) are executed so that 1/1000, and the maximum p (| D (x _L , x _R ) |) is selected as p _merge . Method.

After obtaining the site where the chromosomal copy number variation occurred,
e) Perform CNV analysis based on the break point obtained in step d, select a site where the CNV ratio of the test sample to the normal sample is less than or equal to the fine deletion detection threshold, and select the site for the normal sample. Selecting a site where the CNV ratio of the inspection sample is equal to or greater than the fine overlap detection threshold as a fine overlap site; and
f) Annotating the type of chromosomal microdeletion and / or microduplication syndrome disease by performing genetic annotation and functional analysis of the microdeletion site and / or microduplication site against existing CNV and disease databases ,
The method according to any one of claims 1 to 3, further comprising:

The method according to claim 4, wherein the fine deletion detection threshold is 0.75 and the fine duplication detection threshold is 1.25.

The step of randomly cleaving the DNA molecules of the sample genome is performed by a chemical or physical fragmentation method, and the chemical or physical fragmentation method includes enzyme fragmentation destruction, atomization, ultrasonic or HydroShear method destruction. The method according to any one of to 5 .

The DNA fragment sequencing step is performed using high-throughput sequencing technology, which uses Illumina / Solexa ™ , ABI / SOLiD ™ or Roche / 454 ™ sequencing technology. comprising a method according to any one of claims 1-6.

The method according to any one of claims 1 to 5 , wherein the range of the sequencing depth collected in the DNA fragment sequencing step is 1 to 30x.

6. The method according to claim 4 or 5, further comprising the step of producing a chromosomal numeral karyotype with copy number variation ratio values.