JP2015035212A

JP2015035212A - Method for finding variants from targeted sequencing panels

Info

Publication number: JP2015035212A
Application number: JP2014148832A
Authority: JP
Inventors: ルヌアシュトシュ; Juneja Ashutosh; エイ．ルコッククリスチャン; A Le Cocq Christian; ジョシデヴェンドラ; Joshi Devendra
Original assignee: Agilent Technologies Inc
Current assignee: Agilent Technologies Inc
Priority date: 2013-07-29
Filing date: 2014-07-22
Publication date: 2015-02-19
Also published as: JP6882373B2; JP2019164830A; CN104346539B; CN104346539A

Abstract

PROBLEM TO BE SOLVED: To provide a method for finding variants from targeted sequencing panels.SOLUTION: Provided herein is a method for identifying a sequence variant in an enriched sample. In certain embodiments, this method may comprise: (a) obtaining (i) a plurality of sequence reeds from a sample that has been enriched for a genomic region and (ii) a reference sequence for the genomic region; (b) assembling the sequence reeds to obtain a plurality of discrete sequence assemblies that correspond to potential variants; (c) determining which of the potential variants are true and which are artifacts by examining the sequence reeds that make up each of the discrete sequence assemblies; (d) optionally determining whether each of the true potential variants contains a mutation that is known to be associated with the reference sequence; and (e) outputting a report indicating whether the sample comprises a sequence variant.

Description

この発明は、ターゲットシークエンシングパネルから変異を見つける方法に関する。 The present invention relates to a method for finding mutations from a target sequencing panel.

突然変異についての包括的な詳細が癌を含む多くの疾患の理解、診断および治療に不可欠である。シークエンシングデータから突然変異を見つけるために多数の方法が提案されてきたが、これらは通常、参照と比較して変異塩基の存在を統計学的に評価することからなる。しかし、突然変異の正確な決定は、突然変異が断片のみに発見される状況においては依然として難題である。このような突然変異の描写は特に癌において重要である。腫瘍の不均一性、したがって再発および治療耐性の根本的な原因を理解するために、このような突然変異は腫瘍含量の低いサンプルだけでなく、微量の腫瘍サブクローンをキャプチャするためにも重要である。 Comprehensive details about mutations are essential for understanding, diagnosing and treating many diseases, including cancer. A number of methods have been proposed to find mutations from sequencing data, but these usually consist of statistically assessing the presence of a mutated base relative to a reference. However, the precise determination of mutations remains a challenge in situations where mutations are found only in fragments. Such a depiction of mutations is particularly important in cancer. In order to understand the underlying cause of tumor heterogeneity, and therefore recurrence and resistance to treatment, such mutations are important not only for capturing low tumor content samples but also for capturing trace tumor subclones. is there.

エンリッチメント技術は、高い均質性およびリード深度が可能なことにより、このようなサンプルの研究には魅力的である。しかし、実験技術によって正確に情報を把握できるものの、既存の解析方法は低頻度の変異の検出には適していない。 Enrichment techniques are attractive for the study of such samples due to their high homogeneity and lead depth. However, although accurate information can be obtained by experimental techniques, existing analysis methods are not suitable for detecting low-frequency mutations.

配列変異をコールできるオープンソースと市販両方の多数の他のツールがある。ターゲットエンリッチメントデータ用としてこのようなツールを使用する試みは、しばしば厄介なものとなる傾向にあり、データのすべての特徴を利用するわけではないため、誤ったコールまたは誤判定およびミスコールも招く。さらに、文献にて記載されているように、各方法は欠点を有しているだけでなく、コールはまた異なる方法間で一致しない。合致した正常なサンプルが供給されたときに低頻度の突然変異の検出を試みるだけの方法もあれば、ＳＮＰのみをコールし、挿入、欠失または多塩基多型(multiple nucleotide polymorphisms)(MNPs)はコールしない方法もある。 There are many other tools, both open source and commercially available, that can call for sequence variation. Attempts to use such tools for target enrichment data often tend to be cumbersome and do not utilize all the features of the data, resulting in false calls or misjudgments and miscalls. Furthermore, as described in the literature, each method not only has its drawbacks, but the calls also do not match between the different methods. Some methods only attempt to detect low frequency mutations when a matched normal sample is provided, while others call only SNPs and insert, delete or multiple nucleotide polymorphisms (MNPs) There is also a way to not call.

リード深度が高いターゲットシークエンシングにおける低頻度の変異の場合、問題は深刻化する。個々の変異部位を見て、その位置の突然変異の統計的有意性を評価することによってほとんどの方法が機能する。例えば、個々の遺伝子座が１０００リード深度の場合、平均してヘテロ接合体コールが５００リードによってカバーされ突然変異対立遺伝子を支持すると予測される。しかし、ヘテロ接合体が本当に存在しているがほんの数回しか標本抽出されない位置がある。モザイクサンプルの場合、微量構成要素の特徴である突然変異はずっと低い頻度を有するであろう。統計学的にこのような大きな標本空間から標本抽出するときには、希少事象が起こるため、低頻度のコールとシークエンシングエラーを区別することが難しくなる。問題は、増幅およびキャプチャでの他のアーティファクトの存在によりさらに複雑化する。ゲノム領域内の複雑な事象および挿入欠失（挿入−欠失）の存在において、参照配列では正確に変異の分布を表さず、これによりさらなるアーティファクトにつながる。既存の解決策の多くは、複数の独立した方法を用いてこの問題を解決しようと試みるが、最新の文献によれば、信頼性をもってこれらの変異をコールすることのできる解決策はない。 The problem is exacerbated for low frequency mutations in target sequencing with high read depth. Most methods work by looking at individual mutation sites and assessing the statistical significance of the mutation at that position. For example, if individual loci are 1000 read depths, on average, heterozygous calls are predicted to be covered by 500 reads and support mutant alleles. However, there are locations where heterozygotes really exist but are sampled only a few times. In the case of mosaic samples, mutations that are characteristic of microcomponents will have a much lower frequency. Statistically, when sampling from such a large sample space, rare events occur, making it difficult to distinguish low frequency calls from sequencing errors. The problem is further complicated by the presence of other artifacts in amplification and capture. In the presence of complex events and insertional deletions (insertion-deletions) within the genomic region, the reference sequence does not accurately represent the distribution of mutations, thereby leading to further artifacts. Many existing solutions attempt to solve this problem using multiple independent methods, but according to the latest literature, there are no solutions that can reliably call these mutations.

米国特許出願第２００４０２４１６５８号明細書US Patent Application No. 20040241658 米国特許出願第２０１００１２００９８号明細書US Patent Application No. 201300120098 米国特許第５，７９５，７８２号明細書US Pat. No. 5,795,782 米国特許第６，０１５，７１４号明細書US Pat. No. 6,015,714 米国特許第６，６２７，０６７号明細書US Pat. No. 6,627,067 米国特許第７，２３８，４８５号明細書US Pat. No. 7,238,485 米国特許第７，２５８，８３８号明細書US Pat. No. 7,258,838 米国特許出願第２００６００３１７１号明細書US Patent Application No. 2006003171 米国特許出願第２００９００２９４７７号明細書US Patent Application No. 20090029477 米国特許第８，２０９，１３０号明細書US Pat. No. 8,209,130 米国特許出願公開第２０１１／０００４４１３号明細書US Patent Application Publication No. 2011/0004413 米国特許出願公開第２０１１／００１５８６３明細書US Patent Application Publication No. 2011/0015863 Specification 米国特許出願公開第２０１０／００６３７４２号明細書US Patent Application Publication No. 2010/0063742

Hedgesら、Comparison of three targeted enrichment strategies on the SOLiD sequencing platform, PLoS One 2011 6: e18595Hedges et al., Comparison of three targeted enrichment strategies on the SOLiD sequencing platform, PLoS One 2011 6: e18595 Shearerら、Solution-based targeted genomic enrichment for precious DNA samples BMC Biotechnol. 2012 12: 20Shearer et al., Solution-based targeted genomic enrichment for precious DNA samples BMC Biotechnol. 2012 12: 20 Chial Proto-oncogenes to oncogenes to cancer. Nature Education 2008 1:1Chial Proto-oncogenes to oncogenes to cancer.Nature Education 2008 1: 1 Dahlら、Multiplex amplification enabled by selective circularization of large sets of genomic DNA fragments. Nucleic Acids Res. 2005 33: e71Dahl et al., Multiplex amplification enabled by selective circularization of large sets of genomic DNA fragments. Nucleic Acids Res. 2005 33: e71 Ausubel, F. M.ら、Short protocols in molecular biology，3rd ed., 1995, John Wiley & Sons, Inc., New YorkAusubel, F. M. et al., Short protocols in molecular biology, 3rd ed., 1995, John Wiley & Sons, Inc., New York Sambrook, J. ら、Molecular cloning: A laboratory manual, 2nd ed., 1989, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New YorkSambrook, J. et al., Molecular cloning: A laboratory manual, 2nd ed., 1989, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York Lageら、Genome Res. 2003 13: 294-307Lage et al., Genome Res. 2003 13: 294-307 Zongら、Science. 2012 338: 1622-1626Zong et al., Science. 2012 338: 1622-1626 Caruccio Methods Mol. Biol. 2011 733: 241-55Caruccio Methods Mol. Biol. 2011 733: 241-55 Kaperら、Proc. Natl. Acad. Sci. 2013 110: 5552-7Kaper et al., Proc. Natl. Acad. Sci. 2013 110: 5552-7 Marineら、Appl. Environ. Microbiol. 2011 77: 8071-9Marine et al., Appl. Environ. Microbiol. 2011 77: 8071-9 Marguliesら、Nature 2005 437: 376-80Margulies et al., Nature 2005 437: 376-80 Ronaghiら、Analytical Biochemistry 1996 242: 84-9Ronaghi et al., Analytical Biochemistry 1996 242: 84-9 Shendureら、Science 2005 309: 1728-32Shendure et al., Science 2005 309: 1728-32 Imelfortら、Brief Bioinform. 2009 10: 609-18Imelfort et al., Brief Bioinform. 2009 10: 609-18 Foxら、Methods Mol Biol. 2009; 553: 79-108Fox et al., Methods Mol Biol. 2009; 553: 79-108 Applebyら、Methods Mol Biol. 2009; 513: 19-39Appleby et al., Methods Mol Biol. 2009; 513: 19-39 Morozovaら、Genomics. 2008 92: 255-64Morozova et al., Genomics. 2008 92: 255-64 Soniら、2007 Clin. Chem. 53: 1996-2001Soni et al., 2007 Clin. Chem. 53: 1996-2001 Myersら、Science 2000 287: 2196-204Myers et al., Science 2000 287: 2196-204 Batzoglouら、Genome Research 2002 12: 177-89Batzoglou et al., Genome Research 2002 12: 177-89 Dohmら、Genome Research 2007 17: 1697-706Dohm et al., Genome Research 2007 17: 1697-706 Boisvertら、Journal of Computational Biology 2010 17: 1519-33Boisvert et al., Journal of Computational Biology 2010 17: 1519-33 Morenoら、Graph-Theoretic Concepts in Computer Science 2004 3353: 168Moreno et al., Graph-Theoretic Concepts in Computer Science 2004 3353: 168 Tarjanら、Proc FOCS 1984 12-20Tarjan et al., Proc FOCS 1984 12-20 Jungら、Systematic investigation of cancer-associated somatic point mutations in SNP databases Nature Biotechnology 2013 31: 787-789Jung et al., Systematic investigation of cancer-associated somatic point mutations in SNP databases Nature Biotechnology 2013 31: 787-789 Burmerら, Proc. Natl. Acad. Sci. 1989 86: 2403-7Burmer et al., Proc. Natl. Acad. Sci. 1989 86: 2403-7 Almogueraら, Cell 1988 53: 549-54Almoguera et al., Cell 1988 53: 549-54 Tamら，Clin．Cancer Res, 2006 12: 1647-53Tam et al., Clin. Cancer Res, 2006 12: 1647-53

上記背景技術を鑑み、ターゲットシークエンスパネルから変異を見つける方法を提供することを目的とする。 In view of the above background art, an object is to provide a method for finding a mutation from a target sequence panel.

本明細書にてエンリッチされたサンプルの配列変異を同定する方法を提供する。特定の実施形態にて、本方法は（ａ）（ｉ）ゲノム領域がエンリッチされたサンプルの複数の配列リードおよび（ｉｉ）ゲノム領域の参照配列を取得すること、（ｂ）前記配列リードをアセンブリングして、潜在的な変異に対応する、複数の離散的な配列アセンブリを得ること、（ｃ）離散的な配列アセンブリのそれぞれを構成する配列リードを調べることによって、どの潜在的な変異が真であり、どれがアーティファクトであるかを決定すること、（ｄ）任意で、真の潜在的な変異のそれぞれが、参照配列と関連すると分かっている突然変異を含んでいるかどうかを決定すること、ならびに、（ｅ）サンプルが配列変異を含んでいるかどうかを示すレポートを出力することを含んでもよい。 Provided herein are methods for identifying sequence variations in an enriched sample. In certain embodiments, the method comprises (a) (i) obtaining a plurality of sequence reads of a sample enriched in a genomic region and (ii) a reference sequence of the genomic region, (b) assembling said sequence read. To obtain a plurality of discrete sequence assemblies corresponding to the potential variation, (c) examining which sequence reads make up each of the discrete sequence assemblies, which potential variation is true Determining which are artifacts; (d) optionally determining whether each of the true potential mutations contains a mutation known to be associated with a reference sequence; And (e) outputting a report indicating whether the sample contains a sequence variation.

また、ａ）配列のデータベースと、ｂ）本方法を実行するための実行可能なプログラムとを含むメモリを含むコンピュータシステムも提供される。 A computer system is also provided that includes a memory that includes a) a database of sequences and b) an executable program for performing the method.

本方法を実行するための指示を含むコンピュータ可読記憶媒体もまた提供される。 A computer readable storage medium including instructions for performing the method is also provided.

変異配列を同定する方法もまた提供される。特定の実施形態において、本方法は、ａ）本方法を実行するための命令を含むプログラムを含むコンピュータシステムに配列情報を入力すること、ｂ）プログラムを実行すること、およびｃ）コンピュータシステムからの出力を受信することとを含んでもよい。 A method of identifying a mutant sequence is also provided. In certain embodiments, the method includes: a) entering sequence information into a computer system that includes a program that includes instructions for performing the method, b) executing the program, and c) from the computer system. Receiving the output.

本教示のこれらおよび他の特徴を本明細書にて説明する。 These and other features of the present teachings are described herein.

当業者は、下記の図面が例示目的のみのものであることを理解するであろう。本図面は、決して本教示の範囲を限定することを目的としていない。 Those skilled in the art will appreciate that the following drawings are for illustrative purposes only. The drawings are in no way intended to limit the scope of the present teachings.

本方法の一実施形態を示すフローチャートである。4 is a flowchart illustrating an embodiment of the method. 本方法の他の実施形態を示すフローチャートである。It is a flowchart which shows other embodiment of this method.

定義
特に定義しない限り、本明細書で用いるすべての技術的および科学的用語は、本開示の属する分野における当業者に一般的に理解されるものと同じ意味を有する。本明細書に記載されるのと類似のまたは同等のいかなる方法および材料も本教示の実施または試験に用いることができるが、いくつかの代表的な方法および材料をここで記述する。 Definitions Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. Although any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present teachings, some representative methods and materials are now described.

本明細書で用いられる用語「増幅する」とは、ターゲット核酸を鋳型として使用してターゲット核酸の１以上のコピーを生成することを意味する。 As used herein, the term “amplify” means to produce one or more copies of a target nucleic acid using the target nucleic acid as a template.

本発明で使用する場合、用語「一塩基多型」または略して「ＳＮＰ」とは、集団において相当の頻度（例えば、少なくとも１％）で２つまたはそれ以上の代替対立遺伝子が存在する、ゲノム配列における単一のヌクレオチド位置を意味する。 As used herein, the term “single nucleotide polymorphism” or abbreviated “SNP” refers to a genome in which two or more alternative alleles are present in a population at a significant frequency (eg, at least 1%). Refers to a single nucleotide position in the sequence.

ゲノムに関する用語「エンリッチする」とは、１以上のゲノム領域をゲノムの残りから分離して、ゲノムの残りから分離された生成物を生成することを意味する。エンリッチは、例えば非特許文献１および非特許文献２に記載されている方法を含む、種々の方法を用いて行ってもよい。 The term “enriching” with respect to the genome means separating one or more genomic regions from the rest of the genome to produce a product separated from the rest of the genome. Enrichment may be performed using various methods including the methods described in Non-Patent Document 1 and Non-Patent Document 2, for example.

用語「エンリッチされたサンプル」とは、ゲノムの残りから分離されたゲノムＤＮＡ断片を含むサンプルを意味する。エンリッチされた断片は、用いる断片化方法に応じて任意の長さであることができる。特定の実施形態にて、断片は長さ１００ｂｐから１ｋｂ、例えば長さ２００ｂｐから５００ｂｐであってもよいが、この範囲外の断片を使用してもよい。断片化および／またはエンリッチをどのように行うかに応じて、任意の１つのエンリッチされた領域について断片分子の末端は同じであっても異なっていてもよい。 The term “enriched sample” means a sample that contains genomic DNA fragments separated from the rest of the genome. The enriched fragment can be of any length depending on the fragmentation method used. In certain embodiments, fragments may be 100 bp to 1 kb in length, for example 200 bp to 500 bp in length, although fragments outside this range may be used. Depending on how fragmentation and / or enrichment is performed, the ends of the fragment molecules may be the same or different for any one enriched region.

本明細書で用いられる用語「ゲノム領域」とは、ゲノム、例えば、ヒト、サル、ラット、魚もしくは昆虫または植物等の動物または植物のゲノムの領域を意味する。 The term “genomic region” as used herein refers to a region of the genome, for example the genome of an animal or plant, such as a human, monkey, rat, fish or insect or plant.

「複数」は少なくとも２つの要素を含む。ある場合において、複数は、少なくとも１０、少なくとも１００、少なくとも１０００、少なくとも１０，０００、少なくとも１００，０００、少なくとも１０^６、少なくとも１０^７、少なくとも１０^８もしくは少なくとも１０^９またはそれ以上の要素を有してもよい。 “Plural” includes at least two elements. In certain instances, the plurality has at least 10, at least 100, at least 1000, at least 10,000, at least 100,000, at least 10 ⁶ , at least 10 ⁷ , at least 10 ⁸ or at least 10 ⁹ or more elements. Also good.

本明細書で用いられる用語「シークエンシング」とは、ポリヌクレオチドの少なくとも１０の連続するヌクレオチドを同定（例えば、少なくとも２０、少なくとも５０、少なくとも１００もしくは少なくとも２００またはそれ以上の連続したヌクレオチドを同定）できる方法を意味する。 As used herein, the term “sequencing” can identify at least 10 contiguous nucleotides of a polynucleotide (eg, identify at least 20, at least 50, at least 100 or at least 200 or more contiguous nucleotides). Means the method.

用語「次世代シークエンシング」とは、イルミナ株式会社、ライフテクノロジーズ社およびロシュ社等が現在採用している、いわゆる、並列的な合成によるシークエンシング(sequencing-by-synthesis)プラットフォームまたはライゲーションによるシークエンシング(sequencing-by-ligation)プラットフォームを意味する。次世代シークエンシング法はまた、ナノポアシークエンシング法または、ライフテクノロジーズ社によって実用化されたイオントレント技術などの電子検出に基づいた方法を含んでもよい。 The term “next generation sequencing” refers to the so-called sequencing-by-synthesis platform or ligation sequencing currently employed by Illumina, Life Technologies, and Roche. (sequencing-by-ligation) means platform. Next generation sequencing methods may also include methods based on electron detection, such as nanopore sequencing methods or ion torrent technology put to practical use by Life Technologies.

用語「配列リード」とは、シークエンシングランの出力を意味する。配列リードは一列のヌクレオチドによって表される。配列リードには配列のクオリティについての評価基準が伴っていてもよい。例えば、配列リードの各ヌクレオチドは、ベースコールの信頼性、すなわち、そのヌクレオチドに対してヌクレオチドがＧ、Ａ、ＴまたはＣのいずれであるかの決定を伴ってもよい。 The term “sequence read” means the output of a sequencing run. Sequence reads are represented by a single row of nucleotides. The sequence read may be accompanied by an evaluation criterion for the quality of the sequence. For example, each nucleotide in the sequence read may involve a base call reliability, ie, a determination of whether the nucleotide is G, A, T or C for that nucleotide.

用語「配列変異」とは、少なくとも１つの位置で参照配列とは異なる核酸配列を意味する。配列変異の例としては、ＳＮＰおよび体細胞突然変異を含む配列が挙げられる。 The term “sequence variation” means a nucleic acid sequence that differs from a reference sequence at at least one position. Examples of sequence variations include sequences containing SNPs and somatic mutations.

用語「低頻度の配列変異」、「少数種」および「少数変異」とは、非変異タイプの配列に対してほんの１０％未満の頻度（例えば、５％未満または１％未満）でサンプル内に存在する変異配列を意味する。多くの場合、低頻度の配列を遺伝子内でのヌクレオチドの置換または挿入欠失によって表してもよく、非変異タイプの配列を同じ遺伝子の野生型対立遺伝子によって表してもよい。低頻度の配列変異は例えば、体細胞突然変異によって生じさせられる。 The terms “low frequency sequence variation”, “minority species” and “minority variation” within a sample with a frequency of less than 10% (eg, less than 5% or less than 1%) relative to a non-mutated type sequence. Means an existing mutant sequence. In many cases, low frequency sequences may be represented by nucleotide substitutions or insertional deletions within the gene, and non-mutated sequences may be represented by wild-type alleles of the same gene. Infrequent sequence variations are caused, for example, by somatic mutations.

用語「参照配列」とは、公知である配列、例えば、候補配列と比較できる公衆または企業内データベースからの配列を意味する。 The term “reference sequence” means a known sequence, eg, a sequence from a public or corporate database that can be compared to a candidate sequence.

本発明で使用する場合、用語「アセンブリング」とは、長い核酸の断片を表す配列のアライメントを伴う多段階プロセスを意味する。特定の場合において、アセンブリングは、セグメントの配列を構成するために配列の融合を伴ってもよい。 As used herein, the term “assembling” refers to a multi-step process involving alignment of sequences representing fragments of long nucleic acids. In certain cases, assembling may involve fusion of sequences to constitute the sequence of segments.

本発明で使用する場合、用語「アンカー」とは、長い配列をアライメントするのに用いることができる、これら長い配列に存在する配列を意味する。特定の場合において、アンカーは長い配列を正確にアライメントするのに十分であってもよい。 As used herein, the term “anchor” refers to sequences present in these long sequences that can be used to align long sequences. In certain cases, the anchor may be sufficient to accurately align long sequences.

本発明で使用する場合、用語「配列コンティグ」とは、重ね合わせた配列をアセンブリングすることによって生成されるヌクレオチドの連続配列を意味する。 As used herein, the term “sequence contig” means a contiguous sequence of nucleotides generated by assembling a superposed sequence.

本発明で使用する場合、用語「癌と関連する」とは、癌の表現型と関連する突然変異を含むゲノム領域、例えば、遺伝子を意味する。場合によっては、突然変異は癌の原因としての役割があると考えられている。 As used herein, the term “associated with cancer” refers to a genomic region, eg, a gene, that contains a mutation associated with a cancer phenotype. In some cases, mutations are thought to have a role as a cause of cancer.

詳細な説明
種々の実施形態を記載する前に、本開示の教示は記載した特定の実施形態に限定されず、そのため、当然ながら変更できることが理解されるであろう。また、本教示の範囲は添付の請求の範囲によってのみ限定されるものであるため、本明細書で使用する用語は、特定の実施形態を説明する目的のためにすぎず、限定することを意図するものではないことが理解されるであろう。 DETAILED DESCRIPTION Before describing the various embodiments, it will be understood that the teachings of the present disclosure are not limited to the specific embodiments described, and can, of course, vary. Also, since the scope of the present teachings is limited only by the appended claims, the terminology used herein is for the purpose of describing particular embodiments only and is intended to be limiting. It will be understood that it does not.

本発明で用いられるセクションの見出しは、単なる構成目的にすぎず、決して主題を限定するものと解釈されるべきではない。本教示は種々の実施形態とともに記載されているが、本教示がこのような実施形態に限定されることを意図するものではない。むしろ、本教示は当業者には理解されるように、さまざまな代替、変更および等価物を包含する。 The section headings used in the present invention are for organizational purposes only and are not to be construed as limiting the subject matter in any way. Although the present teachings have been described in conjunction with various embodiments, it is not intended that the present teachings be limited to such embodiments. Rather, the present teachings encompass various alternatives, modifications, and equivalents, as will be appreciated by those skilled in the art.

値の範囲が示されるところでは、その範囲の上限と下限の間で、その内容に別段の明確な指示がない限り、下限の単位の１０分の１までの各中間の値、および規定の範囲内における任意の他の規定のまたは中間の値が本開示に含まれることが理解される。 Where a range of values is indicated, between the upper and lower limits of the range, unless there is a clear indication in the contents, each intermediate value up to one-tenth of the lower limit unit, and the specified range It is understood that any other specified or intermediate value within is included in this disclosure.

任意の文献の引用は、その出願日よりも前の開示に関するものであり、本発明が、先行する発明のためにそのような文献に先行する権利が無くなることを認めるものと解釈すべきではない。また、与えられる公開日は、独立して確認される必要のある実際の公開日とは異なっていることができる。 The citation of any document is for disclosure prior to the filing date of the application and should not be construed as an admission that the present invention ceases to be entitled to an earlier document due to the prior invention. . Also, the given release date can be different from the actual release date that needs to be independently confirmed.

本明細書および添付の請求の範囲で使用する時、単数形「１つの(a)」、「１つの(an)」および「前記(the)」は別段の明確な指示がない限り、複数の指示対象を含むことに留意せねばならない。さらに、請求の範囲はいかなる任意の要素をも排除すべく書かれていることに留意されたい。そのため、この記述は、請求の範囲の要素の詳細説明または「否定的な」限定の使用に関連して「単独で(solely)」、「のみ(only)」等のような排除的用語を使用する先行詞としての役割を果たすことが意図されている。 As used in this specification and the appended claims, the singular forms “a”, “an”, and “the” refer to a plural number unless the context clearly dictates otherwise. It should be noted that the target object is included. Furthermore, it is noted that the claims are written to exclude any optional element. As such, this description uses exclusive terms such as “solely”, “only”, etc., in connection with a detailed description of a claim element or the use of a “negative” limitation. It is intended to serve as an antecedent.

本開示を読めば当業者には明らかなように、本明細書に記載され、示される個々の各実施形態は、本教示の範囲または趣旨から逸脱することなしに他のいくつかの実施形態の特徴のいずれかから容易に分離できるかまたは組み合わせられる、それぞれの構成要素および特徴を有する。任意の列挙された方法は、列挙された事象の順番で、または論理的に可能な任意の他の順番で実施することができる。 It will be apparent to those skilled in the art after reading this disclosure that each individual embodiment described and shown herein is the same as that of several other embodiments, without departing from the scope or spirit of the present teachings. Each component and feature is easily separable or combined from any of the features. Any recited method can be carried out in the order of events recited or in any other order which is logically possible.

当業者は、本発明がその出願において、構成の詳細、構成要素の配置、カテゴリ選択、重み付け、所定のシグナル限界または本明細書もしくは図面に規定される工程に限定されないものであることを理解するであろう。本発明は他の実施形態が可能であり、また多くの異なる方法で実践または実施することができる。 Those skilled in the art will appreciate that the present invention is not limited in the application to configuration details, component placement, category selection, weighting, predetermined signal limits, or steps specified herein or in the drawings. Will. The invention is capable of other embodiments and of being practiced or carried out in many different ways.

上述したように、本方法は、特定のゲノム領域がエンリッチされたサンプル、すなわち、断片が断片化全ゲノムＤＮＡからエンリッチされた、特定のゲノム領域に対応するゲノムＤＮＡの断片を含むサンプルから取得された配列リードで行ってもよい。場合によっては、エンリッチされたゲノム領域は、１種以上の癌、例えば乳癌、黒色腫、腎癌、子宮内膜癌、卵巣癌、膵癌、白血病、大腸癌、前立腺癌、中皮腫、神経膠腫、髄芽腫(medullobastoma)、赤血球増加症、リンパ腫、肉腫または多発性骨髄腫等と関係した突然変異を有する遺伝子を含んでもよい（例えば、非特許文献３参照）。対象遺伝子としては、ＰＩＫ３ＣＡ、ＮＲＡＳ、ＫＲＡＳ、ＪＡＫ２、ＨＲＡＳ、ＦＧＦＲ３、ＦＧＦＲ１、ＥＧＦＲ、ＣＤＫ４、ＢＲＡＦ、ＲＥＴ、ＰＧＤＦＲＡ、ＫＩＴおよびＥＲＢＢ２が挙げられるが、これらに限定されない。特定の場合において、サンプルは、エンリッチされた複数の異なるゲノム領域（例えば、いくつかの異なる領域、例えば、少なくとも２、少なくとも５、少なくとも１０、少なくとも５０、少なくとも１００、または少なくとも１０００以上の異なる、重なり合っていない領域）に対応するゲノムＤＮＡの断片を含有してもよい。各領域は、遺伝子、例えば腫瘍遺伝子に対応してもよい。 As described above, the method is obtained from a sample enriched in a specific genomic region, i.e. a sample containing fragments of genomic DNA corresponding to a specific genomic region, wherein the fragments are enriched from fragmented total genomic DNA. Alternatively, array read may be used. In some cases, the enriched genomic region comprises one or more cancers, such as breast cancer, melanoma, kidney cancer, endometrial cancer, ovarian cancer, pancreatic cancer, leukemia, colon cancer, prostate cancer, mesothelioma, glioma It may contain a gene having a mutation associated with tumor, medullobastoma, erythrocytosis, lymphoma, sarcoma or multiple myeloma (for example, see Non-Patent Document 3). Target genes include, but are not limited to, PIK3CA, NRAS, KRAS, JAK2, HRAS, FGFR3, FGFR1, EGFR, CDK4, BRAF, RET, PGDFRA, KIT and ERBB2. In certain cases, the sample is enriched with a plurality of different genomic regions (eg, several different regions, eg, at least 2, at least 5, at least 10, at least 50, at least 100, or at least 1000 or more different, overlapping A fragment of genomic DNA corresponding to a region that is not present). Each region may correspond to a gene, such as an oncogene.

エンリッチされたゲノム領域は、任意の便利な方法を用いて、例えばオリゴヌクレオチドプローブにハイブリダイゼーションを用いて、またはライゲーションを基にした方法を用いて、初期ゲノムサンプルからエンリッチされてもよい。いくつかの実施形態では、対象領域をキャプチャするために、ゲノム領域は、溶液中で２０〜２００ｎｔの長さ、例えば１００〜１５０ｎｔの長さであってもよい、１以上のビオチニル化オリゴヌクレオチド（特定の場合において、ＲＮＡオリゴヌクレオチドであってもよい）にハイブリダイゼーションしてエンリッチされてもよい。これらの実施形態において、キャプチャ後、オリゴヌクレオチドにハイブリダイズするゲノムＤＮＡの断片を含有する二本鎖を、例えば、ストレプトアビジンビーズを用いて他の断片から分離してもよい。他の実施形態では、対象領域を、非特許文献４によって記述された方法を用いてエンリッチしてもよい。本方法では、ゲノムサンプルを１以上の制限酵素を用いて断片化して変性させてもよい。本方法では、プローブライブラリをターゲット断片にハイブリダイズする。各プローブは、ターゲットＤＮＡ制限断片の両末端にハイブリダイズし、これによりターゲット断片がガイドされて環状ＤＮＡ分子を形成するように設計されたオリゴヌクレオチドである。プローブはまた、環状化中に組み込まれる方法特異的なシークエンシングモチーフを含む。場合によっては、プローブはビオチニル化され、ターゲット断片はストレプトアビジンビーズを使用して回収される。次いで、環状分子はライゲーション、すなわち、完全にハイブリダイズされた断片のみが確実に環状化される非常に明確な反応によって閉じられる。次に、環状ＤＮＡターゲットを増幅する。他のエンリッチメント技術は、例えば非特許文献１および非特許文献２に記載されていてもよい。 Enriched genomic regions may be enriched from the initial genomic sample using any convenient method, for example using hybridization to oligonucleotide probes or using ligation based methods. In some embodiments, to capture the region of interest, the genomic region can be 20-200 nt long in solution, such as 100-150 nt long, in one or more biotinylated oligonucleotides ( In certain cases, it may be an RNA oligonucleotide) and may be enriched by hybridization. In these embodiments, after capture, duplexes containing fragments of genomic DNA that hybridize to oligonucleotides may be separated from other fragments using, for example, streptavidin beads. In other embodiments, the region of interest may be enriched using the method described by NPL 4. In this method, a genomic sample may be denatured by fragmentation using one or more restriction enzymes. In this method, the probe library is hybridized to the target fragment. Each probe is an oligonucleotide designed to hybridize to both ends of the target DNA restriction fragment, thereby guiding the target fragment to form a circular DNA molecule. The probe also contains a method-specific sequencing motif that is incorporated during circularization. In some cases, the probe is biotinylated and the target fragment is recovered using streptavidin beads. The circular molecule is then closed by ligation, a very well defined reaction that ensures that only fully hybridized fragments are circularized. Next, the circular DNA target is amplified. Other enrichment techniques may be described in Non-Patent Document 1 and Non-Patent Document 2, for example.

ゲノムＤＮＡは任意の生命体から分離されてもよい。生命体は原核生物または真核生物であってもよい。特定の場合において、生命体は、植物、例えば、シロイヌナズナもしくはトウモロコシ、または爬虫類、哺乳類、鳥類、魚類および両生類を含む動物であってもよい。場合によっては、初期ゲノムサンプルはヒトまたはマウスもしくはラット等の齧歯類から分離されてもよい。例示的な実施形態において、初期ゲノムサンプルは、ヒト、マウス、ラットまたはサル細胞等の哺乳類細胞からのゲノムＤＮＡを含有してよい。非特許文献５および非特許文献６に記載されている方法のような、解析のためのゲノムＤＮＡの作製方法は当該技術分野において常用されており、公知である。初期ゲノムサンプルは、ゲノムＤＮＡまたはその増幅されたバージョン（例えば、非特許文献７、非特許文献８または公開済み特許文献１の方法を用いて全ゲノム増幅方法によって増幅されたゲノムＤＮＡ）を含有してもよい。断片は、物理的方法（例えば、音波処理、噴霧もしくはせん断）を用いて、化学的に、酵素的に（例えば、レアカット制限酵素を用いて）または転移因子を用いて（例えば、非特許文献９；非特許文献１０；非特許文献１１および特許文献２参照）、ゲノムを断片化することによって作製されてもよい。 Genomic DNA may be isolated from any organism. The organism may be prokaryotic or eukaryotic. In certain cases, the organism may be a plant, such as Arabidopsis or corn, or an animal including reptiles, mammals, birds, fish and amphibians. In some cases, the initial genomic sample may be isolated from humans or rodents such as mice or rats. In exemplary embodiments, the initial genomic sample may contain genomic DNA from mammalian cells such as human, mouse, rat or monkey cells. Methods for producing genomic DNA for analysis, such as the methods described in Non-Patent Document 5 and Non-Patent Document 6, are commonly used in the art and are publicly known. The initial genomic sample contains genomic DNA or an amplified version thereof (eg, genomic DNA amplified by a whole genome amplification method using the method of Non-Patent Document 7, Non-Patent Document 8, or Published Patent Document 1). May be. Fragments can be obtained using physical methods (eg sonication, spraying or shearing), chemically, enzymatically (eg using rare-cut restriction enzymes) or transposable elements (eg Non-Patent Document 9). Non-Patent Document 10; see Non-Patent Document 11 and Patent Document 2), and may be prepared by fragmenting the genome.

サンプルは、培養した細胞または臨床検体の細胞、例えば、組織生検、スクレープもしくは洗浄または法医学的サンプルの細胞（すなわち、犯行現場から採取したサンプルの細胞）から作製してもよい。特定の実施形態では、核酸サンプルは、細胞、組織、体液および便等の生体サンプルから得られてもよい。対象の体液としては、血液、血清、血漿、唾液、粘液、痰、脳脊髄液、胸水、涙、乳糜管液、リンパ液、痰、脳脊髄液、滑液、尿、羊水および精液が挙げられるが、これらに限定されない。特定の実施形態では、サンプルは、対象、例えばヒトから取得されてもよく、本方法での使用前に処理してもよい。例えば、公知の方法にて使用前に核酸をサンプルから抽出してもよい。特定の実施形態では、ゲノムサンプルは、ホルマリン固定パラフィン包理(FFPE)サンプルのものであってもよい。 The sample may be made from cultured cells or cells of a clinical specimen, such as tissue biopsy, scrape or wash or forensic sample cells (ie, sample cells taken from a crime scene). In certain embodiments, nucleic acid samples may be obtained from biological samples such as cells, tissues, body fluids and stool. The body fluids of interest include blood, serum, plasma, saliva, mucus, sputum, cerebrospinal fluid, pleural effusion, tears, milk duct fluid, lymph fluid, sputum, cerebrospinal fluid, synovial fluid, urine, amniotic fluid and semen. However, it is not limited to these. In certain embodiments, the sample may be obtained from a subject, such as a human, and may be processed prior to use in the method. For example, the nucleic acid may be extracted from the sample before use by a known method. In certain embodiments, the genomic sample may be of a formalin fixed paraffin embedding (FFPE) sample.

どの方法を実施するかに応じて、初期サンプル（すなわち、エンリッチメント前）は、既にアダプターライゲーションしたゲノムＤＮＡの断片を含有してよい。他の実施形態では、断片は、エンリッチされた後でアダプターにライゲーションしてもよい。 Depending on which method is performed, the initial sample (ie, prior to enrichment) may contain fragments of genomic DNA that have already been adapter ligated. In other embodiments, the fragment may be ligated to the adapter after being enriched.

場合によっては、サンプルをプールしてもよい。これらの実施形態では、断片は、その供給源を示すために分子バーコードを有してもよい。いくつかの実施形態において、解析されるＤＮＡは単一の供給源（例えば、単一の生命体、ウイルス、組織、細胞、対象等）由来であってもよく、これに対して、他の実施形態においては、核酸サンプルは、複数供給源から抽出された核酸のプール（例えば、複数の生命体、組織、細胞、対象等からの核酸のプール）であってもよく、ここで「複数」とは２以上を意味する。そのため、特定の実施形態にて、サンプルは２以上の供給源、３以上の供給源、５以上の供給源、１０以上の供給源、５０以上の供給源、１００以上の供給源、５００以上の供給源、１０００以上の供給源、５０００以上の供給源から最大約１０，０００の供給源、および約１０，０００以上の供給源からの核酸を含有できる。分子バーコードは、異なる供給源からの配列を、解析後に区別されるようにしてもよい。 In some cases, samples may be pooled. In these embodiments, the fragment may have a molecular barcode to indicate its source. In some embodiments, the DNA to be analyzed may be from a single source (eg, a single organism, virus, tissue, cell, subject, etc.), whereas other implementations In embodiments, the nucleic acid sample may be a pool of nucleic acids extracted from multiple sources (eg, a pool of nucleic acids from multiple organisms, tissues, cells, subjects, etc.), where “multiple” Means 2 or more. As such, in certain embodiments, the sample has two or more sources, three or more sources, five or more sources, ten or more sources, 50 or more sources, 100 or more sources, 500 or more sources. Nucleic acids from sources, 1000 or more sources, 5000 or more sources up to about 10,000 sources, and about 10,000 or more sources can be included. Molecular barcodes may allow sequences from different sources to be distinguished after analysis.

エンリッチされたサンプルが得られた後、サンプルは増幅およびシークエンシングされる。特定の実施形態にて、断片は、例えばイルミナ社の可逆性ターミネータ法、ロシュ社のパイロシークエンシング法（４５４）、ライフテクノロジーズ社のライゲーションによるシークエンシング（ＳＯＬｉＤプラットフォーム）またはライフテクノロジーズ社のイオントレントプラットフォームでの使用に適合するプライマーを用いて増幅される。このような方法の例は以下の参考文献に記載される：非特許文献１２；非特許文献１３；非特許文献１４；非特許文献１５；非特許文献１６；非特許文献１７および非特許文献１８。これらは、それぞれのステップの開始生成物、試薬および最終生成物を含めた本方法および本方法の特定のステップの一般的な記述のために参考として援用される。 After the enriched sample is obtained, the sample is amplified and sequenced. In certain embodiments, the fragment is, for example, Illumina's reversible terminator method, Roche's pyrosequencing method (454), Life Technologies ligation sequencing (SOLiD platform) or Life Technologies' ion torrent platform. Amplified using primers suitable for use in Examples of such methods are described in the following references: Non-patent document 12; Non-patent document 13; Non-patent document 14; Non-patent document 15; Non-patent document 16; . These are incorporated by reference for a general description of the method and specific steps of the method, including the starting product, reagents and final product of each step.

一実施形態では、分離した生成物をナノポアシークエンシング（例えば、非特許文献１９に記載されるような、またはオックスフォードナノポアテクノロジーズ社によって記述されるような）を用いてシークエンシングしてもよい。ナノポアシークエンシングは、ＤＮＡの単一の分子をナノポアを通過させて直接シークエンシングする、単一分子シークエンシング技術である。ナノポアとは、直径が１ナノメートル程度の小さな穴である。ナノポアを導電性流体に浸漬し、電位（電圧）をそこに印加することにより、ナノポアを通るイオン伝導によってわずかな電流が生じる。流れる電流量はナノポアのサイズおよび形状に左右される。ＤＮＡ分子がナノポアを通過する際、ＤＮＡ分子の各ヌクレオチドが異なる程度でナノポアを塞ぎ、ナノポアを通過する電流の大きさが異なる程度で変化する。したがって、このＤＮＡ分子がナノポアを通過する際の電流の変化が、ＤＮＡ配列の読み取りを表す。ナノポアシークエンシング技術は特許文献３、特許文献４、特許文献５、特許文献６および特許文献７ならびに特許文献８および特許文献９に開示される。 In one embodiment, the separated product may be sequenced using nanopore sequencing (eg, as described in Non-Patent Document 19 or as described by Oxford Nanopore Technologies). Nanopore sequencing is a single molecule sequencing technique in which a single molecule of DNA is sequenced directly through the nanopore. A nanopore is a small hole having a diameter of about 1 nanometer. By immersing the nanopore in a conductive fluid and applying a potential (voltage) thereto, a small current is generated by ionic conduction through the nanopore. The amount of current that flows depends on the size and shape of the nanopore. When the DNA molecule passes through the nanopore, each nucleotide of the DNA molecule blocks the nanopore to a different extent, and the magnitude of the current passing through the nanopore changes to a different extent. Thus, the change in current as the DNA molecule passes through the nanopore represents the reading of the DNA sequence. Nanopore sequencing technology is disclosed in Patent Literature 3, Patent Literature 4, Patent Literature 5, Patent Literature 6 and Patent Literature 7, and Patent Literature 8 and Patent Literature 9.

いくつかの実施形態では、エンリッチされた領域それぞれについて、シークエンシングにより、少なくとも１００、少なくとも１，０００、少なくとも１０，０００から１００，０００まで、またはそれ以上の配列リードを生成してもよい。配列リード長は、例えば使用するプラットフォームに応じて大きく変化してもよい。いくつかの実施形態では、配列リード長は、３０〜８００塩基の範囲にあってもよく、場合によっては、ペアエンドリードを含んでもよい。 In some embodiments, each enriched region may generate at least 100, at least 1,000, at least 10,000 to 100,000 or more sequence reads by sequencing. The sequence read length may vary greatly depending on, for example, the platform used. In some embodiments, the sequence read length may be in the range of 30-800 bases and may optionally include paired-end reads.

種々の異なる方法を用いて、それぞれが潜在的な変異に対応する複数の離散的な配列アセンブリを得るために配列リードをアセンブリングすることができる。配列リードは、これらすべてが方法の開示のために参考として援用される、非特許文献２０、非特許文献２１、非特許文献２２および非特許文献２３等の種々の刊行物にその基本ステップが記載される任意の好適な方法を用いてアセンブリングしてもよい。いくつかの実施形態では、エンリッチされた領域それぞれに対して、配列リードをアセンブリングして、特定の位置にてヌクレオチド変異（例えば、置換、挿入または欠失）を有する配列リードを同定するために調べられる単一のパイルアップを生成することができる。次いで、ヌクレオチド変異を特定のヌクレオチド位置にて有する配列リードを、離散的な配列アセンブリとしてリアセンブリングすることができる。他の実施形態では、配列を高い厳密性をもって、すなわち、同じ変異を有する配列リードが配列を互いに群にならしめるやり方で、アセンブリングしてもよい。さらに他の実施形態では、配列リードを、参照ゲノム等の参照配列に各リードをアライメントすることでアセンブリングできる。特定の場合において、配列リードから得られた少なくとも１つのアセンブリングされた配列は参照配列にアライメントする。 A variety of different methods can be used to assemble sequence reads to obtain multiple discrete sequence assemblies, each corresponding to a potential mutation. Sequence reads are described in their basic steps in various publications such as Non-Patent Document 20, Non-Patent Document 21, Non-Patent Document 22, and Non-Patent Document 23, all of which are incorporated by reference for method disclosure. Any suitable method may be used for assembly. In some embodiments, for each enriched region, a sequence read is assembled to identify a sequence read having a nucleotide mutation (eg, substitution, insertion or deletion) at a particular position. A single pileup that can be examined can be generated. Sequence reads having nucleotide variations at specific nucleotide positions can then be reassembled as discrete sequence assemblies. In other embodiments, the sequences may be assembled with high stringency, ie, in a manner that sequence reads with the same mutation group the sequences together. In yet other embodiments, sequence reads can be assembled by aligning each read to a reference sequence, such as a reference genome. In certain cases, at least one assembled sequence obtained from a sequence read is aligned to a reference sequence.

場合によっては、また以下でさらに詳細に記載するように、グラフ理論を用いてリードをアセンブリングする。特定の場合において、配列リードのアセンブリングはｄｅＢｒｕｉｊｎグラフ等の有向グラフの作成を含んでもよい。例えば、配列リードのｄｅＢｒｕｉｊｎグラフ構成には、配列リードから、ターゲット領域のリード内の長さｋの部分配列も含め、重複するｋ−ｍｅｒを集めること、各ｋ−ｍｅｒを２つの重複する（ｋ−１）−ｍｅｒに分割すること、およびグラフの頂点またはノードを各（ｋ−１）−ｍｅｒに割り当て、またグラフ内の２つのノードを接続するエッジをｋ−ｍｅｒに割り当てることとを伴ってよい。したがって、各配列リードはグラフ内でｋ−ｍｅｒが通る経路として表され、潜在的な配列コンティグはグラフ内でｋ−ｍｅｒが通る複数の経路を結合することで表されてもよい。リードのアセンブリングのためのｄｅ−Ｂｒｕｉｊｎグラフの使用については、本明細書に参考として援用される特許文献１０、特許文献１１、特許文献１２および特許文献１３に記載されている。 In some cases, and as described in more detail below, the leads are assembled using graph theory. In certain cases, assembly of sequence reads may include the creation of a directed graph such as a de Bruijn graph. For example, in a de Bruijn graph configuration of sequence reads, overlapping k-mers are collected from sequence reads including partial sequences of length k in the target region leads, and each k-mer is duplicated twice ( with k-1) -mer and assigning a vertex or node of the graph to each (k-1) -mer, and assigning an edge connecting two nodes in the graph to k-mer It's okay. Thus, each sequence read may be represented as a path through which the k-mer passes in the graph, and a potential sequence contig may be represented by combining multiple paths through which the k-mer passes in the graph. The use of the de-Bruijn graph for lead assembly is described in Patent Document 10, Patent Document 11, Patent Document 12, and Patent Document 13, which are incorporated herein by reference.

特定の場合において、有向グラフは有向重み付きグラフであってもよい。特定の態様では、有向重み付きグラフは同じ長さのｋ−ｍｅｒを用いて構成される。特定の実施形態にて、ノードでの潜在的配列を構成するのにどのエッジを選択するかは、特定のノードまたはこのノードに接続しているエッジのリードカバレッジの関数であるカットオフ値を用いずに選択される。 In certain cases, the directed graph may be a directed weighted graph. In a particular aspect, the directed weighted graph is constructed using k-mers of the same length. In certain embodiments, which edges to select to construct a potential array at a node uses a cut-off value that is a function of the lead coverage of the particular node or edges connected to this node. Selected without.

潜在的配列は、オイラーパスによる有向重み付きグラフで表される。したがって、配列リードのアセンブリングはさらに、配列リードから構成された有向重み付きグラフを通してオイラーパスを見つけることを伴ってもよい。有向重み付きグラフを通してオイラーパスを見つけることは、禁止文字列を有する言語において最小ｄｅ−Ｂｒｕｉｊｎ配列（すなわち、所定のアルファベットＡの長さｎのあらゆる可能な部分配列が、連続する文字配列としてちょうど１回現れる、サイズがｋのＡの周期性配列）を見つけることを含んでもよい。例えば、非特許文献２４を参照のこと。かかる場合、最小ｄｅ−Ｂｒｕｉｊｎ配列は、ＢＥＳＴ（ｄｅＢｒｕｉｊｎ、Ｅｈｒｅｎｆｅｓｔ、ＳｍｉｔｈおよびＴｕｔｔｅ）定理を用いて有向重み付きグラフの全域部分グラフによって、または木によって定義されてもよい（有向グラフにおけるオイラー回路の数に対する積公式を提供し、またオイラー回路の数を、所定の頂点の根付き全域木の数に関連づける）。有向グラフの全域木の決定は任意の便利な方法によって行われてもよい（例えば非特許文献２５参照）。重み付き有向グラフを、禁止語を有するｄｅＢｒｕｉｊｎ配列として表すことは、グラフ内で可能な語の最大数の概算につながり、そして有向グラフの情報エントロピーを反映する。このエントロピー限界は有向グラフの遷移行列の固有値の限界でもある。情報エントロピーの限界は配列リードから構成された有向グラフによって定義されるため、シークエンシングリードのセットがあるとして、参照または他の潜在的な変異由来であることができない任意の潜在的な変異配列は、情報エントロピー限界を超えることなしに（すなわち、潜在的な変異と他の変異または参照との間の遷移行列の固有値が、上記で確立された限界を超える場合）、不要となる。 The potential array is represented by a directed weighted graph by Euler path. Thus, assembly of array leads may further involve finding Euler paths through directed weighted graphs constructed from array leads. Finding the Euler path through a directed weighted graph is a minimum de-Bruijn array (ie, every possible sub-array of length n of a given alphabet A is just a continuous character array in a language with forbidden strings. It may include finding a periodic array of A of size k that appears once. For example, see Non-Patent Document 24. In such a case, the minimal de-Bruijn array may be defined by a global subgraph of the directed weighted graph using the BEST (de Bruijn, Ehrenfest, Smith and Tutte) theorem, or by a tree (of Euler circuits in a directed graph) Provide a product formula for the number and relate the number of Euler circuits to the number of rooted spanning trees for a given vertex). The determination of the spanning tree of the directed graph may be performed by any convenient method (see, for example, Non-Patent Document 25). Representing a weighted directed graph as a de Bruijn array with prohibited words leads to an approximation of the maximum number of words possible in the graph and reflects the information entropy of the directed graph. This entropy limit is also the limit of the eigenvalue of the transition matrix of the directed graph. Since the limit of information entropy is defined by a directed graph composed of sequence reads, any potential mutant sequence that cannot be derived from a reference or other potential mutation, as there is a set of sequencing reads, It becomes unnecessary without exceeding the information entropy limit (ie, the eigenvalues of the transition matrix between the potential mutation and other mutations or references exceed the limit established above).

特定の場合において、配列リードは参照配列にアンカーされてもよいが、これは下記にてさらに詳しく論じる。いくつかの実施形態では、配列アセンブリ方法は、配列リードのそれぞれにおいて、シークエンシングの信頼性が高いと思われる領域の境界を定めることを含み、また各々のアセンブリは参照配列および参照配列に固有の配列を用いてアンカーされてもよい。 In certain cases, sequence reads may be anchored to a reference sequence, which is discussed in further detail below. In some embodiments, the sequence assembly method includes demarcating regions of each of the sequence reads that are likely to be reliable for sequencing, and each assembly is unique to the reference sequence and the reference sequence. It may be anchored using a sequence.

本方法において、配列アセンブリステップによって、各アセンブリが潜在的な変異に対応する複数の離散的アセンブリがもたらされる。潜在的な変異はそれぞれ、配列リードにて発見される配列変異によって定義される。そのため、離散的アセンブリの候補配列はすべて同じ変異を有する。任意の１つのエンリッチされた領域は、少なくとも２、少なくとも５、少なくとも１０、少なくとも１５、少なくとも２０、少なくとも３０、少なくとも５０、少なくとも１００またはそれ以上の離散的アセンブリによって表されてもよい。各アセンブリの配列リードの数は大きく可変であってもよい。いくつかの場合において、配列リードの大部分が、サンプルの優勢変異を表す１つまたは２つのアセンブリにアセンブリングしてもよい（ゲノムＤＮＡの元々の入手元であるサンプルが、エンリッチされた領域において、生殖系列の違い、例えばＳＮＰについてホモ接合であるかヘテロ接合であるかに応じて）。残りのアセンブリは低頻度の変異配列（例えば、体細胞変異した細胞由来の配列）に対応してもよく、ＰＣＲエラーに由来してもよく、および／またはミスコールされたベースを含んでいてもよい。特定の場合において、これらのアセンブリは、変異を含んだより少ない配列リード（例えば、取得される配列リードの合計数に応じて、１０〜１，０００以上）によって表されてもよい。 In this method, the sequence assembly step results in a plurality of discrete assemblies, each assembly corresponding to a potential mutation. Each potential mutation is defined by a sequence mutation found in the sequence read. Therefore, all candidate sequences of the discrete assembly have the same mutation. Any one enriched region may be represented by at least 2, at least 5, at least 10, at least 15, at least 20, at least 30, at least 50, at least 100 or more discrete assemblies. The number of sequence reads for each assembly may be highly variable. In some cases, the majority of the sequence reads may be assembled into one or two assemblies that represent the dominant mutation in the sample (the sample from which the genomic DNA was originally obtained is in an enriched region). , Depending on germline differences, eg homozygous or heterozygous for SNP). The remaining assembly may correspond to low frequency mutated sequences (eg, sequences derived from somatically mutated cells), may be derived from PCR errors, and / or may include miscalled bases. . In certain cases, these assemblies may be represented by fewer sequence reads containing mutations (eg, 10 to 1,000 or more depending on the total number of sequence reads obtained).

本方法の次のステップにて、離散的アセンブリがスクリーニングされて、どの潜在的な変異が「真」であるか（すなわち、サンプル内の分子に配列を正しく提供しており、シークエンシング反応またはデータ処理のエラー、例えばベースミスコールの結果ではない）、またどの候補分子がアーティファクトであるか（すなわち、シークエンシング反応またはデータ処理のエラー、例えばベースミスコールの結果であり、サンプルの分子の実際の配列ではない）を決定する。このステップは、離散的な配列アセンブリのそれぞれをつくりあげている配列リードを調べることによって行われてもよい。いくつかの実施形態では、このステップは、リードクオリティ、ベースコールの信頼性およびアライメントの信頼性（すなわち、配列が正しい位置にマップされたかどうか）を含む、種々のパラメータを調べることによって行ってもよい。不十分に定義された候補分子（すなわち、不良な配列リードによって定義された候補分子、配列変異が信頼性の低いベースコールで表される候補分子等）は取り消すことができ、配列を他のアライメントとマージすることができる。特定の実施形態にて、シークエンシングリードのセットがあるとして、各潜在的な変異の尤度は、隠れマルコフモデルを用いて割り当てられる。いくつかの実施形態では、このステップは、配列のクオリティ、リードの数、ベースコールのクオリティおよびその参照配列へのマッチを調べ、潜在的な変異のそれぞれのスコアを提供することを含んでもよい。 In the next step of the method, the discrete assemblies are screened to determine which potential mutations are “true” (ie, correctly providing the sequence to the molecules in the sample, sequencing reactions or data Processing errors, not the result of base miscalls, and which candidate molecules are artifacts (ie, sequencing reaction or data processing errors, eg base miscalls, and the actual sequence of sample molecules Not). This step may be performed by examining the sequence reads that make up each of the discrete sequence assemblies. In some embodiments, this step may be performed by examining various parameters, including read quality, base call reliability, and alignment reliability (ie, whether the sequence has been mapped to the correct location). Good. Candidate molecules that are poorly defined (ie, candidate molecules defined by bad sequence reads, candidate molecules whose sequence variation is represented by unreliable base calls, etc.) can be canceled and the sequence aligned to other alignments Can be merged with. In certain embodiments, given a set of sequencing reads, the likelihood of each potential mutation is assigned using a hidden Markov model. In some embodiments, this step may include examining the quality of the sequence, the number of reads, the quality of the base call and its match to the reference sequence and providing a score for each potential mutation.

真の潜在的な変異が同定されると、潜在的な変異により定義された突然変異は任意で、参照配列に対して周知の突然変異と比較することができる。ここで参照配列とは公衆の、または企業内データベースの配列である。特定の実施形態にて、該比較は、真の潜在的な変異のそれぞれが、参照配列と関係していることが知られている突然変異を含んでいるかどうかを決定することを伴ってもよい。例えば、数百の遺伝子における数千の癌関連突然変異の同一性は、サンガー・センターのＣＯＳＭＩＣデータベースにて見つけられる（非特許文献２６もまた参照のこと）。例えば、エンリッチされた配列がＫＲＡＳ遺伝子の配列を含む場合、真の変異を解析して、その後、該配列のどれが、３５Ｇ＞Ａ、３５Ｇ＞Ｔ、３８Ｇ＞Ａ、３４Ｇ＞Ｔ、３５Ｇ＞Ｃ、３４Ｇ＞Ａ、３４Ｇ＞Ｃ、３７Ｇ＞Ｔ、１８３Ａ＞Ｃ、３７Ｇ＞Ａ、１８２Ａ＞Ｔ、１８３Ａ＞Ｔ、４３６Ｇ＞Ａ、３７Ｇ＞Ｃ、Ｉ８２Ａ＞Ｇ、３４＿３５ＧＧ＞ＴＴ、３８Ｇ＞Ｃ、１８１Ｃ＞Ａ、３８＿３９ＧＣ＞ＡＴまたは３８Ｇ＞Ｔのうちどの突然変異を有するかを決定する。これらの変異は、白血病、結腸直腸癌（非特許文献２７）、膵癌（非特許文献２８）および肺癌（非特許文献２９）高頻度で見られる。同様に、エンリッチされた配列がＮＲＡＳ遺伝子の配列を含む場合、真の候補分子を解析して、該配列のいずれかが、１８２Ａ＞Ｇ、１８１Ｃ＞Ａ、３５Ｇ＞Ａ、１８２Ａ＞Ｔ、３８Ｇ＞Ａ、３４Ｇ＞Ａ、３７Ｇ＞Ｃまたは１８４９Ｇ＞Ｔの突然変異のうちいずれかをＮＲＡＳ内に有するかを決定する。 Once a true potential mutation is identified, the mutation defined by the potential mutation can optionally be compared to a known mutation relative to the reference sequence. Here, the reference sequence is a public or in-company database sequence. In certain embodiments, the comparison may involve determining whether each of the true potential mutations contains a mutation known to be associated with a reference sequence. . For example, the identity of thousands of cancer-related mutations in hundreds of genes can be found in the Sanger Center's COSMIC database (see also non-patent document 26). For example, if the enriched sequence includes the sequence of the KRAS gene, the true mutation is analyzed, and then any of the sequences is 35G> A, 35G> T, 38G> A, 34G> T, 35G> C. , 34G> A, 34G> C, 37G> T, 183A> C, 37G> A, 182A> T, 183A> T, 436G> A, 37G> C, I82A> G, 34_35GG> TT, 38G> C, 181C Determine which mutation has> A, 38_39GC> AT or 38G> T. These mutations are frequently seen in leukemia, colorectal cancer (Non-patent document 27), pancreatic cancer (Non-patent document 28), and lung cancer (Non-patent document 29). Similarly, if the enriched sequence includes the sequence of the NRAS gene, the true candidate molecule is analyzed and any of the sequences is 182A> G, 181C> A, 35G> A, 182A> T, 38G> Determine which of the mutations A, 34G> A, 37G> C or 1849G> T is present in the NRAS.

特定の実施形態にて、本方法は、ゲノム領域の各対が対象のゲノム領域（例えば、癌関連遺伝子）および対象のゲノム領域に隣接する（また、場合によっては重なっている）領域から構成されている１対以上のゲノム領域のエンリッチを伴ってもよい。これらの実施形態において、このペアは増幅前に、個別におよび組み合わせてエンリッチされてもよい。各対の配列リードは一緒に解析されてもよい。第２のゲノム領域のリードにより、より長い長さにわたって統計を平均することが可能になり、これによってより良い結果がもたらされる。場合によっては、隣接した領域の配列リードは、例えばどんな標本抽出バイアスにも対応できるように結果を調節するために使用することができる。 In certain embodiments, the method comprises a region where each pair of genomic regions is adjacent to (and possibly overlaps) the genomic region of interest (eg, a cancer-related gene) and the genomic region of interest. Enrichment of one or more pairs of genomic regions may be involved. In these embodiments, the pair may be enriched individually and in combination prior to amplification. Each pair of sequence reads may be analyzed together. Reading the second genomic region allows statistics to be averaged over a longer length, which yields better results. In some cases, adjacent region sequence reads can be used, for example, to adjust the results to accommodate any sampling bias.

本方法はサンプルが特定の配列変異を含んでいるかどうかを示すレポートを出力することを含んでもよい。このレポートは、サンプルが突然変異を含むかどうかの指標、ならびに参照配列および突然変異についての利用可能な公的情報を含んでもよい。場合によっては、レポートは、突然変異がサンプル内にあることの信頼性を示してもよい。 The method may include outputting a report indicating whether the sample contains a particular sequence variation. This report may include an indication of whether the sample contains the mutation, as well as available public information about the reference sequence and the mutation. In some cases, the report may indicate the confidence that the mutation is in the sample.

上述した方法を採用して、症状を特徴付ける、症状を分類する、症状を区別する、症状に等級を付ける、症状に段階を付ける、症状を診断するもしくは症状を予測してもよく、または治療に対する反応を予測してもよい。特定の場合において、本方法を用いて、癌の症状または、白血病、乳癌、前立腺癌、アルツハイマー病、パーキンソン病、てんかん、筋萎縮性側索硬化症、多発性硬化症、脳卒中、自閉症、精神遅延、および発達障害が挙げられるがこれらに限定されない、他の哺乳類の疾患を調査してもよい。多くのヌクレオチドの多型は、これらの病気を引き起こす要因と関連しており、またこの要因であると考えられている。ヌクレオチドの多型の種類および位置を知ることは、種々の哺乳類の疾患の診断、予測および理解の大いなる助けとなるだろう。加えて、本明細書に記載されるアッセイ条件は、例えば、感染症の検出、ウイルス量モニタリング、ウイルス遺伝子型決定、環境試験、食品試験、法医学、疫学および特定の核酸配列検出が使用される他の領域を含む、他の核酸検出用途にて採用される。 Use the methods described above to characterize symptoms, classify symptoms, differentiate symptoms, grade symptoms, stage symptoms, diagnose symptoms or predict symptoms, or for treatment The reaction may be predicted. In certain cases, the method can be used to treat cancer symptoms or leukemia, breast cancer, prostate cancer, Alzheimer's disease, Parkinson's disease, epilepsy, amyotrophic lateral sclerosis, multiple sclerosis, stroke, autism, Other mammalian diseases may be investigated including, but not limited to, mental retardation and developmental disorders. Many nucleotide polymorphisms are associated with and are believed to be factors that cause these diseases. Knowing the type and location of nucleotide polymorphisms will greatly assist in the diagnosis, prediction and understanding of various mammalian diseases. In addition, the assay conditions described herein include, for example, infectious disease detection, viral load monitoring, viral genotyping, environmental testing, food testing, forensic medicine, epidemiology and other nucleic acid sequence detection used. It is employed in other nucleic acid detection applications including these regions.

いくつかの実施形態では、生体サンプル、例えば生検は、患者から得てもよく、このサンプルは本方法を用いて解析してもよい。特定の実施形態では、本方法を採用して、ゲノム遺伝子座の野生型コピーおよび、ゲノム遺伝子座の野生型コピーに対して点突然変異を有するゲノム遺伝子座の突然変異コピーの両方を含む生体サンプル内の、ゲノム遺伝子座の突然変異コピーの量を同定および／または評価してもよい。この例においては、サンプルは、ゲノム遺伝子座の突然変異コピーの、少なくとも１００倍（例えば、少なくとも１，０００倍、少なくとも５，０００倍、少なくとも１０，０００倍、少なくとも５０，０００倍、または少なくとも１００，０００倍）のゲノム遺伝子座の野生型コピーを含んでもよい。 In some embodiments, a biological sample, such as a biopsy, may be obtained from a patient, and the sample may be analyzed using the method. In certain embodiments, the method is employed to include a biological sample comprising both a wild type copy of a genomic locus and a mutant copy of a genomic locus having a point mutation relative to the wild type copy of the genomic locus. The amount of mutated copies of the genomic locus may be identified and / or evaluated. In this example, the sample is at least 100 times (eg, at least 1,000 times, at least 5,000 times, at least 10,000 times, at least 50,000 times, or at least 100 times the mutated copy of the genomic locus. , 000 times) wild type copy of the genomic locus.

これらの実施形態において、本方法を採用して、乳癌、黒色腫、腎癌、子宮内膜癌、卵巣癌、膵癌、白血病、結腸直腸癌、前立腺癌、中皮腫、神経膠腫、髄芽腫、赤血球増加症、リンパ腫、肉腫または多発性骨髄腫と関連してもよい発癌性突然変異（体細胞突然変異であってもよい）、例えば、ＰＩＫ３ＣＡ、ＮＲＡＳ、ＫＲＡＳ、ＪＡＫ２、ＨＲＡＳ、ＦＧＦＲ３、ＦＧＦＲ１、ＥＧＦＲ、ＣＤＫ４、ＢＲＡＦ、ＲＥＴ、ＰＧＤＦＲＡ、ＫＩＴまたはＥＲＢＢ２を検出してもよい（例えば、非特許文献３参照）。 In these embodiments, the method is employed to employ breast cancer, melanoma, renal cancer, endometrial cancer, ovarian cancer, pancreatic cancer, leukemia, colorectal cancer, prostate cancer, mesothelioma, glioma, medulloblast Carcinogenic mutations (which may be somatic mutations) that may be associated with tumors, erythrocytosis, lymphomas, sarcomas or multiple myeloma, eg, PIK3CA, NRAS, KRAS, JAK2, HRAS, FGFR3, FGFR1, EGFR, CDK4, BRAF, RET, PGDFRA, KIT or ERBB2 may be detected (see, for example, Non-Patent Document 3).

ゲノム遺伝子座の点突然変異は癌と直接的な関連があってもよいため、本主題の方法を単独で、または他の臨床的技法（例えば、結腸鏡検査またはマンモグラム等の理学的検査）もしくは分子技術（例えば、免疫組織化学解析）を組み合わせて採用して、癌または前癌症状（例えば、腺腫等）の患者を診断してもよい。例えば、対象のアッセイから得られた結果は、他の情報、例えば、他の遺伝子座のメチル化状態に関する情報、同じ遺伝子座内のまたは異なる遺伝子座での再配列または置換に関する情報、細胞遺伝学的情報、再構成に関する情報、遺伝子発現情報またはテロメアの長さについての情報と組み合わせられて、癌または他の疾患の全体的診断を行ってもよい。 Since point mutations at genomic loci may be directly related to cancer, the subject method alone or other clinical technique (eg, physical examination such as colonoscopy or mammogram) or Molecular techniques (eg, immunohistochemical analysis) may be employed in combination to diagnose patients with cancer or precancerous conditions (eg, adenomas, etc.). For example, the results obtained from the subject assay can include other information, such as information about the methylation status of other loci, information about rearrangements or substitutions within the same locus or at different loci, cytogenetics Combined with genetic information, information on reconstruction, gene expression information or information on telomere length, a global diagnosis of cancer or other diseases may be made.

一実施形態では、サンプルは第１の場所、例えば病院内または医師のオフィス等の臨床現場で患者から採取されてよく、該サンプルは第２の場所、例えば研究所に送られてよく、この第２の場所にてサンプルが処理され、上述の方法が行われてレポートを作成する。本明細書に記載される「レポート」とは、電子または有形の文書であり、これはサンプル内のゲノム遺伝子座の突然変異コピーの存在を示すＣｔ値またはＣｐ値等を含んでよい試験結果を提供するレポート要素を含む。レポートが作成されると他の場所（第１の場所と同じ場所であってもよい）へ転送されて、そこで臨床的診断の一部として医療従事者（例えば、臨床医、検査技師、または腫瘍専門医、外科医、病理医等の医師）によってレポートが解釈されてよい。 In one embodiment, the sample may be taken from a patient at a first location, eg, a clinical site such as a hospital or a doctor's office, and the sample may be sent to a second location, eg, a laboratory. Samples are processed at two locations and the method described above is performed to create a report. A “report” as described herein is an electronic or tangible document that contains test results that may include Ct or Cp values that indicate the presence of mutant copies of genomic loci in a sample. Contains report elements to provide. Once the report is generated, it is transferred to another location (which may be the same location as the first location), where it is a health care worker (eg, clinician, laboratory technician, or tumor) as part of a clinical diagnosis The report may be interpreted by a doctor (specialist, surgeon, pathologist, etc.).

本方法の一実施例を図１および図２のフローチャートに記載する。第１のフローには本方法の全体的な設定、例えば全体のワークフローが記載される。第２のフローには本方法それ自体のフローが記載される。本方法の各構成要素を次に詳述する。以下に記述される本方法はステップＢ３の実施態様であり、ステップＢ４ならびにステップＣのパート６および７に関連する。一実施態様において、本方法はＢ３、すなわち、一塩基多型ならびに挿入および欠失の双方の変異の同定に関する。本発明のフローは図２に記載され詳述される。 One embodiment of the method is described in the flowcharts of FIGS. The first flow describes the overall settings of the method, for example the overall workflow. The second flow describes the flow of the method itself. Each component of the method will now be described in detail. The method described below is an embodiment of step B3 and relates to step B4 and parts 6 and 7 of step C. In one embodiment, the method relates to identification of B3, a single nucleotide polymorphism and both insertion and deletion mutations. The flow of the present invention is described in detail in FIG.

ステップ１において、設計情報を収集し、これを使用して対象となる領域をアノテーションする。設計情報は以下の方法にて用いられる：対象の領域を分画して、ベイトが置かれるサブ領域を対象の領域内で特定する。シークエンシングが確実であることができる領域を取得して、マーキングする。所望により、対象の領域の両末端に指定した数の塩基を該領域に含ませるようにして、リードのオフターゲットマッチを評価し、また後続のステップのための参照アンカーポイントを指示することができる。典型的な参照配列（単数または複数）を鋳型として取得する。所与の領域内の任意の既知の変異についての情報を含みたい場合、指定した領域内でこのような変異もマーキングする。計算資源の効率的使用のためにＪａｖａ（登録商標）７Ｆｏｒｋ−ＪｏｉｎＦｒａｍｅｗｏｒｋを使用して、重複していない領域のそれぞれを同時に構成し、解析する（後続のステップにて）。このステップでは、「領域」とは単なるゲノム鋳型であり、所望に応じて、また必要に応じてデータをロードする。第２のステップにおいて、高い信頼性をもってこのような領域で構成できる分子配列の関連する、あらゆる代替的伸長を見つけようと試みる。第１の候補参照配列（単数または複数）が供給された参照配列から読み込まれる。本方法では、参照と完全に同一である少なくとも１つの分子表示が得られると仮定する。そのような表示が２つ以上得られる場合、すべてを構成して以下のように評価する。次いで、あらゆる代替的表示を構成する。これはターゲット領域のリードを局部的にリアセンブリングすることによって行う。このリアセンブリングについて、本出願人らは象徴的な配列の理論(symbolic sequences theory)による多数の結果を用い、これにより候補分子配列の最適化および素早い決定がもたらされる。まず、有向重み付きグラフを重複するｋ−ｍｅｒから構成する。任意の候補分子がオイラーパス（すなわち、エッジのそれぞれを通る、または換言すると、エッジ横断が完了している）としてこのグラフ内に表されねばならない。「見逃された」または「シークエンシングされていない」領域は、参照と同一のものであるとみなされ、利用可能であればペアエンドランの両方のメイトを利用する。ペアのうち１つだけを高い信頼性をもってマップする場合、本方法ではマップされていないリードをすべて見て、局所的なリアライメントが黙示的に行われるように、ｋ−ｍｅｒを利用して候補表示を構成することを試みる。 In step 1, design information is collected and used to annotate the target area. The design information is used in the following manner: The target area is fractionated, and the sub-area where the bait is placed is specified in the target area. Obtain and mark areas where sequencing can be assured. If desired, the specified number of bases at both ends of the region of interest can be included in the region to evaluate the off-target match of the lead and to indicate a reference anchor point for subsequent steps. . A typical reference sequence or sequences are obtained as a template. If you want to include information about any known mutations in a given region, also mark such mutations in the specified region. For efficient use of computational resources, each non-overlapping region is constructed and analyzed simultaneously (in subsequent steps) using the Java® 7 Fork-Join Framework. In this step, a “region” is just a genomic template, and data is loaded as desired and as needed. In the second step, an attempt is made to find any relevant alternative extension of the molecular sequence that can be reliably constructed in such regions. The first candidate reference sequence (s) is read from the supplied reference sequence. The method assumes that at least one molecular representation that is completely identical to the reference is obtained. If more than one such display is obtained, all are configured and evaluated as follows. Any alternative displays are then constructed. This is done by locally reassembling the target area leads. For this reassembly, we use a number of results from symbolic sequences theory, which results in optimization and quick determination of candidate molecular sequences. First, the directed weighted graph is composed of overlapping k-mers. Any candidate molecule must be represented in this graph as an Euler path (ie, through each of the edges, or in other words, edge traversal is complete). The “missed” or “unsequenced” region is considered identical to the reference and utilizes both mates of the paired end run if available. When mapping only one of the pairs with high confidence, the method looks at all unmapped leads and uses k-mer to make a local realignment implicit. Try to configure the display.

これを効率的に行うには、理論上の結果を用いる。候補となる解を見つけるという課題を認識することは、禁止文字列を有する言語において最小ｄｅ−Ｂｒｕｉｊｎ配列を見つけることと同等であり、特定の長さの「語」の数を情報エントロピーの評価に関連づける限界があることに留意されたい。このエントロピー限界は異なるｋ−ｍｅｒ間の遷移を特定する伝達行列の最大固有値（すなわち、最大固有値は情報の自然対数）の限界でもある。したがって、種々の候補を表すグラフを構成する間、所定の長さの許容された語の数のカウントを考慮することができる。場合によっては、禁止語（生じてはならない語）の数のカウントを考慮してもよく、これによって可能な語の合計数と共に所望の情報が与えられる。禁止語は、グラフそれ自体を構成しながら容易に見つけることができる。最大固有値の限界を用いて次のステップの尤度計算を高速度化できる。 To do this efficiently, use theoretical results. Recognizing the problem of finding a candidate solution is equivalent to finding a minimal de-Bruijn sequence in a language with forbidden strings, and the number of “words” of a particular length is used to evaluate information entropy. Note that there are limitations associated with it. This entropy limit is also the limit of the maximum eigenvalue (ie, the maximum eigenvalue is the natural logarithm of information) of the transfer matrix that specifies transitions between different k-mers. Thus, while constructing graphs representing various candidates, a count of the number of allowed words of a given length can be considered. In some cases, a count of the number of prohibited words (words that should not occur) may be considered, which gives the desired information along with the total number of possible words. Prohibited words can be easily found while constructing the graph itself. The likelihood calculation of the next step can be speeded up using the limit of the maximum eigenvalue.

使用される第２の結果は、ＢＥＳＴ定理、すなわちｄｅＢｒｕｉｊｎ、Ｅｈｒｅｎｆｅｓｔ、ＳｍｉｔｈおよびＴｕｔｔｅの定理に頼ったものである。この定理は可能なオイラーパスをグラフの全域木の数と関連付ける。本出願人らの目的はオイラーパスの構成であるため、この定理によりこの問題を、全域木を見つける問題へと変換するが、これは利用可能な迅速な解決策とともに周知の問題である。ヴィシュキンの公式化(Vishkin’s formulation)を用いて全域木を見つけることができる。 The second result used relies on the BEST theorem, the de Bruijn, Ehrenfest, Smith, and Tutte theorems. This theorem associates possible Euler paths with the number of spanning trees in the graph. Since Applicants' objective is Euler path construction, this theorem translates this problem into a problem of finding spanning trees, which is a well-known problem with a quick solution available. Spanning trees can be found using the Vishkin ’s formulation.

グラフは不均衡であることができるため、上記の結果は、計算を大幅に高速度化したものの、重複してマッチしたリードまたは構造変異およびコピー数多型が多くなる状況では特に、いくつかのパスが見逃されることがある。このようなコーナーケースを防ぐために、入る重みと出てくる重みが平均と顕著に異なるパスをカウントする。そのようなパスが発見された場合、このようなパスに表示されるｋ−ｍｅｒの部分配列についてオイラーパスを徹底調査する。 Since the graph can be disequilibrium, the above results show some speedups, especially in situations where there are many duplicate matched reads or structural variations and copy number polymorphisms, although the calculations are significantly faster. Passes may be missed. In order to prevent such a corner case, the number of paths in which the incoming weight and the outgoing weight are significantly different from the average are counted. If such a path is found, the Euler path is thoroughly investigated for the partial arrangement of k-mers displayed in such a path.

候補分子表示が見つかった後、マルコフモデルを用いて尤度をそれぞれに割り当てる。このときリード（ペア）を見て、所与のデータからどの候補分子がもっともそれらしいか評価する。この評価に用いられるリードはまず、マッピングクオリティについての指定フィルタリング基準によってフィルタリングされる。候補間の遷移は、伝達行列として表され、該領域のリードデータに基づいて遷移を最適化する。この間に、上述の固有値限界を用いて、限界と一致しない解をもたらすであろうどの反復をも迅速に終了させる。出力確率および遷移確率は、この高速化を除いた標準ビタビ反復によって決定される。指定数の最も高スコアの候補を調べることができる。 After candidate molecule representations are found, likelihoods are assigned to each using a Markov model. At this time, the lead (pair) is looked at to evaluate which candidate molecules are most likely from the given data. The leads used for this evaluation are first filtered according to a specified filtering criterion for mapping quality. The transition between candidates is represented as a transfer matrix, and the transition is optimized based on the read data of the area. During this time, the eigenvalue limits described above are used to quickly terminate any iteration that would result in a solution that does not match the limits. Output probabilities and transition probabilities are determined by standard Viterbi iterations excluding this acceleration. The specified number of highest score candidates can be examined.

このステップの後、候補解中に存在する種々の対立遺伝子を調査して変異コールを行える。リード末端に近接しすぎている（「近接」はパラメータによって定義される）塩基に支えられていると分かった対立遺伝子はフィルタリングで除外する。また、変異候補がアンプリコン断片の末端にあり、遺伝子座をカバーするアンプリコンが１つだけであれば、この変異候補をフィルタリングで除外する。２つ以上のアンプリコンがこの遺伝子座を支持する場合、このような候補は２つ以上のアンプリコンによって支持される場合のみとっておく。 After this step, mutation calls can be made by examining the various alleles present in the candidate solution. Alleles found to be supported by bases that are too close to the lead ends ("proximity" is defined by the parameters) are filtered out. If the mutation candidate is at the end of the amplicon fragment and only one amplicon covers the gene locus, this mutation candidate is excluded by filtering. If more than one amplicon supports this locus, then such candidates are taken only if they are supported by more than one amplicon.

それぞれの変異のスコアをつける。換言すれば、一組のリード｛Ｒ｝および一組の遺伝子型｛Ｇ｝があるとすると、本出願人らはＰ（｛Ｇ｝｜｛Ｒ｝）を見つけたい。このためには、ベイズの定理を用いて、すなわち、Ｐ（｛Ｒ｝｜｛Ｇ｝）およびＰ（｛Ｇ｝）を得て、これらを組み合わせて所望の結果を得る。 Give a score for each mutation. In other words, given a set of leads {R} and a set of genotypes {G}, Applicants want to find P ({G} | {R}). To this end, Bayes' theorem is used, that is, P ({R} | {G}) and P ({G}) are obtained and combined to obtain a desired result.

すなわち、基礎となる遺伝子型があるとして、リードのセットを得る確率は、基礎となる遺伝子型の観測結果のセットから標本抽出する確率に比例しているが、本出願人らのリードが正しいという確率で調整される。積Ｐ（ｂ’｜ｂ）の下の項は、所定の遺伝子座での所定の代替コールが正しい確率である。所定のリードにおける塩基のクオリティにより、そのリードにおける特定の塩基が正しく、かつ不完全にマッピングしたリードを本出願人らが除外した確率が与えられるため、対立遺伝子のクオリティは、塩基の中間クオリティおよび中間マッピングクオリティの最小値であると仮定する。所望により、この評価に塩基対立遺伝子クオリティ（ＢＡＱ）を用いることができる。ｂε｛Ｇ｝であればＰ（ｂ’｜ｂ）は１−ｑであり、ｂε｛Ｇ｝以外の場合はＰ（ｂ’｜ｂ）はｑである。 That is, given the underlying genotype, the probability of obtaining a set of leads is proportional to the probability of sampling from the set of observations of the underlying genotype, but the applicant's leads are correct Adjusted with probability. The term under the product P (b '| b) is the probability that a given substitution call at a given locus is correct. Since the quality of a base in a given lead gives the probability that Applicants excluded a read that correctly and incompletely mapped a particular base in that lead, the quality of the allele is the intermediate quality of the base and Assume the minimum of the intermediate mapping quality. If desired, base allele quality (BAQ) can be used for this assessment. If bε {G}, P (b ′ | b) is 1-q, and if other than bε {G}, P (b ′ | b) is q.

候補分子尤度によるＰ（｛Ｇ｝）（Ｇ１．．．Ｇｎを見る可能性である）がすでに得られた。遺伝子座で変異をコールするには、候補領域に２つ以上の対立遺伝子があり、Ｐ（｛Ｇ（ｉ）｝｜｛Ｒ（ｉ）｝）が顕著である部位を見ていきたい。すでに参照と異なる種々の候補の確率は分かっているため、したがって
Ｐ（Ｋ＞１｜Ｒ１，．．．，Ｒｎ）＝１−Ｐ（Ｋ＝１｜Ｒ１，．．．，Ｒｎ）
によって変異コールの確率を得る。 P ({G}) (possibility to see G1... Gn) by candidate molecule likelihood has already been obtained. To call a mutation at a locus, we would like to look at a site where there are two or more alleles in the candidate region and P ({G (i)} | {R (i)}) is prominent. Since the probabilities of the various candidates that are different from the reference are already known, therefore P (K> 1 | R1,..., Rn) = 1−P (K = 1 | R1,.
To get the probability of a mutation call.

本方法は、ターゲットエンリッチメントパネルの高速で正確かつ使用が簡単な解析ツールを探している臨床研究者に使用されてもよい。このソフトウェアによって、エンド・ツー・エンドデータ解析の解決策、すなわちアライメントから変異の分類まで提供できることにより、結果を得るまでの時間を数日間から数時間へと減少させる。本方法は、試験サンプルの大半に対する誤判定率に影響を及ぼすことなく突然変異のコールにおける検出漏れ率がはるかに低く、本方法によって、複数の対立遺伝子が関与する複雑な場合においても低頻度の対立遺伝子を有する変異を検出することができると同時に、誤判定率を顕著に増加させず、また低頻度の変異の検出時においては、効率および速度が顕著に衰えないため、従来のアルゴリズムよりも有利である。 The method may be used by clinical researchers looking for a fast, accurate and easy to use analysis tool for the target enrichment panel. This software can provide an end-to-end data analysis solution, from alignment to mutation classification, reducing the time to results from days to hours. This method has a much lower detection failure rate in mutation calls without affecting the misjudgment rate for the majority of test samples, and this method allows low frequency alleles even in complex cases involving multiple alleles. It is possible to detect mutations that have a gene, and at the same time, it does not significantly increase the misjudgment rate, and when detecting low-frequency mutations, the efficiency and speed do not decrease significantly. is there.

上述した方法はコンピュータ上で実施できる。特定の実施形態にて、汎用コンピュータを本明細書に開示される方法およびプログラムのための機能的な構造に構成することができる。このようなコンピュータのハードウェアアーキテクチャは当業者に公知であり、１つ以上のプロセッサ（ＣＰＵ）、ランダムアクセスメモリ（ＲＡＭ）、読み取り専用メモリ（ＲＯＭ）、内部または外部データ記憶媒体（例えば、ハードディスクドライブ）を含むハードウェアコンポーネントを含むことができる。コンピュータシステムは、表示手段にグラフィック情報を処理および出力するためのグラフィックボードを１つ以上含むこともできる。上記コンポーネントはコンピュータ内のバスにより適切に相互接続できる。コンピュータはさらに、モニタ、キーボード、マウス、ネットワーク等の汎用外部コンポーネントと通信するための好適なインタフェースを含む。いくつかの実施形態では、本方法およびプログラムのために処理能力を増加させるために、コンピュータは並列処理ができるか、または並列計算もしくは分散計算のために構成されるネットワークの一部であることができる。いくつかの実施形態において、記憶媒体から読み取られたプログラムコードを、コンピュータに内蔵されている拡張ボード、またはコンピュータに接続されている拡張ユニット内に備えられたメモリ内に書き込むことができ、拡張ボードまたは拡張ユニット内に備えられたＣＰＵ等により、下記の機能を達成するためにプログラムコードの命令に従って実際に演算の一部またはすべてを行える。他の実施形態では、本方法はクラウドコンピューティングシステムを用いて実施できる。これらの実施形態において、データファイルおよびプログラミングをクラウドコンピュータにエクスポートでき、クラウドコンピュータはプログラムを実行して、ユーザに出力を返す。 The method described above can be implemented on a computer. In certain embodiments, a general purpose computer can be configured into a functional structure for the methods and programs disclosed herein. The hardware architecture of such computers is well known to those skilled in the art and includes one or more processors (CPU), random access memory (RAM), read only memory (ROM), internal or external data storage media (eg, hard disk drive). ) Including hardware components. The computer system may also include one or more graphic boards for processing and outputting graphic information on the display means. The above components can be appropriately interconnected by a bus in the computer. The computer further includes a suitable interface for communicating with general purpose external components such as a monitor, keyboard, mouse, network and the like. In some embodiments, to increase processing power for the present methods and programs, the computer can be parallel or be part of a network configured for parallel or distributed computation. it can. In some embodiments, the program code read from the storage medium can be written into an expansion board embedded in the computer or a memory provided in an expansion unit connected to the computer. Alternatively, a CPU or the like provided in the expansion unit can actually perform some or all of the calculations according to the instructions of the program code in order to achieve the following functions. In other embodiments, the method can be implemented using a cloud computing system. In these embodiments, data files and programming can be exported to a cloud computer, which executes the program and returns output to the user.

システムは、特定の実施形態において、ａ）中央演算処理装置、ｂ）ソフトウェアおよびデータを記憶するための、記憶ドライブがディスクコントローラによって制御される１つ以上のハードライブを含むことができる、主要不揮発性記憶ドライブ、ｃ）不揮発性記憶ドライブからロードされるプログラムおよびデータを含む、システムコントロールプログラム、データおよびアプリケーションプログラムを記憶するための、システムメモリ、例えば、高速ランダムアクセスメモリ（ＲＡＭ）（システムメモリには読み取り専用メモリ（ＲＯＭ）を含むことができる）、ｄ）マウス、キーパッドおよびディスプレイ等の１つ以上の入力および出力デバイスを含むユーザーインターフェース、ｅ）任意の有線または無線通信ネットワーク、例えばプリンタに接続するための、任意のネットワークインタフェースカード、ならびにｆ）システムの上述の要素と相互接続するための内部バスを含むコンピュータを含む。 The system, in certain embodiments, includes: a) a central processing unit; b) a main non-volatile, where the storage drive for storing software and data can be controlled by a disk controller; C) a system memory for storing system control programs, data and application programs, including programs and data loaded from non-volatile storage drives, eg high speed random access memory (RAM) (in system memory) Can include read-only memory (ROM)), d) a user interface including one or more input and output devices such as a mouse, keypad and display, e) any wired or wireless communication network, eg For connecting to a printer, comprising a computer including any network interface card, and an internal bus for interconnecting the above elements of f) system.

コンピュータシステムのメモリは、プロセッサによる検索のために情報を記憶できる任意のデバイスであり、また磁気もしくは光学デバイスまたはソリッドステートメモリデバイス（揮発性または不揮発性ＲＡＭ等）を含むことができる。メモリまたはメモリユニットは、同じまたは異なる種類の、２つ以上の物理メモリデバイスを有することができる（例えば、メモリは、複数のドライブ、カード等の複数のメモリデバイスもしくは複数のソリッドステートメモリデバイスまたはこれらのいくつかの組み合わせを有することができる）。コンピュータ可読媒体に関して、「永久メモリ」とは永続性のメモリを意味する。永久メモリはコンピュータまたはプロセッサへの電力供給が停止しても消えない。コンピュータハードドライブＲＯＭ（すなわち、バーチャルメモリとして使用されないＲＯＭ）、ＣＤ−ＲＯＭ、フロッピー（登録商標）ディスクおよびＤＶＤはすべて永久メモリの例である。ランダムアクセスメモリ（ＲＡＭ）は非永久（すなわち、揮発性）メモリの例である。永久メモリ内のファイルは編集可能かつ書換え可能であることができる。 The memory of a computer system is any device that can store information for retrieval by a processor and may include magnetic or optical devices or solid state memory devices (such as volatile or non-volatile RAM). A memory or memory unit may have two or more physical memory devices of the same or different types (eg, a memory may be a plurality of memory devices such as a plurality of drives, cards, or a plurality of solid state memory devices or these Can have several combinations). With respect to computer readable media, “permanent memory” means permanent memory. Permanent memory does not disappear when power to the computer or processor is stopped. Computer hard drive ROM (ie, ROM not used as virtual memory), CD-ROM, floppy disk and DVD are all examples of permanent memory. Random access memory (RAM) is an example of non-permanent (ie, volatile) memory. Files in permanent memory can be editable and rewritable.

コンピュータの演算は、主としてオペレーティング・システムによって制御されるが、これは中央演算処理装置によって実行される。オペレーティング・システムはシステムメモリ内に記憶することができる。いくつかの実施形態では、オペレーティング・システムはファイルシステムを含む。オペレーティング・システムに加えて、システムメモリの可能な一実施態様には、以下に記述される方法を実施するための種々のプログラミングファイルおよびデータファイルを含む。特定の場合において、プログラミングは、種々のモジュールから構成できるプログラムおよび、プログラムへの入力またはプログラムに使用されるパラメータをユーザに手動で選択または変更させられるユーザインタフェースモジュールを含むことができる。データファイルはプログラムのための種々の入力を含むことができる。 Computer operations are controlled primarily by the operating system, which is performed by a central processing unit. The operating system can be stored in system memory. In some embodiments, the operating system includes a file system. In addition to the operating system, one possible implementation of system memory includes various programming and data files for performing the methods described below. In certain cases, programming can include a program that can be composed of various modules and user interface modules that allow the user to manually select or change the inputs to the program or the parameters used in the program. The data file can contain various inputs for the program.

特定の実施形態にて、本明細書に記載される方法による命令を、「プログラミング」の形態でコンピュータ可読媒体にコードすることができる。ここで本明細書において用いられる用語「コンピュータ可読媒体」は、実行および／または処理のためのコンピュータへの命令および／またはデータの提供に関与する任意の記憶媒体または伝達媒体を意味する。記憶媒体の例には、このようなデバイスがコンピュータの内部または外部であろうとなかろうと、フロッピー（登録商標）ディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、磁気テープ、不揮発性メモリカード、ＲＯＭ、ＤＶＤ−ＲＯＭ、ブルーレイディスク、ソリッドステートディスク、およびネットワークアタッチトストレージ（ＮＡＳ）を含む。情報を含むファイルは、コンピュータ可読媒体に「保存」することができ、ここで「保存する」とは、情報がコンピュータによって後日アクセス可能で検索可能であるように情報を記憶することを意味する。 In certain embodiments, instructions according to the methods described herein can be encoded on a computer-readable medium in the form of “programming”. The term “computer-readable medium” as used herein refers to any storage or transmission medium that participates in providing instructions and / or data to a computer for execution and / or processing. Examples of storage media include floppy disks, hard disks, optical disks, magneto-optical disks, CD-ROMs, CD-Rs, magnetic tapes, non-volatiles, whether such devices are internal or external to the computer. Volatile memory card, ROM, DVD-ROM, Blu-ray disc, solid state disc, and network attached storage (NAS). A file containing information can be “saved” on a computer readable medium, where “save” means storing the information so that the information is accessible and searchable at a later date by the computer.

本明細書に記載されるコンピュータで実施される方法は、１以上の任意数のコンピュータプログラミング言語で書き込むことができるプログラムを用いて実行できる。このような言語には、例えば、Ｊａｖａ（登録商標）（サン・マイクロシステムズ社、カリフォルニア州サンタクララ）、ＶｉｓｕａｌＢａｓｉｃ（マイクロソフト社、ワシントン州レドモンド）およびＣ＋＋（ＡＴ＆Ｔ社、ニュージャージー州ベッドミンスター）ならびに任意の多数の多言語を含む。 The computer-implemented methods described herein can be performed using a program that can be written in any number of one or more computer programming languages. Such languages include, for example, Java (Sun Microsystems, Santa Clara, CA), Visual Basic (Microsoft, Redmond, WA) and C ++ (AT & T, Bedminster, NJ) and any Including many multi-languages.

任意の実施形態において、データを「遠隔地」に転送できるが、ここで「遠隔地」とはプログラムが実行される場所以外の場所を意味する。例えば、遠隔地は、同じ都市の他の場所（例えば、オフィス、研究所等）、異なる都市の他の場所、他の州の他の場所、異なる国の他の場所等であることができる。そのため、１つのアイテムが他のアイテムの「遠隔」にあると示されるとき、２つのアイテムが同じ部屋にあるが、離れている、または少なくとも異なる部屋もしくは異なる建物にあり、かつ少なくとも１マイル、１０マイルまたは少なくとも１００マイル離れていることを意味する。「通信」情報とは、好適な通信チャネル（例えば、プライベートネットワークまたはパブリックネットワーク）上の電気信号としてその情報を表すデータの送信を意味する。アイテムの「転送」とは、物理的にアイテムを運搬することによって、または別の方法で（それが可能であれば）、そのアイテムをある場所から次の場所に移動させる任意の手段を意味し、これには、少なくともデータの場合、データを保持する媒体を物理的に運搬すること、または、データを通信することとが含まれる。通信媒体の例としては、無線または赤外線伝送路および他のコンピュータまたはネットワークデバイスへのネットワーク接続、ならびにインターネットを含み、または電子メール送信およびウェブサイト等に記憶された情報を含む。 In any embodiment, data can be transferred to a “remote location”, where “remote location” means a location other than where the program is executed. For example, remote locations can be other locations in the same city (eg, offices, laboratories, etc.), other locations in different cities, other locations in other states, other locations in different countries, etc. Thus, when one item is shown to be “remote” to another item, the two items are in the same room but are separated, or at least in different rooms or different buildings, and at least 1 mile, 10 Means miles or at least 100 miles away. “Communication” information means the transmission of data representing that information as an electrical signal on a suitable communication channel (eg, a private or public network). “Transfer” of an item means any means of moving the item from one place to the next, either by physically transporting the item or otherwise (if possible) This includes, at least in the case of data, physically transporting the medium that holds the data or communicating the data. Examples of communication media include wireless or infrared transmission lines and network connections to other computers or network devices, and the Internet, or include information stored in e-mail transmissions and websites.

いくつかの実施形態としては、単一のコンピュータでの、またはコンピュータネットワーク上での、もしくは、コンピュータのネットワークのネットワーク上、例えば、ネットワークのクラウド上での、ローカルエリアネットワーク上での、ハンドヘルドコンピュータ等での実施態様を含む。好ましい実施形態には、本明細書に記載されるステップの１つ以上を実行するコンピュータプログラム（単数または複数）での実施態様を含む。このようなコンピュータプログラムは本明細書に記載されるステップの１つ以上を実行する。本発明の好ましい実施形態は、本発明に記載される、コンピュータ可読媒体（単数または複数）で符号化され、通信ネットワーク（単数または複数）上で伝送可能な、種々のデータ構造、カテゴリ、および変更子を含む。 Some embodiments include a handheld computer on a local computer, on a single computer, on a computer network, or on a network of computer networks, eg, on a cloud of networks, etc. Embodiments. Preferred embodiments include implementations in computer program (s) that perform one or more of the steps described herein. Such a computer program performs one or more of the steps described herein. Preferred embodiments of the present invention are various data structures, categories, and modifications that are encoded on computer readable medium (s) and that can be transmitted over communication network (s) as described in the present invention. Includes children.

ソフトウェア、ウェブ、インターネット、クラウドまたは本発明の他の記憶およびコンピュータネットワーク実施態様は、種々のデータベースの検索、変更、関連付け、比較、決定、シグナル伝達、スコアリング、監視または順位付けを達成する標準プログラミング技術で達成できるであろう。 Software, web, internet, cloud or other storage and computer network implementations of the present invention are standard programming to achieve searching, modifying, associating, comparing, determining, signaling, scoring, monitoring or ranking various databases Could be achieved with technology.

本明細書にて引用されるすべての刊行物および特許出願は、それぞれ個別の刊行物または特許出願が明確かつ個別に参考として援用されると示されるかのごとく、本明細書に参考として援用される。任意の文献の引用は、その出願日よりも前の開示に関するものであり、本発明が、先行する発明のためにそのような文献に先行する権利が無くなることを認めるものと解釈すべきではない。 All publications and patent applications cited herein are hereby incorporated by reference as if each individual publication or patent application was clearly and individually indicated to be incorporated by reference. The The citation of any document is for disclosure prior to the filing date of the application and should not be construed as an admission that the present invention ceases to be entitled to an earlier document due to the prior invention. .

関連出願の相互参照
本出願は、米国特許仮出願第６１／８５９，６２５号（２０１３年７月２９日出願）の利益を主張するものであり、この出願全体が本明細書に参考として援用されている。
This application claims the benefit of US Provisional Patent Application No. 61 / 859,625 (filed July 29, 2013), which is hereby incorporated by reference in its entirety. ing.

Claims

A method for identifying a sequence variation comprising:
(A) (i) obtaining a plurality of sequence reads of a sample enriched in a genomic region and (ii) a reference sequence for the genomic region;
(B) combining the sequence reads to obtain a plurality of discrete sequence assemblies, each corresponding to a potential mutation;
(C) determining which potential mutations are true and which are artifacts by examining the sequence reads comprising each of the discrete sequence assemblies;
(D) optionally, determining whether each of the true potential mutations contains a mutation known to be associated with the reference sequence; and (e) whether the sample contains a sequence mutation. Outputting a report indicating whether or not.

The method of claim 1, wherein the genomic region is associated with cancer.

2. The genomic region comprises at least a portion of at least one of the following genes: PlK3CA, NRAS, KRAS, JAK2, HRAS, FGFR3, FGFR1, EGFR, CDK4, BRAF, RET, FGDFRA, KIT and ERBB2. the method of.

The method of claim 1, wherein the sequence variant is a low frequency sequence variation corresponding to a somatic mutation.

The method of claim 1, wherein the genomic region is a region of the human genome.

2. The method of claim 1, wherein the enriched genomic region is enriched from total DNA obtained from a clinical specimen.

The method of claim 6, wherein the clinical specimen is a biopsy.

The method of claim 1, wherein the report provides an indication of whether the specimen contains a mutation and public information available about the reference sequence.

The method of claim 1, wherein the assembling comprises fractionating each of the regions of the sequence read where the sequence is likely to be reliable.

The method of claim 1, wherein the assembling uses graph theory.

The method of claim 10, wherein the assembling is performed using a minimal de-Bruijn array.

The method of claim 10, wherein the assembling is performed using a BEST theorem.

The determination of claim 1, wherein the determining comprises examining sequence quality, number of reads, base call quality and its match to the reference sequence and providing a score for each of the potential mutations. the method of.

The method of claim 1, wherein the reference sequence is known in the art and is annotated to identify mutations for which a sequencing read is appropriate.

The method of claim 1, wherein the assembling uses a sequence from the reference sequence and a sequence unique to the reference sequence to anchor the assembly.

The method of claim 1, wherein the method provides a probability of a mutant call.

A computer system including a memory,
(A) a sequence read database of samples enriched in genomic regions;
A computer system comprising (b) a reference sequence of the genomic region, and (c) a program executable to perform the method of claim 1

A computer readable storage medium comprising instructions for performing the method of claim 1.

A method for identifying a mutant sequence, comprising:
a) inputting sequence information into a computer system comprising a program comprising instructions for performing the method of claim 1;
b) executing the program; and c) receiving output from the computer system.