JP7264534B2

JP7264534B2 - Determination of base modifications of nucleic acids

Info

Publication number: JP7264534B2
Application number: JP2021514525A
Authority: JP
Inventors: ユク－ミンデニスロー; ロッサワイクンチウ; クワンチーチャン; ペイヨンチアン; スクハンチョン; ウェンレイポン; オンイェーツェ
Original assignee: Chinese University of Hong Kong CUHK
Current assignee: Chinese University of Hong Kong CUHK
Priority date: 2019-08-16
Filing date: 2020-08-17
Publication date: 2023-04-25
Anticipated expiration: 2040-08-17
Also published as: IL302199B1; CN112752853B; JP2023098964A; TW202214872A; TW202321463A; MX2021000931A; GB2619466A; GB2600649A; ZA202100887B; GB202306697D0; CN116855595A; TWI783820B; CN116694745A; JP7462993B2; NZ786185A; JP7369492B2; AU2022202791C1; EP3827092A4; GB2620315A; CN116875669A

Description

関連出願の相互参照
本出願は、２０２０年７月１３日に出願された「核酸の塩基修飾の決定」と題する米国仮特許出願第６３／０５１，２１０号、２０２０年５月４日に出願された「核酸の塩基修飾の決定」と題する米国仮特許出願第６３／０１９，７９０号、２０２０年３月１９日に出願された「核酸の塩基修飾の決定」と題する米国仮特許出願第６２／９９１，８９１号、２０２０年２月５日に出願された「核酸の塩基修飾の決定」と題する米国仮特許出願第６２／９７０，５８６号、および、２０１９年８月１６日に出願された「核酸の塩基修飾の決定」と題する米国仮特許出願第６２／８８７，９８７号に対する優先権の利益を主張する。これらすべての内容は、すべての目的のために参照により本明細書に援用される。 CROSS-REFERENCE TO RELATED APPLICATIONS This application is filed May 4, 2020, U.S. Provisional Patent Application No. 63/051,210, entitled "Determination of Base Modifications of Nucleic Acids," filed July 13, 2020. U.S. Provisional Patent Application No. 63/019,790 entitled "Determination of Base Modifications of Nucleic Acids," U.S. Provisional Patent Application No. 62/019,790, entitled "Determination of Base Modifications of Nucleic Acids," filed March 19, 2020. 991,891, U.S. Provisional Patent Application No. 62/970,586, entitled "Determination of Base Modifications of Nucleic Acids," filed February 5, 2020; No. 62/887,987 entitled "Determination of Base Modifications of Nucleic Acids". The contents of all of these are hereby incorporated by reference for all purposes.

核酸の塩基修飾の存在は、ウイルス、細菌、植物、真菌、線虫、昆虫、および脊椎動物（例えば、ヒト）などを含む、様々な生物で異なる。最も一般的な塩基修飾は、異なる位置における異なるＤＮＡ塩基へのメチル基の付加、いわゆるメチル化である。メチル化は、５ｍＣ（５－メチルシトシン）、４ｍＣ（Ｎ４－メチルシトシン）、５ｈｍＣ（５－ヒドロキシメチルシトシン）、５ｆＣ（５－ホルミルシトシン）、５ｃａＣ（５－カルボキシルシトシン）、１ｍＡ（Ｎ１－メチルアデニン）、３ｍＡ（Ｎ３－メチルアデニン）、７ｍＡ（Ｎ７－メチルアデニン）、３ｍＣ（Ｎ３－メチルシトシン）、２ｍＧ（Ｎ２－メチルグアニン）、６ｍＧ（Ｏ６－メチルグアニン）、７ｍＧ（Ｎ７－メチルグアニン）、３ｍＴ（Ｎ３－メチルチミン）、および４ｍＴ（Ｏ４－メチルチミン）などのシトシン、アデニン、チミン、グアニンで見出されている。脊椎動物のゲノムでは、５ｍＣが最も一般的なタイプの塩基メチル化であり、グアニンのメチル化がそれに続く（すなわち、ＣｐＧの文脈で）。 The presence of base modifications in nucleic acids varies in different organisms, including viruses, bacteria, plants, fungi, nematodes, insects, vertebrates (eg, humans), and the like. The most common base modification is the addition of methyl groups to different DNA bases at different positions, so-called methylation. Methylation is 5mC (5-methylcytosine), 4mC (N4-methylcytosine), 5hmC (5-hydroxymethylcytosine), 5fC (5-formylcytosine), 5caC (5-carboxylcytosine), 1mA (N1-methylcytosine) adenine), 3 mA (N3-methyladenine), 7 mA (N7-methyladenine), 3mC (N3-methylcytosine), 2mG (N2-methylguanine), 6mG (O6-methylguanine), 7mG (N7-methylguanine) , 3mT (N3-methylthymine), and 4mT (O4-methylthymine), such as cytosines, adenines, thymines, guanines. In vertebrate genomes, 5mC is the most common type of base methylation, followed by guanine methylation (ie, in the context of CpGs).

ＤＮＡメチル化は哺乳動物の発生に不可欠であり、遺伝子発現およびサイレンシング、胚発生、転写、クロマチン構造、Ｘ染色体不活性化、反復要素の活性に対する保護、有糸分裂中のゲノム安定性の維持、ならびに親起源のゲノムインプリンティングの調節において注目すべき役割を果たす。 DNA methylation is essential for mammalian development, gene expression and silencing, embryonic development, transcription, chromatin structure, X-chromosome inactivation, protection against repetitive element activity, maintenance of genome stability during mitosis , as well as play a notable role in regulating genomic imprinting of parental origin.

ＤＮＡメチル化は、プロモーターおよびエンハンサーのサイレンシングにおいて、協調的な様式で多くの重要な役割を果たす（Ｒｏｂｅｒｔｓｏｎ，２００５、ＳｍｉｔｈａｎｄＭｅｉｓｓｎｅｒ，２０１３）。多くのヒトの疾患は、ＤＮＡメチル化の異常に関連することが見出されており、限定されないが、発癌のプロセス、インプリンティング障害（例えば、ベックウィズ・ウィーデマン症候群およびプラダー・ウィリー症候群）、反復不安定性疾患（例えば、脆弱Ｘ症候群）、自己免疫障害（例えば、全身性紅斑性狼瘡）、代謝障害（例えば、Ｉ型およびＩＩ型糖尿病）、神経障害、加齢などを含む。 DNA methylation plays many important roles in promoter and enhancer silencing in a coordinated manner (Robertson, 2005; Smith and Meissner, 2013). Many human diseases have been found to be associated with abnormal DNA methylation, including but not limited to carcinogenic processes, imprinting disorders (e.g. Beckwith-Wiedemann syndrome and Prader-Willi syndrome), recurrent Instability disorders (eg, fragile X syndrome), autoimmune disorders (eg, systemic lupus erythematosus), metabolic disorders (eg, type I and type II diabetes), neuropathies, aging, and the like.

ＤＮＡ分子のメチロミックな修飾を正確に測定することは、多くの臨床的意味を有する。ＤＮＡメチル化を測定するために広く使用されている１つの方法は、バイサルファイト配列決定（ＢＳ－ｓｅｑ）を使用することである（Ｌｉｓｔｅｒｅｔａｌ．，２００９、Ｆｒｏｍｍｅｒｅｔａｌ．，１９９２）。このアプローチでは、ＤＮＡ試料を、最初にバイサルファイトで処理して、非メチル化シトシン（すなわち、Ｃ）をウラシルに変換する。対照的に、メチル化シトシンは、変化せずに残る。次いで、バイサルファイト修飾ＤＮＡを、ＤＮＡ配列決定によって分析する。別のアプローチでは、バイサルファイト変換に続いて、修飾ＤＮＡは、次いで異なるメチル化プロファイルのバイサルファイト変換ＤＮＡを区別できるプライマーを使用して、ポリメラーゼ連鎖反応（ＰＣＲ）増幅にかけられる（Ｈｅｒｍａｎｅｔａｌ．，１９９６）。この後者のアプローチは、メチル化特異的ＰＣＲと呼ばれる。 Accurate measurement of methylomic modifications of DNA molecules has many clinical implications. One widely used method to measure DNA methylation is to use bisulfite sequencing (BS-seq) (Lister et al., 2009; Frommer et al., 1992). In this approach, a DNA sample is first treated with bisulfite to convert unmethylated cytosines (ie, C) to uracil. In contrast, methylated cytosines remain unchanged. The bisulfite-modified DNA is then analyzed by DNA sequencing. In another approach, following bisulfite conversion, the modified DNA is then subjected to polymerase chain reaction (PCR) amplification using primers that can distinguish bisulfite-converted DNA of different methylation profiles (Herman et al., 1996). This latter approach is called methylation-specific PCR.

このようなバイサルファイトに基づくアプローチの１つの欠点は、バイサルファイト変換ステップで、処理されたＤＮＡの大部分が著しく分解されることが報告されていることである（Ｇｒｕｎａｕ，２００１）。別の欠点は、バイサルファイト変換ステップによって強いＣＧバイアスが生成され（Ｏｌｏｖａｅｔａｌ．，２０１８）、典型的には、不均一なメチル化状態を有するＤＮＡ混合物に対して信号対雑音比が低下することである。さらに、バイサルファイト配列決定では、バイサルファイト処理中にＤＮＡが分解されるため、長鎖ＤＮＡ分子を配列決定することができない。したがって、事前の化学処理（例えば、バイサルファイト変換）および核酸増幅（例えば、ＰＣＲの使用）なしに、核酸の塩基の修飾を決定する必要がある。 One drawback of such bisulfite-based approaches is that the bisulfite conversion step has been reported to significantly degrade most of the treated DNA (Grunau, 2001). Another drawback is that the bisulfite conversion step produces a strong CG bias (Olova et al., 2018), typically resulting in a reduced signal-to-noise ratio for DNA mixtures with heterogeneous methylation states. That is. Furthermore, bisulfite sequencing cannot sequence long DNA molecules because the DNA is degraded during the bisulfite treatment. Therefore, there is a need to determine base modifications of nucleic acids without prior chemical treatment (eg, bisulfite conversion) and nucleic acid amplification (eg, using PCR).

本発明者らは、一実施形態では、酵素的および／または化学的変換、あるいはタンパク質および／または抗体結合などの鋳型ＤＮＡの前処理なしで、核酸中の５ｍＣなどの塩基修飾の決定を可能にする新しい方法を開発した。そのような鋳型ＤＮＡの前処理は、塩基修飾の決定に必要ではないが、示される実施例において、特定の前処理（例えば、制限酵素による消化）は、本発明の態様を強化するのに役立つ可能性がある（例えば、分析のためのＣｐＧ部位の濃縮を可能にする）。本開示に存在する実施形態は、例えば、限定されないが、４ｍＣ、５ｈｍＣ、５ｆＣ、および５ｃａＣ、１ｍＡ、３ｍＡ、７ｍＡ、３ｍＣ、２ｍＧ、６ｍＧ、７ｍＧ、３ｍＴおよび４ｍＴなどを含む、異なるタイプの塩基修飾を検出するために使用され得る。そのような実施形態は、様々な塩基修飾によって影響を受ける動態特徴などの配列決定に由来する特徴、ならびにメチル化状態が決定される標的位置周囲のウィンドウにおけるヌクレオチドの識別（ｉｄｅｎｔｉｔｙ）を利用することができる。
We have found that in one embodiment, it enables the determination of base modifications such as 5mC in nucleic acids without enzymatic and/or chemical transformations or pretreatment of the template DNA such as protein and/or antibody binding. developed a new method to Such pretreatment of template DNA is not required for determination of base modifications, but in the examples shown, certain pretreatments (e.g., digestion with restriction enzymes) serve to enhance aspects of the invention. (eg allowing enrichment of CpG sites for analysis). Embodiments present in the present disclosure include, for example, but not limited to, 4mC, 5hmC, 5fC, and 5caC, 1 mA, 3 mA, 7 mA, 3 mC, 2 mG, 6 mG, 7 mG, 3 mT and 4 mT, and the like. can be used to detect Such embodiments take advantage of sequencing-derived features such as kinetic features that are affected by various base modifications, as well as the identity of nucleotides in the window around the target position in which methylation status is determined. can be done.

本発明の実施形態は、限定されないが、単一分子配列決定に使用することができる。単一分子配列決定の１つのタイプは、単一ＤＮＡ分子の配列決定の進行状況をリアルタイムで監視する単一分子リアルタイム配列決定である。単一分子リアルタイム配列決定の１つのタイプは、ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓによって、単一分子リアルタイム（ＳＭＲＴ）システムを使用して商品化されたものである。方法は、塩基または近傍の塩基の修飾を検出するために、配列決定塩基からの信号のパルス幅、塩基のパルス間隔（ｉｎｔｅｒｐｕｌｓｅｄｕｒａｔｉｏｎ、ＩＰＤ）、および塩基の識別（ｉｄｅｎｔｉｔｙ）を使用することができる。別の単一分子システムは、ナノポア配列決定に基づくシステムである。ナノポア配列決定システムの一例は、ＯｘｆｏｒｄＮａｎｏｐｏｒｅＴｅｃｈｎｏｌｏｇｉｅｓによって、商品化されたものである。
Embodiments of the present invention can be used for, but not limited to, single molecule sequencing. One type of single-molecule sequencing is single-molecule real-time sequencing, which monitors the sequencing progress of a single DNA molecule in real-time. One type of single-molecule real-time sequencing has been commercialized by Pacific Biosciences using single-molecule real-time (SMRT) systems. The method can use the pulse width of the signal from the sequencing bases, the interpulse duration (IPD) of the bases, and the identity of the bases to detect modifications of the bases or nearby bases. . Another single-molecule system is a system based on nanopore sequencing. One example of a nanopore sequencing system is that commercialized by Oxford Nanopore Technologies.

本発明者らが開発した方法は、生体試料の塩基修飾を検出して、限定されないが、研究や診断の目的を含む様々な目的で、試料のメチル化プロファイルを評価するためのツールとして役立つ。検出されたメチル化プロファイルは、異なる分析に使用することができる。メチル化プロファイルは、ＤＮＡの起源を検出するために使用することができる（例えば、母体または胎児、組織、細菌、あるいは癌患者の血液から濃縮された腫瘍細胞から取得されたＤＮＡ）。組織における異常なメチル化プロファイルの検出は、個人の発達障害の特定、腫瘍または悪性腫瘍の特定および予測に役立つ。 The method developed by the inventors detects base modifications in biological samples and serves as a tool to assess the methylation profile of samples for a variety of purposes, including but not limited to research and diagnostic purposes. The detected methylation profile can be used for different analyses. Methylation profiles can be used to detect the origin of DNA (eg, DNA obtained from maternal or fetal, tissue, bacteria, or tumor cells enriched from the blood of cancer patients). Detection of aberrant methylation profiles in tissues helps identify developmental disorders in individuals, identify and predict tumors or malignancies.

本発明の実施形態は、生物のハプロタイプの相対的なメチル化レベルを分析することを含み得る。２つのハプロタイプ間のメチル化レベルの不均衡は、障害の分類を決定するために使用され得る。より大きな不均衡は、障害の存在、またはより重度の障害を示している可能性がある。障害には、癌が含まれ得る。 Embodiments of the invention may include analyzing relative methylation levels of haplotypes in an organism. The imbalance in methylation levels between the two haplotypes can be used to determine disorder classification. A larger imbalance may indicate the presence of a disability, or a more severe disability. Disorders can include cancer.

単一分子のメチル化パターンにより、キメラおよびハイブリッドＤＮＡを特定することができる。キメラおよびハイブリッド分子は、２つの異なる遺伝子、染色体、細胞小器官（例えば、ミトコンドリア、核、葉緑体）、生物（哺乳動物、細菌、ウイルスなど）、および／または種からの配列を含み得る。キメラまたはハイブリッドＤＮＡ分子の接合部を検出することで、癌、出生前障害または先天性障害を含む様々な障害または疾患の遺伝子融合を検出することが可能になり得る。 Chimeric and hybrid DNA can be identified by single-molecule methylation patterns. Chimeric and hybrid molecules can include sequences from two different genes, chromosomes, organelles (eg, mitochondria, nuclei, chloroplasts), organisms (mammals, bacteria, viruses, etc.), and/or species. Detecting junctions in chimeric or hybrid DNA molecules may allow detection of gene fusions in various disorders or diseases, including cancer, prenatal or congenital disorders.

本発明の実施形態の性質および利点に関するより良好な理解は、以下の詳細な説明および添付の図面を参照して得ることができる。 A better understanding of the nature and advantages of embodiments of the present invention may be obtained with reference to the following detailed description and accompanying drawings.

本発明の実施形態による、塩基修飾を有する分子のＳＭＲＴ配列決定を示す。Figure 3 shows SMRT sequencing of molecules with base modifications according to embodiments of the invention. 本発明の実施形態による、メチル化および非メチル化ＣｐＧ部位を有する分子のＳＭＲＴ配列決定を示す。Figure 3 shows SMRT sequencing of molecules with methylated and unmethylated CpG sites according to embodiments of the invention. 本発明の実施形態による、パルス間隔およびパルス幅を示す。4 shows pulse intervals and pulse widths according to embodiments of the present invention; 本発明の実施形態による、塩基修飾を検出するための、ＤＮＡのワトソン鎖の測定ウィンドウの一例を示す。FIG. 10 shows an example of a Watson strand measurement window of DNA for detecting base modifications according to embodiments of the present invention. FIG. 本発明の実施形態による、塩基修飾を検出するためのＤＮＡのクリック鎖の測定ウィンドウの一例を示す。FIG. 10 shows an example of a DNA click strand measurement window for detecting base modifications according to embodiments of the present invention. FIG. 本発明の実施形態による、任意の塩基修飾を検出するための、ＤＮＡのワトソン鎖およびその相補的なクリック鎖からのデータを組み合わせることによる測定ウィンドウの一例を示す。FIG. 10 shows an example of a measurement window by combining data from the Watson strand of DNA and its complementary Crick strand for detecting arbitrary base modifications according to embodiments of the present invention. FIG. 本発明の実施形態による、任意の塩基修飾を検出するための、ＤＮＡのワトソン鎖およびその近くの領域のクリック鎖からのデータを組み合わせることによる測定ウィンドウの一例を示す。FIG. 11 shows an example of a measurement window by combining data from the Watson strand of DNA and the Crick strand of the nearby region for detecting arbitrary base modifications according to embodiments of the present invention. 本発明の実施形態による、ＣｐＧ部位のメチル化状態を決定するためのワトソン鎖、クリック鎖、および両鎖の測定ウィンドウの一例を示す。FIG. 10 shows an example of a Watson strand, a Crick strand, and a measurement window for both strands for determining the methylation status of a CpG site, according to embodiments of the present invention. FIG. 本発明の実施形態による、塩基修飾を分類するための分析的、計算的、数学的、または統計モデルを構築する一般的な手順を示す。1 shows general procedures for building analytical, computational, mathematical, or statistical models for classifying base modifications according to embodiments of the invention. 本発明の実施形態による、塩基修飾の分類の一般的な手順を示す。1 shows a general procedure for classifying base modifications according to embodiments of the invention. 本発明の実施形態による、ワトソン鎖の既知のメチル化状態を有する試料を使用して、ＣｐＧ部位のメチル化状態を分類するための分析的、計算的、数学的、または統計モデルを構築する一般的な手順を示す。General building analytical, computational, mathematical, or statistical models for classifying the methylation status of CpG sites using samples with known methylation status of Watson strands according to embodiments of the present invention step-by-step instructions. 本発明の実施形態による、未知の試料のワトソン鎖のメチル化状態を分類する一般的な手順を示す。1 shows a general procedure for classifying the Watson strand methylation status of unknown samples according to embodiments of the present invention. 本発明の実施形態による、クリック鎖の既知のメチル化状態を有する試料を使用して、ＣｐＧ部位でのメチル化状態を分類するための分析的、計算的、数学的、または統計モデルを構築する一般的な手順を示す。Building analytical, computational, mathematical, or statistical models for classifying methylation states at CpG sites using samples with known methylation states of click strands according to embodiments of the present invention Here are the general steps. 本発明の実施形態による、未知の試料のクリック鎖のメチル化状態を分類する一般的な手順を示す。1 shows a general procedure for classifying the click strand methylation status of an unknown sample according to embodiments of the present invention. 本発明の実施形態による、ワトソン鎖およびクリック鎖の両方からの既知のメチル化状態を有する試料を使用して、ＣｐＧ部位のメチル化状態を分類するための統計モデルを構築する一般的な手順を示す。A general procedure for building a statistical model for classifying the methylation status of CpG sites using samples with known methylation status from both Watson and Crick strands according to embodiments of the present invention. show. 本発明の実施形態による、ワトソン鎖およびクリック鎖からの未知の試料のメチル化状態を分類する一般的な手順を示す。FIG. 2 shows a general procedure for classifying the methylation status of unknown samples from Watson and Crick strands according to embodiments of the present invention. FIG. 本発明の実施形態による、メチル化を決定するための訓練データセットおよび試験データセットの性能を示す。Figure 3 shows the performance of training and test datasets for determining methylation according to embodiments of the invention. 同上。Ditto. 本発明の実施形態による、メチル化を決定するための訓練データセットおよび試験データセットの性能を示す。Figure 3 shows the performance of training and test datasets for determining methylation according to embodiments of the invention. 同上。Ditto. 本発明の実施形態による、メチル化を決定するための異なる配列決定深度での訓練データセットおよび試験データセットの性能を示す。FIG. 4 shows the performance of training and test datasets at different sequencing depths for determining methylation according to embodiments of the invention. FIG. 同上。Ditto. 本発明の実施形態による、メチル化を決定するための異なる鎖の訓練データセットおよび試験データセットの性能を示す。FIG. 4 shows the performance of different strand training and test datasets for determining methylation according to embodiments of the present invention. FIG. 同上。Ditto. 本発明の実施形態による、メチル化を決定するための異なる測定ウィンドウの訓練データセットおよび試験データセットの性能を示す。FIG. 4 shows the performance of training and test datasets for different measurement windows for determining methylation according to embodiments of the invention. FIG. 同上。Ditto. 本発明の実施形態による、メチル化を決定するためにのみ下流の塩基を使用する異なる測定ウィンドウについての訓練データセットおよび試験データセットの性能を示す。FIG. 4 shows the performance of training and test datasets for different measurement windows using downstream bases only to determine methylation, according to embodiments of the present invention. FIG. 同上。Ditto. 本発明の実施形態による、メチル化を決定するためにのみ上流の塩基を使用する異なる測定ウィンドウについての訓練データセットおよび試験データセットの性能を示す。FIG. 4 shows the performance of training and test datasets for different measurement windows using upstream bases only to determine methylation, according to embodiments of the present invention. FIG. 同上。Ditto. 本発明の実施形態による、訓練データセットにおける非対称隣接サイズを使用する下流および上流の塩基に関連する動態パターンを使用するメチル化分析の性能を示す。Figure 2 shows the performance of methylation analysis using dynamic patterns associated with downstream and upstream bases using asymmetric neighborhood sizes in the training dataset, according to embodiments of the present invention. 本発明の実施形態による、試験データセットにおける非対称隣接サイズを使用する下流および上流の塩基に関連する動態パターンを使用するメチル化分析の性能を示す。FIG. 4 shows the performance of methylation analysis using dynamic patterns associated with downstream and upstream bases using asymmetric neighborhood sizes in test datasets, according to embodiments of the present invention. 本発明の実施形態による、ＣｐＧ部位のメチル化状態の分類に関する特徴の相対的重要性を示す。Fig. 2 shows the relative importance of features for classifying the methylation status of CpG sites according to embodiments of the present invention. 本発明の実施形態による、パルス幅信号を使用しないメチル化検出のためのモチーフベースのＩＰＤ分析の性能を示す。Figure 2 shows the performance of motif-based IPD analysis for methylation detection without the use of pulse width signals, according to embodiments of the present invention. 本発明の実施形態による、メチル化分析にかけられるシトシンの上流の２ｎｔおよび下流の６ｎｔを使用した主成分分析技術のグラフである。FIG. 10 is a graph of a principal component analysis technique using 2 nts upstream and 6 nts downstream of cytosine subjected to methylation analysis according to embodiments of the present invention; FIG. 本発明の実施形態による、主成分分析を使用した方法と畳み込みニューラルネットワークを使用した方法との性能比較のグラフである。5 is a graph of a performance comparison between a method using principal component analysis and a method using convolutional neural networks, according to embodiments of the invention; 本発明の実施形態による、メチル化を決定するためにのみ上流の塩基を使用する、異なる分析的、計算的、数学的、または統計モデルの訓練データセットおよび試験データセットの性能を示す。FIG. 4 shows training and test data set performance of different analytical, computational, mathematical, or statistical models using upstream bases only to determine methylation, according to embodiments of the present invention. 同上。Ditto. 本発明の実施形態による、全ゲノム増幅により、非メチル化アデニンを有する分子を生成するための１つのアプローチの一例を示す。An example of one approach for generating molecules with unmethylated adenines by whole genome amplification according to embodiments of the present invention is shown. 本発明の実施形態による、全ゲノム増幅により、メチル化アデニンを有する分子を生成するための１つのアプローチの一例を示す。An example of one approach for generating molecules with methylated adenines by whole genome amplification according to embodiments of the present invention is shown. 本発明の実施形態による、非メチル化データセットとメチル化データセットとの間のワトソン鎖の鋳型ＤＮＡにおける配列決定されたＡ塩基にわたるパルス間隔（ＩＰＤ）値を示す。FIG. 4 shows pulse interval (IPD) values across sequenced A bases in Watson strand template DNA between unmethylated and methylated data sets, according to embodiments of the present invention. 同上。Ditto. 本発明の実施形態による、ワトソン鎖のメチル化を決定するための受信者操作特性曲線を示す。FIG. 4 shows a receiver operating characteristic curve for determining Watson strand methylation, according to embodiments of the present invention. FIG. 本発明の実施形態による、非メチル化データセットとメチル化データセットとの間のクリック鎖の鋳型ＤＮＡにおける配列決定されたＡ塩基にわたるパルス間隔（ＩＰＤ）値を示す。FIG. 4 shows pulse interval (IPD) values across sequenced A bases in click strand template DNA between unmethylated and methylated data sets, according to embodiments of the present invention. 同上。Ditto. 本発明の実施形態による、クリック鎖のメチル化を決定するための受信者操作特性曲線を示す。FIG. 4 shows receiver operating characteristic curves for determining click strand methylation, according to embodiments of the present invention. FIG. 本発明の実施形態による、ワトソン鎖の６ｍＡの決定を示す。6 shows the Watson chain 6 mA determination according to embodiments of the present invention. 本発明の実施形態による、クリック鎖の６ｍＡの決定を示す。6 shows a click strand 6mA determination according to an embodiment of the present invention. 本発明の実施形態による、測定ウィンドウベースの畳み込みニューラルネットワークモデルを使用して、ｕＡデータセットとｍＡデータセットとの間のワトソン鎖の配列決定されたＡ塩基についてのメチル化される決定された確率を示す。Determined probabilities of being methylated for sequenced A bases of the Watson strand between the uA and mA datasets using a measurement window-based convolutional neural network model, according to embodiments of the present invention. indicates 同上。Ditto. 本発明の実施形態による、ワトソン鎖の配列決定されたＡ塩基の測定ウィンドウベースのＣＮＮモデルを使用して６ｍＡを検出するためのＲＯＣ曲線を示す。FIG. 10 shows a ROC curve for detecting 6 mA using a measurement window-based CNN model of sequenced A bases of Watson strands according to embodiments of the present invention. FIG. 本発明の実施形態による、ＩＰＤメトリックベースの６ｍＡ検出と測定ウィンドウベースの６ｍＡ検出との間の性能比較を示す。6 shows a performance comparison between IPD metric-based 6mA detection and measurement window-based 6mA detection according to embodiments of the present invention. 本発明の実施形態による、測定ウィンドウベースのＣＮＮモデルを使用して、ｕＡデータセットとｍＡデータセットとの間のクリック鎖のそれらの配列決定されたＡ塩基についてメチル化される決定された確率を示す。Using a measurement window-based CNN model according to embodiments of the present invention, the determined probabilities of being methylated for those sequenced A bases in the click strand between the uA and mA data sets are show. 同上。Ditto. 本発明の実施形態による、クリック鎖の配列決定されたＡ塩基についての測定ウィンドウベースのＣＮＮモデルを使用した６ｍＡ検出の性能を示す。FIG. 4 shows the performance of 6 mA detection using a measurement window-based CNN model for sequenced A bases of click strands according to embodiments of the present invention. 本発明の実施形態による、ワトソン鎖およびクリック鎖を含む分子のＡ塩基にわたるメチル化状態の例を示す。FIG. 4 shows examples of methylation states across the A bases of molecules comprising Watson and Crick strands, according to embodiments of the present invention. 本発明の実施形態による、その１０パーセンタイル超のＩＰＤ値を有するｍＡデータセットのＡ塩基を選択的に使用することによる強化訓練の一例を示す。FIG. 11 shows an example of reinforcement training by selectively using A bases of the mA data set with IPD values above its 10th percentile, according to embodiments of the present invention. FIG. 本発明の実施形態による、各ウェルにおけるサブリードの数に対するｍＡデータセットにおける非メチル化アデニンのパーセンテージのグラフである。FIG. 10 is a graph of the percentage of unmethylated adenine in the mA data set versus the number of subreads in each well, according to embodiments of the present invention; FIG. 本発明の実施形態による、試験データセットにおける二本鎖ＤＮＡ分子のワトソン鎖とクリック鎖との間のメチルアデニンのパターンを示す。4 shows the pattern of methyladenines between the Watson and Crick strands of a double-stranded DNA molecule in a test data set, according to embodiments of the present invention. 本発明の実施形態による、訓練データセットおよび試験データセットにおける、完全非メチル化分子、ヘミメチル化分子、完全メチル化分子、およびインターレースのメチルアデニンパターンを有する分子のパーセンテージを示す表である。4 is a table showing the percentage of fully unmethylated, hemimethylated, fully methylated and molecules with interlaced methyladenine patterns in training and test datasets, according to embodiments of the present invention. 本発明の実施形態による、アデニン部位に関する完全非メチル化分子を有する分子、ヘミメチル化分子、完全メチル化分子、およびインターレースのメチルアデニンパターンを有する分子の代表的な例を示す。Representative examples of molecules with fully unmethylated, hemimethylated, fully methylated, and interlaced methyladenine patterns with respect to adenine sites are shown according to embodiments of the present invention. 本発明の実施形態による、ＣｐＧアイランド（黄色の網掛け）を有する長いリード（６，２６５ｂｐ）の一例を示す。An example of a long read (6,265 bp) with CpG islands (yellow shading) according to embodiments of the invention is shown. 本発明の実施形態による、９つのＤＮＡ分子がＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓＳＭＲＴ配列決定によって配列決定され、それらがインプリント領域と重複していることを示す表である。FIG. 4 is a table showing that nine DNA molecules were sequenced by Pacific Biosciences SMRT sequencing and that they overlap with imprinted regions, according to embodiments of the present invention. FIG. 本発明の実施形態による、ゲノムインプリンティングの一例を示す。1 shows an example of genomic imprinting according to embodiments of the present invention. 本発明の実施形態による、インプリント領域におけるメチル化パターンの決定の一例を示す。FIG. 4 shows an example of determination of methylation patterns in imprinted regions according to embodiments of the present invention. FIG. 本発明の実施形態による、新しいアプローチと従来のバイサルファイト配列決定との間で推定されたメチル化レベルの比較を示す。Figure 2 shows a comparison of estimated methylation levels between the new approach and conventional bisulfite sequencing according to embodiments of the present invention. 本発明の実施形態による、血漿ＤＮＡのメチル化の検出の性能を示す。（Ａ）メチル化の予測確率とバイサルファイト配列決定によって定量されたメチル化レベルの範囲との関係。（Ｂ）本開示に存在する実施形態による、ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ（ＰａｃＢｉｏ）配列決定によって決定されたメチル化レベル（ｙ軸）と、１０Ｍｂ分解能でのバイサルファイト配列決定によって定量されたメチル化レベル（ｘ軸）との間の相関関係。Figure 2 shows the performance of plasma DNA methylation detection according to embodiments of the present invention. (A) Relationship between the predicted probability of methylation and the range of methylation levels quantified by bisulfite sequencing. (B) Methylation levels determined by Pacific Biosciences (PacBio) sequencing (y-axis) and methylation levels quantified by bisulfite sequencing at 10 Mb resolution (x-axis) according to embodiments present in the present disclosure. ). 同上。Ditto. 本発明の実施形態による、ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓＳＭＲＴ配列決定とＢＳ－ｓｅｑとの間のＹ染色体のゲノム表現（ＧＲ）の相関を示す。FIG. 2 shows the correlation of Y chromosome genomic representation (GR) between Pacific Biosciences SMRT sequencing and BS-seq, according to embodiments of the present invention. 本発明の実施形態による、各々が一連のＣｐＧ部位を有するＣｐＧブロックを使用したメチル化のＣｐＧブロックベースの検出の一例を示す。５ｍＣ：メチル化、Ｃ：非メチル化。FIG. 4 shows an example of CpG block-based detection of methylation using CpG blocks each having a series of CpG sites according to embodiments of the invention. 5mC: methylated, C: unmethylated. 本発明の実施形態による、ＣｐＧブロックベースのアプローチを使用したヒトＤＮＡ分子のメチル化の判定（ｍｅｔｈｙｌａｔｉｏｎｃａｌｌｉｎｇ）の訓練および試験を示す。（Ａ）訓練データセットの性能。（Ｂ）独立した試験データセットの性能。Figure 3 shows training and testing of methylation calling of human DNA molecules using a CpG block-based approach according to embodiments of the present invention. (A) Performance of the training dataset. (B) Performance of independent test datasets. 同上。Ditto. 本発明の実施形態による、腫瘍組織におけるコピー数変化を示す。Figure 3 shows copy number alterations in tumor tissue according to embodiments of the present invention. 同上。Ditto. 本発明の実施形態による、腫瘍組織におけるコピー数変化を示す。Figure 3 shows copy number alterations in tumor tissue according to embodiments of the present invention. 同上。Ditto. 本発明の実施形態による、推定されたメチル化レベルを使用する、妊婦の血漿からの血漿ＤＮＡ組織マッピングの概略図を示す。FIG. 4 shows a schematic of plasma DNA tissue mapping from plasma of pregnant women using estimated methylation levels according to embodiments of the present invention. 本発明の実施形態による、推定された母体血漿ＤＮＡへの胎盤の寄与と、Ｙ染色体リードによって推定された胎児ＤＮＡ画分との間の相関を示す。FIG. 4 shows a correlation between the estimated placental contribution to maternal plasma DNA and the fetal DNA fraction estimated by Y-chromosome reads, according to embodiments of the present invention. 本発明の実施形態による、異なるヒト組織ＤＮＡ試料からの配列決定データを要約した表を示す。1 shows a table summarizing sequencing data from different human tissue DNA samples, according to embodiments of the present invention. 本発明の実施形態による、メチル化パターンを分析する様々な方法の図を示す。FIG. 2 shows diagrams of various methods of analyzing methylation patterns, according to embodiments of the present invention. 本発明の実施形態による、バイサルファイト配列決定および単一分子リアルタイム配列決定によって定量された全ゲノムレベルでのメチル化密度の比較を示す。FIG. 4 shows a comparison of methylation densities at the genome-wide level as quantified by bisulfite sequencing and single-molecule real-time sequencing, according to embodiments of the present invention. 同上。Ditto. 本発明の実施形態による、バイサルファイト配列決定および単一分子リアルタイム配列決定によって定量された全体的なメチル化レベルの異なる相関を示す。Figure 3 shows different correlations of global methylation levels quantified by bisulfite sequencing and single-molecule real-time sequencing, according to embodiments of the present invention. 同上。Ditto. 同上。Ditto. 本発明の実施形態による、バイサルファイト配列決定および単一分子リアルタイム配列決定によって決定されたメチル化レベルとともに、肝細胞癌（ＨＣＣ）細胞株および健康な対照の対象由来のバフィーコート試料についてのメチル化パターンを、１Ｍｎｔの分解能で示す。Methylation levels for hepatocellular carcinoma (HCC) cell lines and buffy coat samples from healthy control subjects, along with methylation levels determined by bisulfite sequencing and single-molecule real-time sequencing, according to embodiments of the present invention. Patterns are shown at 1 Mnt resolution. 同上。Ditto. ＨＣＣ細胞株（ＨｅｐＧ２）および健康な対照の対象由来のバフィーコート試料について、本発明の実施形態によるバイサルファイト配列決定および単一分子リアルタイム配列決定によって決定された１Ｍｎｔの分解能でのメチル化レベルの散布図を示す。Scattering methylation levels at 1 Mnt resolution determined by bisulfite sequencing and single-molecule real-time sequencing according to embodiments of the present invention for buffy coat samples from HCC cell lines (HepG2) and healthy control subjects. Figure shows. 同上。Ditto. ＨＣＣ細胞株（ＨｅｐＧ２）および健康な対照の対象由来のバフィーコート試料について、本発明の実施形態によるバイサルファイト配列決定および単一分子リアルタイム配列決定によって決定された１００ｋｎｔの分解能でのメチル化レベルの散布図を示す。Scattering methylation levels at 100 knt resolution determined by bisulfite sequencing and single molecule real-time sequencing according to embodiments of the present invention for buffy coat samples from HCC cell lines (HepG2) and healthy control subjects. Figure shows. 同上。Ditto. 本発明の実施形態による、バイサルファイト配列決定および単一分子リアルタイム配列決定によって決定されたメチル化レベルとともに、ＨＣＣ腫瘍組織および隣接する正常組織についてのメチル化パターンを、１Ｍｎｔの分解能で示す。Methylation patterns for HCC tumor tissue and adjacent normal tissue are shown at 1 Mnt resolution, along with methylation levels determined by bisulfite sequencing and single-molecule real-time sequencing, according to embodiments of the present invention. 同上。Ditto. ＨＣＣ腫瘍組織および隣接する正常組織について、本発明の実施形態によるバイサルファイト配列決定および単一分子リアルタイム配列決定によって決定された１Ｍｎｔの分解能でのメチル化レベルの散布図を示す。Figure 3 shows a scatter plot of methylation levels at 1 Mnt resolution determined by bisulfite sequencing and single molecule real-time sequencing according to embodiments of the present invention for HCC tumor tissue and adjacent normal tissue. 同上。Ditto. ＨＣＣ腫瘍組織および隣接する正常組織について、本発明の実施形態によるバイサルファイト配列決定および単一分子リアルタイム配列決定によって決定された１００ｋｎｔの分解能でのメチル化レベルの散布図を示す。Figure 3 shows a scatter plot of methylation levels at 100 knt resolution determined by bisulfite sequencing and single molecule real-time sequencing according to embodiments of the present invention for HCC tumor tissue and adjacent normal tissue. 同上。Ditto. 本発明の実施形態による、バイサルファイト配列決定および単一分子リアルタイム配列決定によって決定されたメチル化レベルとともに、ＨＣＣ腫瘍組織および隣接する正常組織についてのメチル化パターンを、１Ｍｎｔの分解能で示す。Methylation patterns for HCC tumor tissue and adjacent normal tissue are shown at 1 Mnt resolution, along with methylation levels determined by bisulfite sequencing and single-molecule real-time sequencing, according to embodiments of the present invention. 同上。Ditto. ＨＣＣ腫瘍組織および隣接する正常組織について、本発明の実施形態によるバイサルファイト配列決定および単一分子リアルタイム配列決定によって決定された１Ｍｎｔの分解能でのメチル化レベルの散布図を示す。Figure 3 shows a scatter plot of methylation levels at 1 Mnt resolution determined by bisulfite sequencing and single molecule real-time sequencing according to embodiments of the present invention for HCC tumor tissue and adjacent normal tissue. 同上。Ditto. ＨＣＣ腫瘍組織および隣接する正常組織について、本発明の実施形態によるバイサルファイト配列決定および単一分子リアルタイム配列決定によって決定された１００ｋｎｔの分解能でのメチル化レベルの散布図を示す。Figure 3 shows a scatter plot of methylation levels at 100 knt resolution determined by bisulfite sequencing and single molecule real-time sequencing according to embodiments of the present invention for HCC tumor tissue and adjacent normal tissue. 同上。Ditto. 本発明の実施形態による、腫瘍抑制遺伝子ＣＤＫＮ２Ａの近くのメチル化の異常なパターンの一例を示す。FIG. 11 shows an example of an aberrant pattern of methylation near the tumor suppressor gene CDKN2A according to embodiments of the present invention. FIG. 本発明の実施形態による、単一分子リアルタイム配列決定によって検出された可変メチル化領域を示す。FIG. 4 shows variable methylated regions detected by single-molecule real-time sequencing, according to embodiments of the present invention. 同上。Ditto. 本発明の実施形態による、単一分子リアルタイム配列決定を使用した、ＨＣＣ組織と隣接する非腫瘍組織との間のＢ型肝炎ウイルスＤＮＡのメチル化パターンを示す。FIG. 4 shows hepatitis B virus DNA methylation patterns between HCC tissue and adjacent non-tumor tissue using single-molecule real-time sequencing, according to embodiments of the present invention. 本発明の実施形態による、バイサルファイト配列決定を使用した、肝硬変を有するがＨＣＣを有しない患者由来の肝臓組織におけるＢ型肝炎ウイルスＤＮＡのメチル化レベルを示す。FIG. 4 shows hepatitis B virus DNA methylation levels in liver tissue from patients with cirrhosis but without HCC using bisulfite sequencing, according to embodiments of the present invention. 本発明の実施形態による、バイサルファイト配列決定を使用した、ＨＣＣ組織におけるＢ型肝炎ウイルスＤＮＡのメチル化レベルを示す。FIG. 4 shows hepatitis B virus DNA methylation levels in HCC tissue using bisulfite sequencing, according to embodiments of the present invention. 本発明の実施形態による、メチル化ハプロタイプ分析を示す。4 shows methylation haplotype analysis according to embodiments of the invention. 本発明の実施形態による、コンセンサス配列から決定された配列決定された分子のサイズ分布を示す。Figure 3 shows the size distribution of sequenced molecules determined from consensus sequences, according to embodiments of the present invention. 本発明の実施形態による、インプリント領域におけるアレルメチル化パターンの例を示す。FIG. 4 shows examples of allelic methylation patterns in imprinted regions, according to embodiments of the present invention. FIG. 同上。Ditto. 同上。Ditto. 同上。Ditto. 本発明の実施形態による、非インプリント領域におけるアレルメチル化パターンの例を示す。FIG. 4 shows examples of allelic methylation patterns in non-imprinted regions according to embodiments of the present invention. FIG. 同上。Ditto. 同上。Ditto. 同上。Ditto. 本発明の実施形態による、アレル特異的断片のメチル化レベルの表を示す。FIG. 4 shows a table of methylation levels of allele-specific fragments according to embodiments of the invention. FIG. 本発明の実施形態による、メチル化プロファイルを使用して、妊娠中の血漿ＤＮＡの胎盤起源を決定する一例を示す。FIG. 10 shows an example of using methylation profiles to determine the placental origin of plasma DNA during pregnancy, according to embodiments of the present invention. FIG. 本発明の実施形態による、胎児特異的ＤＮＡメチル化の分析を示す。Figure 3 shows an analysis of fetal-specific DNA methylation according to embodiments of the present invention; 本発明の実施形態による、ＳＭＲＴ－ｓｅｑのための異なる試薬キットにわたる、異なる測定ウィンドウサイズの性能を示す。Figure 2 shows the performance of different measurement window sizes across different reagent kits for SMRT-seq according to embodiments of the present invention. 同上。Ditto. 同上。Ditto. 本発明の実施形態による、ＳＭＲＴ－ｓｅｑのための異なる試薬キットにわたる、異なる測定ウィンドウサイズの性能を示す。Figure 2 shows the performance of different measurement window sizes across different reagent kits for SMRT-seq according to embodiments of the present invention. 同上。Ditto. 同上。Ditto. 本発明の実施形態による、バイサルファイト配列決定およびＳＭＲＴ－ｓｅｑ（ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ２．０）によって定量された全体的なメチル化レベルの相関を示す。Figure 2 shows the correlation of global methylation levels quantified by bisulfite sequencing and SMRT-seq (Sequel II Sequencing Kit 2.0) according to embodiments of the present invention. 同上。Ditto. 同上。Ditto. 本発明の実施形態による、様々な腫瘍組織と、ペアの隣接する非腫瘍組織との間の全体的なメチル化レベルの比較を示す。FIG. 4 shows a comparison of global methylation levels between various tumor tissues and paired adjacent non-tumor tissues, according to embodiments of the present invention. 同上。Ditto. 本発明の実施形態による、循環コンセンサス配列（ｃｉｒｃｕｌａｒｃｏｎｓｅｎｓｕｓｓｅｑｕｅｎｃｅ、ＣＣＳ）から決定された配列文脈を使用してメチル化状態を決定することを示す。FIG. 11 shows the use of sequence context determined from a circular consensus sequence (CCS) to determine methylation status, according to embodiments of the present invention. FIG. 本発明の実施形態による、ＣＣＳから決定された配列文脈を使用したメチル化されたＣｐＧ部位の検出についてのＲＯＣ曲線を示す。FIG. 4 shows ROC curves for detection of methylated CpG sites using sequence context determined from CCS, according to embodiments of the present invention. FIG. 本発明の実施形態による、ＣＣＳ情報を用いず、かつ参照ゲノムへの事前の整列を用いないメチル化ＣｐＧ部位の検出についてのＲＯＣ曲線を示す。FIG. 4 shows ROC curves for detection of methylated CpG sites without CCS information and without prior alignment to the reference genome, according to embodiments of the present invention. FIG. 本発明の実施形態による、単一分子リアルタイム配列決定のための分子の調製の一例を示す。1 shows an example of molecule preparation for single-molecule real-time sequencing, according to embodiments of the present invention. 本発明の実施形態による、ＣＲＩＳＰＲ／Ｃａｓ９システムの図を示す。1 shows a diagram of a CRISPR/Cas9 system, according to an embodiment of the invention; FIG. 本発明の実施形態による、目的の末端ブロックされた分子にまたがる２つの切断を導入するためのＣａｓ９複合体の一例を示す。FIG. 11 shows an example of a Cas9 complex for introducing two truncations spanning an end-blocked molecule of interest, according to embodiments of the present invention. FIG. 本発明の実施形態による、バイサルファイト配列決定および単一分子リアルタイム配列決定によって決定されたＡｌｕ領域のメチル化分布を示す。FIG. 4 shows the methylation distribution of Alu regions determined by bisulfite sequencing and single-molecule real-time sequencing, according to embodiments of the present invention. 本発明の実施形態による、単一分子リアルタイム配列決定の結果を使用したモデルによって決定された、Ａｌｕ領域のメチル化レベルの分布を示す。FIG. 4 shows the distribution of methylation levels of Alu regions as determined by a model using single-molecule real-time sequencing results, according to embodiments of the present invention. 本発明の実施形態による、組織および組織中のＡｌｕ領域のメチル化レベルの表を示す。FIG. 3 shows a table of tissues and methylation levels of Alu regions in tissues according to embodiments of the present invention. FIG. 本発明の実施形態による、Ａｌｕ反復配列に関連するメチル化信号を使用した異なる癌のタイプのクラスター分析を示す。FIG. 4 shows cluster analysis of different cancer types using methylation signals associated with Alu repeats according to embodiments of the present invention. FIG. 本発明の実施形態による、全ゲノム増幅およびＭ．ＳｓｓｓＩ処理に関与した試験データセットにおける全体的なメチル化レベルの定量に対するリード深度の影響を示す。Whole genome amplification and M . Figure 2 shows the effect of read depth on quantification of global methylation levels in test datasets involving SsssI treatment. 同上。Ditto. 本発明の実施形態による、異なるサブリード深度カットオフを使用した、ＳＭＲＴ－ｓｅｑ（ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ２．０）およびＢＳ－ｓｅｑによって決定された全体的なメチル化レベル間の比較を示す。Figure 2 shows a comparison between global methylation levels determined by SMRT-seq (Sequel II Sequencing Kit 2.0) and BS-seq using different sub-read depth cutoffs according to embodiments of the present invention. 本発明の実施形態による、ＳＭＲＴ－ｓｅｑ（ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ２．０）およびＢＳ－ｓｅｑによる２つの測定値間のメチル化レベルの相関に対するサブリード深度の影響を示す表である。FIG. 11 is a table showing the effect of sub-read depth on the correlation of methylation levels between two measurements by SMRT-seq (Sequel II Sequencing Kit 2.0) and BS-seq, according to embodiments of the present invention. FIG. 本発明の実施形態による、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ２．０によって生成されたデータにおける断片サイズに関するサブリード深度分布を示す。FIG. 4 shows sub-read depth distributions with respect to fragment size in data generated by Sequel II Sequencing Kit 2.0, according to embodiments of the present invention. FIG. 本発明の実施形態による、核酸分子のヌクレオチドの修飾を検出する方法を示す。FIG. 1 illustrates a method for detecting nucleotide modifications of nucleic acid molecules according to embodiments of the present invention. FIG. 本発明の実施形態による、核酸分子のヌクレオチドの修飾を検出する方法を示す。FIG. 1 illustrates a method for detecting nucleotide modifications of nucleic acid molecules according to embodiments of the present invention. FIG. 本発明の実施形態による、相対的なハプロタイプに基づくメチル化不均衡分析を示す。FIG. 3 shows a relative haplotype-based methylation imbalance analysis according to embodiments of the present invention. FIG. 本発明の実施形態による、ケースＴＢＲ３０３３の隣接する非腫瘍組織ＤＮＡと比較した、腫瘍ＤＮＡにおけるハプロタイプＩ（ＨａｐＩ）とハプロタイプＩＩ（ＨａｐＩＩ）との間の異なるメチル化レベルを示すハプロタイプブロックの表である。Table of haplotype blocks showing differential methylation levels between haplotype I (Hap I) and haplotype II (Hap II) in tumor DNA compared to adjacent non-tumor tissue DNA of case TBR3033, according to embodiments of the present invention. is. 同上。Ditto. 本発明の実施形態による、ケースＴＢＲ３０３２の隣接する正常組織ＤＮＡと比較した、腫瘍ＤＮＡにおけるＨａｐＩとＨａｐＩＩとの間の異なるメチル化レベルを示すハプロタイプブロックの表である。FIG. 10 is a table of haplotype blocks showing differential methylation levels between Hap I and Hap II in tumor DNA compared to adjacent normal tissue DNA of case TBR3032, according to embodiments of the present invention. FIG. 本発明の実施形態による、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ２．０によって生成されたデータに基づく、腫瘍と隣接する非腫瘍組織との間の２つのハプロタイプ間のメチル化不均衡を示すハプロタイプブロックの数をまとめた表である。Summarize the number of haplotype blocks showing methylation imbalance between two haplotypes between tumor and adjacent non-tumor tissue based on data generated by Sequel II Sequencing Kit 2.0 according to embodiments of the present invention. It is a table. 本発明の実施形態による、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ２．０によって生成されたデータに基づく、異なる腫瘍病期の腫瘍組織における２つのハプロタイプ間のメチル化不均衡を示すハプロタイプブロックの数をまとめた表である。FIG. 11 is a table summarizing the number of haplotype blocks showing methylation imbalance between two haplotypes in tumor tissues of different tumor stages, based on data generated by Sequel II Sequencing Kit 2.0, according to an embodiment of the present invention; be. 本発明の実施形態による、相対的なハプロタイプに基づくメチル化不均衡分析を示す。FIG. 3 shows a relative haplotype-based methylation imbalance analysis according to embodiments of the present invention. FIG. 本発明の実施形態による、第１のハプロタイプおよび第２のハプロタイプを有する生物における障害を分類する方法を示す。FIG. 4 illustrates a method of classifying disorders in organisms having a first haplotype and a second haplotype, according to embodiments of the present invention; FIG. 本発明の実施形態による、ヒト部分はメチル化されているが、マウス部分はメチル化されていない、ヒト－マウスハイブリッド断片の作成を示す。Figure 3 shows the generation of a human-mouse hybrid fragment in which the human portion is methylated, but the mouse portion is unmethylated, according to embodiments of the present invention. 本発明の実施形態による、ヒト部分はメチル化されていないが、マウス部分はメチル化されている、ヒト－マウスハイブリッド断片の作成を示す。Figure 3 shows the generation of a human-mouse hybrid fragment in which the human portion is unmethylated, but the mouse portion is methylated, according to embodiments of the present invention. 本発明の実施形態による、連結後のＤＮＡ混合物（試料ＭＩＸ０１）中のＤＮＡ分子の鎖長分布を示す。Figure 2 shows the chain length distribution of DNA molecules in a DNA mixture (sample MIX01) after ligation according to an embodiment of the present invention. 本発明の実施形態による、第１のＤＮＡ（Ａ）および第２のＤＮＡ（Ｂ）がともに結合する接合領域を示す。FIG. 2 shows junction regions where a first DNA (A) and a second DNA (B) are joined together according to embodiments of the present invention. 本発明の実施形態による、ＤＮＡ混合物のメチル化分析を示す。Figure 3 shows methylation analysis of DNA mixtures according to embodiments of the present invention. 本発明の実施形態による、試料ＭＩＸ０１のＣｐＧ部位についてメチル化される確率の箱ひげ図を示す。FIG. 10 shows a boxplot of methylated probabilities for CpG sites in sample MIX01, according to embodiments of the present invention. FIG. 本発明の実施形態による、試料ＭＩＸ０２の交差連結後のＤＮＡ混合物中のＤＮＡ分子の鎖長分布を示す。Figure 2 shows the chain length distribution of DNA molecules in the DNA mixture after cross-ligation of sample MIX02 according to embodiments of the present invention. 本発明の実施形態による、試料ＭＩＸ０２のＣｐＧ部位についてメチル化される確率の箱ひげ図を示す。FIG. 10 shows a boxplot of methylated probabilities for CpG sites in sample MIX02, according to embodiments of the present invention. FIG. 本発明の実施形態による、ＭＩＸ０１のバイサルファイト配列決定およびＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ配列決定によって決定されたメチル化を比較した表である。FIG. 4 is a table comparing methylation determined by bisulfite sequencing and Pacific Biosciences sequencing of MIX01, according to embodiments of the present invention. FIG. 本発明の実施形態による、ＭＩＸ０２のバイサルファイト配列決定およびＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ配列決定によって決定されたメチル化を比較した表である。FIG. 4 is a table comparing methylation determined by bisulfite sequencing and Pacific Biosciences sequencing of MIX02, according to embodiments of the present invention. FIG. 本発明の実施形態による、ＭＩＸ０１およびＭＩＸ０２についての、ヒトのみのＤＮＡおよびマウスのみのＤＮＡの５Ｍｂビンでのメチル化レベルを示す。FIG. 4 shows methylation levels in 5 Mb bins of human-only DNA and mouse-only DNA for MIX01 and MIX02, according to embodiments of the present invention. 同上。Ditto. 本発明の実施形態による、ＭＩＸ０１およびＭＩＸ０２についての、ヒト－マウスハイブリッドＤＮＡ断片のヒト部分およびマウス部分の５Ｍｂビンでのメチル化レベルを示す。FIG. 4 shows methylation levels in 5 Mb bins of human and mouse portions of human-mouse hybrid DNA fragments for MIX01 and MIX02 according to embodiments of the present invention. 同上。Ditto. 本発明の実施形態による、単一のヒト－マウスハイブリッド分子におけるメチル化状態を示す代表的なグラフである。FIG. 4 is a representative graph showing methylation status in a single human-mouse hybrid molecule, according to embodiments of the present invention. FIG. 同上。Ditto. 本発明の実施形態による、生体試料においてキメラ分子を検出する方法を示す。Figure 4 shows a method of detecting chimeric molecules in a biological sample according to embodiments of the present invention. 本発明の実施形態による、測定システムを示す。1 shows a measurement system according to an embodiment of the invention; 本発明の実施形態による、システムおよび方法とともに使用可能な例示的なコンピュータシステムのブロック図を示す。1 shows a block diagram of an exemplary computer system usable with the systems and methods according to embodiments of the present invention; FIG. 本発明の実施形態による、ＤＮＡ末端修復およびＡテーリングを使用したＭｓｐＩベースの標的化単一分子リアルタイム配列決定を示す。Figure 3 shows MspI-based targeted single-molecule real-time sequencing using DNA end repair and A-tailing according to embodiments of the present invention. 本発明の実施形態による、ＭｓｐＩ消化断片のサイズ分布を示す。Figure 2 shows the size distribution of MspI digested fragments according to embodiments of the present invention. 同上。Ditto. 本発明の実施形態による、特定の選択されたサイズ範囲についてのＤＮＡ分子の数の表を示す。FIG. 4 shows a table of numbers of DNA molecules for certain selected size ranges, according to embodiments of the present invention. 本発明の実施形態による、制限酵素消化後の、ＣｐＧアイランド内のＣｐＧ部位のパーセントカバレッジ対ＤＮＡ断片のサイズのグラフである。FIG. 4 is a graph of percent coverage of CpG sites within a CpG island versus DNA fragment size after restriction enzyme digestion, according to embodiments of the present invention. 本発明の実施形態による、ＤＮＡ末端修復およびＡテーリングを使用しないＭｓｐＩベースの標的化単一分子リアルタイム配列決定を示す。Figure 3 shows MspI-based targeted single-molecule real-time sequencing without DNA end repair and A-tailing according to embodiments of the present invention. 本発明の実施形態による、アダプターの自己連結の確率が低減された、ＭｓｐＩベースの標的化単一分子リアルタイム配列決定を示す。FIG. 11 shows MspI-based targeted single-molecule real-time sequencing with reduced probability of adapter self-ligation according to embodiments of the present invention. FIG. 本発明の実施形態による、ＭｓｐＩベースの標的化単一分子リアルタイム配列決定によって決定された胎盤とバフィーＤＮＡ試料との間の全体的なメチル化レベルのグラフである。FIG. 4 is a graph of global methylation levels between placenta and buffy DNA samples determined by MspI-based targeted single-molecule real-time sequencing, according to embodiments of the present invention. FIG. 本発明の実施形態による、ＭｓｐＩベースの標的化単一分子リアルタイム配列決定により決定されたＤＮＡメチル化プロファイルを使用した胎盤およびバフィーコート試料のクラスター分析を示す。Figure 3 shows cluster analysis of placenta and buffy coat samples using DNA methylation profiles determined by MspI-based targeted single-molecule real-time sequencing, according to embodiments of the present invention.

用語
「組織」は、機能単位としてともにグループ化する細胞のグループに対応する。２つ以上のタイプの細胞が、単一の組織内に見出され得る。異なるタイプの組織は、異なるタイプの細胞（例えば、肝細胞、肺胞細胞、または血球細胞）から構成されてもよく、異なる生物（母体対胎児、移植を受けた対象の組織、微生物またはウイルスに感染した生物の組織）由来の組織あるいは健康な細胞対腫瘍細胞に対応してもよい。「参照組織」は、組織特異的メチル化レベルを決定するために使用される組織に対応する。異なる個体由来の同じ組織タイプの複数の試料を使用して、その組織タイプの組織特異的メチル化レベルを決定することができる。 The term "tissue" corresponds to a group of cells grouped together as a functional unit. More than one type of cell can be found within a single tissue. Different types of tissue may be composed of different types of cells (e.g., hepatocytes, alveolar cells, or blood cells) and may be affected by different organisms (maternal versus fetal, transplanted target tissue, microbial or viral). tissue from an infected organism) or healthy versus tumor cells. A "reference tissue" corresponds to a tissue used to determine tissue-specific methylation levels. Multiple samples of the same tissue type from different individuals can be used to determine tissue-specific methylation levels for that tissue type.

「生体試料」とは、ヒト対象から採取された任意の試料を指す。生体試料は、組織生検、穿刺吸引物、または血球であり得る。試料はまた、例えば、妊婦からの血漿または血清または尿であり得る。便試料もまた使用され得る。様々な実施形態では、無細胞ＤＮＡについて濃縮された妊婦からの生体試料（例えば、遠心分離プロトコルを介して取得された血漿試料）におけるＤＮＡの大部分は、無細胞であり得、例えば、５０％超、６０％超、７０％超、８０％超、９０％超、９５％超、または９９％超のＤＮＡは無細胞であり得る。遠心分離プロトコルは、例えば、３，０００ｇ×１０分で流体部分を取得することと、残留細胞を除去するために３０，０００ｇでさらに１０分間再遠心分離することと、を含み得る。特定の実施形態では、３，０００ｇの遠心分離ステップに続いて、流体部分の濾過を行うことができる（例えば、直径５μｍ以下の孔径のフィルターを使用）。 "Biological sample" refers to any sample taken from a human subject. A biological sample can be a tissue biopsy, fine needle aspirate, or blood cells. The sample can also be, for example, plasma or serum or urine from pregnant women. A stool sample may also be used. In various embodiments, the majority of the DNA in a biological sample from a pregnant woman enriched for cell-free DNA (e.g., a plasma sample obtained via a centrifugation protocol) can be cell-free, e.g., 50% More than, 60%, 70%, 80%, 90%, 95%, or 99% of the DNA can be cell-free. A centrifugation protocol can include, for example, obtaining a fluid portion at 3,000 g×10 min and re-centrifuging at 30,000 g for an additional 10 min to remove residual cells. In certain embodiments, the 3,000 g centrifugation step can be followed by filtration of the fluid portion (eg, using filters with pore sizes of 5 μm diameter or less).

「配列リード」は、核酸分子の任意の部分または全部から配列決定されるヌクレオチドの鎖を指す。例えば、配列リードは、核酸断片から配列決定された短鎖ヌクレオチド（例えば、約２０～１５０個）、核酸断片の片端もしくは両端の短鎖ヌクレオチド、または生体試料中に存在する核酸断片全体の配列決定であり得る。配列リードは、例えば、配列決定技術を使用した、またはプローブを使用した種々の方法で、例えば、ハイブリダイゼーションアレイもしくは捕捉プローブで、または単一プライマーもしくは等温増幅を使用した、ポリメラーゼ連鎖反応（ＰＣＲ）もしくは線形増幅などの増幅技術で、取得することができる。 A "sequence read" refers to a strand of nucleotides that is sequenced from any portion or all of a nucleic acid molecule. For example, a sequence read can be a short nucleotide sequence (eg, about 20-150) from a nucleic acid fragment, short nucleotides at one or both ends of a nucleic acid fragment, or sequencing of an entire nucleic acid fragment present in a biological sample. can be Sequence reads may be obtained by polymerase chain reaction (PCR), e.g., using sequencing techniques or in various ways using probes, e.g., with hybridization arrays or capture probes, or using single primers or isothermal amplification. Alternatively, it can be obtained by an amplification technique such as linear amplification.

「サブリード」は、環状化ＤＮＡ鋳型の１つの鎖のすべての塩基から生成された配列であり、ＤＮＡポリメラーゼによって１つの連続した鎖にコピーされている。例えば、サブリードは、環状化ＤＮＡ鋳型のＤＮＡの１つの鎖に対応し得る。このような例では、環状化後、１つの二本鎖ＤＮＡ分子には、２つのサブリードがある（各配列決定パスについて１つ）。一部の実施形態では、生成された配列は、例えば、配列決定エラーが存在するため、１つの鎖のすべての塩基のサブセットを含み得る。 A "subread" is a sequence generated from all the bases of one strand of a circularized DNA template and copied into one contiguous strand by a DNA polymerase. For example, a subread can correspond to one strand of DNA in a circularized DNA template. In such an example, after circularization, one double-stranded DNA molecule has two subreads (one for each sequencing pass). In some embodiments, the generated sequence may contain a subset of all bases of one strand, eg, due to the presence of sequencing errors.

「部位」（「ゲノム部位」とも呼ばれる）は、単一の塩基位置、または相関する塩基位置のグループ、例えば、ＣｐＧ部位、または相関する塩基位置のより大きいグループであり得る、単一の部位に対応する。「遺伝子座」は、複数の部位を含む領域に対応し得る。遺伝子座は、遺伝子座をその文脈における部位と等価にするであろうただ１つの部位を含むことができる。 A "site" (also called a "genomic site") can be a single base position, or a group of correlated base positions, e.g., a CpG site, or a larger group of correlated base positions. handle. A "locus" can correspond to a region that includes multiple sites. A locus can contain only one site that would make the locus equivalent to the site in its context.

「メチル化状態」とは、所与の部位でのメチル化の状態を指す。例えば、ある部位は、メチル化されているか、メチル化されていないか、または場合によっては未決定であるかのいずれかである。 "Methylation state" refers to the state of methylation at a given site. For example, a site is either methylated, unmethylated, or possibly undetermined.

各ゲノム部位（例えば、ＣｐＧ部位）に対する「メチル化指数」は、その部位におけるメチル化を、その部位をカバーするリード数の合計にわたって示す、（例えば、配列リードまたはプローブから決定されるような）ＤＮＡ断片の割合を指し得る。「リード」は、ＤＮＡ断片から取得された情報（例えば、部位のメチル化状態）に対応することができる。リードは、１つ以上の部位における特定のメチル化状態のＤＮＡ断片と優先的にハイブリダイズする試薬（例えば、プライマーまたはプローブ）を使用して、取得することができる。典型的には、このような試薬は、それらのメチル化状態に応じて、ＤＮＡ分子を示差的に修飾するかまたは認識するプロセス、例えば、バイサルファイト変換、またはメチル化感受性制限酵素、またはメチル化結合タンパク質、または抗メチルシトシン抗体、あるいはメチルシトシンおよびヒドロキシメチルシトシンを認識する単一分子配列決定技術（例えば、単一分子リアルタイム配列決定およびナノポア配列決定（例えば、ＯｘｆｏｒｄＮａｎｏｐｏｒｅＴｅｃｈｎｏｌｏｇｉｅｓから））で処理した後で適用される。 A "methylation index" for each genomic site (e.g., CpG site) indicates methylation at that site over the total number of reads covering that site (e.g., as determined from sequence reads or probes). It can refer to the percentage of DNA fragments. A "read" can correspond to information obtained from a DNA fragment (eg, the methylation state of a site). Reads can be obtained using reagents (eg, primers or probes) that preferentially hybridize to DNA fragments with specific methylation states at one or more sites. Typically, such reagents are involved in processes that differentially modify or recognize DNA molecules, depending on their methylation state, such as bisulfite conversion, or methylation-sensitive restriction enzymes, or methylation. treated with binding proteins, or anti-methylcytosine antibodies, or single-molecule sequencing technologies that recognize methylcytosine and hydroxymethylcytosine (e.g., single-molecule real-time sequencing and nanopore sequencing (e.g., from Oxford Nanopore Technologies)) applied later.

領域の「メチル化密度」は、この領域における部位をカバーするリード数の合計で割ったメチル化を示す、領域内の部位におけるリード数を指し得る。この部位は、具体的な特徴を有し得、例えば、ＣｐＧ部位であり得る。したがって、領域の「ＣｐＧメチル化密度」は、この領域におけるＣｐＧ部位（例えば、特定のＣｐＧ部位、ＣｐＧアイランド内またはそれより大きな領域のＣｐＧ部位）をカバーするリード数の合計で割ったＣｐＧメチル化を示すリード数を指す。例えば、ヒトゲノム中の各１００ｋｂビンのメチル化密度は、１００ｋｂ領域へマップされた配列リードによってカバーされたすべてのＣｐＧ部位の割合として、ＣｐＧ部位の（メチル化されたシトシンに対応する）バイサルファイト処理後に変換されていないシトシンの総数から判定することができる。この分析はまた、５００ｂｐ、５ｋｂ、１０ｋｂ、５０ｋｂ、もしくは１Ｍｂなどの他のビンサイズに対して実施することができる。領域は、全ゲノム、または染色体、または染色体の一部（例えば、染色体腕）であり得る。ＣｐＧ部位のメチル化指数は、領域がそのＣｐＧ部位のみを含む場合、その領域のメチル化密度と同じである。「メチル化シトシンの比率」は、この領域における分析されたシトシン残基の総数、すなわちＣｐＧの文脈外のシトシンを含む、メチル化されている（例えば、バイサルファイト変換後に変換されていない）ことが示されているシトシン部位「Ｃ」の数を指すことができる。「メチル化レベル」の例としては、メチル化指数、メチル化密度、１つ以上の部位でメチル化された分子の数、および１つ以上の部位でメチル化された分子（例えば、シトシン）の割合がある。バイサルファイト変換とは別に、当業者に既知の他のプロセスを使用してＤＮＡ分子のメチル化状態を調べることができ、限定されないが、メチル化状態に感受性の酵素（例えば、メチル化感受性制限酵素）、メチル化結合タンパク質、メチル化状態に感受性のプラットフォームを使用した単一分子配列決定（例えば、ナノポア配列決定（Ｓｃｈｒｅｉｂｅｒｅｔａｌ．ＰｒｏｃＮａｔｌＡｃａｄＳｃｉ２０１３；１１０：１８９１０－１８９１５）および単一分子リアルタイム配列決定（例えば、ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓによる）（Ｆｌｕｓｂｅｒｇｅｔａｌ．ＮａｔＭｅｔｈｏｄｓ２０１０；７：４６１－４６５））が含まれる。 The "methylation density" of a region can refer to the number of reads at a site within the region, representing the methylation divided by the total number of reads covering the site in that region. This site may have specific characteristics and may be, for example, a CpG site. Thus, the "CpG methylation density" of a region is the CpG methylation divided by the total number of reads covering CpG sites in this region (e.g., a particular CpG site, CpG sites within a CpG island or larger region). Refers to the number of reads that indicate For example, the methylation density of each 100 kb bin in the human genome is expressed as the percentage of all CpG sites covered by sequence reads mapped to the 100 kb region, bisulfite treatment of CpG sites (corresponding to methylated cytosines) It can later be determined from the total number of unconverted cytosines. This analysis can also be performed for other bin sizes such as 500 bp, 5 kb, 10 kb, 50 kb, or 1 Mb. A region can be an entire genome, or a chromosome, or a portion of a chromosome (eg, a chromosomal arm). The methylation index of a CpG site is the same as the methylation density of the region if the region contains only that CpG site. "Percentage of methylated cytosines" is the total number of analyzed cytosine residues in this region, i.e., cytosines outside the context of the CpG that are methylated (e.g., unconverted after bisulfite conversion). The number of cytosine sites 'C' shown can be referred to. Examples of "methylation level" include methylation index, methylation density, number of molecules methylated at one or more sites, and number of molecules (e.g., cytosine) methylated at one or more sites. there is a proportion. Apart from bisulfite conversion, other processes known to those skilled in the art can be used to examine the methylation state of a DNA molecule, including but not limited to enzymes that are sensitive to the methylation state (e.g., methylation-sensitive restriction enzymes). ), methylation-binding proteins, single-molecule sequencing using platforms sensitive to methylation status (e.g., nanopore sequencing (Schreiber et al. Proc Natl Acad Sci 2013; 110: 18910-18915) and single-molecule real-time sequencing (eg by Pacific Biosciences) (Flusberg et al. Nat Methods 2010;7: 461-465)).

「メチローム」は、ゲノムにおける複数の部位または遺伝子座のＤＮＡメチル化の量の尺度を提供する。メチロームは、ゲノムの全部、ゲノムの実質的な部分、またはゲノムの比較的わずかな箇所（複数可）に対応し得る。 A "methylome" provides a measure of the amount of DNA methylation at multiple sites or loci in the genome. A methylome can correspond to the entire genome, a substantial portion of the genome, or relatively few point(s) of the genome.

「妊婦血漿メチローム」は、妊娠した動物（例えば、ヒト）の血漿または血清から決定されたメチロームである。妊婦血漿メチロームは、血漿および血清が無細胞ＤＮＡを含むため、無細胞メチロームの一例である。妊婦血漿メチロームは、体内の異なる器官または組織または細胞に由来するＤＮＡの混合物であるため、混合メチロームの一例でもある。一実施形態では、このような細胞は、赤血球（すなわち、赤色細胞）系譜、骨髄系譜（例えば、好中球およびこれらの前駆体）および巨核球系譜の細胞を含むが、これらに限定されない造血細胞である。妊娠中、血漿メチロームは胎児および母親からのメチローム情報を含有することがある。「細胞性メチローム」は、患者の細胞（例えば、血球）から決定されるメチロームに対応する。血球のメチロームは、血球メチローム（または血中メチローム）と呼ばれる。 A "pregnant plasma methylome" is a methylome determined from the plasma or serum of a pregnant animal (eg, human). The maternal plasma methylome is an example of a cell-free methylome because plasma and serum contain cell-free DNA. The maternal plasma methylome is also an example of a mixed methylome, as it is a mixture of DNA from different organs or tissues or cells in the body. In one embodiment, such cells include, but are not limited to, cells of erythroid (i.e., red cells) lineage, myeloid lineage (e.g., neutrophils and their precursors), and megakaryocytic lineage. is. During pregnancy, the plasma methylome may contain methylome information from the fetus and the mother. A "cellular methylome" corresponds to a methylome determined from a patient's cells (eg, blood cells). The methylome of blood cells is called a blood cell methylome (or blood methylome).

「メチル化プロファイル」には、複数の部位または領域のＤＮＡまたはＲＮＡのメチル化に関連する情報が含まれる。ＤＮＡメチル化に関連する情報は、ＣｐＧ部位のメチル化指数、領域中のＣｐＧ部位のメチル化密度（略称ＭＤ）、連続した領域にわたるＣｐＧ部位の分布、２つ以上のＣｐＧ部位を含有する領域内の各個々のＣｐＧ部位のメチル化のパターンまたはレベル、および非ＣｐＧメチル化を含み得るが、これらに限定されない。一実施形態では、メチル化プロファイルは、２つ以上のタイプの塩基（例えば、シトシンまたはアデニン）のメチル化または非メチル化のパターンを含み得る。ゲノムの実質的な部分のメチル化プロファイルは、メチロームと等価とみなすことができる。哺乳動物ゲノムにおける「ＤＮＡメチル化」とは、典型的には、ＣｐＧジヌクレオチド間でシトシン残基の５’炭素へのメチル基の付加（すなわち、５－メチルシトシン）を指す。ＤＮＡメチル化は、他の文脈、例えば、ＣＨＧおよびＣＨＨにおいてシトシンで生じ得、ここで、Ｈは、アデニン、シトシン、またはチミンである。シトシンのメチル化は、５－ヒドロキシメチルシトシンの形態でもあり得る。Ｎ^６－メチルアデニンなどの非シトシンメチル化もまた、報告されている。 A "methylation profile" includes information relating to DNA or RNA methylation at multiple sites or regions. Information related to DNA methylation includes the methylation index of CpG sites, the methylation density (abbreviated MD) of CpG sites in a region, the distribution of CpG sites over contiguous regions, the distribution of CpG sites in regions containing two or more CpG sites. and non-CpG methylation at each individual CpG site. In one embodiment, a methylation profile may include patterns of methylation or unmethylation of more than one type of base (eg, cytosine or adenine). The methylation profile of a substantial portion of the genome can be considered equivalent to the methylome. "DNA methylation" in mammalian genomes typically refers to the addition of a methyl group to the 5' carbon of a cytosine residue (ie, 5-methylcytosine) between CpG dinucleotides. DNA methylation can occur at cytosines in other contexts, such as CHG and CHH, where H is adenine, cytosine, or thymine. Cytosine methylation can also be in the form of 5-hydroxymethylcytosine. Non-cytosine methylations such as N ⁶ -methyladenine have also been reported.

「メチル化パターン」とは、メチル化塩基と非メチル化塩基の順序を指す。例えば、メチル化パターンは、単一のＤＮＡ鎖、単一の二本鎖ＤＮＡ分子、または別のタイプの核酸分子上のメチル化塩基の順序であり得る。一例として、３つの連続するＣｐＧ部位は、以下のメチル化パターン：ＵＵＵ、ＭＭＭ、ＵＭＭ、ＵＭＵ、ＵＵＭ、ＭＵＭ、ＭＵＵ、またはＭＭＵ、のいずれかを有し得る。ここで、「Ｕ」は非メチル化部位を示し、「Ｍ」はメチル化部位を示す。限定されないが、この概念をメチル化を含む塩基修飾に拡張する場合、修飾塩基と非修飾塩基の順序を指す「修飾パターン」という用語を使用するであろう。例えば、修飾パターンは、単一のＤＮＡ鎖、単一の二本鎖ＤＮＡ分子、または別のタイプの核酸分子上の修飾された塩基の順序であり得る。一例として、３つの連続する潜在的に修飾可能な部位は、以下の修飾パターン：ＵＵＵ、ＭＭＭ、ＵＭＭ、ＵＭＵ、ＵＵＭ、ＭＵＭ、ＭＵＵ、またはＭＭＵ、のいずれかを有し得る。ここで、「Ｕ」は非修飾部位を示し、「Ｍ」は修飾部位を示す。メチル化に基づかない塩基修飾の一例は、８－オキソグアニンなどの酸化的変化である。 "Methylation pattern" refers to the order of methylated and unmethylated bases. For example, the methylation pattern can be the order of methylated bases on a single DNA strand, a single double-stranded DNA molecule, or another type of nucleic acid molecule. As an example, three consecutive CpG sites can have any of the following methylation patterns: UUU, MMM, UMM, UMU, UUM, MUM, MUU, or MMU. Here, "U" indicates an unmethylated site and "M" indicates a methylated site. When extending this concept to base modifications, including but not limited to methylation, we will use the term "modification pattern" to refer to the order of modified and unmodified bases. For example, a modification pattern can be the order of modified bases on a single DNA strand, a single double-stranded DNA molecule, or another type of nucleic acid molecule. As an example, three consecutive potentially modifiable sites can have any of the following modification patterns: UUU, MMM, UMM, UMU, UUM, MUM, MUU, or MMU. Here, "U" indicates an unmodified site and "M" indicates a modified site. An example of a base modification that is not based on methylation is an oxidative change such as 8-oxoguanine.

「高メチル化」および「低メチル化」という用語は、その単一分子のメチル化レベルによって測定される単一のＤＮＡ分子のメチル化密度、例えば、その分子内のメチル化された塩基またはヌクレオチドの数を、その分子内のメチル化可能な塩基またはヌクレオチドの総数で割ったものを指し得る。高メチル化分子は、単一分子のメチル化レベルが閾値以上である分子であり、用途ごとに定義され得る。この閾値は、５％、１０％、２０％、３０％、４０％、５０％、６０％、７０％、８０％、９０％、または９５％であり得る。低メチル化分子は、単一分子のメチル化レベルが閾値以下である分子であり、用途ごとに定義され得、用途ごとに変化し得る。この閾値は、５％、１０％、２０％、３０％、４０％、５０％、６０％、７０％、８０％、９０％、または９５％であり得る。 The terms "hypermethylation" and "hypomethylation" refer to the methylation density of a single DNA molecule as measured by the methylation level of that single molecule, e.g., the methylated bases or nucleotides within that molecule. divided by the total number of methylatable bases or nucleotides in the molecule. Hypermethylated molecules are molecules with a single molecule methylation level above a threshold and can be defined for each application. This threshold can be 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95%. Hypomethylated molecules are molecules with a single molecule methylation level below a threshold and can be defined and varied from application to application. This threshold can be 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95%.

「高メチル化」および「低メチル化」という用語はまた、これらの分子の複数の分子のメチル化レベルによって測定される、ＤＮＡ分子の集団のメチル化レベルを指してもよい。分子の高メチル化集団は、複数の分子のメチル化レベルが閾値以上である集団であり、用途ごとに定義され得、用途ごとに変化し得る。この閾値は、５％、１０％、２０％、３０％、４０％、５０％、６０％、７０％、８０％、９０％、または９５％であり得る。分子の低メチル化集団は、複数の分子のメチル化レベルが閾値以下である集団であり、用途ごとに定義され得、用途ごとに変化し得る。この閾値は、５％、１０％、２０％、３０％、４０％、５０％、６０％、７０％、８０％、９０％、または９５％であり得る。一実施形態では、分子の集団は、１つ以上の選択されたゲノム領域に整列され得る。一実施形態では、選択されたゲノム領域（複数可）は、癌、遺伝障害、インプリンティング障害、代謝障害、または神経障害などの疾患に関連し得る。選択されたゲノム領域（複数可）は、５０ヌクレオチド（ｎｔ）、１００ｎｔ、２００ｎｔ、３００ｎｔ、５００ｎｔ、１０００ｎｔ、２ｋｎｔ、５ｋｎｔ、１０ｋｎｔ、２０ｋｎｔ、３０ｋｎｔ、４０ｋｎｔ、５０ｋｎｔ、６０ｋｎｔ、７０ｋｎｔ、８０ｋｎｔ、９０ｋｎｔ、１００ｋｎｔ、２００ｋｎｔ、３００ｋｎｔ、４００ｋｎｔ、５００ｋｎｔ、または１Ｍｎｔの鎖長を有し得る。 The terms "hypermethylation" and "hypomethylation" may also refer to the methylation level of a population of DNA molecules as measured by the methylation level of multiple molecules of these molecules. A hypermethylated population of molecules is a population in which more than one molecule has a methylation level equal to or greater than a threshold, and can be defined and varied from application to application. This threshold can be 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95%. A hypomethylated population of molecules, which is a population in which more than one molecule has a methylation level below a threshold, can be defined and varied from application to application. This threshold can be 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95%. In one embodiment, the population of molecules can be aligned to one or more selected genomic regions. In one embodiment, the selected genomic region(s) may be associated with diseases such as cancer, genetic disorders, imprinting disorders, metabolic disorders, or neurological disorders. Selected genomic region(s) are 50 nucleotides (nt), 100nt, 200nt, 300nt, 500nt, 1000nt, 2knt, 5knt, 10knt, 20knt, 30knt, 40knt, 50knt, 60knt, 70knt, 80knt, 90knt, 100knt , 200 knt, 300 knt, 400 knt, 500 knt, or 1 Mnt.

「配列決定深度」という用語は、遺伝子座が、その遺伝子座に整列された配列リードによってカバーされる回数を指す。遺伝子座は、ヌクレオチドの小ささ、または染色体アームの大きさ、またはゲノム全体の大きさであってもよい。配列決定深度は、５０ｘ、１００ｘなどと表され、「ｘ」は、遺伝子座が配列リードでカバーされる回数を指す。また、配列決定深度は、複数の遺伝子座またはゲノム全体に適用することもでき、この場合、×はそれぞれ、遺伝子座もしくはハプロイドゲノムまたはゲノム全体が配列決定される平均回数を指し得る。ウルトラディープ配列決定は、少なくとも１００ｘの配列決定深度を指し得る。 The term "sequencing depth" refers to the number of times a locus is covered by sequence reads aligned to that locus. A locus can be as small as a nucleotide, or as large as a chromosomal arm, or as large as an entire genome. Sequencing depth is expressed as 50x, 100x, etc., where 'x' refers to the number of times the locus is covered by sequence reads. Sequencing depth can also be applied to multiple loci or the entire genome, where x can refer to the average number of times the locus or haploid genome or the entire genome is sequenced, respectively. Ultra-deep sequencing can refer to a sequencing depth of at least 100x.

本明細書で使用される「分類」という用語は、試料の特定の特性と関係した任意の数（複数可）または他の特徴（複数可）を指す。例えば、「＋」という記号（または「陽性」という語）は、試料が欠失または増幅を有するものとして分類されることを意味し得る。分類は、二項（例えば、陽性または陰性）であるか、またはより多くのレベルの分類（例えば、１～１０または０～１のスケール）を有することができる。 As used herein, the term "classification" refers to any number(s) or other characteristic(s) associated with a particular property of a sample. For example, a "+" symbol (or the word "positive") can mean that the sample is classified as having deletions or amplifications. The classification can be binary (eg, positive or negative) or have more levels of classification (eg, a scale of 1-10 or 0-1).

「カットオフ」および「閾値」という用語は、ある操作において使用される所定の数を指す。例えば、カットオフサイズは、それを超えると断片が除外されるサイズを指すことができる。閾値は、特定の分類が要求しているものを上回るまたは下回る値であり得る。これらの用語のいずれも、これらの文脈のいずれにおいても使用することができる。カットオフまたは閾値は、「参照値」であり得るか、または特定の分類を表すか、または２つ以上の分類間を区別する参照値から導出され得る。そのような参照値は、当業者によって理解されるように、様々な方法で決定することができる。例えば、異なる既知の分類を有する対象の２つの異なるコホートについて、メトリックを決定することができ、参照値を１つの分類（例えば、平均）の代表として、またはメトリックの２つのクラスター間の値（例えば、所望の感度と特異度を取得するために選択される）として選択し得る。別の例として、参照値は、統計分析または試料のシミュレーションに基づいて決定することができる。 The terms "cutoff" and "threshold" refer to a predetermined number used in some operation. For example, a cutoff size can refer to the size above which fragments are excluded. The threshold can be a value above or below what a particular classification requires. Any of these terms can be used in any of these contexts. A cutoff or threshold may be a "reference value" or may be derived from a reference value that represents a particular classification or distinguishes between two or more classifications. Such reference values can be determined in a variety of ways, as understood by those skilled in the art. For example, a metric can be determined for two different cohorts of subjects with different known classifications, with a reference value as representative of one classification (e.g., the mean), or a value between two clusters of the metric (e.g. , selected to obtain the desired sensitivity and specificity). As another example, the reference value can be determined based on statistical analysis or sample simulation.

「癌のレベル」という用語は、癌が存在するかどうか（すなわち、存在または不在）、癌の病期、腫瘍のサイズ、転移があるかどうか、身体の総腫瘍負荷、治療に対する癌の応答、および／または癌の重症度の他の尺度（例えば、癌の再発）を指し得る。癌のレベルは、記号、アルファベット文字、および色などの数または他のしるしであり得る。レベルは、ゼロであり得る。癌のレベルには、前悪性病態または前癌性病態（状態）も含まれ得る。癌のレベルは、様々な方法で使用することができる。例えば、スクリーニングにより、癌が有することを今まで知らなかった人物において癌が存在するかどうかをチェックすることができる。評価は、癌と診断されている人物を調べて、癌の進行を経時的に監視し、療法の有効性を研究し、または予後を決定することができる。一実施形態では、予後は、患者が癌で死亡する可能性、または特定の持続時間または特定の時間の後、癌が進行する可能性、または癌が転移する可能性もしくは程度として表すことができる。検出は、「スクリーニング」を意味することができ、または癌の示唆的な特徴（例えば、症状または他の陽性検査）を有する人物が癌を有するかどうかをチェックすることを意味し得る。 The term "level of cancer" refers to whether cancer is present (i.e., present or absent), the stage of the cancer, the size of the tumor, whether there are metastases, the body's total tumor burden, the cancer's response to treatment, and/or other measures of cancer severity (eg, cancer recurrence). Cancer levels can be numbers or other indicia, such as symbols, letters, and colors. The level can be zero. Levels of cancer can also include premalignant or precancerous conditions (conditions). Cancer levels can be used in a variety of ways. For example, screening can check to see if cancer is present in a person who was not previously known to have cancer. Evaluation can examine a person who has been diagnosed with cancer, monitor cancer progression over time, study the effectiveness of therapy, or determine prognosis. In one embodiment, the prognosis can be expressed as the likelihood that the patient will die from the cancer, or the likelihood that the cancer will progress after a certain duration or time, or the likelihood or extent to which the cancer will metastasize. . Detecting can mean "screening," or it can mean checking to see if a person who has features (eg, symptoms or other positive tests) suggestive of cancer has cancer.

「病理のレベル」（または障害のレベル）とは、生物に関連する病理の量、程度、重症度を指し得、そのレベルは、癌について上で説明したとおりであり得る。病理の別の例は、移植された臓器の拒絶反応である。他の病理の例としては、遺伝子インプリンティング障害、自己免疫発作（例えば、腎臓を損傷するループス腎炎損傷または多発性硬化症）、炎症性疾患（例えば、肝炎）、線維化プロセス（例えば、肝硬変）、脂肪浸潤（例えば、脂肪肝疾患）、変性プロセス（例えば、アルツハイマー病）、および虚血性組織損傷（例えば、心筋梗塞または脳卒中）が含まれ得る。対象の健康な状態は、病理のない分類とみなすことができる。 "Level of pathology" (or level of disorder) can refer to the amount, extent, severity of pathology associated with an organism, and can be as described above for cancer. Another example of pathology is rejection of transplanted organs. Examples of other pathologies include gene imprinting disorders, autoimmune attacks (e.g. lupus nephritis injury or multiple sclerosis that damages the kidney), inflammatory diseases (e.g. hepatitis), fibrotic processes (e.g. cirrhosis). , fatty infiltration (eg, fatty liver disease), degenerative processes (eg, Alzheimer's disease), and ischemic tissue damage (eg, myocardial infarction or stroke). A subject's healthy status can be considered a pathology-free classification.

「妊娠関連障害」には、母体および／または胎児組織における遺伝子の異常な相対的発現レベルを特徴とする任意の障害が含まれる。これらの障害には、子癇前症、子宮内胎児発育遅延、侵襲性胎盤形成、早産、新生児溶血性疾患、胎盤機能不全、胎児水腫、胎児奇形、ＨＥＬＬＰ症候群、全身性紅斑性狼瘡、およびその他の母親の免疫疾患が含まれるが、これらに限定されない。 A "pregnancy-related disorder" includes any disorder characterized by abnormal relative expression levels of genes in maternal and/or fetal tissues. These disorders include preeclampsia, intrauterine growth restriction, invasive placentation, preterm birth, hemolytic disease of the newborn, placental insufficiency, hydrops fetalis, fetal malformations, HELLP syndrome, systemic lupus erythematosus, and others. Includes but is not limited to maternal immune disorders.

略語「ｂｐ」は、塩基対を指す。場合によっては、「ｂｐ」は、ＤＮＡ断片が一本鎖であり、塩基対を含まない場合でも、ＤＮＡ断片の鎖長を示すために使用され得る。一本鎖ＤＮＡの文脈では、「ｂｐ」は、ヌクレオチドの鎖長を提供すると解釈される場合がある。 The abbreviation "bp" refers to base pairs. In some cases, "bp" can be used to denote the length of a DNA fragment even when the DNA fragment is single-stranded and contains no base pairs. In the context of single-stranded DNA, "bp" may be taken to provide a length in nucleotides.

略語「ｎｔ」は、ヌクレオチドを指す。場合によっては、「ｎｔ」を使用して、塩基単位で一本鎖ＤＮＡの長さを示すことができる。また、「ｎｔ」は、分析される遺伝子座の上流または下流などの相対位置を示すために使用され得る。技術的概念化、データ表示、処理、および分析に関する一部の文脈では、「ｎｔ」と「ｂｐ」は互換的に使用される場合がある。 The abbreviation "nt" refers to nucleotide. In some cases, "nt" can be used to denote the length of a single-stranded DNA in bases. Also, "nt" can be used to denote a relative position, such as upstream or downstream of the analyzed locus. In some contexts relating to technical conceptualization, data representation, processing, and analysis, "nt" and "bp" may be used interchangeably.

「配列文脈」という用語は、ＤＮＡのストレッチにおける塩基組成（Ａ、Ｃ、Ｇ、またはＴ）および塩基順序を指し得る。このようなＤＮＡのストレッチは、塩基修飾分析にかけられる塩基または標的となる塩基を取り巻いている可能性がある。例えば、配列文脈は、塩基修飾分析にかけられる塩基の上流および／または下流の塩基を指し得る。 The term "sequence context" can refer to the base composition (A, C, G, or T) and base order in a stretch of DNA. Such stretches of DNA may surround the bases to be subjected to base modification analysis or targeted bases. For example, sequence context can refer to bases upstream and/or downstream of the base subjected to base modification analysis.

「動態特徴」という用語は、単一分子リアルタイム配列決定を含む、配列決定に由来する特徴を指し得る。このような特徴は、塩基修飾分析に使用することができる。動態特徴の例には、上流および下流の配列文脈、鎖情報、パルス間隔、パルス幅、およびパルス強度が含まれる。単一分子リアルタイム配列決定では、ＤＮＡ鋳型に対するポリメラーゼの活性の影響を継続的に監視している。したがって、このような配列決定から生成された測定値は、動態特徴、例えば、ヌクレオチド配列とみなすことができる。 The term "kinetic features" can refer to features derived from sequencing, including single-molecule real-time sequencing. Such features can be used for base modification analysis. Examples of kinetic features include upstream and downstream sequence context, strand information, pulse interval, pulse width, and pulse intensity. Single-molecule real-time sequencing continuously monitors the effect of polymerase activity on the DNA template. Measurements generated from such sequencing can therefore be viewed as kinetic features, eg, nucleotide sequences.

「機械学習モデル」という用語には、試料データ（例えば、訓練データ）を使用して試験データを予測することに基づくモデルが含まれる場合があり、したがって、教師あり学習が含まれ得る。機械学習モデルは、しばしば、コンピュータまたはプロセッサを使用して開発される。機械学習モデルには、統計モデルが含まれ得る。 The term "machine learning model" may include models based on using sample data (eg, training data) to predict test data, and thus may include supervised learning. Machine learning models are often developed using computers or processors. Machine learning models can include statistical models.

「データ分析フレームワーク」という用語は、データを入力として受け取り、次に予測結果を出力することができるアルゴリズムおよび／またはモデルを含み得る。「データ分析フレームワーク」の例には、統計モデル、数学的モデル、機械学習モデル、その他の人工知能モデル、およびそれらの組み合わせが含まれる。 The term "data analysis framework" can include algorithms and/or models that can take data as input and then output predictive results. Examples of "data analysis frameworks" include statistical models, mathematical models, machine learning models, other artificial intelligence models, and combinations thereof.

「リアルタイム配列決定」という用語は、配列決定に関与する反応の進行中にデータ収集または監視を伴う技術を指す場合がある。例えば、リアルタイム配列決定は、新しい塩基を組み込むＤＮＡポリメラーゼの光学的監視または撮影を伴う場合がある。 The term "real-time sequencing" may refer to techniques that involve data collection or monitoring while reactions involving sequencing are in progress. For example, real-time sequencing may involve optical monitoring or filming of DNA polymerases as they incorporate new bases.

「約」または「およそ」という用語は、当業者によって決定される特定の値の許容誤差範囲内を意味し得、これは値の測定または決定方法、すなわち測定システムの制限について部分的に依存する。例えば、「約」は、当技術分野の慣例により、１以内または１を超える標準偏差を意味し得る。あるいは、「約」は、所与の値の最大２０％、最大１０％、最大５％、または最大１％の範囲を意味し得る。あるいは、特に生物学的システムまたはプロセスに関して、「約」または「およそ」という用語は、値の１桁以内、５倍以内、より好ましくは２倍以内を意味し得る。本出願および特許請求の範囲に特定の値が記載されている場合、特に明記しない限り、特定の値の許容誤差範囲内の「約」という用語を想定すべきである。「約」という用語は、当業者によって一般に理解されている意味を有し得る。「約」という用語は、±１０％を指し得る。「約」という用語は、±５％を指し得る。 The terms "about" or "approximately" can mean within a particular value tolerance range as determined by one skilled in the art, which depends in part on how the value is measured or determined, i.e., limitations of the measurement system. . For example, "about" can mean within 1 or more than 1 standard deviations, per the practice in the art. Alternatively, "about" can mean a range of up to 20%, up to 10%, up to 5%, or up to 1% of a given value. Alternatively, the term "about" or "approximately," particularly with respect to biological systems or processes, can mean within one order of magnitude, within five times, more preferably within two times the value. Where a particular value is recited in the present application and claims, the term "about" should be assumed within a tolerance range of the particular value unless otherwise stated. The term "about" may have the meaning commonly understood by those of ordinary skill in the art. The term "about" can refer to ±10%. The term "about" can refer to ±5%.

メチル化された塩基を含む、バイサルファイトを含まない塩基修飾の決定を達成することは、様々な研究努力の対象であるが、商業的に実行可能であることが示されているものはない。最近、５ｍＣおよび５ｈｍＣの塩基変換に穏やかな条件を使用して、５ｍＣおよび５ｈｍＣを検出するためのバイサルファイトを使用しない方法が公開された（Ｙ．Ｌｉｕｅｔａｌ．，２０１９）。この方法には、テンイレブントランスロケーション（ＴＥＴ）酸化、ピリジンボラン還元、およびＰＣＲを含む、酵素反応および化学反応の複数のステップが含まれる。変換反応の各ステップの効率ならびにＰＣＲバイアスは、５ｍＣ分析の最終的な精度に悪影響を及ぼす。例えば、５ｍＣの変換率は約９６％であり、偽陰性率は約３％であると報告されている。このような性能は、ゲノムにおけるメチル化の特定のわずかな変化を検出する能力を制限する可能性がある。一方、酵素変換は、ゲノム全体では同等にうまく機能し得ないだろう。例えば、５ｈｍＣの変換率は、５ｍＣの変換率よりも８．２％低く、非ＣｐＧに対する変換率は、ＣｐＧ文脈に対する変換率よりも１１．４％低かった（Ｙ．Ｌｉｕｅｔａｌ．，２０１９）。したがって、理想的な状況は、事前の変換（化学的または酵素的、またはそれらの組み合わせ）ステップを用いずに、さらには増幅ステップを用いずに、天然ＤＮＡ分子の塩基修飾を測定するためのアプローチを開発することである。 Achieving determination of bisulfite-free base modifications, including methylated bases, is the subject of various research efforts, none of which has been shown to be commercially viable. Recently, a bisulfite-free method for detecting 5mC and 5hmC was published using mild conditions for base conversion of 5mC and 5hmC (Y. Liu et al., 2019). The method involves multiple steps of enzymatic and chemical reactions, including ten eleven translocation (TET) oxidation, pyridine borane reduction, and PCR. The efficiency of each step of the conversion reaction as well as the PCR bias adversely affect the final accuracy of the 5mC analysis. For example, 5mC has been reported to have a conversion rate of about 96% and a false negative rate of about 3%. Such performance can limit the ability to detect certain subtle changes in methylation in the genome. Enzymatic conversions, on the other hand, may not work equally well across the genome. For example, conversion of 5hmC was 8.2% lower than that of 5mC, and conversion to non-CpG was 11.4% lower than that to CpG context (Y. Liu et al., 2019). . Therefore, the ideal situation would be an approach for measuring base modifications of natural DNA molecules without prior transformation (chemical or enzymatic, or a combination thereof) step and even without an amplification step. is to develop

いくつかの概念実証研究があり（Ｑ．Ｌｉｕｅｔａｌ．，２０１９、Ｎｉｅｔａｌ．，２０１９）、ロングリード（ｌｏｎｇ－ｒｅａｄ）ナノポア配列決定アプローチ（例えば、ＯｘｆｏｒｄＮａｎｏｐｏｒｅＴｅｃｈｎｏｌｏｇｉｅｓによって開発されたシステムを使用）によって生成された電気信号により、深層学習法を使用してメチル化状態を検出することができるようになった。ＯｘｆｏｒｄＮａｎｏｐｏｒｅに加えて、ロングリードを可能にする他の単一分子配列決定アプローチがある。一例は、単一分子リアルタイム配列決定である。単一分子リアルタイム配列決定の一例は、ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓＳＭＲＴシステムとして商品化されたものである。単一分子の原理として、リアルタイム配列決定（例えば、ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓＳＭＲＴシステム）は、非光学ベースのナノポアシステム（例えば、ＯｘｆｏｒｄＮａｎｏｐｏｒｅＴｅｃｈｎｏｌｏｇｉｅｓ）のものとは異なり、このような非光学ベースのナノポアシステム用に開発された塩基修飾検出のアプローチは、単一分子リアルタイム配列決定には使用することができない。例えば、非光学ナノポアシステムは、固定化ＤＮＡポリメラーゼベースのＤＮＡ合成（ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓＳＭＲＴシステムなどの単一分子リアルタイム配列決定で採用）によって生成される蛍光信号のパターンを捕捉するようには設計されていない。さらなる例として、オックスフォードナノポア配列決定プラットフォームでは、測定された各電気事象は、ｋ－ｍｅｒ（例えば、５－ｍｅｒ）に関連付けられている（Ｑ．Ｌｉｕｅｔａｌ．，２０１９）。しかしながら、ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓＳＭＲＴ配列決定プラットフォームでは、各蛍光事象は、一般に、単一の組み込まれた塩基に関連付けられている。さらに、単一のＤＮＡ分子は、ワトソン鎖およびクリック鎖を含むＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓＳＭＲＴ配列決定で複数回配列決定される。逆に、ＯｘｆｏｒｄＮａｎｏｐｏｒｅロングリード配列決定アプローチの場合、配列の読み出しは、ワトソン鎖とクリック鎖の各々に対して１回実施される。 There are several proof-of-concept studies (Q. Liu et al., 2019, Ni et al., 2019), using long-read nanopore sequencing approaches (e.g., systems developed by Oxford Nanopore Technologies). ) has allowed us to detect the methylation state using deep learning methods. In addition to the Oxford Nanopore, there are other single-molecule sequencing approaches that allow long reads. One example is single-molecule real-time sequencing. An example of single molecule real-time sequencing is commercialized as the Pacific Biosciences SMRT system. As a single-molecule principle, real-time sequencing (e.g., Pacific Biosciences SMRT system) is different from that of non-optical-based nanopore systems (e.g., Oxford Nanopore Technologies) for such non-optical-based nanopore systems. The developed base modification detection approach cannot be used for single molecule real-time sequencing. For example, non-optical nanopore systems have not been designed to capture the pattern of fluorescent signals generated by immobilized DNA polymerase-based DNA synthesis (employed in single-molecule real-time sequencing such as the Pacific Biosciences SMRT system). . As a further example, in the Oxford nanopore sequencing platform, each measured electrical event is associated with a k-mer (eg, 5-mer) (Q. Liu et al., 2019). However, on the Pacific Biosciences SMRT sequencing platform, each fluorescence event is generally associated with a single incorporated base. In addition, single DNA molecules are sequenced multiple times with Pacific Biosciences SMRT sequencing, including Watson and Crick strands. Conversely, for the Oxford Nanopore long-read sequencing approach, sequence reads are performed once for each of the Watson and Crick strands.

ポリメラーゼの動態は、大腸菌の配列のメチル化状態によって影響を受けることが報告されている（Ｆｌｕｓｂｅｒｇｅｔａｌ．，２０１０）。以前の研究では、６ｍＡ、４ｍＣ、５ｈｍＣ、および８－オキソグアニンの検出と比較した場合、単一分子中の特定のＣｐＧのメチル化状態（５ｍＣ対Ｃ）を推定するために単一分子リアルタイム配列決定のポリメラーゼ動態を使用することは、より困難であることが示された。その理由は、メチル基が小さく、主溝に配向しており、塩基対形成には関与せず、５ｍＣに起因する動態において非常にわずかな中断しか得られないためである（Ｃｌａｒｋｅｔａｌ．，２０１３）。したがって、単一分子レベルでシトシンのメチル化状態を決定するためのアプローチが不足している。 Polymerase kinetics have been reported to be affected by the methylation status of E. coli sequences (Flusberg et al., 2010). Previous studies have used single-molecule real-time sequencing to estimate the methylation state of specific CpGs in a single molecule (5mC versus C) when compared to the detection of 6mA, 4mC, 5hmC, and 8-oxoguanine. Using deterministic polymerase kinetics proved to be more difficult. This is because the methyl group is small, oriented in the major groove, does not participate in base-pairing, and provides very little disruption in the kinetics attributed to 5mC (Clark et al., 2013). Therefore, there is a paucity of approaches to determine the methylation status of cytosines at the single-molecule level.

Ｓｕｚｕｋｉらは、アルゴリズムを開発し、近傍のＣｐＧ部位のパルス間隔（ＩＰＤ）比を組み合わせて、それらの部位のメチル化状態を特定する際の信頼性を高めようと試みた（Ｓｕｚｕｋｉｅｔａｌ．，２０１６）。しかしながら、このアルゴリズムは、完全にメチル化されているゲノム領域、または全くメチル化されていないゲノム領域を予測することしかできず、中間のメチル化パターンを決定する能力がなかった。 Suzuki et al. developed an algorithm to combine the pulse interval (IPD) ratios of neighboring CpG sites in an attempt to increase confidence in identifying the methylation status of those sites (Suzuki et al., 2016). However, this algorithm was only able to predict genomic regions that were either fully methylated or not methylated at all, and lacked the ability to determine intermediate methylation patterns.

単一分子リアルタイム配列決定に関して、現在のアプローチでは、１つまたは２つのパラメータのみを個別に使用しており、５－メチルシトシンとシトシンとの間の測定値の違いから５ｍＣを検出する精度が非常に限られている。例えば、Ｆｌｕｓｂｅｒｇらは、Ｎ６－メチルアデノシン、５－メチルシトシン、および５－ヒドロキシメチルシトシンを含む塩基修飾において、ＩＰＤが変化することを実証した。しかしながら、配列決定動態のパルス幅（ＰＷ）に重要な効果があることが見出されていなかった。したがって、彼らが塩基修飾を予測するために使用した方法では、Ｎ６－メチルアデノシンの検出を使用して、一例として、ＰＷではなくＩＰＤのみが使用された。 For single-molecule real-time sequencing, current approaches use only one or two parameters independently and have very high accuracy in detecting 5mC from the difference in measurements between 5-methylcytosine and cytosine. is limited to For example, Flusberg et al. demonstrated that the IPD changed at base modifications including N6-methyladenosine, 5-methylcytosine, and 5-hydroxymethylcytosine. However, no significant effect was found on the pulse width (PW) of sequencing kinetics. Therefore, the method they used to predict base modifications used detection of N6-methyladenosine, and only IPD, not PW, as an example.

同じグループによるフォローアップ刊行物（Ｃｌａｒｋｅｔａｌ．，２０１２、Ｃｌａｒｋｅｔａｌ．２０１３）では、５－メチルシトシンを検出するアルゴリズムに、ＰＷではなくＩＰＤが、組み込まれた。Ｃｌａｒｋら（２０１２）において、５－メチルシトシンに変換しない５－メチルシトシンの検出率は、１．９％～４．３％の範囲であった。さらに、Ｃｌａｒｋら（．２０１３）において、著者らは、５－メチルシトシンの動態特性（ｋｉｎｅｔｉｃｓｉｇｎａｔｕｒｅ）の微妙さをさらに再確認した。Ｃｌａｒｋらは、５－メチルシトシンの検出感度の低さを克服するために、テンイレブントランスロケーション（Ｔｅｔ）タンパク質を使用して５－メチルシトシンを５－カルボキシルメチルシトシンに変換することで、５－メチルシトシンの感度を改善する方法をさらに開発した（Ｃｌａｒｋｅｔａｌ．２０１３）。これは、５－カルボキシルシトシンに起因するＩＰＤの変化が、５－メチルシトシンよりもはるかに大きいためであった。 In follow-up publications by the same group (Clark et al., 2012, Clark et al. 2013), IPD, rather than PW, was incorporated into the algorithm to detect 5-methylcytosine. In Clark et al. (2012), the detection rate of 5-methylcytosine that did not convert to 5-methylcytosine ranged from 1.9% to 4.3%. Furthermore, in Clark et al. (.2013) the authors further reconfirmed the subtleties of the kinetic signature of 5-methylcytosine. To overcome the poor detection sensitivity of 5-methylcytosine, Clark et al. A method was further developed to improve sensitivity to methylcytosine (Clark et al. 2013). This was because the change in IPD due to 5-carboxylcytosine was much greater than 5-methylcytosine.

Ｂｌｏｗらによる最近の報告では、Ｆｌｕｓｂｅｒｇらによって以前に記載されたＩＰＤ比率ベースの方法を使用して、生物あたり１３０倍のリードカバレッジで２１７種類の細菌種と１３種類の古細菌種の塩基修飾を検出した（Ｂｌｏｗｅｔａｌ．，２０１６）。彼らが特定したすべての塩基修飾の中で、５－メチルシトシンが関与したのはわずか５％であった。彼らは、５－メチルシトシンのこの低い検出率は、５－メチルシトシンを検出するための単一分子リアルタイム配列決定の感度が低いことに起因すると考えた。ほとんどの細菌では、一連の配列モチーフは、ゲノムにおけるこれらのモチーフのほぼすべてで、ＤＮＡメチルトランスフェラーゼ（ＭＴａｓｅ）によるメチル化の標的になっており（例えば、大腸菌におけるＤａｍによる５’－ＧｍＡＴＣ－３’またはＤｃｍによる５’－ＣｍＣＷＧＧ－３’）、これらのモチーフ部位のごく一部のみが非メチル化のままであった（Ｂｅａｕｌａｕｒｉｅｒｅｔａｌ．２０１９）。さらに、ＩＰＤベースの方法を使用して５’－ＣＣＷＧＧ－３’モチーフの２番目のＣのメチル化状態を分類し、Ｔｅｔタンパク質で処理した場合と使用しない場合で、５－メチルシトシンの検出率は、それぞれ９５．２％および１．９％であった（Ｃｌａｒｋｅｔａｌ．２０１３）。全体として、事前の塩基変換を用いないＩＰＤの方法（例えば、Ｔｅｔタンパク質を使用）は、５－メチルシトシンの大部分を見逃した。 A recent report by Blow et al. used the IPD ratio-based method previously described by Flusberg et al. detected (Blow et al., 2016). Of all the base modifications they identified, only 5% involved 5-methylcytosine. They attributed this low detection rate of 5-methylcytosine to the low sensitivity of single-molecule real-time sequencing for detecting 5-methylcytosine. In most bacteria, a series of sequence motifs are targeted for methylation by DNA methyltransferases (MTases) at nearly all of these motifs in the genome (e.g., 5'-GmATC-3' by Dam in E. coli). or 5′-CmCWGG-3′ by Dcm), only a small fraction of these motif sites remained unmethylated (Beaulaurier et al. 2019). In addition, an IPD-based method was used to classify the methylation status of the second C of the 5′-CCWGG-3′ motif, showing the detection rate of 5-methylcytosine with and without Tet protein treatment. were 95.2% and 1.9%, respectively (Clark et al. 2013). Overall, IPD methods without prior base conversion (eg, using the Tet protein) missed the majority of 5-methylcytosines.

上述の研究（Ｃｌａｒｋｅｔａｌ．，２０１２、Ｃｌａｒｋｅｔａｌ．，２０１３、Ｂｌｏｗｅｔａｌ．，２０１６）では、候補塩基修飾が位置する配列文脈を考慮せずに、ＩＰＤベースのアルゴリズムが使用された。他のグループは、塩基修飾の検出のために、ヌクレオチドの配列文脈を考慮しようと試みた。例えば、Ｆｅｎｇらは、階層モデルを使用して、それぞれの配列文脈で４－メチルシトシンおよび６－メチルアデノシンを検出するために、ＩＰＤを分析した（Ｆｅｎｇｅｔａｌ．２０１３）。しかしながら、彼らの方法では、目的の塩基およびその塩基に隣接する配列文脈におけるＩＰＤのみを考慮し、目的の塩基に隣接するすべての近傍の塩基のＩＰＤ情報を使用しなかった。さらに、ＰＷがアルゴリズムで考慮されておらず、５－メチルシトシンの検出に関するデータも提示されていなかった。 In the studies mentioned above (Clark et al., 2012, Clark et al., 2013, Blow et al., 2016), IPD-based algorithms were used without considering the sequence context in which candidate base modifications are located. Other groups have attempted to consider the sequence context of nucleotides for the detection of base modifications. For example, Feng et al. analyzed IPD to detect 4-methylcytosine and 6-methyladenosine in their respective sequence contexts using a hierarchical model (Feng et al. 2013). However, their method only considered the IPD in the sequence context of the base of interest and its flanking bases, and did not use the IPD information of all neighboring bases flanking the base of interest. Furthermore, PW was not considered in the algorithm and no data on detection of 5-methylcytosine were presented.

別の研究では、Ｓｃｈａｄｔらは、条件付き確率場と呼ばれる統計方法を開発し、目的の塩基および近傍の塩基のＩＰＤ情報を分析して、目的の塩基が５－メチルシトシンであるかどうかを決定した（Ｓｃｈａｄｔｅｔａｌ．，２０１２）。この研究では、それらの塩基間のＩＰＤ相互作用を、それらを方程式に入力することによって、考慮した。しかしながら彼らは、彼らの方程式にヌクレオチド配列、すなわちＡ、Ｔ、Ｇ、またはＣを入力しなかった。彼らがこの方法を適用して、Ｍ．Ｓａｕ３ＡＩプラスミドのメチル化状態を決定したとき、ＲＯＣ曲線下面積は、プラスミド配列の８００倍の配列カバレッジでさえ、０．５に近かった。さらに、彼らの方法では、彼らは、分析においてＰＷを考慮していなかった。 In another study, Schadt et al. developed a statistical method called conditional random fields to analyze the IPD information of the base of interest and neighboring bases to determine whether the base of interest is 5-methylcytosine. (Schadt et al., 2012). In this study, the IPD interactions between those bases were considered by entering them into the equation. However, they did not enter the nucleotide sequence, ie A, T, G, or C, into their equations. When they applied this method, M. When the methylation status of the Sau3AI plasmid was determined, the area under the ROC curve was close to 0.5 even at 800-fold sequence coverage of the plasmid sequence. Furthermore, in their method they did not consider PW in the analysis.

Ｂｅｃｋｍａｎらによるさらに別の研究では、標的細菌ゲノムと完全非メチル化ゲノムとの間で、ゲノム内の同じ４ｎｔまたは６ｎｔモチーフを共有するすべての配列のＩＰＤを比較した（例えば、全ゲノム増幅を通して取得）（Ｂｅｃｋｍａｎｅｔａｌ．２０１４）。このような分析の目的は、塩基修飾によってより頻繁に影響を受けるモチーフを特定することだけであった。この研究では、彼らは潜在的に修飾された塩基のＩＰＤのみを考慮したが、近傍の塩基またはＰＷのＩＰＤは考慮しなかった。彼らの方法は、個々のヌクレオチドのメチル化状態について有益ではなかった。 Yet another study by Beckman et al. compared the IPDs of all sequences sharing the same 4-nt or 6-nt motif in the genome between the target bacterial genome and the fully unmethylated genome (e.g., obtained through whole-genome amplification). ) (Beckman et al. 2014). The purpose of such analyzes was only to identify motifs more frequently affected by base modifications. In this study, they only considered the IPDs of potentially modified bases, but not the IPDs of neighboring bases or PWs. Their method was not informative for the methylation status of individual nucleotides.

要約すると、これらの以前の試み、ＩＰＤのみを利用するか、またはデータをグループ化するために近傍のヌクレオチドの配列情報をＩＰＤと組み合わせて利用する試みは、有意義なまたは実用的な精度で５－メチルシトシンの塩基修飾を決定することができなかった。Ｇｏｕｉｌらによる最近のレビューでは、著者らは、信号対雑音比が低いため、単一分子リアルタイム配列決定を使用した単一分子における５－メチルシトシンの検出は不正確であると結論付けた（Ｇｏｕｉｌｅｔａｌ．，２０１９）。これらの以前の研究では、全ゲノムメチロミック分析、特にヒトゲノム、癌ゲノム、胎児ゲノムなどの複雑なゲノムに動態特徴を使用することが実行可能かどうかについては不明なままである。 In summary, these previous attempts, either utilizing IPD alone or utilizing sequence information of neighboring nucleotides in combination with IPD for grouping data, have been 5-5 with no meaningful or practical accuracy. It was not possible to determine the base modification of methylcytosine. In a recent review by Gouil et al., the authors concluded that detection of 5-methylcytosine in single molecules using single-molecule real-time sequencing is imprecise due to low signal-to-noise ratios (Gouil et al. et al., 2019). In these previous studies, it remains unclear whether it is feasible to use dynamic features for whole-genome methylomic analysis, especially complex genomes such as human, cancer and fetal genomes.

以前の研究とは対照的に、本開示に記載の方法の一部の実施形態は、測定ウィンドウ内のすべての塩基について、ＩＰＤ、ＰＷ、および配列文脈を測定することおよび利用することに基づいている。本発明者らは、例えば、上流および下流の配列文脈、鎖情報、ＩＰＤ、パルス幅、ならびにパルス強度を含む特徴を同時に利用するなど、複数のメトリックを組み合わせて使用することができれば、単一塩基の分解能で、塩基修飾（例えば、ｍＣ検出）の正確な測定を実現できるであろうと考えた。配列文脈とは、ＤＮＡのストレッチにおける塩基組成（Ａ、Ｃ、Ｇ、またはＴ）および塩基の順序を指す。このようなＤＮＡのストレッチは、塩基修飾分析にかけられる塩基または標的となる塩基を取り巻いている可能性がある。一実施形態では、ＤＮＡのストレッチは、塩基修飾分析にかけられる塩基の近位にあり得る。別の実施形態では、ＤＮＡのストレッチは、塩基修飾分析にかけられる塩基から遠く離れている可能性がある。ＤＮＡのストレッチは、塩基修飾分析にかけられる塩基の上流および／または下流にある可能性がある。 In contrast to previous studies, some embodiments of the methods described in this disclosure are based on measuring and utilizing IPD, PW, and sequence context for all bases within the measurement window. there is We believe that single base We thought that accurate measurement of base modifications (eg, mC detection) could be achieved with a resolution of . Sequence context refers to the base composition (A, C, G, or T) and the order of bases in a stretch of DNA. Such stretches of DNA may surround the bases to be subjected to base modification analysis or targeted bases. In one embodiment, the stretch of DNA may be proximal to the bases subjected to base modification analysis. In another embodiment, the stretch of DNA may be far away from the bases subjected to base modification analysis. A stretch of DNA can be upstream and/or downstream of the bases subjected to base modification analysis.

一実施形態では、塩基修飾分析に使用される、上流および下流の配列文脈、鎖情報、ＩＰＤ、パルス幅、ならびにパルス強度の特徴は、動態特徴と呼ばれる。 In one embodiment, the upstream and downstream sequence context, strand information, IPD, pulse width, and pulse intensity features used for base modification analysis are referred to as kinetic features.

本開示に存在する実施形態は、限定されないが、細胞株、生物からの試料（例えば、固形臓器、固形組織、内視鏡検査を介して取得された試料、血液、または妊婦の血漿もしくは血清もしくは尿、絨毛膜絨毛生検など）、環境から取得された試料（例えば、細菌、細胞夾雑物）、食品（例えば、肉）から取得されたＤＮＡについて使用することができる。一部の実施形態では、本開示に存在する方法はまた、例えばハイブリダイゼーションプローブ（Ａｌｂｅｒｔｅｔａｌ．，２００７、Ｏｋｏｕｅｔａｌ．，２００７、Ｌｅｅｅｔａｌ．，２０１１）、または物理的分離（サイズなどに基づく）に基づくもしくは制限酵素消化（例えば、ＭｓｐＩ）に続くアプローチ、またはＣａｓ９ベースの濃縮（Ｗａｔｓｏｎｅｔａｌ．，２０１９）を使用して、ゲノムの一部が最初に濃縮されるステップの後で適用され得る。酵素的または化学的変換は、本発明が機能するのに必要ではないが、特定の実施形態では、そのような変換ステップが、本発明の性能をさらに高めるために含まれていてもよい。 Embodiments present in the present disclosure include, but are not limited to, cell lines, samples from organisms (e.g., solid organs, solid tissues, samples obtained via endoscopy, blood, or maternal plasma or serum or urine, chorionic villus biopsy, etc.), samples obtained from the environment (eg, bacteria, cellular contaminants), DNA obtained from food (eg, meat). In some embodiments, the methods presented in the present disclosure also use, for example, hybridization probes (Albert et al., 2007; Okou et al., 2007; Lee et al., 2011) or physical separation (such as size after a step in which a portion of the genome is first enriched using an approach based on the can be applied. Enzymatic or chemical transformations are not required for the invention to function, but in certain embodiments such transformation steps may be included to further enhance the performance of the invention.

本開示の実施形態は、塩基修飾の検出または修飾レベルの測定における改善された精度または実用性または利便性を可能にする。修飾は、直接検出され得る。実施形態は、検出のためにすべての修飾情報が保たれない可能性がある酵素的または化学的変換を回避することができる。さらに、特定の酵素的または化学的変換は、特定のタイプの修飾と互換性がない場合がある。本開示の実施形態はまた、塩基修飾情報をＰＣＲ産物に伝達しない可能性があるＰＣＲによる増幅を回避し得る。さらに、ＤＮＡの両方の鎖を一緒に配列決定することができ、それによって、一方の鎖からの配列と、他方の鎖に相補的な配列との対形成が可能になる。対照的に、ＰＣＲ増幅は二本鎖ＤＮＡの２つの鎖を分割するため、このような配列の対形成は困難である。 Embodiments of the present disclosure allow for improved accuracy or practicality or convenience in detecting base modifications or measuring modification levels. Modifications can be detected directly. Embodiments can avoid enzymatic or chemical transformations that may not retain all modification information for detection. Additionally, certain enzymatic or chemical transformations may be incompatible with certain types of modifications. Embodiments of the present disclosure may also avoid amplification by PCR, which may not convey base modification information to the PCR product. Additionally, both strands of DNA can be sequenced together, thereby allowing the pairing of sequences from one strand with sequences complementary to the other strand. In contrast, pairing of such sequences is difficult because PCR amplification splits the two strands of double-stranded DNA.

酵素的または化学的変換の有無にかかわらず、決定されたメチル化プロファイルは、生体試料の分析に使用することができる。一実施形態では、メチル化プロファイルを使用して、細胞ＤＮＡの起源（例えば、母体または胎児、組織、ウイルス、または腫瘍）を検出することができる。組織における異常なメチル化プロファイルの検出は、個人における発達障害の特定、ならびに腫瘍や悪性腫瘍の特定および予測に役立つ。ハプロタイプ間のメチル化レベルの不均衡は、癌を含む障害を検出するために使用することができる。単一分子のメチル化パターンは、キメラＤＮＡ（例えば、ウイルスとヒト間）およびハイブリッドＤＮＡ（例えば、天然ゲノムでは通常融合されない２つの遺伝子間）または２つの種間（例えば、遺伝子またはゲノム操作による）を特定することができる。 The determined methylation profile, with or without enzymatic or chemical conversion, can be used for analysis of biological samples. In one embodiment, methylation profiles can be used to detect the origin of cellular DNA (eg, maternal or fetal, tissue, viral, or tumor). Detection of aberrant methylation profiles in tissues helps identify developmental disorders in individuals, as well as identifying and predicting tumors and malignancies. Imbalance in methylation levels between haplotypes can be used to detect disorders including cancer. Single-molecule methylation patterns can occur in chimeric DNA (e.g., between viruses and humans) and hybrid DNA (e.g., between two genes that are not normally fused in the native genome) or between two species (e.g., due to genetic or genomic engineering). can be specified.

メチル化分析は、訓練セットで使用されるデータを絞り込むことを含む、強化訓練によって改善される可能性がある。特定の領域が、分析の標的となる場合がある。実施形態では、そのような標的化は、単独で、または他の試薬（複数可）と組み合わせて、その配列に基づいてＤＮＡ配列またはゲノムを切断し得る酵素を含み得る。一部の実施形態では、酵素は、特定のＤＮＡ配列（複数可）を認識して切断する制限酵素である。他の実施形態では、異なる認識配列を有する２つ以上の制限酵素を、組み合わせて使用することができる。一部の実施形態では、制限酵素は、認識配列のメチル化状態に基づいて、切断するかまたは切断しない場合がある。一部の実施形態では、酵素は、ＣＲＩＳＰＲ／Ｃａｓファミリー内の酵素である。例えば、目的のゲノム領域は、ＣＲＩＳＰＲ／Ｃａｓ９システムまたはガイドＲＮＡに基づく他のシステム（すなわち、相補的な標的ＤＮＡ配列に結合し、そのプロセスで酵素を標的ゲノム位置に誘導して作用させる短いＲＮＡ配列）を使用して標的化することができる。場合によっては、参照ゲノムに整列しなくてもメチル化分析が可能な場合がある。 Methylation analysis can be improved by reinforcement training, which involves narrowing down the data used in the training set. Certain regions may be targeted for analysis. In embodiments, such targeting may include enzymes capable of cleaving a DNA sequence or genome based on its sequence, alone or in combination with other reagent(s). In some embodiments, the enzyme is a restriction enzyme that recognizes and cuts a specific DNA sequence(s). In other embodiments, two or more restriction enzymes with different recognition sequences can be used in combination. In some embodiments, the restriction enzyme may or may not cut based on the methylation state of the recognition sequence. In some embodiments, the enzyme is an enzyme within the CRISPR/Cas family. For example, the genomic region of interest may be a CRISPR/Cas9 system or other guide RNA-based system (i.e., a short RNA sequence that binds to a complementary target DNA sequence and in the process directs the enzyme to the target genomic location to act). ) can be used to target. In some cases, methylation analysis may be possible without alignment to the reference genome.

Ｉ．単一分子リアルタイム配列決定によるメチル化検出
本開示の実施形態は、酵素的または化学的変換なしに、塩基修飾を直接検出することを可能にする。単一分子リアルタイム配列決定を通して取得された動態特徴（例えば、配列文脈、ＩＰＤ、ＰＷ）を、機械学習で分析して、修飾を検出するまたは修飾の不在を検出するモデルを開発することができる。修飾レベルは、ＤＮＡ分子の起源または障害の存在もしくはレベルを決定するために使用することができる。 I. Methylation Detection by Single-Molecular Real-Time Sequencing Embodiments of the present disclosure allow direct detection of base modifications without enzymatic or chemical transformations. Kinetic features (e.g., sequence context, IPD, PW) obtained through single-molecule real-time sequencing can be analyzed with machine learning to develop models that detect modifications or detect the absence of modifications. The level of modification can be used to determine the origin of the DNA molecule or the presence or level of disorder.

説明のために、単一分子リアルタイム配列決定の一例としてのＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓＳＭＲＴ配列決定を使用して、ＤＮＡポリメラーゼ分子を、ゼロモード導波（ｚｅｒｏ－ｍｏｄｅｗａｖｅｇｕｉｄｅ、ＺＭＷ）として機能するウェルの底に配置する。ＺＭＷは、光を小さな観察ボリュームに閉じ込めるためのナノフォトニックデバイスである。これは、直径が非常に小さな穴であり、検出に使用される波長範囲の光の伝搬を許容せず、固定化されたポリメラーゼによって組み込まれた色素標識ヌクレオチドからの光信号の発光のみが、低い一定のバックグラウンド信号に対して検出可能である（Ｅｉｄｅｔａｌ．，２００９）。ＤＮＡポリメラーゼは、蛍光標識ヌクレオチドの、相補的な核酸鎖への取り込みを触媒する。 For illustration, using Pacific Biosciences SMRT sequencing as an example of single-molecule real-time sequencing, DNA polymerase molecules are placed at the bottom of wells that act as zero-mode waveguides (ZMWs). do. A ZMW is a nanophotonic device for confining light to a small viewing volume. This is a very small diameter hole that does not allow the propagation of light in the wavelength range used for detection, only the low emission of optical signal from dye-labeled nucleotides incorporated by the immobilized polymerase. It is detectable against a constant background signal (Eid et al., 2009). DNA polymerases catalyze the incorporation of fluorescently labeled nucleotides into complementary nucleic acid strands.

図１は、単一分子の循環コンセンサス配列決定によって配列決定された塩基修飾を有する分子の例を示す。分子１０２、１０４、および１０６は、塩基修飾を有する。ＤＮＡ分子（例えば、分子１０６）は、ヘアピンアダプターと連結されて、連結された分子１０８を形成し得る。次いで、連結された分子１０８は、環状化された分子１１０を形成することができる。環状化された分子は、固定化されたＤＮＡポリメラーゼに結合し、ＤＮＡ合成を開始することができる。塩基修飾を有しない分子も配列決定することができる。 FIG. 1 shows examples of molecules with base modifications sequenced by single-molecule circular consensus sequencing. Molecules 102, 104 and 106 have base modifications. A DNA molecule (eg, molecule 106 ) can be ligated with a hairpin adapter to form ligated molecule 108 . Linked molecules 108 can then form circularized molecules 110 . Circularized molecules can bind to the immobilized DNA polymerase and initiate DNA synthesis. Molecules without base modifications can also be sequenced.

図２は、単一分子リアルタイム配列決定によって配列決定されたメチル化および／または非メチル化ＣｐＧ部位を有する分子の一例を示す。最初に、ＤＮＡ分子がヘアピンアダプターに連結されて、環状化された分子が形成され、これが固定化されたＤＮＡポリメラーゼに結合し、ＤＮＡ合成が開始されるであろう。図２では、ＤＮＡ分子２０２は、ヘアピンアダプターと連結されて、連結された分子２０４を形成する。次いで、連結された分子２０４は、環状化された分子２０６を形成する。ＣｐＧ部位がない分子も配列決定することができる。環状分子２０６は、非メチル化ＣｐＧ部位２０８を含み、これも依然として配列決定され得る。 FIG. 2 shows an example of molecules with methylated and/or unmethylated CpG sites sequenced by single-molecule real-time sequencing. First, a DNA molecule will be ligated to the hairpin adapter to form a circularized molecule that will bind the immobilized DNA polymerase and initiate DNA synthesis. In FIG. 2, DNA molecule 202 is ligated with a hairpin adapter to form ligated molecule 204 . Linked molecules 204 then form circularized molecules 206 . Molecules without CpG sites can also be sequenced. Circular molecule 206 contains unmethylated CpG sites 208, which can still be sequenced.

ＤＮＡ合成が開始すると、蛍光色素標識ヌクレオチドが、環状ＤＮＡ鋳型に基づいて固定化されたポリメラーゼによって新しく合成された鎖に組み込まれ、光信号の放出につながる。ＤＮＡ鋳型は環状化されているため、環状ＤＮＡ鋳型全体が、ポリメラーゼを複数回通過する（すなわち、ＤＮＡ鋳型の１つのヌクレオチドが複数回配列決定される）。環状化ＤＮＡ鋳型のすべての塩基が、完全にＤＮＡポリメラーゼを通過するプロセスから生成された配列は、サブリードと呼ばれる。ポリメラーゼは環状ＤＮＡ鋳型全体を複数回継続できるため、ＺＭＷ内の１つの分子は、複数のサブリードを生成する。一実施形態では、サブリードは、一実施形態では、配列決定エラーの存在のために、環状ＤＮＡ鋳型の配列、塩基修飾または他の分子情報のサブセットのみを含有し得る。 Upon initiation of DNA synthesis, fluorochrome-labeled nucleotides are incorporated into newly synthesized strands by an immobilized polymerase based on a circular DNA template, leading to the emission of a light signal. Because the DNA template is circularized, the entire circular DNA template is passed through the polymerase multiple times (ie, one nucleotide of the DNA template is sequenced multiple times). Sequences generated from the process of passing all bases of a circularized DNA template completely through a DNA polymerase are called subreads. A single molecule within a ZMW will generate multiple subreads, as the polymerase can continue over the circular DNA template multiple times. In one embodiment, a subread may contain only a subset of the sequence, base modifications or other molecular information of the circular DNA template, in one embodiment due to the presence of sequencing errors.

図３に示されるように、得られた蛍光パルスの到着時間および持続時間は、ポリメラーゼ動態を測定することを可能にするであろう。パルス間隔（ＩＰＤ）は、２つの放出パルス間の期間の長さについてのメトリックであり、各々は、新生鎖に組み込まれた蛍光標識ヌクレオチドを示唆するであろう（図３）。図３に示されるように、パルス幅（ＰＷ）は、ベースコールに関連するパルスの持続時間に関連して、ポリメラーゼ動態を反映する別のメトリックである。ＰＷは、信号ピークの高さの０％でのパルスの持続時間（すなわち、組み込まれた色素標識ヌクレオチドの蛍光強度）である可能性がある。一実施形態では、ＰＷは、例えば、限定されないが、信号ピークの高さの５％、１０％、２０％、３０％、４０％、５０％、６０％、７０％、８０％または９０％でのパルスの持続時間によって定義され得る。一部の実施形態では、ＰＷは、ピーク下面積を信号ピークの高さで割ったものでもよい。 As shown in FIG. 3, the arrival times and durations of the resulting fluorescence pulses will allow the polymerase kinetics to be measured. Pulse interval (IPD) is a metric for the length of time between two emission pulses, each of which would suggest fluorescently labeled nucleotides incorporated into nascent strands (FIG. 3). As shown in FIG. 3, pulse width (PW) is another metric that reflects polymerase dynamics in relation to the duration of pulses associated with base calls. PW can be the pulse duration (ie, the fluorescence intensity of the incorporated dye-labeled nucleotide) at 0% of the signal peak height. In one embodiment, PW is for example, but not limited to, 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80% or 90% of the signal peak height. can be defined by the pulse duration of . In some embodiments, PW may be the area under the peak divided by the height of the signal peak.

ＩＰＤなどのこのようなポリメラーゼ動態は、合成および微生物配列（例えば、Ｅ．ｃｏｌｉ）におけるＮ６－メチルアデニン（６ｍＡ）、５－メチルシトシン（５ｍＣ）、および５－ヒドロキシメチルシトシン（５ｈｍＣ）などの塩基修飾の影響を受けることが示されている（Ｆｌｕｓｂｅｒｇｅｔａｌ．，２０１０）。Ｆｌｕｓｂｅｒｇら（．２０１０）は、修飾を検出するための独立した入力として配列文脈およびＩＰＤを使用しなかったため、実質的に意味のある検出の精度を欠くモデルとなった。Ｆｌｕｓｂｅｒｇらは、配列文脈のみを使用して、ＧＡＴＣで６ｍＡが生じたことを確認した。Ｆｌｕｓｂｅｒｇらは、メチル化状態を検出するための入力として、ＩＰＤと組み合わせて配列文脈を使用することには言及していない。 Such polymerase kinetics, such as IPD, have been observed in synthetic and microbial sequences (e.g., E. coli) for bases such as N6-methyladenine (6mA), 5-methylcytosine (5mC), and 5-hydroxymethylcytosine (5hmC). It has been shown to be affected by modification (Flusberg et al., 2010). Flusberg et al. (.2010) did not use sequence context and IPD as independent inputs to detect modifications, resulting in a model that lacked substantially meaningful detection accuracy. Flusberg et al. used sequence context only to confirm that 6 mA occurred at GATC. Flusberg et al. do not mention using sequence context in combination with IPD as input for detecting methylation status.

相補鎖の５－メチルシトシンへの新しい塩基の取り込みに対して与えられた弱い中断は、メチル化モチーフＣ^ｍＣＷＧＧの検出がほんの１．９％～４．３％の範囲であると報告されているため（Ｃｌａｒｋｅｔａｌ．，２０１３）、ＩＰＤ信号のみを使用する場合、比較的単純な微生物ゲノムでさえ、メチル化の判定を非常に困難にさせる。例えば、ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓが提供する分析ソフトウェアパッケージ（ＳＭＲＴＬｉｎｋｖ６．０．０）は、５ｍＣの分析を実施することができない。さらに、以前のバージョンのＳＭＲＴＬｉｎｋｖ５．１．０では、メチル化分析の前に、Ｔｅｔ１酵素を使用して５ｍＣを５－カルボキシルシトシン（５ｃａＣ）に変換する必要があった。これは、５ｃａＣに関連するＩＰＤ信号が強化されるためである（Ｃｌａｒｋｅｔａｌ．，２０１３）。したがって、単一分子リアルタイム配列決定を使用して、ヒトゲノムのゲノム全体の様式で天然ＤＮＡを分析することの実現可能性を示す研究がないことは驚くべきことではない。 The weak interruption afforded to the incorporation of new bases into 5-methylcytosines of the complementary strand has been reported to range from only 1.9% to 4.3% of detection of the methylation motif C ^m CWGG. (Clark et al., 2013), making determination of methylation very difficult even in relatively simple microbial genomes when using only the IPD signal. For example, the analysis software package provided by Pacific Biosciences (SMRT Link v6.0.0) cannot perform 5mC analysis. Additionally, previous versions of SMRT Link v5.1.0 required the conversion of 5mC to 5-carboxylcytosine (5caC) using the Tet1 enzyme prior to methylation analysis. This is due to the enhanced IPD signal associated with 5caC (Clark et al., 2013). It is therefore not surprising that there are no studies demonstrating the feasibility of analyzing natural DNA in a genome-wide fashion of the human genome using single-molecule real-time sequencing.

ＩＩ．測定ウィンドウパターンと機械学習モデル
修飾および／または塩基を酵素的または化学的に変換することなく、塩基の修飾を検出する技術が望まれている。本明細書に記載されるように、標的塩基の修飾は、標的塩基を取り巻く塩基の単一分子リアルタイム配列決定から取得された動態特徴データを使用して、検出され得る。動態特徴には、パルス間隔、パルス幅、および配列文脈が含まれ得る。これらの動態特徴は、標的塩基の上流および下流の特定の数のヌクレオチドの測定ウィンドウについて取得することができる。これらの機能（例えば、測定ウィンドウの特定の場所）を使用して、機械学習モデルを訓練することができる。試料調製の一例として、ＤＮＡ分子の２本の鎖は、ヘアピンアダプターによって結合され得、それにより、環状ＤＮＡ分子が形成される。環状ＤＮＡ分子により、ワトソン鎖およびクリック鎖のいずれかまたは両方の動態特徴を取得することができる。データ分析フレームワークは、測定ウィンドウの動態特徴に基づいて開発され得る。次いで、このデータ分析フレームワークを使用して、メチル化を含む修飾を検出することができる。このセクションでは、修飾を検出するための様々な技術について説明する。 II. Measurement Window Patterns and Machine Learning Models Techniques to detect modifications of bases without enzymatic or chemical conversion of the modifications and/or bases are desired. As described herein, modifications of a target base can be detected using kinetic characterization data obtained from single-molecule real-time sequencing of bases surrounding the target base. Kinetic features can include pulse intervals, pulse widths, and sequence context. These kinetic features can be obtained for measurement windows of a specified number of nucleotides upstream and downstream of the target base. These features (eg, specific locations in the measurement window) can be used to train machine learning models. As an example of sample preparation, two strands of a DNA molecule can be joined by a hairpin adapter, thereby forming a circular DNA molecule. Circular DNA molecules allow one to obtain the kinetic characteristics of either or both Watson and Crick strands. A data analysis framework can be developed based on the dynamic characteristics of the measurement window. Modifications, including methylation, can then be detected using this data analysis framework. This section describes various techniques for detecting modifications.

Ａ．一本鎖の使用
図４に示すように、一例として、ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓＳＭＲＴ配列決定からワトソン鎖のサブリードを取得して、塩基修飾の状態に関する１つの特定の塩基を分析した。図４では、塩基修飾分析にかけられた塩基の各側からの３つの塩基は、測定ウィンドウ４００として定義されるであろう。一実施形態では、これらの７つの塩基（すなわち、３ヌクレオチド（ｎｔ）上流および下流の配列ならびに塩基修飾分析のための１ヌクレオチド）についての配列文脈、ＩＰＤ、およびＰＷは、測定ウィンドウとして２次元（すなわち、２－Ｄ）マトリックスにコンパイルされた。示されている例では、測定ウィンドウ４００は、ワトソン鎖の１つのサブリード用である。他の変形が本明細書に記載されている。 A. Use of Single Strands As an example, Watson strand subreads were obtained from Pacific Biosciences SMRT sequencing to analyze one specific base for base modification status, as shown in FIG. In FIG. 4, three bases from each side of the base subjected to base modification analysis would be defined as measurement window 400 . In one embodiment, the sequence context, IPD, and PW for these 7 bases (i.e., 3 nucleotide (nt) upstream and downstream sequences and 1 nucleotide for base modification analysis) are measured in two dimensions ( 2-D) compiled into a matrix. In the example shown, the measurement window 400 is for one subread of the Watson strand. Other variations are described herein.

マトリックスの最初の行４０２は、調査された配列を示している。マトリックスの２行目４０４では、０の位置は、塩基修飾分析のための塩基を表した。－１、－２、および－３の相対位置は、それぞれ、塩基修飾分析にかけられる塩基の１ｎｔ、２ｎｔ、および３ｎｔ上流の位置を示した。＋１、＋２、および＋３の相対位置は、それぞれ、塩基修飾分析にかけられる塩基の１ｎｔ、２ｎｔ、および３ｎｔ下流の位置を示した。各位置には、対応するＩＰＤ値およびＰＷ値を含有する２つの列が含まれている。次の４行（行４０８、４１２、４１６、および４２０）は、それぞれ、鎖（例えば、ワトソン鎖）の４種類のヌクレオチド（Ａ、Ｃ、Ｇ、およびＴ）に対応した。マトリックス内に存在するＩＰＤ値およびＰＷ値は、どの対応するヌクレオチドの種類が特定の位置で配列決定されたかに依存した。図４に示すように、相対位置０で、ＩＰＤ値およびＰＷ値がワトソン鎖の「Ｇ」を示す行に表示され、その位置での配列結果において、グアニンが呼び出されたことを示している。配列決定された塩基に対応しなかった列の他のグリッドは、「０」としてコード化される。一例として、２Ｄデジタルマトリックス（図４）に対応する配列情報は、ワトソン鎖について５’－ＧＡＴＧＡＣＴ－３’である。 The first row 402 of the matrix shows the sequences investigated. In the second row 404 of the matrix, the 0 position represented the base for base modification analysis. Relative positions of -1, -2, and -3 indicated positions 1 nt, 2 nt, and 3 nt upstream of the base subjected to base modification analysis, respectively. Relative positions of +1, +2, and +3 indicated positions 1 nt, 2 nt, and 3 nt downstream of the base subjected to base modification analysis, respectively. Each position contains two columns containing the corresponding IPD and PW values. The next four rows (rows 408, 412, 416, and 420) corresponded respectively to the four nucleotides (A, C, G, and T) of the strand (eg Watson strand). The IPD and PW values present in the matrix depended on which corresponding nucleotide type was sequenced at a particular position. As shown in FIG. 4, at relative position 0, the IPD and PW values are displayed in the line indicating the "G" of the Watson chain, indicating that guanine was called for in the sequence results at that position. Other grids in columns that did not correspond to sequenced bases are coded as '0'. As an example, the sequence information corresponding to the 2D digital matrix (Figure 4) is 5'-GATGACT-3' for the Watson chain.

図５に図示された一実施形態で示されるように、測定ウィンドウは、クリック鎖からのデータに適用され得る。塩基修飾の状態に関して１つの特定の塩基を分析するために、単一分子リアルタイム配列決定からクリック鎖のサブリードを取得した。図５では、塩基修飾分析にかけられた塩基の各側からの３つの塩基、および塩基修飾分析にかけられた塩基は、測定ウィンドウとして定義されるであろう。一実施形態では、これらの７つの塩基（すなわち、３ヌクレオチド（ｎｔ）上流および下流の配列ならびに塩基修飾分析のための１ヌクレオチド）についての配列文脈、ＩＰＤ、ＰＷは、測定ウィンドウとして２次元（すなわち、２－Ｄ）マトリックスにコンパイルされた。マトリックスの最初の行は、調査された配列を示している。マトリックスの２行目では、０の位置は、塩基修飾分析の塩基を表している。－１、－２、および－３の相対位置は、それぞれ、塩基修飾分析にかけられる塩基の１ｎｔ、２ｎｔ、および３ｎｔ上流の位置を示した。＋１、＋２、および＋３の相対位置は、それぞれ、塩基修飾分析にかけられる塩基の１ｎｔ、２ｎｔ、および３ｎｔ下流の位置を示した。各位置には、対応するＩＰＤ値およびＰＷ値を含有する２つの列が含まれている。次の４行は、この鎖（例えば、クリック鎖）の４種類のヌクレオチド（Ａ、Ｃ、Ｇ、Ｔ）に対応している。マトリックス内に存在するＩＰＤ値およびＰＷ値は、どの対応するヌクレオチドの種類が特定の位置で配列決定されたかに依存した。図５に示すように、相対位置０で、ＩＰＤ値およびＰＷ値がクリック鎖の「Ｔ」を示す行に表示され、その位置での配列結果において、チミンが呼び出されたことを示している。配列決定された塩基に対応しなかった列の他のグリッドは、「０」としてコード化される。一例として、２Ｄデジタルマトリックス（図５）に対応する配列情報は、クリック鎖について５’－ＡＣＴＴＡＧＣ－３’である。 As shown in one embodiment illustrated in FIG. 5, a measurement window can be applied to the data from the click chain. To analyze one specific base for base modification status, click strand subreads were obtained from single-molecule real-time sequencing. In FIG. 5, three bases from each side of the base subjected to base modification analysis and the base subjected to base modification analysis would be defined as the measurement window. In one embodiment, the sequence context, IPD, PW for these 7 bases (i.e., 3 nucleotide (nt) upstream and downstream sequences and 1 nucleotide for base modification analysis) are two-dimensional (i.e. , 2-D) compiled into matrices. The first row of the matrix shows the sequences investigated. In the second row of the matrix, the 0 position represents the base of the base modification analysis. Relative positions of -1, -2, and -3 indicated positions 1 nt, 2 nt, and 3 nt upstream of the base subjected to base modification analysis, respectively. Relative positions of +1, +2, and +3 indicated positions 1 nt, 2 nt, and 3 nt downstream of the base subjected to base modification analysis, respectively. Each position contains two columns containing the corresponding IPD and PW values. The next four lines correspond to the four nucleotides (A, C, G, T) of this strand (eg, click strand). The IPD and PW values present in the matrix depended on which corresponding nucleotide type was sequenced at a particular position. As shown in FIG. 5, at relative position 0, the IPD and PW values are displayed in the line labeled 'T' in the click strand, indicating that thymine was called for in the sequence results at that position. Other grids in columns that did not correspond to sequenced bases are coded as '0'. As an example, the sequence information corresponding to the 2D digital matrix (Figure 5) is 5'-ACTTAGC-3' for the click strand.

Ｂ．ワトソン鎖およびクリック鎖の両方の使用
図６は、ワトソン鎖およびその相補的なクリック鎖からのデータを組み合わせることができる方法で、測定ウィンドウが実装され得る実施形態を示す。図６に示すように、ワトソン鎖およびクリック鎖のサブリードを単一分子リアルタイム配列決定から取得して、１つの特定の塩基の修飾について分析した。一実施形態では、環状ＤＮＡ鋳型のクリック鎖からの測定ウィンドウは、塩基修飾分析にかけられたワトソン鎖からの測定ウィンドウと相補的であった。図６では、塩基修飾分析にかけられたワトソン鎖の第１の塩基の各側からの３つの塩基および第１の塩基は、第１の測定ウィンドウとして定義されるであろう。クリック鎖の第２の塩基の各側からの３つの塩基および第２の塩基は、第２の測定ウィンドウとして定義されるであろう。第２の塩基は、第１の塩基と相補的であった。一実施形態では、ワトソンおよびクリック鎖からのこれらの７つの塩基（すなわち、３ヌクレオチド（ｎｔ）上流および下流の配列ならびに塩基修飾分析のための１ヌクレオチド）についての配列文脈、ＩＰＤ、ＰＷは、２次元（すなわち、２－Ｄ）マトリックスにコンパイルされた。ワトソン鎖とクリック鎖からのこれらの測定ウィンドウは、それぞれ、第１の測定ウィンドウおよび第２の測定ウィンドウとみなされた。 B. Using Both Watson and Crick Strands FIG. 6 shows an embodiment in which the measurement window can be implemented in a way that data from the Watson strand and its complementary Crick strand can be combined. As shown in FIG. 6, Watson and Crick strand subreads were obtained from single-molecule real-time sequencing and analyzed for modification of one particular base. In one embodiment, the measurement window from the Crick strand of the circular DNA template was complementary to the measurement window from the Watson strand subjected to base modification analysis. In FIG. 6, three bases from each side of the first base and the first base of the Watson strand subjected to base modification analysis would be defined as the first measurement window. Three bases from each side of the second base of the click strand and the second base will be defined as the second measurement window. The second base was complementary to the first base. In one embodiment, the sequence context, IPD, PW, for these 7 bases from the Watson and Crick strand (i.e., 3 nucleotide (nt) upstream and downstream sequences and 1 nucleotide for base modification analysis) is 2 Compiled into a dimensional (ie, 2-D) matrix. These measurement windows from the Watson and Crick strands were considered the first and second measurement windows, respectively.

ワトソン鎖とクリック鎖のマトリックスの最初の行は、調査された配列を示している。ワトソン鎖のマトリックスの２行目では、０の位置は、塩基修飾分析の最初の塩基を表している。クリック鎖のマトリックスの２行目に示されている０の位置は、第１の塩基と相補的な第２の塩基を表している。－１、－２、および－３の相対位置は、それぞれ、第１の塩基および第２の塩基の１ｎｔ、２ｎｔ、および３ｎｔ上流の位置を示した。＋１、＋２、および＋３の相対位置は、それぞれ、第１の塩基および第２の塩基の１ｎｔ、２ｎｔ、および３ｎｔ下流の位置を示した。ワトソン鎖およびクリック鎖に由来する各位置は、対応するＩＰＤ値およびＰＷ値を含有する２つの列に対応するであろう。ワトソン鎖およびクリック鎖のマトリックスの次の４行は、それぞれ、特定の鎖（例えば、クリック鎖）の４種類のヌクレオチド（Ａ、Ｃ、Ｇ、およびＴ）に対応していた。マトリックス内に存在するＩＰＤ値およびＰＷ値は、どの対応するヌクレオチドの種類が特定の位置で配列決定されたかに依存した。 The first row of the matrices for Watson and Crick strands shows the sequences investigated. In the second row of the Watson strand matrix, the 0 position represents the first base of the base modification analysis. The 0 position shown in the second row of the click strand matrix represents the second base complementary to the first base. Relative positions of -1, -2, and -3 indicated positions 1 nt, 2 nt, and 3 nt upstream of the first and second bases, respectively. Relative positions of +1, +2, and +3 indicated positions 1 nt, 2 nt, and 3 nt downstream of the first and second bases, respectively. Each position from the Watson and Crick strands will correspond to two columns containing the corresponding IPD and PW values. The next four rows of matrices for Watson and Crick strands each corresponded to the four nucleotides (A, C, G, and T) of a particular strand (eg, Crick strand). The IPD and PW values present in the matrix depended on which corresponding nucleotide type was sequenced at a particular position.

図６に示すように、相対位置の０では、ＩＰＤ値およびＰＷ値が、ワトソン鎖の「Ａ」およびクリック鎖の「Ｔ」を示す行に示され、ワトソン鎖およびクリック鎖のその位置での配列結果において、それぞれ、アデニンとチミンが呼び出されたことを示している。配列決定された塩基に対応しなかった列の他のグリッドは、「０」としてコード化される。一例として、ワトソン鎖の２Ｄデジタルマトリックス（図６）に対応する配列情報は、５’－ＡＴＡＡＧＴＴ－３’であろう。クリック鎖の２Ｄデジタルマトリックス（図６）に対応する配列情報は、５’－ＡＡＣＴＴＡＴ－３’であろう。 As shown in FIG. 6, at relative position 0, the IPD and PW values are shown in the row labeled "A" for Watson strands and "T" for Crick strands, and for Watson and Crick strands at that position. Sequence results show that adenine and thymine were called, respectively. Other grids in columns that did not correspond to sequenced bases are coded as '0'. As an example, the sequence information corresponding to the Watson chain 2D digital matrix (FIG. 6) would be 5'-ATAAGTT-3'. The sequence information corresponding to the 2D digital matrix of the click strand (Figure 6) would be 5'-AACTTAT-3'.

この例で示されるように、ワトソン鎖およびクリック鎖からのデータを組み合わせて新しいマトリックスを形成することができ、これを、測定ウィンドウとみなすこともできる。この新しいマトリックスは、機械学習モデルを訓練するために使用される単一の試料として使用することができる。したがって、２Ｄマトリックスの特定の配置は、畳み込みニューラルネットワーク（ＣＮＮ）が使用される場合など、影響がある可能性があるが、新しいマトリックスのすべての値を別個の特徴として扱うことができる。異なる鎖の様々な位置での配列文脈は、マトリックスのゼロ以外のエントリを介して伝達できる。 As shown in this example, data from Watson and Crick strands can be combined to form a new matrix, which can also be viewed as a measurement window. This new matrix can be used as a single sample used to train a machine learning model. Therefore, although the particular arrangement of the 2D matrix may have implications, such as when a convolutional neural network (CNN) is used, all values of the new matrix can be treated as separate features. Sequence context at various positions on different strands can be conveyed via non-zero entries in the matrix.

図７は、ワトソン鎖およびクリック鎖からのデータが互いに正確に相補的な位置ではない方法で、測定ウィンドウを実装できることを示す。図７に示されるように、第１の測定ウィンドウは５’－ＡＴＡＡＧＴＴ－３’であり、第２の測定ウィンドウは５’－ＧＴＡＡＣＧＣ－３’であった。一部の実施形態では、ワトソン鎖およびクリック鎖は、位置が相補的でないように互いにシフトしてもよい。 FIG. 7 shows that the measurement window can be implemented in such a way that the data from the Watson and Crick strands are not exactly complementary to each other. As shown in Figure 7, the first measurement window was 5'-ATAAGTT-3' and the second measurement window was 5'-GTAACGC-3'. In some embodiments, the Watson and Crick strands may be shifted relative to each other such that they are not complementary in position.

図８は、測定ウィンドウを使用して、ＣｐＧ部位のメチル化状態を分析できることを示している。０の位置はＣｐＧ部位のシトシンに対応し、したがって２つの鎖間で位置が１つだけシフトするため、両方の鎖について、Ｃが０の位置になる。したがって、ワトソン鎖およびクリック鎖からの測定ウィンドウに含まれる配列の一部のみが、互いに相補的である。他の実施形態では、ワトソン鎖およびクリック鎖からの測定ウィンドウのすべての配列は、互いに相補的であり得る。さらに他の実施形態では、ワトソン鎖およびクリック鎖からの測定ウィンドウの配列のいずれも、互いに相補的ではない。 FIG. 8 shows that measurement windows can be used to analyze the methylation status of CpG sites. The 0 position corresponds to a cytosine in the CpG site, thus shifting by one position between the two strands, resulting in C being the 0 position for both strands. Therefore, only part of the sequences included in the measurement windows from the Watson and Crick strands are complementary to each other. In other embodiments, all sequences of the measurement windows from the Watson and Crick strands can be complementary to each other. In still other embodiments, none of the measurement window sequences from the Watson strand and the Crick strand are complementary to each other.

一実施形態では、測定ウィンドウについて、塩基修飾分析にかけられた塩基を取り巻くＤＮＡストレッチの長さは、非対称であり得る。例えば、その塩基のＸ－ｎｔ上流およびＹ－ｎｔ下流を、塩基修飾分析に使用することができる。Ｘは、０、１、２、３、４、５、６、７、８、９、１０、１１、１２、１３、１４、１５、１６、１７、１８、１９、２０、２１、２２、２３、２４、２５、２６、２７、２８、２９、３０、３１、３２、３３、３４、３５、３６、３７、３８、３９、４０、４１、４２、４３、４４、４５、４６、４７、４８、４９、５０、１００、１５０、２００、３００、４００、５００、１０００、２０００、４０００、５０００、および１００００を含み得るが、これらに限定されない。Ｙは、０、１、２、３、４、５、６、７、８、９、１０、１１、１２、１３、１４、１５、１６、１７、１８、１９、２０、２１、２２、２３、２４、２５、２６、２７、２８、２９、３０、３１、３２、３３、３４、３５、３６、３７、３８、３９、４０、４１、４２、４３、４４、４５、４６、４７、４８、４９、５０、１００、１５０、２００、３００、４００、５００、１０００、２０００、４０００、５０００、および１００００を含み得るが、これらに限定されない。 In one embodiment, the length of the DNA stretch surrounding the bases subjected to base modification analysis can be asymmetric with respect to the measurement window. For example, the X-nt upstream and Y-nt downstream of that base can be used for base modification analysis. X is 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 , 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48 , 49, 50, 100, 150, 200, 300, 400, 500, 1000, 2000, 4000, 5000, and 10,000. Y is 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 , 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48 , 49, 50, 100, 150, 200, 300, 400, 500, 1000, 2000, 4000, 5000, and 10,000.

Ｃ．モデルの訓練および修飾の検出
図９は、測定ウィンドウを使用して任意の塩基修飾を決定する方法に関する一般的な手順を示す。非修飾および修飾が既知のＤＮＡ試料を、単一分子リアルタイム配列決定にかけた。修飾されたＤＮＡ（例えば、修飾分子９０２）は、塩基（例えば、塩基９０４）がその部位に修飾（例えば、メチル化）を有することを意味する。修飾されていないＤＮＡ（例えば、非修飾分子９０６）は、塩基（例えば、塩基９０８）がその部位に修飾を有しないことを意味する。ＤＮＡの両方のセットを、人工的に作成または処理して、修飾／非修飾ＤＮＡを形成することができる。 C. Model Training and Modification Detection FIG. 9 shows the general procedure for how to determine any base modification using measurement windows. Unmodified and known modified DNA samples were subjected to single-molecule real-time sequencing. A modified DNA (eg, modified molecule 902) means that a base (eg, base 904) has a modification (eg, methylation) at that site. Unmodified DNA (eg, unmodified molecule 906) means that the base (eg, base 908) has no modifications at that site. Both sets of DNA can be engineered or manipulated to form modified/unmodified DNA.

ステージ９１０で、試料は、次いで単一分子リアルタイム配列決定を経ることができる。ＳＭＲＴ配列決定の一部として、固定化ＤＮＡポリメラーゼを繰り返し通過させることによって、環状分子を複数回配列決定することができる。毎回取得される配列情報は、サブリードとみなされる。これにより、１つの環状ＤＮＡ鋳型は、複数のサブリードを生成する。配列決定サブリードは、例えば、限定されないが、ＢＬＡＳＲ（ＭａｒｋＪＣｈａｉｓｓｏｎｅｔａｌ，ＢＭＣＢｉｏｉｎｆｏｒｍａｔｉｃｓ．２０１２；１３：２３８）を使用して、参照ゲノムに整列することができる。様々な他の実施形態では、ＢＬＡＳＴ（ＡｌｔｓｃｈｕｌＳＦｅｔａｌ，ＪＭｏｌＢｉｏｌ．１９９０；２１５（３）：４０３－４１０）、ＢＬＡＴ（ＫｅｎｔＷＪ，ＧｅｎｏｍｅＲｅｓ．２００２；１２（４）：６５６－６６４）、ＢＷＡ（ＬｉＨｅｔａｌ，Ｂｉｏｉｎｆｏｒｍａｔｉｃｓ．２０１０；２６（５）：５８９－５９５）、ＮＧＭＬＲ（ＳｅｄｌａｚｅｃｋＦＪｅｔａｌ，ＮａｔＭｅｔｈｏｄｓ．２０１８；１５（６）：４６１－４６８）、ＬＡＳＴ（ＫｉｅｌｂａｓａＳＭｅｔａｌ、ＧｅｎｏｍｅＲｅｓ．２０１１；２１（３）：４８７－４９３）およびＭｉｎｉｍａｐ２（ＬｉＨ，Ｂｉｏｉｎｆｏｒｍａｔｉｃｓ．２０１８；３４（１８）：３０９４－３１００）は、サブリードを参照ゲノムに整列するために使用することができる。整列により、同じ位置の各サブリードのデータを特定できるため、複数のサブリードからのデータを組み合わせることができる（例えば、平均化）。 At stage 910, the sample can then undergo single-molecule real-time sequencing. As part of SMRT sequencing, circular molecules can be sequenced multiple times by repeated passages of immobilized DNA polymerase. Sequence information obtained each time is regarded as a sub-read. This allows one circular DNA template to generate multiple subreads. Sequencing subreads can be aligned to the reference genome using, for example, without limitation, BLASR (Mark J Chaisson et al, BMC Bioinformatics. 2012; 13: 238). In various other embodiments, BLAST (Altschul SF et al, J Mol Biol. 1990;215(3):403-410), BLAT (Kent WJ, Genome Res. 2002;12(4):656-664). , BWA (Li H et al, Bioinformatics. 2010; 26(5):589-595), NGMLR (Sedlazeck FJ et al, Nat Methods. 2018; 15(6):461-468), LAST (Kielbasa SM et al 2011; 21(3):487-493) and Minimap2 (Li H, Bioinformatics. 2018; 34(18):3094-3100) can be used to align subreads to the reference genome. . Alignment allows the identification of data for each subread at the same location so that data from multiple subreads can be combined (eg, averaged).

ステージ９１２では、整列結果から、塩基修飾分析にかけられた塩基を取り巻くＩＰＤ、ＰＷ、および配列文脈が取得された。ステージ９１４では、ＩＰＤ、ＰＷ、および配列文脈は、特定の構造、例えば、限定されないが、図９に示されるような２Ｄマトリックスに記録された。 At stage 912, the IPD, PW, and sequence context surrounding the bases subjected to base modification analysis were obtained from the alignment results. At stage 914, the IPD, PW, and sequence context were recorded in a specific structure, such as, but not limited to, a 2D matrix as shown in FIG.

ステージ９１６では、既知の塩基修飾を有する参照動態パターン由来の分子を含有するいくつかの２Ｄマトリックスを使用して、分析的、計算的、数学的、または統計モデル（複数可）を訓練した。ステージ９１８では、訓練から得られる統計モデルが開発される。簡単に、図９は、訓練によって開発された統計モデルのみを示しているが、任意のモデルまたはデータ分析フレームワークを開発することができる。データ分析フレームワークの例としては、機械学習モデル、統計モデル、数学的モデルが挙げられる。統計モデルには、線形回帰、ロジスティック回帰、深層再帰型ニューラルネットワーク（例えば、長短期記憶、ＬＳＴＭ）、ベイズ分類器、隠れマルコフモデル（ＨＭＭ）、線形判別分析（ＬＤＡ）、ｋ平均クラスタリング、ノイズを伴う用途の密度ベースの空間クラスタリング（ＤＢＳＣＡＮ）、ランダムフォレストアルゴリズム、およびサポートベクトルマシン（ＳＶＭ）が含まれるが、これらに限定されない。塩基修飾分析にかけられた塩基を取り巻くＤＮＡストレッチは、その塩基のＸ－ｎｔ上流とＹ－ｎｔ下流、つまり「測定ウィンドウ」であり得る。 At stage 916, several 2D matrices containing molecules from reference kinetic patterns with known base modifications were used to train analytical, computational, mathematical, or statistical model(s). At stage 918, a statistical model resulting from training is developed. For simplicity, FIG. 9 shows only statistical models developed by training, but any model or data analysis framework can be developed. Examples of data analysis frameworks include machine learning models, statistical models, and mathematical models. Statistical models include linear regression, logistic regression, deep recurrent neural networks (e.g., long short-term memory, LSTM), Bayesian classifiers, hidden Markov models (HMM), linear discriminant analysis (LDA), k-means clustering, and noise. Concomitant applications include but are not limited to density-based spatial clustering (DBSCAN), random forest algorithms, and support vector machines (SVM). The stretch of DNA surrounding a base subjected to base modification analysis can be the X-nt upstream and Y-nt downstream of that base, the "measurement window."

正しい出力（すなわち、修飾状態）が既知であるため、データ構造を訓練プロセスで使用することができる。例えば、ワトソン鎖および／またはクリック鎖（複数可）からの塩基の３ｎｔ上流および下流に対応するＩＰＤ、ＰＷ、および配列文脈を、塩基修飾を分類するための統計モデル（複数可）を訓練するのに使用される２Ｄマトリックスを構築するために使用することができる。このようにして、訓練は、以前の既知の状態を有する核酸の位置での塩基修飾を分類することができるモデルを提供することができる。 The data structure can be used in the training process because the correct outputs (ie, modification states) are known. For example, IPD, PW, and sequence context corresponding to 3 nt upstream and downstream of bases from Watson strand and/or Crick strand(s) to train statistical model(s) to classify base modifications. can be used to construct the 2D matrix used in In this way, training can provide a model that can classify base modifications at nucleic acid positions that have previously known states.

図１０は、塩基修飾の既知の状態を有するＤＮＡ試料から学習された統計モデル（複数可）がどのように塩基修飾を検出することができるかに関する一般的な手順を示す。塩基修飾の状態が未知の試料をＳＭＲＴ配列決定にかけた。配列決定サブリードを、例えば、上述の技術を使用して、参照ゲノムに整列した。それに加えて、またはその代わりに、サブリードを互いに整列させることができる。さらに他の実施形態は、整列が実施されないように、ただ１つのサブリードを使用するか、またはそれらを独立して分析することができる。 FIG. 10 shows a general procedure for how statistical model(s) learned from DNA samples with known states of base modifications can detect base modifications. Samples with unknown base modification status were subjected to SMRT sequencing. Sequencing subreads were aligned to the reference genome using, for example, the techniques described above. Additionally or alternatively, the subreads can be aligned with each other. Still other embodiments may use only one subread or analyze them independently so that no alignment is performed.

塩基修飾分析にかけられた塩基については、訓練ステップ（図９）で使用されたような同等の測定ウィンドウを使用して、整列結果のワトソン鎖および／またはクリック鎖（複数可）から、ＩＰＤ、ＰＷ、および配列文脈を取得し、その塩基と関連付けた。別の実施形態では、訓練手順と試験手順との間の測定ウィンドウは異なるであろう。例えば、訓練手順と試験手順の間の測定ウィンドウのサイズが異なる場合がある。これらのＩＰＤ、ＰＷ、および配列文脈は、２Ｄマトリックスに変換される。試験試料のこのような２Ｄマトリックスは、塩基修飾を決定するために参照動態特徴と比較されるであろう。例えば、試験試料の２Ｄマトリックスは、訓練試料から学習した統計モデル（複数可）を通して参照動態特徴と比較できるため、試験試料の核酸分子の部位での塩基修飾を決定することができるようになる。統計モデルには、線形回帰、ロジスティック回帰、深層再帰型ニューラルネットワーク（例えば、長短期記憶、ＬＳＴＭ）、ベイズ分類器、隠れマルコフモデル（ＨＭＭ）、線形判別分析（ＬＤＡ）、ｋ平均クラスタリング、ノイズを伴う用途の密度ベースの空間クラスタリング（ＤＢＳＣＡＮ）、ランダムフォレストアルゴリズム、およびサポートベクトルマシン（ＳＶＭ）が含まれるが、これらに限定されない。 For bases subjected to base modification analysis, IPD, PW , and the sequence context was obtained and associated with that base. In another embodiment, the measurement windows between training and testing procedures will be different. For example, the size of the measurement window between training and testing procedures may differ. These IPDs, PWs, and sequence contexts are transformed into 2D matrices. Such a 2D matrix of test samples would be compared to reference kinetic signatures to determine base modifications. For example, a 2D matrix of test samples can be compared to reference kinetic features through statistical model(s) learned from training samples, thus allowing base modifications at sites of nucleic acid molecules of test samples to be determined. Statistical models include linear regression, logistic regression, deep recurrent neural networks (e.g., long short-term memory, LSTM), Bayesian classifiers, hidden Markov models (HMM), linear discriminant analysis (LDA), k-means clustering, and noise. Concomitant applications include but are not limited to density-based spatial clustering (DBSCAN), random forest algorithms, and support vector machines (SVM).

図１１は、ＣｐＧ部位でのメチル化状態を分類するための方法をどのように作成することができるかに関する一般的な手順を示す。ＣｐＧ部位で非メチル化およびメチル化が既知のＤＮＡ試料を、単一分子リアルタイム配列決定にかけた。配列決定サブリードを、参照ゲノムに整列した。ワトソン鎖のデータを使用した。 FIG. 11 shows a general procedure for how methods can be developed for classifying the methylation status at CpG sites. DNA samples with known unmethylation and methylation at CpG sites were subjected to single-molecule real-time sequencing. Sequencing subreads were aligned to the reference genome. Watson chain data were used.

整列の結果から、メチル化分析にかけられたＣｐＧ部位でシトシンを取り巻くＩＰＤ、ＰＷ、および配列文脈が取得され、特定の構造、例えば、限定されないが、図１１に示されるような２Ｄマトリックスに記録された。既知のメチル化状態を有する分子に由来する参照動態パターンを含有するいくつかの２Ｄマトリックスを使用して、統計モデル（複数可）を訓練した。調査中の塩基を取り巻くＤＮＡのストレッチは、その塩基のＸ－ｎｔ上流とＹ－ｎｔ下流、つまり「測定ウィンドウ」であり得る。Ｘは、０、１、２、３、４、５、６、７、８、９、１０、１１、１２、１３、１４、１５、１６、１７、１８、１９、２０、２１、２２、２３、２４、２５、２６、２７、２８、２９、３０、３１、３２、３３、３４、３５、３６、３７、３８、３９、４０、４１、４２、４３、４４、４５、４６、４７、４８、４９、５０、１００、１５０、２００、３００、４００、５００、１０００、２０００、４０００、５０００、および１００００を含み得るが、これらに限定されない。Ｙは、０、１、２、３、４、５、６、７、８、９、１０、１１、１２、１３、１４、１５、１６、１７、１８、１９、２０、２１、２２、２３、２４、２５、２６、２７、２８、２９、３０、３１、３２、３３、３４、３５、３６、３７、３８、３９、４０、４１、４２、４３、４４、４５、４６、４７、４８、４９、５０、１００、１５０、２００、３００、４００、５００、１０００、２０００、４０００、５０００、および１００００を含み得るが、これらに限定されない。一実施形態では、ワトソン鎖からの塩基の３ｎｔ上流および下流に対応するＩＰＤ、ＰＷ、および配列文脈を、塩基修飾を分類するための統計モデル（複数可）を訓練するのに使用される２Ｄマトリックスを構築するために使用することができる。 From the alignment results, the IPD, PW, and sequence context surrounding the cytosine at the CpG site subjected to methylation analysis was obtained and recorded in a 2D matrix, such as, but not limited to, the specific structure shown in FIG. rice field. Several 2D matrices containing reference kinetic patterns derived from molecules with known methylation status were used to train the statistical model(s). The stretch of DNA surrounding the base under investigation can be X-nt upstream and Y-nt downstream of that base, the "measurement window." X is 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 , 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48 , 49, 50, 100, 150, 200, 300, 400, 500, 1000, 2000, 4000, 5000, and 10,000. Y is 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 , 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48 , 49, 50, 100, 150, 200, 300, 400, 500, 1000, 2000, 4000, 5000, and 10,000. In one embodiment, the IPD, PW, and sequence context corresponding to the 3nt upstream and downstream of the base from the Watson strand are used to train a statistical model(s) for classifying base modifications. can be used to build

図１２は、未知の試料のメチル化状態を分類する一般的な手順を示す。メチル化状態が未知の試料を、単一分子リアルタイム配列決定にかけた。配列決定サブリードを、参照ゲノムに整列した。 Figure 12 shows the general procedure for classifying the methylation status of unknown samples. Samples with unknown methylation status were subjected to single-molecule real-time sequencing. Sequencing subreads were aligned to the reference genome.

整列結果のＣＧ部位のシトシンについて、訓練ステップ（図１１）で適用された同等の測定ウィンドウを使用して、ワトソン鎖からＩＰＤ、ＰＷ、および配列文脈を取得して、修飾を調査中の塩基と関連付けた。これらのＩＰＤ、ＰＷ、および配列文脈は、２Ｄマトリックスに変換され得る。試験試料のそのような２Ｄマトリックスは、メチル化状態を決定するために、図１１に示される参照動態パターンと比較されるであろう。Ｘ１１ For the cytosines in the CG sites of the alignment results, the IPD, PW, and sequence context were obtained from the Watson strand using the equivalent measurement window applied in the training step (Fig. 11) to associate the modification with the base under investigation. Associated. These IPDs, PWs, and sequence contexts can be transformed into 2D matrices. Such a 2D matrix of test samples would be compared to the reference kinetic pattern shown in FIG. 11 to determine methylation status. X11

図１３および図１４は、ワトソン鎖を用いた手順と同様に、クリック鎖からの動態特徴が、上で詳述したように、訓練手順および試験手順のために使用され得ることを示す。統計モデル（複数可）は、同じモデルでも、異なるモデルでもよい。異なるモデルの場合、それらを使用して独立した分類を取得することができ、これらを比較することができて、例えば、それらが一致している場合、修飾状態が特定される。次いで、それらが一致していない場合、未分類の状態が特定され得る。それらが同じモデルである場合、データは、単一のデータ構造、例えば、図６のマトリックスに組み合わせることができる。 Figures 13 and 14 show that similar to the procedure with Watson strands, kinetic features from Crick strands can be used for training and testing procedures, as detailed above. The statistical model(s) can be the same model or different models. In the case of different models, they can be used to obtain independent classifications, which can be compared and, for example, if they are concordant, the modification state is identified. Then, if they do not match, an unclassified state can be identified. If they are of the same model, the data can be combined into a single data structure, eg the matrix of FIG.

図１５および図１６は、ワトソン鎖およびクリック鎖の両方からの動態特徴が、上で詳述したように、訓練手順および試験手順のために使用され得ることを示す。ＣｐＧ部位で非メチル化およびメチル化が既知のＤＮＡ試料を、単一分子リアルタイム配列決定にかけた。配列決定のサブリードを、参照ゲノムに整列したが、サブリードを相互に整列することも可能であり、本明細書に記載の他の方法で行うことができる。 Figures 15 and 16 show that kinetic features from both Watson and Crick strands can be used for training and testing procedures, as detailed above. DNA samples with known unmethylation and methylation at CpG sites were subjected to single-molecule real-time sequencing. Sequencing subreads were aligned to the reference genome, but it is also possible to align subreads to each other and can be done in other ways as described herein.

整列結果のサブリードについて、メチル化分析にかけられたＣｐＧ部位のシトシンを取り巻くＩＰＤ、ＰＷ、および配列文脈が取得された。ＤＮＡ分子は２つのヘアピンアダプターの使用を通して環状化されているため（例えば、ＳＭＲＴＢｅｌｌ鋳型調製プロトコルに従う）、環状分子を２回以上配列決定することができ、それによって、分子の複数のサブリードが生成される。サブリードは、循環コンセンサス配列（ＣＣＳ）リードを生成するために使用することができる。一般に、本明細書に記載のすべての方法で、１つのＺＭＷは複数のサブリードを生成することができるが、１つのＣＣＳリードのみに対応する。 For the subreads of the alignment results, the IPD, PW, and sequence context surrounding the cytosines of the CpG sites subjected to methylation analysis were obtained. Because the DNA molecule has been circularized through the use of two hairpin adapters (e.g. following the SMRTBell template preparation protocol), the circular molecule can be sequenced more than once, thereby generating multiple subreads of the molecule. be. Subreads can be used to generate circular consensus sequence (CCS) reads. In general, for all methods described herein, one ZMW can generate multiple subreads, but only corresponds to one CCS read.

一部の実施形態では、完全非メチル化データセットは、ヒトＤＮＡ断片に対するＰＣＲによって作成され得る。例えば、完全メチル化データセットは、すべてのＣｐＧ部位がメチル化されていると想定されるＣｐＧメチルトランスフェラーゼＭ．ＳｓｓＩで処理されたヒトＤＮＡ断片を介して生成することができる。他の例では、Ｍ．ＭｐｅＩなどの別のＣｐＧメチルトランスフェラーゼを使用することができる。他の実施形態では、既知のメチル化状態を有する合成配列または異なるメチル化レベルを有する既存のＤＮＡ試料、あるいはメチル化および非メチル化ＤＮＡ分子の制限酵素切断とその後の連結（これによって、キメラのメチル化／非メチル化ＤＮＡ分子の割合が生じる）によって作成されるハイブリッドのメチル化状態は、メチル化の予測モデルまたは分類器の訓練のために使用することができる。 In some embodiments, the fully unmethylated dataset can be generated by PCR on human DNA fragments. For example, the full methylation data set is the CpG methyltransferase M. spp., where all CpG sites are assumed to be methylated. It can be generated via a human DNA fragment treated with SssI. In another example, M. Another CpG methyltransferase such as MpeI can be used. In other embodiments, restriction enzyme digestion and subsequent ligation of synthetic sequences with known methylation status or pre-existing DNA samples with different levels of methylation, or methylated and unmethylated DNA molecules (thus creating chimeras). The hybrid methylation state generated by the method (resulting in the proportion of methylated/unmethylated DNA molecules) can be used for training predictive models of methylation or classifiers.

配列文脈、ＩＰＤ、およびパルス幅（ＰＷ）を含む動態パターンの変換は、図１５に示すように、ＣＧ部位のメチル化状態を分析するためのワトソン鎖およびクリック鎖からの特徴を含む２Ｄマトリックスにすることができる。このアプローチにより、メチル化シトシンならびにその近くの配列文脈に起因するわずかな動態変化を正確にとらえることができた。本明細書に記載の様々な方法のいずれかと同様に、サブリードに存在する各ＣｐＧについて、測定ウィンドウ（例えば、ＣｐＧ部位のシトシンの３塩基上流および下流）をその後の分析に使用することができ、したがって、合計７つのヌクレオチド（ＣｐＧ部位のシトシンを含む）が、一緒に分析される。それら７つのヌクレオチド間の各塩基について、ＩＰＤおよびＰＷを計算することができる。動態変化に起因する配列文脈をとらえるために、ＩＰＤおよびＰＷ信号は、図１５に示すように、特定のベースコール、相対配列決定位置、および鎖情報にコンパイルされ得る。このようなデータ構造を、簡単に、動態の２Ｄデジタルマトリックスと呼ぶ。 Conversion of kinetic patterns, including sequence context, IPD, and pulse width (PW), into a 2D matrix containing features from Watson and Crick strands to analyze the methylation status of CG sites, as shown in FIG. can do. This approach allowed us to accurately capture the subtle dynamic changes due to methylated cytosines as well as the sequence context nearby. As with any of the various methods described herein, for each CpG present in the subread, measurement windows (e.g., 3 bases upstream and downstream of the cytosine at the CpG site) can be used for subsequent analysis, Therefore, a total of 7 nucleotides (including the cytosine of the CpG site) are analyzed together. The IPD and PW can be calculated for each base between those seven nucleotides. To capture the sequence context resulting from dynamic changes, the IPD and PW signals can be compiled into specific base calls, relative sequencing positions, and strand information, as shown in FIG. Such data structures are simply referred to as dynamic 2D digital matrices.

このような２Ｄデジタルマトリックスは、「２Ｄデジタル画像」に類似している。例えば、２Ｄデジタルマトリックスの最初の行には、メチル化分析にかけられたＣｐＧ遺伝子座のシトシンを取り巻く相対位置とともにそのシトシン部位の３ｎｔ上流および下流が含有された。０の位置は、メチル化が決定されるシトシン部位を表している。－１および－２の相対位置は、問題のシトシンの１ｎｔ上流および２ｎｔ上流を示していた。＋１および＋２の相対位置は、使用されるシトシンの１ｎｔ下流および２ｎｔ下流を示している。各位置は、対応するＩＰＤ値およびＰＷ値を含有する２つの列に対応するであろう。各行は、ワトソン鎖およびクリック鎖の４種類のヌクレオチド（Ａ、Ｃ、Ｇ、およびＴ）に対応していた。マトリックス内のＩＰＤ値およびＰＷ値の入力は、特定の位置で配列結果（すなわち、サブリード）に事前設定された対応するヌクレオチドの種類によって異なる。 Such a 2D digital matrix is analogous to a "2D digital image". For example, the first row of the 2D digital matrix contained 3nt upstream and downstream of the cytosine site along with the relative positions surrounding the cytosine of the CpG locus subjected to methylation analysis. Position 0 represents the cytosine site at which methylation is to be determined. Relative positions of -1 and -2 indicated 1nt upstream and 2nt upstream of the cytosine in question. The +1 and +2 relative positions indicate 1 nt downstream and 2 nt downstream of the cytosine used. Each position will correspond to two columns containing the corresponding IPD and PW values. Each row corresponded to four nucleotides (A, C, G, and T) of the Watson and Crick strands. The entry of IPD and PW values in the matrix depends on the type of corresponding nucleotide preset to the sequence result (ie, subread) at a particular position.

図１５に示すように、０の相対位置では、ＩＰＤ値およびＰＷ値がワトソン鎖の「Ｃ」の行に示され、シトシンがその位置で呼び出されたことを示唆している。配列決定された塩基に対応しなかった列の他のグリッドは、「０」としてコード化される。一例として、２Ｄデジタルマトリックス（図１５）に対応する配列情報は、ワトソン鎖およびクリック鎖について、それぞれ、５’－ＡＴＡＣＧＴＴ－３’および５’－ＴＡＡＣＧＴＡ－３’である。この文脈では、ワトソン鎖およびクリック鎖のＣｐＧ部位のシトシンに隣接する上流および下流の配列は異なる。ＣｐＧ部位でのメチル化はワトソン鎖とクリック鎖の間で対称的であるため（Ｌｉｓｔｅｒｅｔａｌ．，２００９）、１つの好ましい実施形態では、両方の鎖の動態を使用して、メチル化予測モデルを訓練した。別の実施形態では、ワトソン鎖およびクリック鎖は、メチル化予測モデルを訓練するために別々に使用され得る。 As shown in Figure 15, at a relative position of 0, the IPD and PW values are shown in the "C" row of the Watson chain, suggesting that cytosine was called at that position. Other grids in columns that did not correspond to sequenced bases are coded as '0'. As an example, the sequence information corresponding to the 2D digital matrix (Figure 15) is 5'-ATACGTT-3' and 5'-TAACGTA-3' for Watson and Crick strands, respectively. In this context, the sequences flanking the cytosine upstream and downstream of the CpG sites of the Watson and Crick strands are different. Since methylation at CpG sites is symmetric between Watson and Crick strands (Lister et al., 2009), in one preferred embodiment, dynamics of both strands are used to generate a methylation prediction model trained. In another embodiment, Watson and Crick strands can be used separately to train a methylation prediction model.

単一分子リアルタイム配列決定の高いデータスループットを考慮すると、一実施形態では、深層学習アルゴリズム（畳み込みニューラルネットワーク（ＣＮＮ））（ＬｅＣｕｎｅｔａｌ．，１９８９）は、メチル化ＣｐＧを非メチル化ＣｐＧから区別するのに好適であり得る。他のアルゴリズムも、追加的または代替的に使用することができ、例えば、限定されないが、線形回帰、ロジスティック回帰、深層再帰型ニューラルネットワーク（例えば、長期短期記憶、ＬＳＴＭ）、ベイズ分類器、隠れマルコフモデル（ＨＭＭ）、線形判別分析（ＬＤＡ）、ｋ－平均クラスタリング、ノイズを伴う用途の密度ベースの空間クラスタリング（ＤＢＳＣＡＮ）、ランダムフォレストアルゴリズム、サポートベクトルマシン（ＳＶＭ）などがある。図６～８に記載されているように、訓練では、ワトソン鎖およびクリック鎖を別々に使用するか、または組み合わせた新しいマトリックスにおいて使用することができる。 Given the high data throughput of single-molecule real-time sequencing, in one embodiment, a deep learning algorithm (convolutional neural network (CNN)) (LeCun et al., 1989) distinguishes methylated from unmethylated CpGs. may be suitable for Other algorithms may additionally or alternatively be used such as, but not limited to, linear regression, logistic regression, deep recurrent neural networks (e.g., long-term short-term memory, LSTM), Bayesian classifiers, hidden Markov Models (HMM), linear discriminant analysis (LDA), k-means clustering, density-based spatial clustering for noisy applications (DBSCAN), random forest algorithms, support vector machines (SVM), and others. As described in FIGS. 6-8, training can use the Watson and Crick strands separately or in a new combined matrix.

動態パターンの別の変換は、Ｎ次元マトリックスであり得る。Ｎは、例えば、１、３、４、５、６、および７であり得る。例えば、３Ｄマトリックスは、分析対象のＤＮＡストレッチのタンデムＣＧ部位の数に従って階層化された２Ｄマトリックスの積み重ねであり、第３の次元は、そのＤＮＡストレッチのタンデムＣＧ部位の数になる。一部の実施形態では、パルス強度またはパルスの大きさ（例えば、パルスのピークの高さによって、またはパルス信号下面積によって測定される）も、マトリックスに組み込まれることがある。パルス強度（パルスピークの振幅のメトリック、図３）は、元の２Ｄマトリックスの上のＰＷ値およびＩＰＤ値に関連する列に隣接する追加の列に加えられるか、または第３の次元に加えられるかのいずれかで、３Ｄマトリックスを形成することができる。 Another transform of kinetic patterns can be an N-dimensional matrix. N can be 1, 3, 4, 5, 6, and 7, for example. For example, a 3D matrix is a stack of 2D matrices layered according to the number of tandem CG sites in the DNA stretch being analyzed, with the third dimension being the number of tandem CG sites in that DNA stretch. In some embodiments, the pulse intensity or pulse magnitude (eg, measured by the peak height of the pulse or by the area under the pulse signal) may also be incorporated into the matrix. The pulse intensity (pulse peak amplitude metric, FIG. 3) is added to an additional column adjacent to the columns associated with the PW and IPD values above the original 2D matrix, or added to the third dimension. A 3D matrix can be formed by either:

さらなる例として、８（行）ｘ２１（列）の２Ｄマトリックスは、１６８個の要素を含む１Ｄマトリックス（すなわち、ベクトル）に変換することができる。また、この１Ｄマトリックスをスキャンして、例えば、ＣＮＮおよびその他のモデリングを実施することができる。別の例として、方法は、８ｘ２１の２Ｄマトリックスを、複数の小さなマトリックス、例えば、２つの４ｘ２１の２Ｄマトリックスに分割することできる。これらの２つの小さなマトリックスを垂直方向に組み合わせると、３Ｄマトリックス（すなわち、ｘ＝２１、ｙ＝４、ｚ＝２）が得られる。方法は、第１の２Ｄマトリックスをスキャンし、次いで第２の２Ｄマトリックスをスキャンして、機械学習のためのデータ表示を形成することができる。データをさらに分割して、より高次元のマトリックスを形成することができる。さらに、二次構造情報を、データ構造に追加することができ、例えば、２Ｄマトリックスの上に追加のマトリックス（１Ｄマトリックス）を加えることができる。このような追加のマトリックスは、測定ウィンドウ内の各塩基が二次構造（例えば、ステム・ループ構造）に関与するかどうかをコード化することができる。例えば、「ステム」に関与する塩基は、０としてコード化され、「ループ」に関与する塩基は、１としてコード化される。 As a further example, a 2D matrix of 8 (rows) by 21 (columns) can be transformed into a 1D matrix (ie, vector) containing 168 elements. Also, this 1D matrix can be scanned to perform, for example, CNN and other modeling. As another example, the method can divide an 8x21 2D matrix into multiple smaller matrices, eg, two 4x21 2D matrices. Combining these two smaller matrices vertically yields a 3D matrix (ie, x=21, y=4, z=2). The method can scan a first 2D matrix and then a second 2D matrix to form a data representation for machine learning. The data can be further partitioned to form higher dimensional matrices. Additionally, secondary structure information can be added to the data structure, for example, additional matrices (1D matrices) can be added on top of the 2D matrices. Such additional matrices can encode whether each base within the measurement window participates in secondary structure (eg, stem-loop structure). For example, bases involved in a "stem" are coded as 0's and bases involved in a "loop" are coded as 1's.

一実施形態では、単一のＤＮＡ分子内のＣｐＧ部位のメチル化状態は、「メチル化」または「非メチル化」の定性的な結果を与えるのではなく、統計モデルに基づいてメチル化される確率として表すことができる。１の確率は、統計モデルに基づいて、ＣｐＧ部位がメチル化されているとみなされ得ることを示す。０の確率は、統計モデルに基づいて、ＣｐＧ部位がメチル化されていないとみなされ得ることを示す。その後の下流分析では、カットオフ値を使用して、確率に基づいて、特定のＣｐＧ部位が「メチル化」または「非メチル化」に分類されるかどうか、分類することができる。カットオフの可能な値には、５％、１０％、１５％、２０％、２５％、３０％、３５％、４０％、４５％、５０％、５５％、６０％、６５％、７０％、７５％、８０％、８５％、９０％、または９５％が含まれる。ＣｐＧ部位についてメチル化される確率が所定のカットオフよりも大きいものは、「メチル化」として分類され、ＣｐＧ部位についてメチル化される確率が所定のカットオフよりも大きくないものは、「非メチル化」として分類される。所望のカットオフは、例えば、受信者操作特性（ＲＯＣ）曲線分析を使用して、訓練データセットから取得され得る。 In one embodiment, the methylation status of CpG sites within a single DNA molecule is methylated based on a statistical model rather than giving a qualitative result of "methylated" or "unmethylated" It can be expressed as a probability. A probability of 1 indicates that the CpG site can be considered methylated based on the statistical model. A probability of 0 indicates that the CpG site can be considered unmethylated based on the statistical model. In subsequent downstream analysis, the cutoff value can be used to classify whether a particular CpG site is classified as "methylated" or "unmethylated" based on probability. Possible values for cutoff include 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70% , 75%, 80%, 85%, 90%, or 95%. Those with a probability of being methylated for a CpG site greater than a given cutoff are classified as "methylated" and those with a probability of being methylated for a CpG site not greater than a given cutoff are classified as "non-methylated". categorized as A desired cutoff can be obtained from a training data set using, for example, receiver operating characteristic (ROC) curve analysis.

図１６は、ワトソン鎖およびクリック鎖からの未知の試料のメチル化状態を分類する一般的な手順を示している。メチル化状態が未知の試料は、単一分子リアルタイム配列決定にかけられた。配列決定サブリードは、他の方法と同様に、参照ゲノムまたは互いに整列して、所与の位置のコンセンサス値（平均値、中央値、モード、またはその他の統計値）を決定することができる。示されるように、２本の鎖についての測定値を、単一の２Ｄマトリックスに組み合わせることができる。 Figure 16 shows the general procedure for classifying the methylation status of unknown samples from Watson and Crick strands. Samples with unknown methylation status were subjected to single-molecule real-time sequencing. Sequencing subreads can be aligned to the reference genome or to each other, as well as other methods, to determine the consensus value (mean, median, mode, or other statistic) for a given position. As shown, the measurements for the two strands can be combined into a single 2D matrix.

整列結果のＣＧ部位のシトシンについて、異なるサイズのウィンドウを使用することができるが、修飾を調査中のその塩基に関連する訓練ステップにおいて適用されるように（図１６）同等の測定ウィンドウ（ＣｐＧ部位のシトシンの３ｎｔ上流および下流）を使用して、ワトソン鎖からＩＰＤ、ＰＷ、および配列文脈が取得され得る。試験試料のこのような２Ｄマトリックスは、メチル化状態を決定するために、図１６に示される参照動態パターンと比較することができる。 For the cytosines of the CG sites in the alignment results, different size windows can be used, but equivalent measurement windows (CpG site 3 nts upstream and downstream of the cytosine) can be used to obtain the IPD, PW, and sequence context from the Watson strand. Such a 2D matrix of test samples can be compared to the reference kinetic pattern shown in FIG. 16 to determine methylation status.

ＩＩＩ．メチル化を検出するための例示的なモデル訓練
提案されたアプローチの実現可能性および妥当性を試験するために、単一分子リアルタイム配列決定の前に、Ｍ．ＳｓｓＩ処理（メチル化ライブラリ）およびＰＣＲ増幅（非メチル化ライブラリ）を用いて、胎盤ＤＮＡライブラリを調製した。それぞれ、４２１，６１４および４４６，２８５の循環コンセンサス配列（ＣＣＳ）に対応する、メチル化および非メチル化ライブラリの４４，７９９，７３６および４３，５８０，４５２のサブリードを取得した。その結果、各分子は、メチル化ライブラリおよび非メチル化ライブラリにおいて、３４倍および３２倍の中央値で配列決定された。データセットは、ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓＳｅｑｕｅｌＳｅｑｕｅｎｃｉｎｇＫｉｔ３．０によって調製されたＤＮＡから生成された。このキットは、最初のＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓＳｅｑｕｅｌシーケンサーを使用するために開発された。本明細書では、Ｓｅｑｕｅｌをその後継であるＳｅｑｕｅｌＩＩと区別するために、最初のＳｅｑｕｅｌをＳｅｑｕｅｌＩと呼ぶ。したがって、本明細書では、ＳｅｑｕｅｌＳｅｑｕｅｎｃｉｎｇＫｉｔ３．０をＳｅｑｕｅｌＩＳｅｑｕｅｎｃｉｎｇＫｉｔ３．０と呼ぶ。ＳｅｑｕｅｌＩＩシーケンサー用に設計された配列決定キットには、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ１．０およびＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ２．０が含まれ、これらも本開示に記載されている。 III. Exemplary Model Training for Detecting Methylation To test the feasibility and validity of the proposed approach, prior to single-molecule real-time sequencing, M. et al. A placental DNA library was prepared using SssI treatment (methylated library) and PCR amplification (unmethylated library). 44,799,736 and 43,580,452 subreads of the methylated and unmethylated libraries were obtained, corresponding to 421,614 and 446,285 circular consensus sequences (CCS), respectively. As a result, each molecule was sequenced at a median of 34-fold and 32-fold in methylated and unmethylated libraries. The dataset was generated from DNA prepared by the Pacific Biosciences Sequel Sequencing Kit 3.0. This kit was developed for use with the original Pacific Biosciences Sequel sequencer. The original Sequel is referred to herein as Sequel I to distinguish it from its successor, Sequel II. Therefore, the Sequel Sequencing Kit 3.0 is referred to herein as the Sequel I Sequencing Kit 3.0. Sequencing kits designed for the Sequel II sequencer include Sequel II Sequencing Kit 1.0 and Sequel II Sequencing Kit 2.0, which are also described in this disclosure.

メチル化ライブラリおよび非メチル化ライブラリから生成された配列決定分子の５０％を使用して、統計モデルを訓練した（残りの５０％は検証用に使用した）。この場合、畳み込みニューラルネットワーク（ＣＮＮ）モデルである。一例として、ＣＮＮモデルは、１つ以上の畳み込み層（例えば、１Ｄまたは２Ｄ層）を有し得る。畳み込み層は、１つ以上の異なるフィルターを使用することができ、各フィルターは、特定のマトリックス要素に対してローカルな（例えば、近傍のまたは周囲の）マトリックス値を操作するカーネルを使用し、それによって、特定のマトリックス要素に新しい値を提供する。１つの実装では、２つの１Ｄ畳み込み層を使用した（それぞれ、カーネルサイズが４の１００個のフィルターがある）。フィルターは、個別に適用してから組み合わせることができる（例えば、加重平均で）。得られたマトリックスは、入力マトリックスよりも小さくすることができる。 50% of the sequenced molecules generated from the methylated and unmethylated libraries were used to train the statistical model (the remaining 50% were used for validation). In this case, it is a convolutional neural network (CNN) model. As an example, a CNN model may have one or more convolutional layers (eg, 1D or 2D layers). A convolutional layer can use one or more different filters, each using a kernel that manipulates the local (e.g., nearby or surrounding) matrix values for a particular matrix element, which provides a new value for a particular matrix element. In one implementation, we used two 1D convolutional layers (each with 100 filters with a kernel size of 4). Filters can be applied individually and then combined (eg, in a weighted average). The resulting matrix can be smaller than the input matrix.

畳み込み層の後に、ＲｅＬＵ（正規化線形ユニット）層が続き、その後にドロップアウト率が０．５のドロップアウト層が続く。ＲｅＬＵは、個々の値を操作して畳み込み層（複数可）から新しいマトリックス（画像）を得る活性化関数の例である。他の活性化関数（例えば、シグモイド、ソフトマックスなど）も使用することができる。このような層のうちの１つ以上を使用することができる。ドロップアウト層は、ＲｅＬＵ層または最大プーリング層で使用することができ、過剰適合を防ぐための正則化として機能する。ドロップアウト層を、訓練プロセス中に使用して、訓練の一部として実施される最適化プロセスの様々な反復中に異なる（例えば、ランダムな）値を無視することができる（例えば、コスト／損失関数を減らすため）。 The convolutional layer is followed by a ReLU (Rectified Linear Unit) layer followed by a dropout layer with a dropout rate of 0.5. ReLU is an example of an activation function that manipulates individual values to obtain a new matrix (image) from the convolutional layer(s). Other activation functions (eg, sigmoid, softmax, etc.) can also be used. One or more of such layers can be used. The dropout layer can be used in the ReLU layer or the max pooling layer and acts as a regularizer to prevent overfitting. A dropout layer can be used during the training process to ignore different (e.g. random) values during the various iterations of the optimization process performed as part of the training (e.g. cost/loss function).

ＲｅＬＵ層の後に、最大プーリング層（例えば、プールサイズ２）を使用することができる。最大プーリング層は、畳み込み層と同様に機能するが、入力とカーネルとの間の内積を得る代わりに、カーネルと重なる入力からの領域の最大値を得ることができる。さらなる畳み込み層（複数可）を使用することができる。例えば、プーリング層からのデータは、別の２つの１Ｄ畳み込み層（例えば、各々、カーネルサイズが２の１２８個のフィルターとそれに続くＲｅＬＵ層を有する）に入力することができ、さらに、ドロップアウト率が０．５のドロップアウト層を使用することができる。プールサイズが２の最大プーリング層を使用した。最後に、全結合層（例えば、１０個のニューロンとそれに続くＲｅＬＵ層を有する）を使用することができる。１つのニューロンを有する出力層の後にシグモイド層を続けることができるため、メチル化の確率が得られる。層、フィルター、カーネルサイズの様々な設定を調整することができる。この訓練データセットでは、メチル化ライブラリおよび非メチル化ライブラリの４６８，５９６および４３２，７６１個のＣｐＧ部位を使用した。 After the ReLU layer, a max pooling layer (eg pool size 2) can be used. A max pooling layer works similarly to a convolutional layer, but instead of taking the dot product between the input and the kernel, we can take the maximum of the regions from the input that overlap the kernel. Additional convolutional layer(s) can be used. For example, the data from the pooling layer can be input to another two 1D convolutional layers (eg, each with 128 filters with a kernel size of 2 followed by a ReLU layer), and the dropout rate A dropout layer of 0.5 can be used. A maximum pooling layer with a pool size of 2 was used. Finally, a fully connected layer (eg, with 10 neurons followed by a ReLU layer) can be used. An output layer with one neuron can be followed by a sigmoid layer, thus obtaining the probability of methylation. Various settings for layers, filters and kernel sizes can be adjusted. 468,596 and 432,761 CpG sites from methylated and unmethylated libraries were used in this training dataset.

Ａ．訓練データセットおよび試験データセットの結果
図１７Ａは、訓練データセット中の各単一ＤＮＡ分子の各ＣｐＧ部位について、メチル化される確率を示す。メチル化の確率は、非メチル化ライブラリよりもメチル化ライブラリの方がはるかに高かった。メチル化される確率のカットオフが０．５の場合、非メチル化ＣｐＧ部位の９４．７％が非メチル化であると正しく予測され、メチル化ＣｐＧの８４．７％がメチル化であると正しく予測された。 A. Results for Training and Test Datasets FIG. 17A shows the probability of being methylated for each CpG site of each single DNA molecule in the training data set. The probability of methylation was much higher for the methylated library than for the unmethylated library. With a probability of being methylated cutoff of 0.5, 94.7% of unmethylated CpG sites were correctly predicted to be unmethylated, and 84.7% of methylated CpG sites were predicted to be methylated. correctly predicted.

図１７Ｂは、試験データセットの性能を示す。訓練データセットによって訓練されたモデルを使用して、メチル化ライブラリおよび非メチル化ライブラリからの独立した試験データセット中の４６９，７２９および４３２，０２４個のＣｐＧ部位のメチル化状態を予測した。メチル化される確率のカットオフが０．５の場合、非メチル化ＣｐＧ部位の９４．０％が非メチル化であると正しく予測され、メチル化されたＣｐＧの８４．１％がメチル化であると正しく予測された。これらの結果は、配列文脈と組み合わせた動態の新しい変換の使用が、ＤＮＡ（例えば、ヒトの対象から）のメチル化状態の決定を可能にし得ることを示唆した。 FIG. 17B shows the performance of the test dataset. Models trained with the training dataset were used to predict the methylation status of 469,729 and 432,024 CpG sites in independent test datasets from methylated and unmethylated libraries. With a methylated probability cutoff of 0.5, 94.0% of unmethylated CpG sites were correctly predicted to be unmethylated, and 84.1% of methylated CpG sites were unmethylated. correctly predicted to exist. These results suggested that the use of novel transformations of kinetics in combination with sequence context may allow determination of the methylation status of DNA (eg, from human subjects).

特徴のサブセットをモデルに含めることによって、ＣｐＧのメチル化状態を予測する際に、各特徴（配列文脈、ＩＰＤ、およびＰＷ）の能力を評価した。訓練データセットでは、（ｉ）配列文脈のみ、（ｉｉ）ＩＰＤのみ、および（ｉｉｉ）ＰＷのみのモデルは、それぞれ、０．５、０．７４、および０．８６の曲線下面積（ＡＵＣ）値を与えた。ＩＰＤと配列文脈を組み合わせると、ＡＵＣが０．８６と性能が改善した。配列文脈（「Ｓｅｑ」）、ＩＰＤ、およびＰＷの複合分析は、ＡＵＣが０．９４と性能が大幅に改善した（図１８Ａ）。独立した試験データセットの性能は、訓練データセットと同等であった（図１８Ｂ）。 By including a subset of features in the model, we evaluated the ability of each feature (sequence context, IPD, and PW) in predicting CpG methylation status. In the training data set, the (i) sequence context only, (ii) IPD only and (iii) PW only models had area under the curve (AUC) values of 0.5, 0.74 and 0.86, respectively. gave Combining IPD and sequence context improved performance with an AUC of 0.86. A combined analysis of sequence context (“Seq”), IPD, and PW significantly improved performance with an AUC of 0.94 (FIG. 18A). Performance of the independent test dataset was comparable to the training dataset (Fig. 18B).

ＣｐＧ部位のサブリード深度を、その部位とその周囲の１０ｂｐをカバーするサブリードの平均数として定義した。図１９Ａおよび図１９Ｂに示されるように、ＣｐＧ部位のサブリード深度が高いほど、達成されるメチル化の検出の精度が高くなる。例えば、試験データセット（図１９Ｂ）に示されるように、各ＣｐＧ部位の深度が少なくとも１０の場合、メチル化状態を予測するＡＵＣは０．９３になる。しかしながら、各ＣｐＧ部位のサブリード深度が少なくとも３００の場合、メチル化状態を予測するＡＵＣは０．９８である。一方、深度が１の場合でさえ、ＡＵＣが０．９を達成した。これは、本発明者らのアプローチが、低い配列決定深度の使用で、メチル化の予測が達成されることを示している。 The subread depth of a CpG site was defined as the average number of subreads covering the site and its surrounding 10 bp. As shown in FIGS. 19A and 19B, the higher the sub-read depth of the CpG sites, the higher the precision of methylation detection achieved. For example, as shown in the test data set (FIG. 19B), the AUC predictive of methylation status is 0.93 when each CpG site is at least 10 deep. However, with a sub-read depth of at least 300 for each CpG site, the AUC predicting methylation status is 0.98. On the other hand, even with a depth of 1, an AUC of 0.9 was achieved. This indicates that our approach achieves methylation prediction using low sequencing depth.

メチル化分析の性能に対する鎖情報の効果を試験するために、ワトソン鎖およびクリック鎖に由来する配列文脈、ＩＰＤおよびＰＷを使用して、それぞれ、本開示に存在する実施形態に従って訓練した。図２０Ａおよび図２０Ｂは、訓練データセットおよび試験データセットにおいてＡＵＣが最大０．９１および０．８７を達成できるので、訓練および試験のために、単一の鎖、すなわちワトソンまたはクリック鎖のいずれかを使用することが実行可能であることを示した。ワトソン鎖およびクリック鎖を含む両方の鎖（例えば、図６～８で説明）を使用すると、最高の性能が得られ（ＡＵＣ：訓練データセットおよび試験データセットでそれぞれ０．９４および０．９０）、鎖情報が最適な性能を達成するために重要であることを示唆している。 To test the effect of strand information on the performance of methylation analysis, sequence contexts, IPD and PW, derived from Watson and Crick strands, respectively, were used to train according to the embodiments present in this disclosure. Figures 20A and 20B show that for training and testing, a single strand, either Watson or Crick strand showed that it is feasible to use Using both strands, including the Watson and Crick strands (described for example in Figures 6-8), gave the best performance (AUC: 0.94 and 0.90 for training and test datasets, respectively). , suggesting that strand information is important to achieve optimal performance.

本開示で開発された本開示に存在する実施形態に従って、このパラメータが性能にどのように影響するかを研究するために、ＣｐＧ部位の上流および下流の異なる数のヌクレオチドを、さらに試験した。図２１Ａおよび図２１Ｂは、ＣｐＧの文脈におけるシトシンの上流および下流のヌクレオチドの数が、メチル化の予測の精度に影響を与えることを示す。例えば、例示的な目的として、調査されるシトシンの、限定されないが２ヌクレオチド（ｎｔ）、３ｎｔ、４ｎｔ、６ｎｔ、８ｎｔ、１０ｎｔ、１５ｎｔ、および２０ｎｔ上流と下流を考慮すると、調査されるシトシンの２ｎｔ上流と下流を使用する方法のＡＵＣは、訓練データセットおよび試験データセットの両方でわずか０．５０であるが、調査されるシトシンの１５ｎｔ上流と下流を使用する方法のＡＵＣは、０．９５と０．９２に増加する。これらの結果は、分析されるシトシンに隣接する上流領域および下流領域の長さを変えることにより、最適な性能を見出すことが可能になることを示唆した。一実施形態では、図２１Ｂに示されるように、シトシンの３ｎｔ上流および下流を使用して、メチル化状態を決定し、０．８９のＡＵＣを達成することができる。 Different numbers of nucleotides upstream and downstream of the CpG site were further tested to study how this parameter affects performance according to the presently disclosed embodiments developed in this disclosure. Figures 21A and 21B show that the number of nucleotides upstream and downstream of a cytosine in the context of a CpG affects the accuracy of methylation prediction. For example, for exemplary purposes, consider but are not limited to 2 nucleotides (nt), 3nt, 4nt, 6nt, 8nt, 10nt, 15nt, and 20nt upstream and downstream of the cytosine under investigation, and 2nt of the cytosine under investigation. The AUC for the method using upstream and downstream is only 0.50 for both the training and test datasets, while the AUC for the method using 15 nt upstream and downstream of the investigated cytosine is 0.95. increases to 0.92. These results suggested that varying the length of the upstream and downstream regions flanking the cytosines analyzed would allow optimal performance to be found. In one embodiment, 3 nt upstream and downstream of cytosine can be used to determine methylation status and achieve an AUC of 0.89, as shown in FIG. 21B.

一実施形態では、本開示に存在する実施形態による分析を実施するために、調査されるシトシンに隣接する非対称配列を使用することができる。例えば、シトシンの１ｎｔ、３ｎｔ、４ｎｔ、５ｎｔ、６ｎｔ、７ｎｔ、８ｎｔ、９ｎｔ、１０ｎｔ、１１ｎｔ、１２ｎｔ、１３ｎｔ、１４ｎｔ、１５ｎｔ、１６ｎｔ、１７ｎｔ、１８ｎｔ、１９ｎｔ、２０ｎｔ、２５ｎｔ、３０ｎｔ、３５ｎｔ、および４０ｎｔ下流と組み合わせて、２ｎｔ上流を使用することができ、シトシンの１ｎｔ、２ｎｔ、４ｎｔ、５ｎｔ、６ｎｔ、７ｎｔ、８ｎｔ、９ｎｔ、１０ｎｔ、１１ｎｔ、１２ｎｔ、１３ｎｔ、１４ｎｔ、１５ｎｔ、１６ｎｔ、１７ｎｔ、１８ｎｔ、１９ｎｔ、２０ｎｔ、２５ｎｔ、３０ｎｔ、３５ｎｔ、および４０ｎｔ下流と組み合わせて、３ｎｔ上流を使用することができ、シトシンの１ｎｔ、２ｎｔ、３ｎｔ、５ｎｔ、６ｎｔ、７ｎｔ、８ｎｔ、９ｎｔ、１０ｎｔ、１１ｎｔ、１２ｎｔ、１３ｎｔ、１４ｎｔ、１５ｎｔ、１６ｎｔ、１７ｎｔ、１８ｎｔ、１９ｎｔ、２０ｎｔ、２５ｎｔ、３０ｎｔ、３５ｎｔ、および４０ｎｔ下流と組み合わせて、４ｎｔ上流を使用することができる。別の例として、シトシンの１ｎｔ、３ｎｔ、４ｎｔ、５ｎｔ、６ｎｔ、７ｎｔ、８ｎｔ、９ｎｔ、１０ｎｔ、１１ｎｔ、１２ｎｔ、１３ｎｔ、１４ｎｔ、１５ｎｔ、１６ｎｔ、１７ｎｔ、１８ｎｔ、１９ｎｔ、２０ｎｔ、２５ｎｔ、３０ｎｔ、３５ｎｔ、および４０ｎｔ上流と組み合わせて、２ｎｔ下流を使用することができ、シトシンの１ｎｔ、２ｎｔ、４ｎｔ、５ｎｔ、６ｎｔ、７ｎｔ、８ｎｔ、９ｎｔ、１０ｎｔ、１１ｎｔ、１２ｎｔ、１３ｎｔ、１４ｎｔ、１５ｎｔ、１６ｎｔ、１７ｎｔ、１８ｎｔ、１９ｎｔ、２０ｎｔ、２５ｎｔ、３０ｎｔ、３５ｎｔ、および４０ｎｔ上流と組み合わせて、３ｎｔ下流を使用することができ、シトシンの１ｎｔ、２ｎｔ、３ｎｔ、５ｎｔ、６ｎｔ、７ｎｔ、８ｎｔ、９ｎｔ、１０ｎｔ、１１ｎｔ、１２ｎｔ、１３ｎｔ、１４ｎｔ、１５ｎｔ、１６ｎｔ、１７ｎｔ、１８ｎｔ、１９ｎｔ、２０ｎｔ、２５ｎｔ、３０ｎｔ、３５ｎｔ、および４０ｎｔ上流と組み合わせて、４ｎｔ下流を使用することができる。シトシンのｎ－ｎｔ上流およびｍ－ｎｔ下流に関連するＩＰＤ、ＰＷ、鎖情報、および配列文脈を利用することによって、特定の実施形態においてメチル化状態を決定する際に、改善された精度を提供することができる。このような異なる測定ウィンドウは、５ｈｍＣ、６ｍＡ、４ｍＣ、およびｏｘｏＧなど、または本明細書に開示される任意の修飾の、他のタイプの塩基修飾分析に適用することができる。このような異なる測定ウィンドウには、グアニン四重鎖およびステム・ループ構造などのＤＮＡ二次構造分析が含まれ得る。このような例は上で説明されている。このような二次構造の情報は、マトリックスの別の列として追加することもできる。 In one embodiment, asymmetric sequences flanking the investigated cytosine can be used to perform analysis according to embodiments present in the present disclosure. For example, 1 nt, 3 nt, 4 nt, 5 nt, 6 nt, 7 nt, 8 nt, 9 nt, 10 nt, 11 nt, 12 nt, 13 nt, 14 nt, 15 nt, 16 nt, 17 nt, 18 nt, 19 nt, 20 nt, 25 nt, 30 nt, 35 nt, and 40 nt of cytosine In combination with downstream, 2 nt upstream can be used, 1 nt, 2 nt, 4 nt, 5 nt, 6 nt, 7 nt, 8 nt, 9 nt, 10 nt, 11 nt, 12 nt, 13 nt, 14 nt, 15 nt, 16 nt, 17 nt, 18 nt of cytosine, 3 nt upstream can be used in combination with 19 nt, 20 nt, 25 nt, 30 nt, 35 nt, and 40 nt downstream; 4nt upstream can be used in combination with 13nt, 14nt, 15nt, 16nt, 17nt, 18nt, 19nt, 20nt, 25nt, 30nt, 35nt, and 40nt downstream. As another example, 1 nt, 3 nt, 4 nt, 5 nt, 6 nt, 7 nt, 8 nt, 9 nt, 10 nt, 11 nt, 12 nt, 13 nt, 14 nt, 15 nt, 16 nt, 17 nt, 18 nt, 19 nt, 20 nt, 25 nt, 30 nt, 35 nt of cytosine , and 2 nt downstream can be used in combination with 40 nt upstream and 1 nt, 2 nt, 4 nt, 5 nt, 6 nt, 7 nt, 8 nt, 9 nt, 10 nt, 11 nt, 12 nt, 13 nt, 14 nt, 15 nt, 16 nt, 17 nt of cytosine , 18nt, 19nt, 20nt, 25nt, 30nt, 35nt and 40nt upstream, 3nt downstream can be used in combination with cytosine 1nt, 2nt, 3nt, 5nt, 6nt, 7nt, 8nt, 9nt, 10nt, 11nt , 12nt, 13nt, 14nt, 15nt, 16nt, 17nt, 18nt, 19nt, 20nt, 25nt, 30nt, 35nt, and 40nt upstream can be used in combination with 4nt downstream. Provide improved accuracy in determining methylation status in certain embodiments by utilizing IPD, PW, chain information and sequence context associated with n-nt upstream and m-nt downstream of cytosine can do. Such different measurement windows can be applied to other types of base modification analysis, such as 5hmC, 6mA, 4mC, and oxoG, or any modification disclosed herein. Such different measurement windows can include DNA secondary structure analysis such as guanine quadruplex and stem-loop structures. Examples of such are described above. Such secondary structure information can also be added as another column of the matrix.

図２２Ａおよび図２２Ｂは、少なくとも３塩基の下流塩基のみに関連する動態パターンを使用してメチル化状態を決定することが実行可能であることを示す。本開示に存在する実施形態によれば、シトシンおよびその下流の３、４、６、８、および１０塩基に関連する特徴を使用して、訓練データセットにおけるメチル化状態の決定では、ＡＵＣが、それぞれ０．９１、０．９２、０．９４、０．９４、および０．９４であり、試験データセットでは、ＡＵＣが、それぞれ０．８７、０．８８、０．９０、０．９０、および０．９０であった。 Figures 22A and 22B show that it is feasible to determine methylation status using kinetic patterns associated only with at least three downstream bases. According to embodiments present in the present disclosure, using features associated with cytosine and its downstream 3, 4, 6, 8, and 10 bases, in determining methylation status in a training dataset, AUC is: 0.91, 0.92, 0.94, 0.94, and 0.94, respectively; was 0.90.

しかしながら、図２３Ａおよび図２３Ｂは、上流塩基に関連する特徴のみを使用する場合、メチル化状態を識別する能力が減少しているように見えることを示す。訓練データセットおよび試験データセットにおいて、ＡＵＣは、２～１０上流塩基についてすべて０．５０であった。 However, Figures 23A and 23B show that the ability to discriminate methylation status appears to be diminished when using only features associated with upstream bases. In the training and test datasets, AUCs were all 0.50 for 2-10 upstream bases.

図２４および図２５は、上流および下流塩基の異なる組み合わせが、メチル化状態を決定する際に、最適な分類を達成することを可能にすることを示す。例えば、シトシンの８塩基上流および８塩基下流に関連する特徴は、このデータセットにおいて最高の性能を達成し、訓練データセットおよび試験データセットのＡＵＣは、それぞれ、０．９４および０．９１であった。 Figures 24 and 25 show that different combinations of upstream and downstream bases allow optimal classification to be achieved when determining methylation status. For example, features associated 8 bases upstream and 8 bases downstream of cytosine achieved the best performance in this dataset, with AUCs of 0.94 and 0.91 for the training and test datasets, respectively. rice field.

図２６は、ＣｐＧ部位でのメチル化状態の分類に関する特徴の相対的重要性を示す。括弧内の「Ｗ」と「Ｃ」は、鎖情報を示し、「Ｗ」はワトソン鎖を示し、「Ｃ」はクリック鎖を示す。配列文脈、ＩＰＤ、およびＰＷを含む各特徴の重要度は、ランダムフォレストを使用して決定された。ランダムフォレストツリー分析は、ＩＰＤおよびＰＷの特徴の重要度が、調査中のシトシンの下流でピークに達したことを示し、分類力への主な寄与が、調査中のシトシンの下流のＩＰＤおよびＰＷであることを明らかにした。 FIG. 26 shows the relative importance of features for classifying methylation status at CpG sites. 'W' and 'C' in brackets indicate chain information, 'W' indicates Watson chain and 'C' indicates Crick chain. The importance of each feature, including sequence context, IPD, and PW, was determined using random forest. Random forest tree analysis showed that the importance of IPD and PW features peaked downstream of the cytosine under investigation, with the main contribution to classification power being that of the IPD and PW downstream of the cytosine under investigation. made it clear that

ランダムフォレストは、複数の決定木で構成された。決定木の構築中に、ジニ不純度を使用して、決定ノードのどの決定論理を用いるかを決定した。最終的な分類結果により大きな影響を与える重要な特徴は、決定木のルートにより近いノードにある可能性が高く、一方、最終的な分類結果に余り影響を与えない重要でない特徴は、ルートから離れたノードにある可能性が高い。そのため、特徴の重要度は、ランダムフォレストのすべての決定木のルートに対する平均距離を計算することによって推定することができる。 A random forest consisted of multiple decision trees. During construction of the decision tree, the Gini impurity was used to determine which decision logic of the decision node to use. Important features that have a greater impact on the final classification result are likely to be at nodes closer to the root of the decision tree, while unimportant features that have less impact on the final classification result are more likely to be located further away from the root. It is likely to be in the node where As such, feature importance can be estimated by computing the average distance to the roots of all decision trees in the random forest.

一部の実施形態では、ワトソン鎖とクリック鎖との間のＣｐＧ部位でのメチル化コール（ｍｅｔｈｙｌａｔｉｏｎｃａｌｌｓ）のコンセンサスは、特異性を改善するためにさらに使用され得る。例えば、メチル化を示す両方の鎖をメチル化状態と呼び、非メチル化を示す両方の鎖を非メチル化状態と呼ぶ必要がある。ＣｐＧ部位でのメチル化は、典型的に対称的であることが知られているため、各鎖からの確認により、特異性を改善させることができる。 In some embodiments, consensus of methylation calls at CpG sites between Watson and Crick strands can be further used to improve specificity. For example, both strands showing methylation should be called the methylated state, and both strands showing unmethylation should be called the unmethylated state. Since methylation at CpG sites is known to be typically symmetrical, confirmation from each strand can improve specificity.

様々な実施形態では、分子全体からの全体的な動態特徴は、メチル化状態の決定のために使用され得る。例えば、分子全体のメチル化は、単一分子リアルタイム配列決定中に、分子全体の動態に影響を与える。ＩＰＤ、ＰＷ、断片サイズ、鎖情報、および配列文脈を含む鋳型ＤＮＡ分子全体の配列決定動態をモデル化することにより、分子がメチル化されているかどうかに関する分類の精度を改善させることができる。一例として、測定ウィンドウは、鋳型分子全体であり得る。分子全体のメチル化を決定するために、ＩＰＤ、ＰＷ、またはその他の動態特徴の統計値（例えば、平均、中央値、モード、パーセンタイルなど）を使用することができる。 In various embodiments, global kinetic features from the entire molecule can be used for determination of methylation status. For example, global methylation affects global dynamics during single-molecule real-time sequencing. Modeling the sequencing dynamics of the entire template DNA molecule, including IPD, PW, fragment size, strand information, and sequence context, can improve the accuracy of classification as to whether a molecule is methylated. As an example, the measurement window can be the entire template molecule. IPD, PW, or other kinetic feature statistics (eg, mean, median, mode, percentile, etc.) can be used to determine methylation across the molecule.

Ｂ．他の分析技術の制限
特定の配列モチーフにおける特定のＣのＩＰＤに基づくメチル化の検出は非常に低く、例えば、感度がわずか１．９％であると報告された（Ｃｌａｒｋｅｔａｌ．，２０１３）。また、本発明者らは、ＰＷメトリックを使用せずに、かつ本明細書に記載されるデータ構造ではなく、ＩＰＤのカットオフのみを使用して、異なる配列モチーフをＩＰＤと組み合わせることによって、このような分析を再現しようとした。例えば、調査されるＣｐＧに隣接する３ｎｔ上流および下流を抽出した。そのＣｐＧのＩＰＤを、そのＣｐＧを中心とした６ｎｔの隣接配列（すなわち、それぞれ上流および下流の３ｎｔ）の文脈に応じて、異なるグループ（６つの位置について４０９６グループ）に階層化した。同じ配列モチーフ内のメチル化ＣｐＧと非メチル化ＣｐＧとの間のＩＰＤは、ＲＯＣを使用して研究した。例えば、非メチル化「ＡＡＴＣＧＧＡＣ」モチーフおよびメチル化「ＡＡＴ^ｍＣＧＧＡＣ」モチーフにおけるＣｐＧのＩＰＤを比較すると、ＡＵＣが０．４８であった。したがって、特定の配列グループにおけるカットオフを使用すると、様々なものを使用する実施形態と比較して、うまく機能しなかった B. Limitations of Other Analytical Techniques IPD-based detection of methylation of specific Cs at specific sequence motifs is very low, for example reported to have a sensitivity of only 1.9% (Clark et al., 2013). . We also found that this I tried to reproduce such an analysis. For example, the 3 nt upstream and downstream flanking the CpG under investigation were extracted. The IPDs of that CpG were stratified into different groups (4096 groups for 6 positions) depending on the context of the 6 nt flanking sequences centered on that CpG (ie, 3 nt upstream and downstream, respectively). IPD between methylated and unmethylated CpGs within the same sequence motif was studied using ROC. For example, comparing the IPD of CpGs in the unmethylated 'AATCGGAC' and methylated 'AAT ^m CGGAC' motifs gave an AUC of 0.48. Therefore, using cutoffs in specific sequence groups did not work as well compared to embodiments using various

図２７は、パルス幅信号を使用せずにメチル化を検出するための、上記のモチーフベースのＩＰＤ分析の性能を示す（Ｂｅｃｋｍａｎｎｅｔａｌ．ＢＭＣＢｉｏｉｎｆｏｒｍａｔｉｃｓ．２０１４）。垂直の棒グラフは、調査されるＣｐＧ部位に隣接する異なるｋ－ｍｅｒモチーフにわたる平均ＡＵＣ（すなわち、調査されるＣｐＧ部位を取り巻く塩基の数）を表している。図２７は、異なるｋ－ｍｅｒモチーフ（例えば、問題のＣｐＧ部位を取り巻く２－ｍｅｒ、３－ｍｅｒ、４－ｍｅｒ、６－ｍｅｒ、８－ｍｅｒ、１０－ｍｅｒ、１５－ｍｅｒ、２０－ｍｅｒ）にわたるメチル化シトシンと非メチル化シトシンとの間のＩＰＤベースの識別力の平均ＡＵＣが、６０％未満であることがわかったことを示す。これらの結果は、近傍のヌクレオチドのＩＰＤを考慮せずに、所与のモチーフ文脈における候補ヌクレオチドのＩＰＤを考慮することが（Ｆｌｕｓｂｅｒｇｅｔａｌ．，２０１０）、ＣｐＧメチル化の決定について本明細書に開示される方法よりも劣っていることを示唆した。 Figure 27 shows the performance of the motif-based IPD analysis described above for detecting methylation without the use of pulse width signals (Beckmann et al. BMC Bioinformatics. 2014). Vertical bars represent the average AUC (ie, the number of bases surrounding the investigated CpG site) across different k-mer motifs flanking the investigated CpG site. FIG. 27 shows different k-mer motifs (eg, 2-mer, 3-mer, 4-mer, 6-mer, 8-mer, 10-mer, 15-mer, 20-mer surrounding the CpG site of interest). The average AUC for IPD-based discrimination between methylated and unmethylated cytosines over time was found to be less than 60%. These results suggest that considering the IPD of candidate nucleotides in a given motif context, without considering the IPD of neighboring nucleotides (Flusberg et al., 2010), is used here for the determination of CpG methylation. suggested to be inferior to the disclosed method.

本発明者らはまた、Ｆｌｕｓｂｅｒｇらの研究（Ｆｌｕｓｂｅｒｇｅｔａｌ．，２０１０）に存在する方法を試験した。メチル化分析にかけられたシトシンの上流２ｎｔおよび下流６ｎｔの、合計５，９４８，３４８個のＤＮＡセグメントを分析した。メチル化された２，８２８，８４８セグメント、および非メチル化された３，１１９，５００セグメントがあった。図２８に示すように、ＩＰＤおよびＰＷを使用した主成分分析から推定された信号は、メチル化シトシン（ｍＣ）および非メチル化シトシン（Ｃ）を有する断片間で大きく重複していることが見出され、Ｆｌｕｓｂｅｒｇらによって説明された方法は、実際的に意味のある正確さを欠いていることを示唆している。これらの結果は、Ｆｌｕｓｂｅｒｇらの研究（Ｆｌｕｓｂｅｒｇｅｔａｌ．，２０１０）で使用されているように、塩基および近傍の塩基でＰＷ値とＩＰＤ値を線形結合した主成分分析では、５－メチルシトシンおよび非メチル化シトシンを信頼的にまたは有意義に区別できないことを示唆した。 We also tested the method present in the study of Flusberg et al. (Flusberg et al., 2010). A total of 5,948,348 DNA segments were analyzed, 2 nts upstream and 6 nts downstream of the cytosine that were subjected to methylation analysis. There were 2,828,848 segments that were methylated and 3,119,500 segments that were unmethylated. As shown in Figure 28, the signals estimated from principal component analysis using IPD and PW were found to have significant overlap between fragments with methylated cytosines (mC) and unmethylated cytosines (C). suggest that the method presented and described by Flusberg et al. lacks any practically meaningful accuracy. These results show that principal component analysis linearly combining PW and IPD values at bases and neighboring bases, as used in the study of Flusberg et al. (Flusberg et al., 2010), showed that suggested that unmethylated cytosines cannot be reliably or meaningfully distinguished.

図２９は、ＩＰＤおよびＰＷを含むＦｌｕｓｂｅｒｇらの研究（Ｆｌｕｓｂｅｒｇｅｔａｌ．，２０１０）で２つの主成分が使用された主成分分析に基づく方法のＡＵＣが（ＡＵＣ：０．５５）、ＩＰＤおよびＰＷを含む畳み込みニューラルネットワークに基づくアプローチ、ならびに本発明者らの開示に示される配列文脈に基づくアプローチ（ＡＵＣ：０．９４）よりもはるかに精度が低いことを示す。 Figure 29 shows the AUC of the method based on principal component analysis where two principal components were used in Flusberg et al.'s study (Flusberg et al., 2010) including IPD and PW (AUC: 0.55), IPD and PW , as well as the sequence-context-based approach (AUC: 0.94) presented in our disclosure.

Ｃ．他の数学／統計モデル
別の実施形態では、例えば、限定されないが、ランダムフォレストおよびロジスティック回帰を含む他の数学的／統計モデルは、上記の開発された特徴を適応することによって訓練することができる。ＣＮＮモデルに関して、訓練データセットおよび試験データセットは、ランダムフォレストを訓練するのに使用されたＭ．ＳｓｓＩ処理（メチル化）およびＰＣＲ増幅（非メチル化）を用いて、ＤＮＡから構築された（Ｂｒｅｉｍａｎ，２００１）。このランダムフォレスト分析では、６つの特徴：ＩＰＤ、ＰＷ、および塩基識別（ｂａｓｅｉｄｅｎｔｉｔｙ）をコードする４成分のバイナリベクトルを用いて、各ヌクレオチドについて説明した。このようなバイナリベクトルでは、Ａ、Ｃ、Ｇ、およびＴは、それぞれ、［１，０，０，０］、［０，１，０，０］、［０，０，１，０］、および［０，０，０，１］でコードされる。本発明者らは、分析される各ＣｐＧ部位について、両方の鎖のその１０ｎｔ上流と下流の情報を組み込んで、各特徴が１つの次元を表す２５２次元（２５２Ｄ）のベクトルを形成した。２５２Ｄベクトルを有する上に記載の訓練データセットを使用して、ランダムフォレストモデルならびにロジスティック回帰モデルを訓練した。訓練されたモデルは、独立した試験データセットのメチル化状態を予測するために使用された。ランダムフォレストは、１００本の決定木で構成された。ツリーの構築中に、ブートストラップ試料が使用された。各決定木のノードを分割する際、最適な分割を決定するためにジニ不純度を使用し、各分割で、最大１５の特徴が考慮される。また、決定木の各リーフには、少なくとも６０試料を含有する必要があった。
C. Other Mathematical/Statistical Models In another embodiment, other mathematical/statistical models including, but not limited to, random forests and logistic regression can be trained by adapting the features developed above. . For the CNN model, the training and test datasets are the M.M. It was constructed from DNA using SssI treatment (methylation) and PCR amplification (unmethylation) (Breiman, 2001). In this random forest analysis, each nucleotide was described using a 4-component binary vector encoding six features: IPD, PW, and base identity . In such a binary vector, A, C, G, and T are [1,0,0,0], [0,1,0,0], [0,0,1,0], and Coded with [0,0,0,1]. For each CpG site analyzed, we incorporated its 10 nt upstream and downstream information on both strands to form a 252-dimensional (252D) vector where each feature represents one dimension. A random forest model as well as a logistic regression model were trained using the training data set described above with 252D vectors. The trained model was used to predict the methylation status of independent test datasets. A random forest consisted of 100 decision trees. Bootstrap samples were used during tree construction. When splitting each decision tree node, the Gini impurity is used to determine the optimal split, and up to 15 features are considered in each split. Also, each leaf of the decision tree had to contain at least 60 samples.

図３０Ａおよび図３０Ｂは、メチル化予測について、ランダムフォレストおよびロジスティック回帰を使用する方法の性能を示す。図３０Ａは、ＣＮＮ、ランダムフォレスト、およびロジスティック回帰の訓練データセットのＡＵＣ値を示す。図３０Ｂは、ＣＮＮ、ランダムフォレスト、およびロジスティック回帰の試験データセットのＡＵＣ値を示す。ランダムフォレストを使用する方法では、ＡＵＣが、訓練データセットおよび試験データセットで、それぞれ０．９３および０．８６を達成した。 Figures 30A and 30B show the performance of methods using random forest and logistic regression for methylation prediction. FIG. 30A shows the AUC values of training datasets for CNN, random forest, and logistic regression. FIG. 30B shows AUC values for CNN, random forest, and logistic regression test datasets. The method using random forest achieved AUCs of 0.93 and 0.86 on the training and test datasets, respectively.

同じ２５２Ｄベクトルを用いて記載された訓練データセットを使用して、ロジスティック回帰モデルを訓練した。訓練されたモデルは、独立した試験データセットのメチル化状態を予測するために使用された。Ｌ２正則化を用いたロジスティック回帰モデル（ＮｇａｎｄＹ．，２００４）が、訓練データセットに適合した。図３０Ａおよび図３０Ｂに示されるように、ロジスティック回帰を使用する方法では、訓練データセットおよび試験データセットにおいて、それぞれ、０．８７および０．８３のＡＵＣが達成される。 A logistic regression model was trained using the training data set described with the same 252D vectors. The trained model was used to predict the methylation status of independent test datasets. A logistic regression model (Ng and Y., 2004) with L2 regularization was fitted to the training data set. As shown in FIGS. 30A and 30B, the method using logistic regression achieves AUCs of 0.87 and 0.83 on the training and test datasets, respectively.

したがって、これらの結果は、本開示で開発された特徴および分析プロトコルを使用して、ＣＮＮ以外の特定のモデル（例えば、限定されないが、ランダムフォレストおよびロジスティック回帰）を、メチル化分析に使用することができることを示唆した。これらの結果はまた、本開示の実施形態に従って実装されたＣＮＮは、試験データセットにおいてＡＵＣが０．９０であり（図３０Ｂ）、ランダムフォレスト（ＡＵＣ：０．８６）およびロジスティック回帰（ＡＵＣ：０．８３）の両方よりも優れていることを示唆した。 These results therefore support the use of specific models other than CNN, such as, but not limited to, random forest and logistic regression, for methylation analysis using the features and analysis protocols developed in this disclosure. suggested that it can be done. These results also show that the CNN implemented according to embodiments of the present disclosure has an AUC of 0.90 on the test dataset (FIG. 30B), random forest (AUC: 0.86) and logistic regression (AUC: 0 .83).

Ｄ．核酸の６ｍＡ修飾の決定
メチル化ＣｐＧに加えて、本明細書に記載の方法はまた、他のＤＮＡ塩基修飾を検出することができる。例えば、６ｍＡの形態を含むメチル化アデニンを検出することができる。 D. Determination of Nucleic Acid 6mA Modifications In addition to methylated CpGs, the methods described herein can also detect other DNA base modifications. For example, methylated adenine can be detected, including the 6mA form.

１．動態特徴および配列文脈を使用した６ｍＡの検出
核酸の塩基修飾の決定のための開示された実施形態の性能および有用性を評価するために、本発明者らは、さらにＮ６－アデニンメチル化（６ｍＡ）を分析した。一実施形態では、約１ｎｇのヒトＤＮＡ（例えば、胎盤組織から抽出された）を増幅して、非メチル化アデニン（ｕＡ）、非メチル化シトシン（Ｃ）、非メチル化グアニン（Ｇ）、および非メチル化チミン（Ｔ）を用いた全ゲノム増幅を通して、１００ｎｇのＤＮＡ産物を取得した。 1. Detection of 6mA Using Kinetic Features and Sequence Context To evaluate the performance and utility of the disclosed embodiments for the determination of base modifications of nucleic acids, we further investigated N6-adenine methylation (6mA ) were analyzed. In one embodiment, about 1 ng of human DNA (eg, extracted from placental tissue) is amplified to produce unmethylated adenine (uA), unmethylated cytosine (C), unmethylated guanine (G), and 100 ng of DNA product was obtained through whole genome amplification with unmethylated thymine (T).

図３１Ａは、全ゲノム増幅によって非メチル化アデニンを有する分子を生成するための１つのアプローチの一例を示す。この図では、「ｕＡ」は非メチル化アデニンを示し、「ｍＡ」はメチル化アデニンを示す。全ゲノム増幅は、プライマーとしてエキソヌクレアーゼ耐性チオリン酸修飾ランダムヘキサマーを使用して実施され、プライマーは、ゲノム上でランダムに結合し、ポリメラーゼ（例えば、Ｐｈｉ２９ＤＮＡポリメラーゼ）がＤＮＡを増幅できるようにする（例えば、等温線形増幅）。ステージ３１０２では、二本鎖ＤＮＡが変性する。ステージ３１０６では、増幅反応は、いくつかのランダムヘキサマー（例えば、３１１０）が、変性した鋳型ＤＮＡ（すなわち、一本鎖ＤＮＡ）にアニーリングしたときに開始される。３１１４に示すように、鎖３１１８のヘキサマーを介したＤＮＡ合成が５’から３’の方向に進み、次のヘキサマーを介したＤＮＡ合成部位に到達すると、ポリメラーゼは、新しく合成されたＤＮＡ鎖（３１２２）を置換し、鎖の伸長を継続した。置換された鎖は、一本鎖ＤＮＡ鋳型になって、ランダムヘキサマーが再び結合し、新しいＤＮＡ合成を開始し得る。等温プロセスでヘキサマーのアニーリングおよび鎖置換を繰り返すと、増幅されたＤＮＡ産物が高収率で得られる。ここで説明される増幅は、多置換増幅（ＭＤＡ）の技術に該当し得る。 FIG. 31A shows an example of one approach for generating molecules with unmethylated adenines by whole genome amplification. In this figure, "uA" indicates unmethylated adenine and "mA" indicates methylated adenine. Whole-genome amplification is performed using exonuclease-resistant thiophosphate-modified random hexamers as primers, which bind randomly on the genome, allowing a polymerase (e.g., Phi29 DNA polymerase) to amplify the DNA. (e.g. isothermal linear amplification). At stage 3102, the double-stranded DNA is denatured. At stage 3106, the amplification reaction is initiated when a number of random hexamers (eg, 3110) anneal to the denatured template DNA (ie, single-stranded DNA). As shown at 3114, hexamer-mediated DNA synthesis of strand 3118 proceeds in the 5' to 3' direction and upon reaching the next hexamer-mediated DNA synthesis site, the polymerase releases the newly synthesized DNA strand (3122 ) to continue chain elongation. The displaced strand becomes a single-stranded DNA template upon which random hexamers can recombine and initiate new DNA synthesis. Repeated annealing and strand displacement of the hexamers in an isothermal process results in high yields of amplified DNA products. The amplification described herein may fall under the technique of multiple displacement amplification (MDA).

増幅されたＤＮＡ産物は、例えば、限定されないが、１００ｂｐ、２００ｂｐ、３００ｂｐ、４００ｂｐ、５００ｂｐ、６００ｂｐ、７００ｂｐ、８００ｂｐ、９００ｂｐ、１ｋｂ、５ｋｂ、１０ｋｂ、２０ｋｂ、３０ｋｂ、４０ｋｂ、５０ｋｂ、６０ｋｂ、７０ｋｂ、８０ｋｂ、９０ｋｂ、１００ｋｂ、または他の所望のサイズ範囲のサイズを有する断片にさらに断片化された。断片化プロセスは、酵素消化、噴霧、流体力学的剪断、超音波処理などを含んでもよい。結果として、６ｍＡなどの元の塩基修飾は、非メチル化Ａ（ｕＡ）による全ゲノム増幅によってほぼ排除され得る。図３１Ａは、ＤＮＡ産物の可能な断片（３１２６、３１３０、および３１３４）を示しており、両方の鎖には、非メチル化Ａがある。ｍＡを含まないこのような全ゲノム増幅ＤＮＡ産物は、単一分子のリアルタイム配列にかけられ、ｕＡデータセットが生成された。 Amplified DNA products can be, for example, but not limited to, , 90 kb, 100 kb, or other desired size range. Fragmentation processes may include enzymatic digestion, nebulization, hydrodynamic shear, sonication, and the like. As a result, original base modifications such as 6mA can be nearly eliminated by whole genome amplification with unmethylated A (uA). FIG. 31A shows possible fragments (3126, 3130, and 3134) of the DNA product, with unmethylated A on both strands. These mA-free whole-genome amplified DNA products were subjected to single-molecule real-time sequencing to generate the uA dataset.

図３１Ｂは、全ゲノム増幅によってメチル化アデニンを有する分子を生成するための１つのアプローチの一例を示す。この図では、「ｕＡ」は非メチル化アデニンを示し、「ｍＡ」はメチル化アデニンを示す。約１ｎｇのヒトＤＮＡを増幅して、６ｍＡおよび非メチル化Ｃ、Ｇ、およびＴを用いた全ゲノム増幅を通して、１０ｎｇのＤＮＡ産物を取得した。メチル化アデニンは、一連の化学反応を通して生成され得る（ＪＤＥｎｇｅｌｅｔａｌ．ＪＢｉｏｌＣｈｅｍ．１９７８；２５３：９２７－３４）。図３１Ｂに示されるように、全ゲノム増幅は、プライマーとしてエキソヌクレアーゼ耐性チオリン酸修飾ランダムヘキサマーを使用して実施され、これは、図３１Ａと同様に、ゲノム上でランダムに結合し、ポリメラーゼ（例えば、Ｐｈｉ２９ＤＮＡポリメラーゼ）がＤＮＡを増幅できるようにする（例えば、等温線形増幅による）。エキソヌクレアーゼ耐性チオリン酸修飾ランダムヘキサマーは、プルーフリーディングＤＮＡポリメラーゼの３’→５’エキソヌクレアーゼ活性に耐性がある。したがって、増幅中、ランダムヘキサマーは分解から保護される。 FIG. 31B shows an example of one approach for generating molecules with methylated adenines by whole genome amplification. In this figure, "uA" indicates unmethylated adenine and "mA" indicates methylated adenine. Approximately 1 ng of human DNA was amplified to obtain 10 ng of DNA product through whole genome amplification with 6 mA and unmethylated C, G, and T. Methylated adenines can be produced through a series of chemical reactions (JD Engel et al. J Biol Chem. 1978;253:927-34). As shown in FIG. 31B, whole-genome amplification was performed using exonuclease-resistant thiophosphate-modified random hexamers as primers, which bind randomly on the genome and polymerase ( For example, Phi29 DNA polymerase) is allowed to amplify the DNA (eg, by isothermal linear amplification). Exonuclease-resistant thiophosphate-modified random hexamers are resistant to the 3' to 5' exonuclease activity of proofreading DNA polymerases. Therefore, random hexamers are protected from degradation during amplification.

いくつかのランダムなヘキサマーが変性した鋳型ＤＮＡ（すなわち、一本鎖ＤＮＡ）にアニーリングしたときに、増幅反応が開始された。ヘキサマーを介したＤＮＡ合成が５’から３’の方向に進み、次のヘキサマーを介したＤＮＡ合成部位に到達すると、ポリメラーゼは新しく合成されたＤＮＡ鎖を置換し、鎖の伸長を継続する。置換された鎖は、一本鎖ＤＮＡ鋳型になって、ランダムヘキサマーが再び結合し、新しいＤＮＡ合成を開始する。等温プロセスでヘキサマーのアニーリングおよび鎖置換を繰り返すと、増幅されたＤＮＡ産物が高収率で得られる。 The amplification reaction was initiated when several random hexamers annealed to the denatured template DNA (ie, single-stranded DNA). When hexamer-mediated DNA synthesis proceeds in the 5' to 3' direction and reaches the next hexamer-mediated DNA synthesis site, the polymerase displaces the newly synthesized DNA strand and continues chain elongation. The displaced strand becomes a single-stranded DNA template and random hexamers recombine to initiate new DNA synthesis. Repeated annealing and strand displacement of the hexamers in an isothermal process results in high yields of amplified DNA products.

増幅されたＤＮＡ産物は、長さが、例えば、限定されないが、１００ｂｐ、２００ｂｐ、３００ｂｐ、４００ｂｐ、５００ｂｐ、６００ｂｐ、７００ｂｐ、８００ｂｐ、９００ｂｐ、１ｋｂ、５ｋｂ、１０ｋｂ、２０ｋｂ、３０ｋｂ、４０ｋｂ、５０ｋｂ、６０ｋｂ、７０ｋｂ、８０ｋｂ、９０ｋｂ、１００ｋｂ、または他の組み合わせにさらに断片化された。図３１Ｂに示されるように、増幅されたＤＮＡ産物は、各鎖のアデニン部位にわたって異なる形態のメチル化パターンを含むであろう。例えば、二本鎖分子の両方の鎖は、アデニン（分子Ｉ）に関してメチル化されている可能性があり、２本の鎖が全ゲノム増幅中のＤＮＡ合成に由来する場合に生成される。 The amplified DNA product can be, for example, but not limited to, 100 bp, 200 bp, 300 bp, 400 bp, 500 bp, 600 bp, 700 bp, 800 bp, 900 bp, 1 kb, 5 kb, 10 kb, 20 kb, 30 kb, 40 kb, 50 kb, 60 kb in length. , 70 kb, 80 kb, 90 kb, 100 kb, or other combinations. As shown in Figure 31B, the amplified DNA product will contain different forms of methylation patterns across the adenine sites of each strand. For example, both strands of a double-stranded molecule can be methylated for adenine (molecule I), produced when two strands are derived from DNA synthesis during whole genome amplification.

別の例として、二本鎖分子の一方の鎖は、アデニン部位にわたってインターレースのメチル化パターンを含有し得る（分子ＩＩ）。インターレースのメチル化パターンは、ＤＮＡ鎖に存在するメチル化塩基および非メチル化塩基の混合物を含むものとして定義される。次の例では、ＤＮＡ鎖に存在するメチル化アデニンおよび非メチル化アデニンの混合物を含むインターレースのアデニンのメチル化パターンを使用する。このタイプの二本鎖分子（分子ＩＩ）は、非メチル化アデニンを含有する非メチル化ヘキサマーがＤＮＡ鎖に結合し、ＤＮＡ伸長を開始したために、生成される可能性がある。非メチル化アデニンを有するヘキサマーを含有するそのような増幅されたＤＮＡ産物は、配列決定されるであろう。あるいは、このタイプの二本鎖分子（分子ＩＩ）は、非メチル化アデニンを含有する元の鋳型ＤＮＡからの断片化されたＤＮＡによって開始され、それは、このような断片化されたＤＮＡがプライマーとしてＤＮＡ鎖に結合する可能性があるためである。鎖に非メチル化アデニンを有する元のＤＮＡの一部を含有するそのような増幅されたＤＮＡ産物は、配列決定されるであろう。非メチル化ヘキサマープライマーは、得られたＤＮＡ鎖のごくわずかな箇所であるため、断片の大部分には６ｍＡがなお含有されている。 As another example, one strand of a double-stranded molecule may contain an interlaced methylation pattern across adenine sites (molecule II). An interlaced methylation pattern is defined as comprising a mixture of methylated and unmethylated bases present on a DNA strand. The following example uses an interlaced adenine methylation pattern containing a mixture of methylated and unmethylated adenines present in the DNA strand. This type of double-stranded molecule (molecule II) may be generated because an unmethylated hexamer containing an unmethylated adenine binds to the DNA strand and initiates DNA elongation. Such amplified DNA products containing hexamers with unmethylated adenines will be sequenced. Alternatively, this type of double-stranded molecule (molecule II) is initiated by fragmented DNA from the original template DNA containing unmethylated adenines, since such fragmented DNA serves as a primer This is because it may bind to the DNA strand. Such amplified DNA products containing portions of the original DNA with unmethylated adenines on the strands will be sequenced. Since the unmethylated hexamer primer is only a small portion of the resulting DNA strand, most of the fragments still contain 6mA.

別の例として、二本鎖ＤＮＡ分子の一方の鎖はアデニン部位にわたってメチル化されている可能性があるが、他方の鎖は非メチル化の可能性がある（分子ＩＩＩ）。このタイプの二本鎖分子は、メチル化アデニンを有しない元のＤＮＡ鎖が、メチル化アデニンを有する新しい鎖を生成するための鋳型ＤＮＡ分子として提供される場合に、生成される可能性がある。 As another example, one strand of a double-stranded DNA molecule may be methylated over an adenine site, while the other strand may be unmethylated (molecule III). This type of double-stranded molecule can be produced when an original DNA strand without methylated adenine is provided as a template DNA molecule for producing a new strand with methylated adenine. .

両方の鎖は非メチル化の可能性がある（分子ＩＶ）。このタイプの二本鎖分子は、メチル化アデニンを有しない２本の元のＤＮＡ鎖が再度アニーリングすることによる可能性がある。 Both strands are potentially unmethylated (molecule IV). This type of double-stranded molecule may be due to re-annealing of the two original DNA strands that do not have methylated adenines.

断片化プロセスには、酵素消化、噴霧、流体力学的剪断、および超音波処理などが含まれ得る。そのような全ゲノム増幅ＤＮＡ産物は、主にＡ部位に関してメチル化され得る。ｍＡを有するこのＤＮＡは、ｍＡデータセットを生成するために、単一分子リアルタイム配列決定にかけられた。 Fragmentation processes can include enzymatic digestion, nebulization, hydrodynamic shear, sonication, and the like. Such whole-genome amplified DNA products can be methylated predominantly on the A sites. This DNA with mA was subjected to single-molecule real-time sequencing to generate the mA dataset.

ｕＡデータセットの場合、単一分子リアルタイム配列決定を使用して、長さが中央値で９６４ｂｐの２６２，６０８分子を配列決定した。サブリード深度の中央値は、１０３倍であった。サブリードのうちの４８％は、ＢＷＡアライナを使用してヒト参照ゲノムに整列することができた（ＬｉＨｅｔａｌ．Ｂｉｏｉｎｆｏｒｍａｔｉｃｓ．２００９；２５：１７５４－６０）。一例として、ＳｅｑｕｅｌＩＩシステム（ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ）を使用して、単一分子リアルタイム配列決定を実行することができる。断片化されたＤＮＡ分子は、ＳＭＲＴｂｅｌｌＥｘｐｒｅｓｓＴｅｍｐｌａｔｅＰｒｅｐＫｉｔ２．０（ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ）を使用して、単一分子リアルタイム（ＳＭＲＴ）配列決定の鋳型の構築にかけられた。配列決定プライマーのアニーリングおよびポリメラーゼ結合の条件は、ＳＭＲＴＬｉｎｋｖ８．０ソフトウェア（ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ）を使用して計算した。簡単に、配列決定プライマーｖ２を配列決定鋳型にアニーリングし、次いでＳｅｑｕｅｌＩＩＢｉｎｄｉｎｇａｎｄＩｎｔｅｒｎａｌＣｏｎｔｒｏｌＫｉｔ２．０（ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ）を使用して、ポリメラーゼを鋳型に結合させた。配列決定は、ＳｅｑｕｅｌＩＩＳＭＲＴＣｅｌｌ８Ｍで実施した。配列決定の動画は、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ２．０（ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ）を用いて、ＳｅｑｕｅｌＩＩシステムで３０時間収集した。 For the uA dataset, 262,608 molecules with a median length of 964 bp were sequenced using single-molecule real-time sequencing. The median sub-read depth was 103-fold. 48% of the subreads could be aligned to the human reference genome using the BWA aligner (Li H et al. Bioinformatics. 2009;25:1754-60). As an example, the Sequel II system (Pacific Biosciences) can be used to perform single-molecule real-time sequencing. Fragmented DNA molecules were subjected to single molecule real-time (SMRT) sequencing template construction using SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences). Conditions for sequencing primer annealing and polymerase binding were calculated using SMRT Link v8.0 software (Pacific Biosciences). Briefly, sequencing primer v2 was annealed to the sequencing template, then polymerase was allowed to bind to the template using the Sequel II Binding and Internal Control Kit 2.0 (Pacific Biosciences). Sequencing was performed on Sequel II SMRT Cell 8M. Sequencing movies were collected for 30 hours on the Sequel II system using the Sequel II Sequencing Kit 2.0 (Pacific Biosciences).

ｍＡデータセットの場合、単一分子のリアルタイム配列を使用して、長さが中央値で８２６ｂｐの８０４，４６９分子を配列決定した。サブリード深度の中央値は、３４倍であった。サブリードのうちの２７％は、ＢＷＡアライナを使用してヒト参照ゲノムに整列することができた（ＬｉＨｅｔａｌ．Ｂｉｏｉｎｆｏｒｍａｔｉｃｓ．２００９；２５：１７５４－６０）。 For the mA dataset, 804,469 molecules with a median length of 826 bp were sequenced using single-molecule real-time sequencing. The median sub-read depth was 34-fold. 27% of the subreads could be aligned to the human reference genome using the BWA aligner (Li H et al. Bioinformatics. 2009;25:1754-60).

一実施形態では、限定されないが、ＩＰＤおよびＰＷを含む動態特性が、鎖特異的な様式で分析された。ワトソン鎖に由来する配列結果では、ｕＡデータセットからランダムに選択されたメチル化を含まない６４４，３１８個のＡ部位と、ｍＡデータセットからランダムに選択されたメチル化を含む７１８，５８６個のＡ部位を使用して、訓練データセットを構成した。このような訓練データセットを使用して、メチル化アデニンおよび非メチル化アデニン間を区別するための分類モデルおよび／または閾値を確立した。試験データセットは、メチル化を含まない６３９，７０２個のＡ部位とメチル化を含む７２３，３２０個のＡ部位から構成された。このような試験データセットを使用して、訓練データセットから推定されたモデル／閾値の性能を検証した。 In one embodiment, kinetic properties including but not limited to IPD and PW were analyzed in a strand-specific manner. Sequence results derived from Watson strands show 644,318 randomly selected A sites containing no methylation from the uA dataset and 718,586 randomly selected methylated A sites from the mA dataset. The A sites were used to construct the training dataset. Using such training data sets, classification models and/or thresholds were established to discriminate between methylated and unmethylated adenines. The test data set consisted of 639,702 A-sites without methylation and 723,320 A-sites with methylation. Such a test dataset was used to validate the performance of the model/threshold estimated from the training dataset.

ワトソン鎖に由来する配列結果を分析した。図３２Ａは、ｕＡデータセットおよびｍＡデータセットの訓練データセットにわたるパルス間隔（ＩＰＤ）値を示す。訓練データセットの場合、配列決定されたＡ部位全体のＩＰＤ値は、ｍＡデータセット（中央値：１．０９、範囲：０～９．５２）の方がｕＡデータセット（中央値：０．２０、範囲：０～９．５２）よりも高いことが観察された（Ｐ値＜０．０００１、マンホイットニのＵ検定）。 Sequence results derived from Watson strands were analyzed. FIG. 32A shows pulse interval (IPD) values across the training datasets of the uA and mA datasets. For the training dataset, the IPD values across the sequenced A sites were higher in the mA dataset (median: 1.09, range: 0-9.52) than in the uA dataset (median: 0.20). , range: 0-9.52) (P value < 0.0001, Mann-Whitney U test).

図３２Ｂは、ｕＡデータセットおよびｍＡデータセットの試験データセットのＩＰＤを示す。試験データセットの配列決定されたＡ部位全体のＩＰＤ値を調べたところ、ｍＡデータセットのＩＰＤ値は、ｕＡデータセットよりも高いことが観察された（中央値１．１０対０．１９、Ｐ値＜０．０００１、マンホイットニのＵ検定）。 FIG. 32B shows the IPD of the test datasets for the uA and mA datasets. When examining the IPD values across the sequenced A sites of the test dataset, we observed higher IPD values for the mA dataset than for the uA dataset (median 1.10 vs. 0.19, P value <0.0001, Mann-Whitney U test).

図３２Ｃは、ＩＰＤカットオフを使用した受信者操作特性（ＲＯＣ）曲線下面積を示す。真陽性率はｙ軸にあり、偽陽性率はｘ軸にある。対応するＩＰＤ値を使用してメチル化がある場合とない場合の鋳型ＤＮＡ分子の配列Ａ塩基を区別する際の受信者操作特性曲線（ＡＵＣ）下面積は、訓練データセットと試験データセットの両方で０．８６であった。 FIG. 32C shows the area under the receiver operating characteristic (ROC) curve using the IPD cutoff. The true positive rate is on the y-axis and the false positive rate is on the x-axis. The area under the receiver operating characteristic curve (AUC) in discriminating the sequence A bases of the template DNA molecule with and without methylation using the corresponding IPD values for both the training and test datasets was 0.86.

ワトソン鎖からの結果に加えて、クリック鎖に由来する配列結果を分析した。図３３Ａは、ｕＡおよびｍＡデータセットの訓練データセット全体のＩＰＤ値を示す。訓練データセットの場合、配列決定されたＡ部位全体のＩＰＤ値は、ｍＡデータセット（中央値：１．１０、範囲０～９．５２）の方がｕＡデータセット（中央値：０．１９、範囲：０～９．５２）よりも高いことが観察された（Ｐ値＜０．０００１、マンホイットニのＵ検定）。 Sequence results from the Crick strand were analyzed in addition to the results from the Watson strand. FIG. 33A shows the IPD values for the entire training dataset for the uA and mA datasets. For the training dataset, the IPD values across the sequenced A sites were higher in the mA dataset (median: 1.10, range 0-9.52) than in the uA dataset (median: 0.19, range 0-9.52). Range: 0-9.52) was observed (P-value < 0.0001, Mann-Whitney U test).

図３４Ｂは、ｕＡデータセットおよびｍＡデータセットの試験データセットのＩＰＤ値を示す。ｕＡデータセットと比較して、配列決定されたＡ部位全体でより高いＩＰＤ値が試験データセットのｍＡデータセットでも観察された（中央値１．１０対０．１９、Ｐ値＜０．０００１、マンホイットニのＵ検定）。 FIG. 34B shows the IPD values for the test datasets of the uA and mA datasets. Higher IPD values across sequenced A sites were also observed in the mA dataset in the test dataset compared to the uA dataset (median 1.10 vs. 0.19, P value < 0.0001, Mann-Whitney U test).

図３３Ｃは、ＲＯＣ曲線下面積を示す。真陽性率はｙ軸にあり、偽陽性率はｘ軸にある。対応するＩＰＤ値を使用してメチル化がある場合とない場合の鋳型ＤＮＡ分子の配列決定されたＡ塩基を区別する際のＲＯＣ曲線下面積（ＡＵＣ）値は、訓練データセットと試験データセットについて、それぞれ０．８６と０．８７であった。 FIG. 33C shows the area under the ROC curve. The true positive rate is on the y-axis and the false positive rate is on the x-axis. The area under the ROC curve (AUC) values in discriminating the sequenced A bases of the template DNA molecule with and without methylation using the corresponding IPD values are given for the training and test data sets. , were 0.86 and 0.87, respectively.

図３４は、本発明の実施形態による、測定ウィンドウを使用したワトソン鎖の６ｍＡ決定の図を示す。このような測定ウィンドウには、ＩＰＤおよびＰＷなどの動態特徴と近くの配列文脈が含まれ得る。６ｍＡの決定は、メチル化ＣｐＧの決定と同様に行うことができる。 FIG. 34 shows a diagram of a 6 mA determination of Watson strands using a measurement window, according to an embodiment of the invention. Such measurement windows can include kinetic features such as IPD and PW and nearby sequence context. The 6mA determination can be done similarly to the methylated CpG determination.

図３５は、本発明の実施形態による、測定ウィンドウを使用したクリック鎖の６ｍＡ決定の図を示す。このような測定ウィンドウには、ＩＰＤおよびＰＷなどの動態特徴と近くの配列文脈が含まれ得る。 FIG. 35 shows a diagram of 6 mA determination of click strands using a measurement window, according to an embodiment of the invention. Such measurement windows can include kinetic features such as IPD and PW and nearby sequence context.

一例として、調査されていた鋳型ＤＮＡの配列決定されたＡ塩基の各側からの１０塩基を使用して、測定ウィンドウを構築した。ＩＰＤ、ＰＷ、および配列文脈を含む特徴値を使用して、本明細書に開示される方法に従って畳み込みニューラルネットワーク（ＣＮＮ）を使用して、モデルを訓練した。他の実施形態では、統計モデルには、線形回帰、ロジスティック回帰、深層再帰型ニューラルネットワーク（例えば、長期短期記憶、ＬＳＴＭ）、ベイズ分類器、隠れマルコフモデル（ＨＭＭ）、線形判別分析（ＬＤＡ）、ｋ平均クラスタリング、ノイズを伴う用途の密度ベースの空間クラスタリング（ＤＢＳＣＡＮ）、ランダムフォレストアルゴリズム、サポートベクトルマシン（ＳＶＭ）などを含み得るが、これらに限定されない。 As an example, a measurement window was constructed using 10 bases from each side of the sequenced A bases of the template DNA under investigation. Feature values including IPD, PW, and sequence context were used to train the model using a convolutional neural network (CNN) according to the methods disclosed herein. In other embodiments, statistical models include linear regression, logistic regression, deep recurrent neural networks (e.g., long-term short-term memory, LSTM), Bayesian classifiers, hidden Markov models (HMM), linear discriminant analysis (LDA), It may include, but is not limited to, k-means clustering, density-based spatial clustering for applications with noise (DBSCAN), random forest algorithms, support vector machines (SVM), and the like.

図３６Ａおよび図３６Ｂは、測定ウィンドウベースのＣＮＮモデルを使用して、ｕＡデータセットとｍＡデータセットの間のワトソン鎖の配列決定されたＡ塩基についてメチル化される決定された確率を示す。図３６Ａは、ＣＮＮモデルが訓練データセットから学習されたことを示す。一例として、ＣＮＮモデルは、２つの１Ｄ畳み込み層（各々、カーネルサイズが４の６４個のフィルターとそれに続くＲｅＬＵ層（正規化線形ユニット）を有する）を利用し、その後ドロップアウト率が０．５のドロップアウト層を利用した。プールサイズが２の最大プーリング層を使用した。次に、２つの１Ｄ畳み込み層（各々がカーネルサイズ２の１２８個のフィルターとそれに続くＲｅＬＵ層）に流れ込み、さらにドロップアウト率が０．５のドロップアウト層を使用した。プールサイズが２の最大プーリング層を使用した。最後に、１０個のニューロンを含む全結合層、それに続く１個のニューロンを含む出力層を有するＲｅＬＵ層、それに続くシグモイド層により、メチル化の確率が得られた。層、フィルター、カーネルサイズの他の設定は、例えば、他のメチル化（例えば、ＣｐＧ）について本明細書に記載のように適合させることができる。ワトソン鎖の配列決定結果に関するこの訓練データセットでは、非メチル化ライブラリとメチル化ライブラリからの６４４，３１８および７１８，５８６個のＡ塩基を使用した。 Figures 36A and 36B show the determined probabilities of being methylated for the sequenced A bases of the Watson strand between the uA and mA data sets using the measurement window-based CNN model. FIG. 36A shows the CNN model learned from the training dataset. As an example, the CNN model utilizes two 1D convolutional layers, each with 64 filters with a kernel size of 4 followed by a ReLU layer (regularized linear unit), followed by a dropout rate of 0.5 A dropout layer was used. A maximum pooling layer with a pool size of 2 was used. We then flowed into two 1D convolutional layers (128 filters each with kernel size 2 followed by a ReLU layer) and used a dropout layer with a dropout rate of 0.5. A maximum pooling layer with a pool size of 2 was used. Finally, methylation probabilities were obtained with a fully connected layer containing 10 neurons, followed by a ReLU layer with an output layer containing 1 neuron, followed by a sigmoidal layer. Other settings of layers, filters, kernel sizes can be adapted, eg, as described herein for other methylations (eg, CpG). In this training dataset for Watson strand sequencing results, 644,318 and 718,586 A bases from unmethylated and methylated libraries were used.

ＣＮＮモデルに基づいて、ワトソン鎖関連データの場合、ｍＡデータベースからの鋳型ＤＮＡ分子の配列決定されたＡ塩基は、ｕＡに存在するそれらのＡ塩基と比較して、訓練データセットと試験データセットの両方で、メチル化の確率がはるかに高くなった（Ｐ値＜０．０００１、マンホイットニのＵ検定）。訓練データセットの場合、ｕＡデータセットのＡ部位でのメチル化の確率の中央値は０．１３（四分位範囲、ＩＱＲ：０．０９～０．１５）であったが、ｍＡデータセットの値は１．０００（ＩＱＲ：０．９９８～１．０００）であった。 Based on the CNN model, for the Watson strand association data, the sequenced A bases of the template DNA molecule from the mA database were compared to those A bases present in uA in the training and test datasets. Both resulted in a much higher probability of methylation (P value <0.0001, Mann-Whitney U test). For the training dataset, the median probability of methylation at the A site for the uA dataset was 0.13 (interquartile range, IQR: 0.09-0.15), whereas for the mA dataset The value was 1.000 (IQR: 0.998-1.000).

図３６Ａは、試験データセットについて決定されたメチル化の確率を示す。試験データセットの場合、ｕＡデータセットのＡ部位でのメチル化の確率の中央値は０．１３（ＩＱＲ：０．１０～０．１５）であったが、ｍＡデータセットの値は１．０００（ＩＱＲ：０．９９７～１．０００）であった。図３６Ａおよび３６Ｂは、測定ウィンドウベースのＣＮＮモデルが、試験データセットにおいてメチル化を検出するように訓練され得ることを示す。 FIG. 36A shows the methylation probabilities determined for the test data set. For the test data set, the median probability of methylation at the A site for the uA data set was 0.13 (IQR: 0.10-0.15), whereas the value for the mA data set was 1.000. (IQR: 0.997-1.000). Figures 36A and 36B show that a measurement window-based CNN model can be trained to detect methylation in the test dataset.

図３７は、ワトソン鎖の配列決定されたＡ塩基についての測定ウィンドウベースのＣＮＮモデルを使用した、６ｍＡを検出するためのＲＯＣ曲線である。真陽性率はｙ軸にあり、偽陽性率はｘ軸にある。この図は、ＣＮＮモデルを使用してメチル化がある場合とない場合の配列決定されたＡ部位を区別する際のＡＵＣ値が、ワトソン鎖の配列決定結果で構成される訓練データセットと試験データセットについて、それぞれ０．９４と０．９３であることを示している。ワトソン鎖のデータを使用してＡ部位のメチル化状態を決定するために本明細書の開示を使用することが実行可能であることが示唆された。決定されたメチル化の確率０．５をカットオフとして使用すると、６ｍＡの検出について９９．３％の特異度と８２．６％の感度を達成することができる。図３７は、測定ウィンドウベースのＣＮＮモデルを使用して、高い特異度と感度で６ｍＡを検出することができることを示している。モデルの精度は、ＩＰＤメトリックのみを使用する技術と比較することができる。 FIG. 37 is the ROC curve for detecting 6 mA using the measurement window-based CNN model for the sequenced A bases of the Watson strand. The true positive rate is on the y-axis and the false positive rate is on the x-axis. This figure shows that the AUC values in discriminating sequenced A sites with and without methylation using the CNN model show the training data set composed of the Watson chain sequencing results and the test data. 0.94 and 0.93 respectively for the set. It was suggested that it would be feasible to use the disclosure herein to determine the methylation status of the A site using the Watson chain data. Using a determined methylation probability of 0.5 as a cutoff, a specificity of 99.3% and a sensitivity of 82.6% for the detection of 6mA can be achieved. FIG. 37 shows that the measurement window-based CNN model can be used to detect 6mA with high specificity and sensitivity. The accuracy of the model can be compared with techniques using only the IPD metric.

図３８は、ＩＰＤメトリックベースの６ｍＡ検出および測定ウィンドウベースの６ｍＡ検出の性能比較を示している。感度はｙ軸にプロットされ、特異度はｘ軸にプロットされる。図３８は、本明細書の開示による測定ウィンドウベースの６ｍＡ分類を使用した性能（ＡＵＣ：０．９４）が、ＩＰＤメトリックのみを使用した従来の方法（ＡＵＣ：０．８７）よりも優れていたことを示す（Ｐ値＜０．０００１、デロングの検定）。測定ウィンドウベースのＣＮＮモデルは、ＩＰＤメトリックベースの検出を上回った。 FIG. 38 shows a performance comparison of IPD metric-based 6mA detection and measurement window-based 6mA detection. Sensitivity is plotted on the y-axis and specificity is plotted on the x-axis. FIG. 38 shows that the performance using the measurement window-based 6mA classification according to the present disclosure (AUC: 0.94) was superior to the conventional method using only the IPD metric (AUC: 0.87). (P value < 0.0001, Delong's test). Measurement window-based CNN models outperformed IPD metric-based detection.

図３９Ａおよび３９Ｂは、測定ウィンドウベースのＣＮＮモデルを使用して、ｕＡデータセットおよびｍＡデータセット間のクリック鎖のそれらの配列決定されたＡ塩基についてメチル化される決定された確率を示す。図３９Ａは訓練データセットを示し、図３９Ｂは試験データセットを示している。両方の図は、メチル化の確率をｙ軸にプロットしている。図３９Ａおよび３９Ｂは、ＣＮＮモデルに基づいて、クリック鎖関連データについて、ｍＡデータベースからの鋳型ＤＮＡ分子の配列決定されたＡ塩基が、ｕＡデータベースに存在するそれらのＡ塩基と比較して、訓練データセットと試験データセットの両方で、メチル化の確率がはるかに高いことを示している（Ｐ値＜０．０００１、マンホイットニのＵ検定）。 Figures 39A and 39B show the determined probabilities of being methylated for those sequenced A bases in the click strand between the uA and mA datasets using the measurement window-based CNN model. Figure 39A shows the training data set and Figure 39B shows the test data set. Both figures plot the methylation probabilities on the y-axis. Figures 39A and 39B show the sequenced A bases of the template DNA molecule from the mA database compared to those A bases present in the uA database for the click strand association data, based on the CNN model, in the training data. Both the set and the test data set show a much higher probability of methylation (P-value < 0.0001, Mann-Whitney U test).

図４０は、クリック鎖の配列決定されたＡ塩基に対して測定ウィンドウベースのＣＮＮモデルを使用した６ｍＡ検出の性能を示している。真陽性率はｙ軸にある。偽陽性率はｘ軸にある。図４０は、ＣＮＮモデルを使用してメチル化がある場合とない場合の配列決定されたＡ部位を区別する際のＡＵＣ値が、クリック鎖配列決定結果で構成される訓練データセットと試験データセットについて、それぞれ０．９５と０．９４であることを示している。本明細書に開示されるＣＮＮアプローチ（ＡＵＣ：０．９４）を使用した性能は、ＩＰＤメトリック（０．８７）のみを使用した性能よりも優れていることも示された（Ｐ値＜０．０００１）。この結果は、本明細書の開示を使用して、クリック鎖のデータを使用してＡ部位のメチル化状態を決定することが実行可能であることを示唆した。決定されたメチル化の確率０．５をカットオフとして使用すると、６ｍＡの検出について９９．３％の特異度と８３．０％の感度を達成することができる。図４０は、測定ウィンドウベースのＣＮＮモデルを使用して、高い特異性および感度で６ｍＡを検出できることを示す。 FIG. 40 shows the performance of 6mA detection using the measurement window-based CNN model for the sequenced A bases of click strands. True positive rate is on the y-axis. False positive rate is on the x-axis. FIG. 40 shows the AUC values in discriminating sequenced A sites with and without methylation using the CNN model, training and test datasets composed of click strand sequencing results. are 0.95 and 0.94, respectively. Performance using the CNN approach disclosed herein (AUC: 0.94) was also shown to be superior to performance using the IPD metric (0.87) alone (P-value < 0.94). 0001). This result suggested that it would be feasible to use the click strand data to determine the methylation status of the A site using the disclosure herein. Using a determined methylation probability of 0.5 as a cutoff, a specificity of 99.3% and a sensitivity of 83.0% for the detection of 6mA can be achieved. FIG. 40 shows that the measurement window-based CNN model can be used to detect 6mA with high specificity and sensitivity.

図４１は、ワトソン鎖とクリック鎖を含む分子のＡ塩基全体のメチル化状態の例を示している。白い点は、非メチル化アデニンを表している。黒い点は、メチル化アデニンを表している。点のある水平線は、二本鎖ＤＮＡ分子の鎖を表している。分子１は、ワトソン鎖とクリック鎖の両方が、Ａ塩基全体でメチル化されていると決定されていることを示している。分子２は、ワトソン鎖がほぼすべてメチル化されていなかったのに対して、クリック鎖がほぼすべてメチル化されていたことを示している。分子３は、ワトソン鎖とクリック鎖の両方がＡ塩基全体でほぼすべてメチル化されていると決定されたことを示している。 FIG. 41 shows examples of the methylation status of the entire A base of molecules containing Watson and Crick strands. White dots represent unmethylated adenines. Black dots represent methylated adenines. Dotted horizontal lines represent strands of a double-stranded DNA molecule. Molecule 1 shows that both Watson and Crick strands have been determined to be methylated throughout the A base. Molecule 2 shows that the Crick strand was almost all methylated, whereas the Watson strand was almost all unmethylated. Molecule 3 shows that both the Watson and Crick strands were determined to be almost exclusively methylated across the A bases.

２．選択的データセットを使用した強化訓練
図３６Ａ、３６Ｂ、３９Ａ、および３９Ｂに示されるように、ｍＡデータセットにおける鋳型ＤＮＡ分子の配列決定されたＡ塩基にわたって、メチル化の確率の二峰分布があった。言い換えれば、ｍＡデータセットには、ｕＡ信号を有する一部の分子が存在した。これは、ｍＡデータセットにおける完全非メチル化分子とヘミメチル化分子の存在によってさらに証明された（図４１）。考えられる理由の１つは、６ｍＡを含む分子が全ゲノム増幅ステップ中にＤＮＡの増幅効率を低下させるため、ＤＮＡ鋳型にｕＡを含む分子が、全ゲノム増幅後もなお、ｍＡデータセットのかなりの箇所を占めていることである。この説明は、６ｍＡで増幅された１ｎｇのゲノムＤＮＡが１０ｎｇのＤＮＡ産物しか生成しないのに対して、非メチル化Ａで増幅された１ｎｇのゲノムＤＮＡは、同じ増幅条件下で１００ｎｇのＤＮＡ産物を生成するという事実によって裏付けられた。したがって、ｍＡデータセットの場合、アデニンが通常メチル化されていない（例えば、０．０５１％）元の鋳型ＤＮＡ分子（ＸｉａｏＣＬｅｔａｌ．Ｍｏｌ２０１８；７１：３０６－３１８）は、総アデニンの約１０％を占めるであろう。 2. Reinforcement Training Using Selective Datasets As shown in Figures 36A, 36B, 39A, and 39B, there was a bimodal distribution of methylation probabilities across the sequenced A bases of the template DNA molecule in the mA data set. rice field. In other words, there were some molecules with uA signals in the mA dataset. This was further evidenced by the presence of fully unmethylated and hemimethylated molecules in the mA dataset (Figure 41). One possible reason is that molecules containing uA in the DNA template still contribute to a significant portion of the mA dataset after whole genome amplification, as molecules containing 6 mA reduce the amplification efficiency of DNA during the whole genome amplification step. It occupies a place. This explanation is that 1 ng of genomic DNA amplified at 6 mA produces only 10 ng of DNA product, whereas 1 ng of genomic DNA amplified with unmethylated A produces 100 ng of DNA product under the same amplification conditions. supported by the fact that it produces Thus, for the mA data set, the original template DNA molecule (Xiao CL et al. Mol 2018;71:306-318), in which adenines are typically unmethylated (eg, 0.051%), represents approximately will account for 10%.

一実施形態では、ｍＡとｕＡ間を区別するためにＣＮＮモデルを訓練しようとするとき、ｍＡデータセットで比較的高いＩＰＤ値を有するそれらのＡ塩基を選択的に使用することで、ｍＡ検出のためのモデルの訓練に対するｕＡデータの影響を低減させる。特定のカットオフ値を超えるＩＰＤ値を有するＡ塩基のみを使用することができる。カットオフ値は、パーセンタイルに対応し得る。一実施形態では、１０パーセンタイルでの値よりも大きなＩＰＤ値を有するｍＡデータセットのそれらのＡ塩基を使用するであろう。一部の実施形態では、１、５、１５、２０、３０、４０、５０、６０、７０、８０、９０または９５パーセンタイルでの値よりも大きいＩＰＤ値を有するそれらのＡを使用するであろう。パーセンタイルは、参照試料または複数の参照試料内のすべての核酸分子からのデータに基づいてもよい。 In one embodiment, when trying to train a CNN model to discriminate between mA and uA, selective use of those A bases with relatively high IPD values in the mA data set improves mA detection. reduce the impact of uA data on the training of models for Only A bases with IPD values above a certain cutoff value can be used. A cutoff value may correspond to a percentile. In one embodiment, we will use those A bases of the mA data set that have an IPD value greater than the value at the 10th percentile. In some embodiments, those A with IPD values greater than the value at the 1st, 5th, 15th, 20th, 30th, 40th, 50th, 60th, 70th, 80th, 90th or 95th percentile will be used. . Percentiles may be based on data from all nucleic acid molecules within a reference sample or multiple reference samples.

図４２は、１０パーセンタイルよりも大きいＩＰＤ値を有するｍＡデータセットのＡ塩基を選択的に使用することによる強化訓練での性能を示す。図４２は、ｙ軸に真陽性率を示し、ｘ軸に偽陽性率を示す。ＣＮＮモデルを訓練するのに、１０パーセンタイルよりも大きいＩＰＤ値を有するｍＡのデータセットにおけるＡ塩基を使用すると、ｍＡとｕＡ塩基間の区別におけるＡＵＣが０．９８に増加し、訓練前のＩＰＤ値による選択なしのデータによって訓練されたモデル（ＡＵＣ：０．９４）よりも優れていたことを、図は示している。訓練データセットを作成するのにＩＰＤ値を使用してｍＡ部位を選択すると、識別力の改善に役立つことが示唆された。 FIG. 42 shows performance in reinforcement training by selectively using A bases of the mA dataset with IPD values greater than the 10th percentile. FIG. 42 shows the true positive rate on the y-axis and the false positive rate on the x-axis. Using A bases in the mA data set with IPD values greater than the 10th percentile to train the CNN model increased the AUC for discrimination between mA and uA bases to 0.98, with the pre-training IPD values The figure shows that it outperformed the model trained on data without selection by (AUC: 0.94). It was suggested that using IPD values to select mA sites to generate the training data set helps improve discriminatory power.

ｍＡデータセットのｕＡ塩基を有する分子の存在をさらに確認するために、本発明者らは、分子内に存在する６ｍＡが、６ｍＡを含まない分子と比較して、新しい鎖の生成時にポリメラーゼの伸長を遅くするため、より多くのサブリードを有するウェルでｍＡデータセットのｕＡのパーセンテージが増加すると仮定した。 To further confirm the presence of molecules with uA bases in the mA data set, we found that the 6mA present in the molecule increased the elongation of the polymerase during new strand generation compared to molecules without 6mA. We hypothesized that wells with more subreads would increase the percentage of uA in the mA data set to slow down .

図４３は、ｍＡデータセットの非メチル化アデニンのパーセンテージに対する各ウェルのサブリードの数のグラフを示す。ｙ軸は、ｍＡデータセットのｕＡのパーセンテージを示す。ｘ軸は、各ウェルのサブリードの数を示す。ＩＰＤ値が１０パーセンタイルを下回っていたＡ部位を除去した後、ｍＡ部位を使用することによって訓練された強化モデルを使用して、試験データセットを再分析した。ウェルあたりのサブリードの数が増加するにつれて（配列決定ウェルあたりのサブリードが１個から１０個に、ウェルあたりのサブリードが１０個から２０個に、ウェルあたりのサブリードが４０個から５０個に、ウェルあたりのサブリードが６０個から７０個に、および７０個超に、を含む）、ｕＡが徐々に増加することが観察された（すなわち、１４．６％から５５．０５％に上昇）。したがって、サブリードの数が多いウェルは、ｍＡが低くなる傾向がある。Ａのメチル化は、配列決定の反応の進行を遅らせる可能性がある。したがって、サブリードの深度が大きい配列決定ウェルは、Ａに関してメチル化されない可能性がより高くなる。この挙動は、分子に関連付けられたサブリードの数のカットオフ値を使用して、非メチル化分子を検出するために利用することができ、例えば、サブリードが７０個を超えると、大部分が非メチル化として特定され得る。 FIG. 43 shows a graph of the number of subreads in each well against the percentage of unmethylated adenine in the mA dataset. The y-axis shows the percentage of uA in the mA data set. The x-axis indicates the number of subreads in each well. After removing A sites that had IPD values below the 10th percentile, the test dataset was reanalyzed using a reinforcement model trained by using mA sites. As the number of subreads per well increases (1 to 10 subreads per sequencing well, 10 to 20 subreads per well, 40 to 50 subreads per well, A gradual increase in uA was observed (ie, increased from 14.6% to 55.05%), including from 60 to 70 and >70 subreads per read. Therefore, wells with a high number of subreads tend to have low mA. Methylation of A can slow the progress of the sequencing reaction. Therefore, sequencing wells with greater depth of subreads are more likely to be unmethylated for A. This behavior can be exploited to detect unmethylated molecules using a cutoff value for the number of subreads associated with a molecule, e.g. can be identified as methylation.

図４４は、試験データセットの二本鎖ＤＮＡ分子のワトソン鎖およびクリック鎖間のメチルアデニンのパターンを示している。Ａのメチル化は非対称であるため、２つの鎖間で挙動が異なる。ほとんどの分子はｍＡの取り込みによりメチル化され、一部に非メチル化Ａが残存する。ｙ軸はクリック鎖のメチルアデニンのレベルを示す。ｘ軸は、ワトソン鎖のメチルアデニンのレベルを示す。各点は、二本鎖分子を表す。選択されたｍＡ部位によって訓練された強化モデルを使用して、二本鎖分子は、以下のように各鎖のメチル化レベルに従って異なるグループに分類され得る。
（ａ）二本鎖ＤＮＡ分子の場合、ワトソン鎖とクリック鎖のメチルアデニンのレベルは、両方とも０．８よりも大きかった。このような二本鎖分子は、アデニン部位に関して完全メチル化分子として定義された（図４４、領域Ａ）。鎖のメチルアデニンのレベルは、その鎖の全Ａ部位の中でメチル化されていると決定されたＡ部位のパーセンテージとして定義された。
（ｂ）二本鎖ＤＮＡ分子の場合、一方の鎖のメチルアデニンのレベルは０．８を超えていたが、もう一方の鎖は０．２未満であった。このような分子は、アデニン部位に関してヘミメチル化分子として定義された（図４４、領域Ｂ１およびＢ２）。
（ｃ）二本鎖ＤＮＡ分子の場合、ワトソン鎖とクリック鎖のメチルアデニンのレベルは、両方とも０．２未満であった。このような二本鎖分子は、アデニン部位に関して完全非メチル化分子として定義された（図４４、領域Ｃ）。
（ｄ）二本鎖ＤＮＡ分子の場合、ワトソン鎖とクリック鎖のメチルアデニンのレベルは、グループａ、ｂ、ｃに属していなかった。このような二本鎖分子は、アデニン部位に関してインターレースのメチル化パターンを有する分子として定義された（図４４、領域Ｄ）。インターレースのメチル化パターンは、ＤＮＡ鎖に存在するメチル化アデニンと非メチル化アデニンの混合物として定義された。 Figure 44 shows the pattern of methyladenines between the Watson and Crick strands of the double-stranded DNA molecules of the test data set. Since the methylation of A is asymmetric, it behaves differently between the two chains. Most molecules are methylated by mA incorporation, leaving some unmethylated A. The y-axis indicates the level of methyladenine in click strands. The x-axis indicates the level of Watson chain methyladenine. Each dot represents a double-stranded molecule. Using a reinforcement model trained with selected mA sites, double-stranded molecules can be classified into different groups according to the methylation level of each strand as follows.
(a) For double-stranded DNA molecules, the levels of methyladenine in the Watson and Crick strands were both greater than 0.8. Such double-stranded molecules were defined as permethylated molecules with respect to the adenine site (Fig. 44, region A). The methyladenine level of a chain was defined as the percentage of A sites determined to be methylated among all A sites of that chain.
(b) For double-stranded DNA molecules, the level of methyladenine in one strand was greater than 0.8, while the other strand was less than 0.2. Such molecules were defined as hemimethylated molecules with respect to the adenine site (Figure 44, regions B1 and B2).
(c) For double-stranded DNA molecules, the levels of methyladenine in the Watson and Crick strands were both less than 0.2. Such double-stranded molecules were defined as fully unmethylated molecules with respect to the adenine site (Fig. 44, region C).
(d) For double-stranded DNA molecules, the levels of methyladenine in Watson and Crick strands did not belong to groups a, b, c. Such double-stranded molecules were defined as molecules with an interlaced methylation pattern with respect to the adenine sites (Figure 44, region D). An interlaced methylation pattern was defined as a mixture of methylated and unmethylated adenines present on the DNA strand.

一部の他の実施形態では、非メチル化鎖を定義するためのメチルアデニンのレベルのカットオフは、限定されないが、０．０１、０．０５、０．１、０．２、０．３、０．４、および０．５未満であり得る。メチル化鎖を定義するためのメチルアデニンのレベルのカットオフは、限定されないが、０．５、０．６、０．７、０．８、０．９、０．９５、および０．９９を超える。 In some other embodiments, cutoff levels of methyladenine to define unmethylated strands include, but are not limited to, 0.01, 0.05, 0.1, 0.2, 0.3 , 0.4, and less than 0.5. Methyladenine level cutoffs for defining methylated strands include, but are not limited to, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, and 0.99. Exceed.

図４５は、訓練データセットおよび試験データセットにおける完全非メチル化分子、ヘミメチル化分子、完全メチル化分子、およびインターレースのメチルアデニンのパターンを有する分子のパーセンテージを示す表である。試験データセットの分子は、アデニン部位に関して、完全非メチル化分子（７．０％）、ヘミメチル化分子（９．８％）、完全メチル化分子（７９．４％）、およびインターレースのメチルアデニンのパターンを有する分子（３．７％）に分類され得る。これらの結果は、訓練データセットに示される結果と同等であり、アデニン部位に関して、完全非メチル化分子（７．０％）、ヘミメチル化分子（１０．０％）、完全メチル化分子（７９．４％）、およびインターレースのメチルアデニンのパターンを有する分子（３．６％）が存在した。 FIG. 45 is a table showing the percentage of fully unmethylated, hemimethylated, fully methylated, and interlaced methyladenine patterns in the training and test datasets. Molecules in the test data set were fully unmethylated (7.0%), hemimethylated (9.8%), fully methylated (79.4%), and interlaced with methyladenine at the adenine site. Molecules with patterns (3.7%) can be classified. These results are comparable to those shown in the training dataset, with fully unmethylated molecules (7.0%), hemimethylated molecules (10.0%) and fully methylated molecules (79.0%) for the adenine site. 4%), and molecules with an interlaced methyladenine pattern (3.6%).

図４６は、アデニン部位に関して、完全非メチル化分子、ヘミメチル化分子、完全メチル化分子、およびインターレースのメチルアデニンのパターンを有する分子の代表的な分子の例を示す。白い点は、非メチル化アデニンを表す。黒い点は、メチル化アデニンを表している。点のある水平線は、二本鎖ＤＮＡ分子の鎖を表している。 FIG. 46 shows representative molecular examples of fully unmethylated molecules, hemimethylated molecules, fully methylated molecules, and molecules with interlaced methyladenine patterns with respect to the adenine site. White dots represent unmethylated adenines. Black dots represent methylated adenines. Dotted horizontal lines represent strands of a double-stranded DNA molecule.

実施形態では、ＣＮＮモデルを訓練するために使用された６ｍＡ塩基の純度を高めることによって、メチル化アデニンと非メチル化アデニン間を区別する際の性能を改善することができる。この目的のために、ＤＮＡ増幅反応の持続時間を長くして、新しく生成されたＤＮＡ産物を増やすと、元のＤＮＡ鋳型から寄与された非メチル化アデニンの効果を弱めることができる。他の実施形態では、６ｍＡを用いたＤＮＡ増幅中にビオチン化塩基を組み込むことができる。新たに生成された６ｍＡを含むＤＮＡ産物を、ストレプトアビジンでコーティングした磁気ビーズを使用して、プルダウンおよび濃縮することができる。 In embodiments, increasing the purity of the 6mA base used to train the CNN model can improve performance in discriminating between methylated and unmethylated adenine. To this end, increasing the duration of the DNA amplification reaction to increase the newly generated DNA product can counteract the effect of the unmethylated adenine contributed from the original DNA template. In other embodiments, biotinylated bases can be incorporated during DNA amplification using 6mA. DNA products containing newly generated 6mA can be pulled down and concentrated using streptavidin-coated magnetic beads.

３．６ｍＡメチル化プロファイルの使用
ＤＮＡの６ｍＡ修飾は、細菌、古細菌、原生生物、真菌のゲノムに存在する（ＤｉｄｉｅｒＷｅｔａｌ．ＮａｔＲｅｖＭｉｃｏｒｂｉｏｌ．２００９；４：１８３－１９２）。ヒトゲノムには６ｍＡが存在し、アデニン全体の０．０５１％を占めることも報告されている（ＸｉａｏＣＬｅｔａｌ．ＭｏｌＣｅｌｌ．２０１８；７１：３０６－３１８）。ヒトゲノムで６ｍＡの含有量が少ないことを考慮すると、一実施形態では、全ゲノム増幅のステップで、ｄＮＴＰミックス（Ｎは未修飾のＡ、Ｃ、Ｇ、およびＴを表す）中の６ｍＡの比率を調整することによって、訓練データセットを作成することができる。例えば、６ｍＡとｄＮＴＰの比率として、１：１０、１：１００、１：１０００、１：１００００、１：１０００００、または１：１００００００を使用することができる。別の実施形態では、アデニンＤＮＡメチルトランスフェラーゼＭ．ＥｃｏＧＩＩを使用して、６ｍＡの訓練データセットを作成することができる。 3. Use of 6mA Methylation Profiles 6mA modifications of DNA are present in the genomes of bacteria, archaea, protists and fungi (Didier W et al. Nat Rev Microbiol. 2009;4:183-192). It has also been reported that 6 mA is present in the human genome and accounts for 0.051% of all adenines (Xiao CL et al. Mol Cell. 2018;71:306-318). Given the low content of 6mA in the human genome, in one embodiment, the whole genome amplification step reduces the proportion of 6mA in the dNTP mix (N represents unmodified A, C, G, and T) to By adjusting, a training dataset can be created. For example, ratios of 6mA to dNTPs of 1:10, 1:100, 1:1000, 1:10000, 1:100000, or 1:1000000 can be used. In another embodiment, the adenine DNA methyltransferase M. EcoGII can be used to create a 6mA training data set.

６ｍＡの量は、胃癌および肝臓癌の組織でより低く、この６ｍＡのダウンレギュレーションは、腫瘍形成の増加と相関していた（ＸｉａｏＣＬｅｔａｌ．ＭｏｌＣｅｌｌ．２０１８；７１：３０６－３１８）。一方、膠芽腫では、高レベルの６ｍＡが存在することが報告されている（Ｘｉｅｅｔａｌ．Ｃｅｌｌ．２０１８；１７５：１２２８－１２４３）。したがって、本明細書に開示されるような６ｍＡのアプローチは、癌ゲノミクスを研究するために有用であろう（ＸｉａｏＣＬｅｔａｌ．ＭｏｌＣｅｌｌ．２０１８；７１：３０６－３１８；Ｘｉｅｅｔａｌ．Ｃｅｌｌ．２０１８；１７５：１２２８－１２４３）。さらに、６ｍＡは、哺乳動物のミトコンドリアＤＮＡでより一般的かつ豊富であることがわかり、低酸素症と関連していることが示された（ＨａｏＺｅｔａｌ．ＭｏｌＣｅｌｌ．２０２０；ｄｏｉ：１０．１０１６／ｊ．ｍｏｌｃｅｌ．２０２０．０２．０１８）。したがって、本開示における６ｍＡ検出のためのアプローチは、妊娠、癌、および自己免疫疾患などの異なる臨床条件下でのミトコンドリアストレス応答を研究するために有用であろう。 6mA levels were lower in gastric and liver cancer tissues, and this downregulation of 6mA correlated with increased tumorigenesis (Xiao CL et al. Mol Cell. 2018;71:306-318). On the other hand, it has been reported that high levels of 6mA are present in glioblastoma (Xie et al. Cell. 2018; 175:1228-1243). Therefore, the 6 mA approach as disclosed herein will be useful for studying cancer genomics (Xiao CL et al. Mol Cell. 2018;71:306-318; Xie et al. Cell. 2018; 175:1228-1243). Moreover, 6mA was found to be more prevalent and abundant in mammalian mitochondrial DNA and was shown to be associated with hypoxia (Hao Z et al. Mol Cell. 2020; doi:10. 1016/j.molcel.2020.02.018). Therefore, the approach for 6mA detection in the present disclosure will be useful for studying mitochondrial stress responses under different clinical conditions such as pregnancy, cancer, and autoimmune diseases.

ＩＶ．結果と用途
Ａ．メチル化の検出
上記の方法を使用したＣｐＧ部位でのメチル化の検出は、様々な生体試料およびゲノム領域に対して実施された。一例として、バイサルファイト配列決定を使用したメチル化の決定に対して、単一分子リアルタイム配列決定を使用した妊婦の血漿中の無細胞ＤＮＡを用いたメチル化の決定が検証された。メチル化の結果は、コピー数の決定や障害の診断を含む、異なる用途に使用することができる。以下に記載される方法は、ＣｐＧ部位に限定されず、本明細書に記載の任意の修飾にも適用され得る。 IV. Results and Uses A. Detection of Methylation Detection of methylation at CpG sites using the methods described above was performed on various biological samples and genomic regions. As an example, methylation determination using cell-free DNA in the plasma of pregnant women using single-molecule real-time sequencing was validated against methylation determination using bisulfite sequencing. Methylation results can be used for different applications, including determining copy number and diagnosing disorders. The methods described below are not limited to CpG sites and can be applied to any modification described herein.

１．胎盤組織における長鎖ＤＮＡ分子のメチル化の検出
単一分子リアルタイム配列決定は、キロ塩基長のＤＮＡ分子を配列決定することができる（Ｎａｔｔｅｓｔａｄｅｔａｌ．，２０１８）。単一分子リアルタイム配列決定のロングリード情報を相乗的に利用することによって、本明細書に記載の本発明を使用したＣｐＧ部位のメチル化状態の解読により、メチル化状態のハプロタイプ情報を推測することが可能になる。ロングリードのメチル化状態ならびにそのハプロタイプ情報を推測することの実行可能性を実証するために、２８，９１３，８３８個のサブリードでカバーされた４７８，７３９個の分子を用いて、胎盤組織ＤＮＡの配列を決定した。サイズが５ｋｂ超の７つの分子があった。各々は、平均で、３つのサブリードでカバーされていた。 1. Detection of Methylation of Long DNA Molecules in Placental Tissue Single-molecule real-time sequencing can sequence kilobase-long DNA molecules (Nattestad et al., 2018). Inferring haplotype information of methylation status by deciphering the methylation status of CpG sites using the invention described herein by synergistically exploiting the long-read information of single-molecule real-time sequencing. becomes possible. To demonstrate the feasibility of inferring the methylation status of long reads as well as their haplotype information, 478,739 molecules covered by 28,913,838 subreads were used to analyze placental tissue DNA. Sequencing was determined. There were 7 molecules >5 kb in size. Each was covered, on average, by 3 subreads.

図４７は、サイズが６，２６５ｂｐの長鎖ＤＮＡ分子（すなわち、ハプロタイプブロック）に沿ったメチル化状態を示している。これは、ＺＭＷホール番号ｍ５４２７６＿１８０６２６＿１６２２４０／４０７６３５０３のＺＭＷにおいて配列決定され、ヒトゲノムにおけるｃｈｒ１：１１３２４６５４６－１１３２５２８１１のゲノム位置にマッピングされた。「－」は、非ＣｐＧヌクレオチドを表す。「Ｕ」は、ＣｐＧ部位の非メチル化状態を表す。「Ｍ」は、ＣｐＧ部位のメチル化状態を表す。黄色で強調表示された領域４７１０は、一般に、メチル化されていないことが知られているＣｐＧアイランド領域を示している（図４７）。そのＣｐＧアイランドのＣｐＧ部位の大部分は、非メチル化と推定された（９６％）。対照的に、ＣｐＧアイランド外のＣｐＧ部位の７５％が、非メチル化と推定された。これらの結果は、ＣｐＧアイランドの外側（例えば、ＣｐＧアイランドのショア／シェルフ）のメチル化レベルが、ＣｐＧアイランドのメチル化レベルよりも高いことを示唆している。そのＣｐＧアイランドの外側の領域でのハプロタイプ配置におけるメチル化状態と非メチル化状態の混合は、メチル化パターンの可変性を示す。このような観察は、一般的に現在の理解と一致していた（Ｚｈａｎｇｅｔａｌ．，２０１５；ＦｅｉｎｂｅｒｇａｎｄＩｒｉｚａｒｒｙ，２０１０）。したがって、この開示は、メチル化状態および非メチル化状態を含む長鎖分子に沿って異なるメチル化状態を呼び出すことを可能にし、メチル化状態のハプロタイプ情報が段階的である可能性があることを意味する。ハプロタイプ情報とは、ＤＮＡの連続したストレッチへのＣｐＧ部位のメチル化状態との関連付けを指す。 FIG. 47 shows methylation states along a long DNA molecule (ie, haplotype block) of size 6,265 bp. It was sequenced in ZMW with ZMW hall number m54276_180626_162240/40763503 and mapped to the genomic position chr1:113246546-113252811 in the human genome. "-" represents non-CpG nucleotides. "U" represents the unmethylated state of the CpG site. "M" represents the methylation status of the CpG site. Regions 4710 highlighted in yellow indicate CpG island regions commonly known to be unmethylated (Fig. 47). The majority of CpG sites in that CpG island were assumed to be unmethylated (96%). In contrast, 75% of CpG sites outside CpG islands were predicted to be unmethylated. These results suggest that the methylation level outside the CpG island (eg, the shore/shelf of the CpG island) is higher than that of the CpG island. The mixture of methylated and unmethylated states in the haplotype arrangement at regions outside the CpG island indicates variability in methylation patterns. Such observations were generally consistent with current understanding (Zhang et al., 2015; Feinberg and Irizarry, 2010). Thus, this disclosure allows calling out different methylation states along long chains, including methylation and unmethylation states, and that haplotype information of methylation states can be graded. means. Haplotype information refers to the association of the methylation status of CpG sites to contiguous stretches of DNA.

一実施形態では、本明細書では、ハプロタイプに沿ったメチル化状態を分析して、インプリント領域を検出および分析するために、このアプローチを使用することができる。インプリント領域は、親起源の様式でメチル化状態を引き起こすエピジェネティックな調節を受ける。例えば、１つの重要なインプリント領域は、ヒト染色体１１ｐ１５．５に位置し、胎児の成長の強力な調節因子であるインプリント遺伝子ＩＧＦ２、Ｈ１９、およびＣＤＫＮ１Ｃ（Ｐ５７^ｋｉｐ２）を含有する（Ｂｒｉｏｕｄｅｅｔａｌ，ＮａｔＲｅｖＥｎｄｏｃｒｉｎｏｌ．２０１８；１４：２２９－２４９）。インプリント領域の遺伝的およびエピジェネティックな異常は、疾患に関連しているであろう。ベックウィズ・ウィーデマン症候群（ＢＷＳ）は、過成長症候群であり、患者はしばしば巨舌、腹壁欠損、半側肥大、腹部臓器の肥大、および幼児期の胎児性腫瘍のリスク増加を伴う。ＢＷＳは、１１ｐ１５．５領域内の遺伝的またはエピジェネティックな欠陥に起因すると考えられている（Ｂｒｉｏｕｄｅｅｔａｌ，ＮａｔＲｅｖＥｎｄｏｃｒｉｎｏｌ．２０１８；１４：２２９－２４９）。Ｈ１９とＩＧＦ２との間に位置するＩＣＲ１（インプリント制御領域１）と呼ばれる領域は、父方のアレル上で可変的にメチル化されている。ＩＣＲ１は、ＩＧＦ２の親起源特異的発現を誘導する。したがって、ＩＣＲ１の遺伝的およびエピジェネティックな異常は、ＢＷＳを引き起こす可能性がある理由の１つであるＩＧＦ２の異常な発現につながる。したがって、インプリント領域に沿ったメチル化状態の検出は、臨床的に重要である。 In one embodiment, this approach can be used herein to analyze methylation status along haplotypes to detect and analyze imprinted regions. Imprinted regions are subject to epigenetic regulation resulting in methylation status in a parental manner. For example, one key imprinted region is located on human chromosome 11p15.5 and contains the imprinted genes IGF2, H19, and CDKN1C (P57 ^kip2 ), which are potent regulators of fetal growth (Brioude et al. , Nat Rev Endocrinol. 2018; 14:229-249). Genetic and epigenetic abnormalities in imprinted regions may be associated with disease. Beckwith-Wiedemann syndrome (BWS) is an overgrowth syndrome in which patients are often associated with macroglossia, abdominal wall defect, hemihypertrophy, enlarged abdominal organs, and an increased risk of fetal tumors in early childhood. BWS is believed to result from genetic or epigenetic defects within the 11p15.5 region (Brioude et al, Nat Rev Endocrinol. 2018; 14:229-249). A region called ICR1 (imprint control region 1) located between H19 and IGF2 is variably methylated on the paternal allele. ICR1 induces parental origin-specific expression of IGF2. Thus, genetic and epigenetic abnormalities in ICR1 lead to aberrant expression of IGF2, one of the possible reasons for BWS. Therefore, detection of methylation status along imprinted regions is clinically important.

本発明者らは、現在報告されているインプリント遺伝子を精選する公開データベース（ｈｔｔｐ：／／ｗｗｗ．ｇｅｎｅｉｍｐｒｉｎｔ．ｏｒｇ／）から９２個のインプリント遺伝子のデータをダウンロードした。これらのインプリント遺伝子の５ｋｂ上流および下流の領域をさらなる分析に使用した。これらの領域の中で、１６０箇所のＣｐＧアイランドが、これらのインプリント遺伝子に関連している。本発明者らは、胎盤試料から３２４，２４８個の循環コンセンサス配列を取得した。低品質の循環コンセンサス配列およびＣｐＧアイランドと重複する短い領域（例えば、関連するＣｐＧアイランドの長さの５０％未満）を除去した後、８つのインプリント遺伝子に対応する９つのＣｐＧアイランドと重複する９つの循環コンセンサス配列を取得した。 We downloaded data for 92 imprinted genes from a curated public database of currently reported imprinted genes (http://www.geneimprint.org/). 5 kb upstream and downstream regions of these imprinted genes were used for further analysis. Within these regions, 160 CpG islands are associated with these imprinted genes. We obtained 324,248 circular consensus sequences from placental samples. After removing low-quality circular consensus sequences and short regions overlapping CpG islands (e.g., less than 50% of the length of the relevant CpG islands), 9 overlapping CpG islands corresponding to 8 imprinted genes. We obtained two circular consensus sequences.

図４８は、単一分子リアルタイム配列決定によって、９つのＤＮＡ分子の配列が決定され、Ｈ１９、ＷＴ１－ＡＳ、ＷＴ１、ＤＬＫ１、ＭＥＧ３、ＡＴＰ１０Ａ、ＬＲＲＴＭ１、およびＭＡＧＩ２を含むインプリント領域と重複することを示す表である。６番目の列には、インプリント領域を含むＣｐＧアイランドと重複するＤＮＡストレッチが含有されていた。「Ｕ」は、ＣｐＧ文脈での非メチル化シトシンを表す。「Ｍ」は、ＣｐＧ文脈でのメチル化シトシンを表す。「＊」は、配列決定結果でカバーされなかったＣｐＧ部位を表す。「－」は、非ＣｐＧ部位からのヌクレオチドを表す。分子が一塩基多型（ＳＮＰ）と重複する場合、遺伝子型を括弧内に示す。７番目の列は、分子全体のメチル化状態を示している。本開示に存在する実施形態に従って、ＣｐＧ部位の大部分（例えば、５０％超）がメチル化されていることが示された場合、分子はメチル化されていると呼ばれ得る。それ以外の場合は、非メチル化されていると呼ばれる。 Figure 48 shows that nine DNA molecules were sequenced by single-molecule real-time sequencing, overlapping imprinted regions including H19, WT1-AS, WT1, DLK1, MEG3, ATP10A, LRRTM1, and MAGI2. It is a table showing The sixth row contained a DNA stretch overlapping the CpG island containing the imprinted region. "U" represents unmethylated cytosine in the CpG context. "M" represents a methylated cytosine in the CpG context. "*" represents CpG sites that were not covered by the sequencing results. "-" represents nucleotides from non-CpG sites. If the molecule overlaps with a single nucleotide polymorphism (SNP), the genotype is shown in brackets. The seventh column shows the methylation status of the whole molecule. According to embodiments present in the present disclosure, a molecule may be referred to as methylated if it is shown that a majority (eg, greater than 50%) of the CpG sites are methylated. otherwise it is said to be unmethylated.

９つのＤＮＡ分子のうち、５つのＤＮＡ分子（５５．６％）はメチル化と呼ばれ、５０％のＤＮＡ分子がメチル化されるという予想から大きく逸脱していなかった。図４８の表の６番目の列に示されるように、ＣｐＧ部位の大部分は、協調して（すなわち、メチル化ハプロタイプとして）メチル化または非メチル化されていることが示された。一実施形態は、本開示に存在する実施形態に従って、ＣｐＧ部位の大部分（例えば、５０％超）がメチル化されていることが示された場合、分子はメチル化されていると呼ばれ得る。そうでない場合は、非メチル化されていると呼ばれる。分子がメチル化されているかどうかを決定するための他のカットオフを使用することができ、限定されないが、分析された分子におけるＣｐＧ部位の少なくとも１０％、２０％、３０％、４０％、５０％、６０％、７０％、８０％、９０％、および１００％が、メチル化されているとみなされる。 Of the 9 DNA molecules, 5 (55.6%) were called methylated, not far from the expectation that 50% of the DNA molecules are methylated. As shown in the sixth column of the table in FIG. 48, the majority of CpG sites were shown to be either methylated or unmethylated in concert (ie, as methylated haplotypes). One embodiment, according to embodiments present in the present disclosure, a molecule may be referred to as methylated if it is shown that a majority (e.g., greater than 50%) of the CpG sites are methylated. . Otherwise it is said to be unmethylated. Other cutoffs for determining whether a molecule is methylated can be used, including but not limited to at least 10%, 20%, 30%, 40%, 50% of the CpG sites in the molecule analyzed. %, 60%, 70%, 80%, 90% and 100% are considered methylated.

別の実施形態では、少なくとも１つのＳＮＰの分析および少なくとも１つのＣｐＧ部位の分析を同時に含む分子を使用して、領域がインプリント領域に関連するかどうか、または既知のインプリント遺伝子が異常であるかどうか（例えば、インプリントの喪失）を決定することができる。例示の目的で、図４９は、インプリンティング領域由来の第１の分子がアレル「Ａ」を有し、そのインプリンティング領域由来の第２の分子がアレル「Ｇ」を有した。インプリンティング領域が父方でインプリントされたと仮定すると、母方のハプロタイプからの第１の分子は、完全非メチル化であり、父方のハプロタイプからの第２の分子は、完全メチル化であった。一実施形態では、そのような仮定は、メチル化状態のグラウンドトゥルースを提供し、本開示に存在する実施形態による塩基修飾検出の性能を試験することを可能にする。 In another embodiment, a molecule comprising analysis of at least one SNP and analysis of at least one CpG site simultaneously is used to determine whether a region is associated with an imprinted region or a known imprinted gene is aberrant. (eg loss of imprint). For purposes of illustration, Figure 49 had the first molecule from the imprinted region having allele "A" and the second molecule from that imprinting region having allele "G". Assuming that the imprinted region was paternally imprinted, the first molecule from the maternal haplotype was fully unmethylated and the second molecule from the paternal haplotype was fully methylated. In one embodiment, such assumptions provide the ground truth of methylation status and allow testing the performance of base modification detection according to embodiments present in the present disclosure.

図４９は、インプリント領域におけるメチル化パターンの決定の一例を示す。生体試料中のＤＮＡを抽出し、ヘアピンアダプターと連結して環状ＤＮＡ分子を形成した。これらの環状ＤＮＡ分子に関する配列情報および塩基修飾（例えば、ＣｐＧ部位のメチル化状態）は不明であった。それらの環状ＤＮＡ分子は、単一分子リアルタイム配列決定にかけられた。サブリードが参照ゲノムにマッピングされた後、それらの環状ＤＮＡ分子に由来する各サブリードの塩基について、ＩＰＤ、ＰＷ、および配列文脈が決定された。さらに、それらの分子の遺伝子型が決定された。ＣＧ部位に関連する測定ウィンドウのＩＰＤ、ＰＷ、および配列文脈は、各ＣｐＧのメチル化状態を決定するために、本開示に存在する実施形態による参照動態パターンと比較されるであろう。アレルが異なる２つの分子が、一方が完全非メチル化で、他方が完全メチル化であるような異なるメチル化パターンを示した場合、これら２つの分子に関連するゲノム領域はインプリント領域であろう。一実施形態では、例えば、図４９に示されるように、そのようなゲノム領域がたまたま既知のインプリント領域であった場合、これらの２つの分子のメチル化パターンは、通常の状況で予想されるメチル化パターン（すなわち、グラウンドトゥルース）と一致した。それは、本開示に存在する実施形態による、メチル化状態の分類のための方法の正確さを示唆し得る。一実施形態では、本開示に存在する実施形態による、測定されたメチル化パターンと予想されるメチル化パターンとの間の導出は、インプリンティングの異常、例えば、インプリンティングの喪失を示すであろう。 FIG. 49 shows an example of determination of methylation patterns in imprinted regions. DNA in biological samples was extracted and ligated with hairpin adapters to form circular DNA molecules. Sequence information and base modifications (eg, methylation status of CpG sites) for these circular DNA molecules were unknown. Those circular DNA molecules were subjected to single-molecule real-time sequencing. After the subreads were mapped to the reference genome, the IPD, PW, and sequence context were determined for each subread base derived from those circular DNA molecules. In addition, those molecules were genotyped. The IPD, PW, and sequence context of the measurement window associated with the CG site will be compared to reference kinetic patterns according to embodiments present in the present disclosure to determine the methylation status of each CpG. If two molecules with different alleles show different methylation patterns, one fully unmethylated and the other fully methylated, the genomic region associated with these two molecules will be an imprinted region. . In one embodiment, for example, as shown in FIG. 49, if such a genomic region happens to be a known imprinted region, the methylation pattern of these two molecules would be expected under normal circumstances. matched the methylation pattern (ie, ground truth). It may indicate the accuracy of the method for classification of methylation status according to embodiments present in the present disclosure. In one embodiment, derivation between measured and expected methylation patterns according to embodiments present in the present disclosure will indicate imprinting abnormalities, e.g., loss of imprinting. .

図５０は、インプリント領域におけるメチル化パターンの決定の一例を示す。一実施形態では、インプリンティングパターンは、特定の家系図にわたるその領域のメチル化パターンを分析することを通して、さらに決定することができる。例えば、父方、母方のゲノム、および子孫全体のメチル化パターンとアレル情報の分析を行うことができる。そのような家系図は、父方または母方の祖父、父方または母方の祖母のゲノムもしくは他の関連するゲノムをさらに含み得る。別の実施形態では、そのような分析は、特定の集団における家族トリオ（母親、父親および子供）データセットに拡張することができ、例えば、本明細書に存在する実施形態に従って、各個体のメチル化および遺伝子型情報を取得する。 FIG. 50 shows an example of determination of methylation patterns in imprinted regions. In one embodiment, the imprinting pattern can be further determined through analyzing the methylation pattern of that region across a particular family tree. For example, analysis of methylation patterns and allelic information across paternal, maternal genomes, and offspring can be performed. Such pedigrees may further include the genomes of paternal or maternal grandfathers, paternal or maternal grandmothers or other related genomes. In another embodiment, such analysis can be extended to family trio (mother, father and child) datasets in a particular population, e.g. Acquire genetic and genotype information.

分類後に示されているように、遺伝子型（ボックス内のアレル）とメチル化状態の両方を決定することができる。各々の分子について、分子がどの親から受け継がれているかを特定するために、各部位のメチル化パターンを提供することができる（例えば、すべてメチル化またはすべて非メチル化）。または、メチル化密度を決定することができ、１つ以上のカットオフにより、分子が高メチル化されているか（例えば、＞８０％または他の％、一方の親から）、低メチル化されているか（例えば、＜２０％または他の％、他方の親から）分類することができる。 As indicated after sorting, both genotype (alleles in boxes) and methylation status can be determined. For each molecule, the methylation pattern at each site can be provided (eg, all methylated or all unmethylated) to identify from which parent the molecule is inherited. Alternatively, methylation density can be determined and one or more cutoffs determine whether a molecule is hypermethylated (e.g., >80% or other %, from one parent) or hypomethylated. (eg, <20% or other %, from the other parent).

２．ｃｆＤＮＡ分子のメチル化の検出
別の例として、無細胞ＤＮＡ（ｃｆＤＮＡ）のメチル化も、非侵襲的な出生前検査の重要な分子信号としてますます認識されている。例えば、組織特異的なメチル化を有する領域のｃｆＤＮＡ分子を使用して、妊婦の血漿中の好中球、Ｔ細胞、Ｂ細胞、肝臓、胎盤などの異なる組織からの比例的な寄与を決定できることを示した（Ｓｕｎｅｔａｌ．，２０１５）。２１番染色体トリソミーを検出するために妊婦の血漿ＤＮＡメチル化を使用することの実行可能性も実証されている（Ｌｕｎｅｔａｌ．，２０１３）。母体血漿中のｃｆＤＮＡ分子は、中央値１６６ｂｐのサイズに断片化された。これは、サイズが約５００ｂｐである人工的に断片化された大腸菌ＤＮＡよりもはるかに短いものである。ｃｆＤＮＡはランダムに断片化されていないことが報告されている。例えば、胎盤由来などの組織起源に関連する血漿ＤＮＡの末端モチーフである。無細胞ＤＮＡのこのような特徴的な特性は、人工的に断片化された大腸菌ＤＮＡとは非常に異なる配列文脈を提供する。したがって、そのようなポリメラーゼの動態が、典型的には無細胞ＤＮＡ分子のメチル化レベルを定量的に推定することを可能にするかどうかは不明のままである。この特許出願における開示は、例えば、限定されないが、上記の組織ＤＮＡ分子から訓練されたメチル化予測モデルを使用することによって、妊婦の血漿中の無細胞ＤＮＡをメチル化分析することに適用可能である。 2. Detection of Methylation of cfDNA Molecules As another example, cell-free DNA (cfDNA) methylation is also increasingly recognized as an important molecular signal for non-invasive prenatal testing. For example, cfDNA molecules in regions with tissue-specific methylation can be used to determine the proportional contributions from different tissues such as neutrophils, T-cells, B-cells, liver, placenta, etc. in the plasma of pregnant women. (Sun et al., 2015). The feasibility of using plasma DNA methylation in pregnant women to detect trisomy 21 has also been demonstrated (Lun et al., 2013). cfDNA molecules in maternal plasma were fragmented to a median size of 166 bp. This is much shorter than artificially fragmented E. coli DNA, which is approximately 500 bp in size. It has been reported that cfDNA is not randomly fragmented. For example, terminal motifs in plasma DNA that are associated with tissue origins such as placental origin. Such characteristic properties of cell-free DNA provide a very different sequence context than artificially fragmented E. coli DNA. Therefore, it remains unclear whether such polymerase kinetics allow quantitative estimation of methylation levels of typically cell-free DNA molecules. The disclosure in this patent application is applicable, for example, but not limited to, methylation analysis of cell-free DNA in plasma of pregnant women by using a methylation prediction model trained from the tissue DNA molecules described above. be.

単一分子リアルタイム配列決定を使用して、男性胎児を有する妊婦の６つの血漿ＤＮＡ試料を配列決定し、中央値が１１１，８３４個のＣＣＳ（範囲：６１，０１０～５０３，５８２個）に対応する中央値が３０，７３８，３９９個のサブリード（範囲：１，４３１，２１５～１０５，８３５，８４６個）を有した。各血漿ＤＮＡは、中央値２６２回（範囲：１７３～３２０回）配列決定された。データセットは、ＳｅｑｕｅｌＩＳｅｑｕｅｎｃｉｎｇＫｉｔ３．０によって調製されたＤＮＡから生成された。 Six plasma DNA samples of pregnant women with male fetuses were sequenced using single-molecule real-time sequencing, corresponding to a median of 111,834 CCSs (range: 61,010-503,582) had a median of 30,738,399 subreads (range: 1,431,215 to 105,835,846). Each plasma DNA was sequenced a median of 262 times (range: 173-320). The dataset was generated from DNA prepared by the Sequel I Sequencing Kit 3.0.

ｃｆＤＮＡ分子のメチル化の検出を評価するために、本発明者らは、バイサルファイト配列決定（Ｊｉａｎｇｅｔａｌ．，２０１４）を使用して、妊婦の上記の６つの血漿ＤＮＡ試料のメチル化を分析した。中央値が６６００万個のペアエンドリードを取得した（５８００万～８２００万個のペアエンドリード）。全体的なメチル化の中央値は６９．６％（６７．１％～７２．０％）であることがわかった。 To assess the detection of methylation of cfDNA molecules, we used bisulfite sequencing (Jiang et al., 2014) to analyze the methylation of the above six plasma DNA samples of pregnant women. bottom. A median of 66 million paired-end reads were obtained (58-82 million paired-end reads). The median overall methylation was found to be 69.6% (67.1%-72.0%).

図５１は、新しいアプローチと従来のバイサルファイト配列決定によって推定されたメチル化レベルの比較を示している。ｙ軸は、この特許出願に存在する実施形態に従って予測されるメチル化レベルである。ｘ軸は、バイサルファイト配列決定によって推定されたメチル化レベルである。単一分子リアルタイム配列決定から生成された血漿ＤＮＡの結果について、中央値が３１４，６７５個のＣｐＧ部位（範囲：１４４，５４６～１，３８２，５６８個）を分析した。メチル化されると予測されたＣｐＧ部位の割合の中央値は６４．７％（範囲：６０．８～６８．５％）であり、バイサルファイト配列決定から推定された結果と同等であるように見えた。図５１に示されるように、このメチル化予測アプローチによる単一分子リアルタイム配列決定によって推定された全体的なメチル化レベルと、バイサルファイト配列決定との間には、良好な相関（ｒ：０．９６、ｐ値＝０．００２３）があった。 Figure 51 shows a comparison of methylation levels deduced by the new approach and conventional bisulfite sequencing. The y-axis is the methylation level predicted according to the embodiments present in this patent application. The x-axis is the methylation level estimated by bisulfite sequencing. A median of 314,675 CpG sites (range: 144,546 to 1,382,568) were analyzed for plasma DNA results generated from single-molecule real-time sequencing. The median percentage of CpG sites predicted to be methylated was 64.7% (range: 60.8-68.5%), similar to the results deduced from bisulfite sequencing. Looked. As shown in FIG. 51, there is a good correlation (r:0. 96, p-value = 0.0023).

バイサルファイト配列決定の深度が浅いため、ヒトゲノムの各ＣｐＧのメチル化レベル（すなわち、メチル化されている配列決定されたＣｐＧの割合）を推定するには頑強ではない可能性がある。代わりに、本発明者らは、任意の２つの連続するＣｐＧ部位が５０ｎｔ以内にあり、かつＣｐＧ部位の数が少なくとも１０個であるゲノム領域のＣｐＧ部位をカバーするリード信号を集約することにより、複数のＣｐＧ部位を有する一部の領域のメチル化レベルを計算した。ある領域のＣｐＧ部位全体の配列決定されたシトシンとチミンの合計に占める配列決定されたシトシンのパーセンテージは、その領域のメチル化レベルを示していた。領域は、領域のメチル化レベルに応じて、異なるグループに分けられた。以前の訓練データセット（すなわち、組織ＤＮＡ）から学習したモデルによって予測されたメチル化の確率は、メチル化レベルが増加するにつれて上昇した（図５２Ａ）。これらの結果はさらに、妊婦のｃｆＤＮＡ分子のメチル化状態を予測するために単一分子リアルタイム配列決定を使用することの実行可能性と妥当性を示唆した。図５２Ｂは、本開示に存在する実施形態による単一分子リアルタイム配列決定を使用して推定された１０Ｍｂゲノムウィンドウにおけるメチル化レベルが、バイサルファイト配列決定によるもので十分に補正されたことを示した（ｒ＝０．７４、ｐ値＜０．０００１）。 Due to the shallow depth of bisulfite sequencing, it may not be robust to estimate the methylation level of each CpG in the human genome (ie, the percentage of sequenced CpGs that are methylated). Instead, we aggregate read signals covering CpG sites in genomic regions where any two consecutive CpG sites are within 50 nt and where the number of CpG sites is at least 10: Methylation levels of some regions with multiple CpG sites were calculated. The percentage of sequenced cytosines out of the total sequenced cytosines and thymines across CpG sites in a region indicated the methylation level of that region. Regions were divided into different groups according to the methylation level of the region. The methylation probabilities predicted by models learned from the previous training dataset (ie, tissue DNA) increased as methylation levels increased (Fig. 52A). These results further suggested the feasibility and validity of using single-molecule real-time sequencing to predict the methylation status of cfDNA molecules in pregnant women. FIG. 52B showed that methylation levels in a 10 Mb genomic window estimated using single-molecule real-time sequencing according to embodiments present in the present disclosure were well corrected with bisulfite sequencing. (r=0.74, p-value<0.0001).

図５３は、単一分子リアルタイム配列決定によって測定された妊婦の母体血漿中のＹ染色体のゲノム表現（ＧＲ）が、ＢＳ－ｓｅｑによって測定されたものとよく相関していることを示した（ｒ＝０．９７、Ｐ値＝０．００７）。これらの結果は、単一分子リアルタイム配列決定により、胎盤などの非造血組織（一般に、寄与するＤＮＡが少数）に由来するＤＮＡ分子の正確な定量も可能になることを示唆した。言い換えれば、本開示は、配列決定の前に、塩基変換および増幅なしに、天然分子のコピー数異常およびメチル化状態を同時に分析するための実行可能性を実証した。 Figure 53 showed that the genomic representation (GR) of the Y chromosome in maternal plasma of pregnant women measured by single-molecule real-time sequencing correlated well with that measured by BS-seq (r = 0.97, P-value = 0.007). These results suggested that single-molecule real-time sequencing would also allow accurate quantification of DNA molecules from non-hematopoietic tissues, such as placenta, which generally have few contributing DNAs. In other words, the present disclosure has demonstrated the feasibility to simultaneously analyze copy number aberrations and methylation status of native molecules without base conversion and amplification prior to sequencing.

３．ＣｐＧブロックベースの方法
一部の実施形態は、例えば、限定されないが、２、３、４、５、１０、２０、３０、４０、５０、１００個のＣｐＧ部位などを含む複数のＣｐＧ部位を有するいくつかのゲノム領域でメチル化分析を行うことができる。このようなゲノム領域のサイズは、例えば、限定されないが、５０、１００、２００、３００、および５００ｎｔなどであり得る。この領域のＣｐＧ部位間の距離は、例えば、限定されないが、１０、２０、３０、４０、５０、１００、２００、３００ｎｔなどであり得る。一実施形態では、５０ｎｔ内の任意の２つの連続するＣｐＧ部位を重ね合わせて、このブロック内のＣｐＧ部位の数が１１個以上であるようにＣｐＧブロックを形成し得る。このようなブロックベースの方法では、複数の領域を単一のマトリックスとして表される１つのウィンドウに組み合わせて、領域を効果的に一緒に処理できる。 3. CpG Block-Based Methods Some embodiments have multiple CpG sites, including, but not limited to, 2, 3, 4, 5, 10, 20, 30, 40, 50, 100 CpG sites. Methylation analysis can be performed on several genomic regions. The size of such genomic regions can be, for example, but not limited to, 50, 100, 200, 300, and 500 nt. The distance between CpG sites in this region can be, for example, but not limited to, 10, 20, 30, 40, 50, 100, 200, 300 nt. In one embodiment, any two consecutive CpG sites within 50 nt may be superimposed to form a CpG block such that the number of CpG sites within this block is 11 or greater. In such block-based methods, multiple regions can be combined into one window represented as a single matrix, effectively processing the regions together.

一例として、図５４に示すように、ＣｐＧブロックに関連するすべてのサブリードの動態を、メチル化分析に使用した。そのブロック内の各ＣｐＧに隣接する上流および下流の１０ｎｔ隣接部の予測ＩＰＤプロファイルを、ＣｐＧ部位に対して人為的に整列させて、平均ＩＰＤプロファイルを計算した（図５４）。「投影された」という言葉は、サブリードの動態信号を、問題の対応する各ＣｐＧ部位に整列されることを意味する。ＣｐＧブロックの平均ＩＰＤプロファイルは、各ブロックのメチル化状態を特定するためのモデルを訓練するために使用された（例えば、人工ニューラルネットワーク、略してＡＮＮ）。ＡＮＮ分析には、入力層、２つの隠れ層、および出力層が含まれた。各ＣｐＧブロックは、ＡＮＮに入力される２１個のＩＰＤ値の特徴ベクトルによって特徴付けられた。最初の隠れ層には、活性化関数としてＲｅＬｕを有する１０個のニューロンが含まれた。２番目の隠れ層には、活性化関数としてＲｅＬｕを有する５個のニューロンが含まれた。最後に、出力層には、メチル化の確率を出力する活性化関数としてシグモイドを有する１つのニューロンが含まれた。メチル化の確率が０．５を超えるＣｐＧ部位は、メチル化とみなされ、それ以外の場合は、非メチル化とみなされた。平均ＩＰＤプロファイルは、分子全体のメチル化状態を分析するために使用することができる。閾値を超える特定の数の部位（例えば、０、１、２、３など）がメチル化されている場合、または分子が特定のメチル化密度を有する場合、分子全体がメチル化されているとみなされ得る。 As an example, the kinetics of all subreads associated with CpG blocks were used for methylation analysis, as shown in FIG. The predicted IPD profiles of the 10 nt flanking upstream and downstream flanking each CpG within the block were artificially aligned to the CpG site to calculate the average IPD profile (Figure 54). The term "projected" means that the subread kinetic signals are aligned to each corresponding CpG site in question. The average IPD profile of CpG blocks was used to train a model (eg artificial neural network, ANN for short) to identify the methylation status of each block. The ANN analysis included an input layer, two hidden layers and an output layer. Each CpG block was characterized by a feature vector of 21 IPD values input to the ANN. The first hidden layer contained 10 neurons with ReLu as the activation function. The second hidden layer contained 5 neurons with ReLu as the activation function. Finally, the output layer contained one neuron with sigmoid as activation function that outputs the probability of methylation. CpG sites with a methylation probability greater than 0.5 were considered methylated, otherwise unmethylated. The average IPD profile can be used to analyze the methylation status of the whole molecule. If a certain number of sites (e.g., 0, 1, 2, 3, etc.) above a threshold are methylated, or if the molecule has a certain methylation density, the entire molecule is considered methylated. can be

非メチル化ライブラリおよびメチル化ライブラリには９，６７８個および９，０２０個のＣｐＧブロックがあり、各々に少なくとも１０個のＣｐＧ部位が含まれた。これらのＣｐＧブロックは、非メチル化ライブラリおよびメチル化ライブラリの１７６，０４８個および１６２，９４３個のＣｐＧ部位をカバーした。図５５Ａおよび図５５Ｂに示すように、訓練データセットと試験データセットの両方でメチル化状態を予測する際に、９０％を超える全体的な精度を達成することができた。しかしながら、ＣｐＧブロックに依存するそのような実施形態は、評価することができるＣｐＧの数を大幅に減らすであろう。定義上、最小数のＣｐＧ部位の要件は、メチル化分析を特定のゲノム領域に制限する（例えば、ＣｐＧアイランドの分析）。 The unmethylated and methylated libraries had 9,678 and 9,020 CpG blocks, each containing at least 10 CpG sites. These CpG blocks covered 176,048 and 162,943 CpG sites in unmethylated and methylated libraries. As shown in Figures 55A and 55B, we were able to achieve over 90% overall accuracy in predicting methylation status on both the training and test datasets. However, such embodiments relying on CpG blocks would greatly reduce the number of CpGs that can be evaluated. By definition, the requirement for a minimum number of CpG sites restricts methylation analysis to specific genomic regions (eg analysis of CpG islands).

Ｂ．起源または障害の決定
メチル化プロファイルは、組織の起源を検出したり、障害の分類を決定したりするために使用することができる。メチル化プロファイル分析は、イメージング、従来の血液パネル、およびその他の医療診断情報を含む他の臨床データと組み合わせて使用することができる。メチル化プロファイルは、本明細書に記載の任意の方法を使用して決定することができる。 B. Determining Origin or Disorder Methylation profiles can be used to detect tissue origin or to determine disorder classification. Methylation profile analysis can be used in combination with other clinical data, including imaging, conventional blood panels, and other medical diagnostic information. Methylation profiles can be determined using any method described herein.

１．コピー数異常の決定
このセクションでは、ＳＭＲＴがコピー数を決定するのに正確であり、したがって、メチル化プロファイルおよびコピー数プロファイルを、同時に分析できることを示す。 1. Determination of Copy Number Abnormalities In this section, we show that SMRT is accurate for determining copy number, and thus methylation and copy number profiles can be analyzed simultaneously.

コピー数の異常は、腫瘍組織の配列決定によって明らかになることが示されている（Ｃｈａｎ（２０１３））。ここで、本発明者らは、癌に関連するコピー数異常が、単一分子リアルタイム配列決定を使用した腫瘍組織の配列決定によって特定できることを示す。例えば、ケースＴＢＲ３０３３の場合、腫瘍ＤＮＡおよびそのペアの隣接する非腫瘍肝組織ＤＮＡについて、それぞれ５８９，４３５個および１，４９５，２２５個のコンセンサス配列（各コンセンサス配列の構築に使用されるサブリードの最小要件は５個）を取得した。データセットは、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ１．０によって調製されたＤＮＡから生成された。一実施形態では、ゲノムを、インシリコで、２Ｍｂウィンドウに分割した。各ウィンドウにマッピングされているコンセンサス配列のパーセンテージを計算し、２Ｍｂの解像度でゲノム表現（ＧＲ）が得られた。ＧＲは、ある位置でのリードの数によって決定でき、ゲノム全体の全配列リードによって正規化された。 Copy number aberrations have been shown to be revealed by sequencing tumor tissue (Chan (2013)). Here we show that cancer-associated copy number aberrations can be identified by sequencing tumor tissue using single-molecule real-time sequencing. For example, for case TBR3033, there are 589,435 and 1,495,225 consensus sequences for tumor DNA and its paired adjacent non-tumor liver tissue DNA, respectively (minimum number of subreads used to construct each consensus sequence). 5 requirements) were acquired. The dataset was generated from DNA prepared by Sequel II Sequencing Kit 1.0. In one embodiment, the genome was split in silico into 2 Mb windows. The percentage of consensus sequences mapping to each window was calculated and the genome representation (GR) was obtained at 2 Mb resolution. GR can be determined by the number of reads at a position, normalized by total sequence reads across the genome.

図５６Ａは、単一分子リアルタイム配列決定を使用した、腫瘍ＤＮＡとそのペアの隣接する非腫瘍組織ＤＮＡとの間のＧＲの比率を示す。腫瘍ＤＮＡおよびそのペアの隣接する正常組織ＤＮＡのコピー数の比率をｙ軸に示し、染色体１～２２を含む各２Ｍｂウィンドウのゲノムビン指数をｘ軸に示す。この図では、すべての２Ｍｂウィンドウの５パーセンタイル未満のＧＲの比率を有する領域で、コピー数の減少があると分類されたのに対して、すべての２Ｍｂウィンドウの９５パーセンタイル超のＧＲの比率を有する領域では、コピー数の増加があると分類された。１３番染色体ではコピー数の減少が見られ、一方、２０番染色体ではコピー数の増加が見られた。このような増加と減少は、正しい結果である。 FIG. 56A shows the ratio of GRs between tumor DNA and its paired adjacent non-tumor tissue DNA using single-molecule real-time sequencing. The copy number ratio of tumor DNA and its paired adjacent normal tissue DNA is shown on the y-axis and the genomic bin index for each 2 Mb window covering chromosomes 1-22 is shown on the x-axis. In this figure, regions with GR ratios below the 5th percentile of all 2 Mb windows were classified as having copy number loss versus those with GR ratios above the 95th percentile of all 2 Mb windows. Regions were classified as having copy number gains. Chromosome 13 showed a decrease in copy number, while chromosome 20 showed an increase in copy number. Such increases and decreases are correct results.

図５６Ｂは、バイサルファイト配列決定を使用した、腫瘍とそのペアの隣接する非腫瘍組織との間のＧＲの比率を示す。腫瘍ＤＮＡおよびそのペアの隣接する正常組織ＤＮＡのコピー数の比率をｙ軸に示し、染色体１～２２を含む各２Ｍｂウィンドウのゲノムビン指数をｘ軸に示す。図５６Ａの単一分子リアルタイム配列決定によって特定されたコピー数の変化は、図５６Ｂの一致したバイサルファイト配列決定の結果で検証された。 FIG. 56B shows the ratio of GRs between tumor and its paired adjacent non-tumor tissue using bisulfite sequencing. The copy number ratio of tumor DNA and its paired adjacent normal tissue DNA is shown on the y-axis and the genomic bin index for each 2 Mb window covering chromosomes 1-22 is shown on the x-axis. The copy number changes identified by single-molecule real-time sequencing in Figure 56A were validated with the concordant bisulfite sequencing results in Figure 56B.

ケースＴＢＲ３０３２の場合、腫瘍ＤＮＡおよびそのペアの隣接する非腫瘍組織ＤＮＡについて、それぞれ４１３，９８２個および２，３９６，０５４個のコンセンサス配列（各コンセンサス配列の構築に使用されるサブリードの最小要件は５個）を取得した。一実施形態では、ゲノムを、インシリコで、２Ｍｂウィンドウに分割した。各ウィンドウにマッピングされているコンセンサス配列のパーセンテージ、つまり、２Ｍｂゲノム表現（ＧＲ）、を計算した。 For case TBR3032, there are 413,982 and 2,396,054 consensus sequences for tumor DNA and its paired adjacent non-tumor tissue DNA, respectively (minimum requirement of 5 subreads used to construct each consensus sequence). ) were obtained. In one embodiment, the genome was split in silico into 2 Mb windows. The percentage of consensus sequences that mapped to each window, ie, 2Mb genome representation (GR), was calculated.

図５７Ａは、単一分子リアルタイム配列決定を使用した、腫瘍ＤＮＡとそのペアの隣接する非腫瘍組織ＤＮＡとの間のＧＲの比率を示す。腫瘍ＤＮＡおよびそのペアの隣接する正常組織ＤＮＡのコピー数の比率をｙ軸に示し、染色体１～２２を含む各２Ｍｂウィンドウのゲノムビン指数をｘ軸に示す。この図では、すべての２Ｍｂウィンドウの５パーセンタイル未満のＧＲの比率を有する領域で、コピー数の減少があると分類されたのに対して、すべての２Ｍｂウィンドウの９５パーセンタイル超のＧＲの比率を有する領域では、コピー数の増加があると分類された。４番染色体、６番染色体、１１番染色体、１３番染色体、１６番染色体、および１７番染色体にはコピー数の減少がみられ、５番染色体および７番染色体にはコピー数の増加が見られた。 FIG. 57A shows the ratio of GRs between tumor DNA and its paired adjacent non-tumor tissue DNA using single-molecule real-time sequencing. The copy number ratio of tumor DNA and its paired adjacent normal tissue DNA is shown on the y-axis and the genomic bin index for each 2 Mb window covering chromosomes 1-22 is shown on the x-axis. In this figure, regions with GR ratios below the 5th percentile of all 2 Mb windows were classified as having copy number loss versus those with GR ratios above the 95th percentile of all 2 Mb windows. Regions were classified as having copy number gains. Chromosomes 4, 6, 11, 13, 16, and 17 show reduced copy numbers, and chromosomes 5 and 7 show increased copy numbers. rice field.

図５７Ｂは、バイサルファイト配列決定を使用した、腫瘍とそのペアの隣接する非腫瘍組織との間のＧＲの比率を示す。腫瘍ＤＮＡおよびそのペアの隣接する正常組織ＤＮＡのコピー数の比率をｙ軸に示し、染色体１～２２を含む各２Ｍｂウィンドウのゲノムビン指数をｘ軸に示す。図５７Ａの単一分子リアルタイム配列決定によって特定されたコピー数の変化は、図５７Ｂの一致したバイサルファイト配列決定の結果で検証された。 FIG. 57B shows the ratio of GRs between tumor and its paired adjacent non-tumor tissue using bisulfite sequencing. The copy number ratio of tumor DNA and its paired adjacent normal tissue DNA is shown on the y-axis and the genomic bin index for each 2 Mb window covering chromosomes 1-22 is shown on the x-axis. The copy number changes identified by single-molecule real-time sequencing in Figure 57A were validated with the concordant bisulfite sequencing results in Figure 57B.

したがって、メチル化プロファイルおよびコピー数プロファイルを同時に分析することができる。この実施例では、腫瘍組織の腫瘍純度が、一般に、いつも１００％であるとは限らないため、増幅された領域は、腫瘍ＤＮＡの寄与を比較的増加させ、欠損した領域は、腫瘍ＤＮＡの寄与を比較的減少させる。腫瘍ゲノムは全体的な低メチル化を特徴としているため、増幅された領域は、欠損した領域と比較して、メチル化レベルをさらに低下させる。実例として、ケースＴＢＲ３０３３の場合、本発明を使用して測定された２２番染色体のメチル化レベル（コピー数の増加）は４８．２％であり、３番染色体（コピー数の減少）のメチル化レベル（メチル化レベル：５４．０％）よりも低かった。ケースＴＢＲ３０３２の場合、本発明を使用して測定した染色体５ｐアームのメチル化レベル（コピー数の増加）は４６．５％であり、染色体５ｑアームのメチル化レベル（コピー数の減少）（メチル化レベル：５４．９％）よりも低かった。 Therefore, methylation profiles and copy number profiles can be analyzed simultaneously. In this example, since the tumor purity of tumor tissue is generally not always 100%, the amplified regions have a relatively increased tumor DNA contribution, and the deleted regions have a relatively increased tumor DNA contribution. relatively decrease. Since tumor genomes are characterized by global hypomethylation, the amplified regions have even lower levels of methylation compared to the deleted regions. As an illustration, for case TBR3033, the methylation level of chromosome 22 (increase in copy number) measured using the present invention was 48.2% and the methylation of chromosome 3 (decrease in copy number) was 48.2%. level (methylation level: 54.0%). For case TBR3032, the chromosome 5p arm methylation level (copy number increase) measured using the present invention was 46.5%, and the chromosome 5q arm methylation level (copy number decrease) (methylation level: 54.9%).

２．妊婦の血漿ＤＮＡ組織マッピング
図５８に示されるように、メチル化分析の精度により、妊婦の血漿ＤＮＡメチル化プロファイルを、異なる参照組織（例えば、肝臓、好中球、リンパ球、胎盤、Ｔ細胞、Ｂ細胞、心臓、脳など）のメチル化プロファイルと比較できると考えた。したがって、異なる細胞型からの妊婦の血漿ＤＮＡプールにおけるＤＮＡの寄与は、以下の手順を使用して推定することができる。本開示に存在する実施形態に従って決定されたＤＮＡ混合物（例えば、血漿ＤＮＡ）のＣｐＧメチル化レベルを、ベクター（Ｘ）に記録し、異なる組織にわたって検索された参照メチル化レベルを、定量（限定されないが、バイサルファイト配列決定）することができるマトリックス（Ｍ）に記録した。異なる組織からＤＮＡ混合物への比例的な寄与（ｐｒｏｐｏｒｔｉｏｎａｌｃｏｎｔｒｉｂｕｔｉｏｎ、ｐ）は、限定されないが、二次計画法によって解くことができる。ここでは、数学的な方程式を使用してＤＮＡ混合物への異なる臓器の比例的な寄与の推定を説明する。ＤＮＡ混合物中の異なる部位のメチル化密度と、異なる組織中の対応する部位のメチル化密度との間の数学的関係を以下のように表すことができる。

式中、

は、ＤＮＡ混合物中のＣｐＧ部位ｉのメチル化密度を表し、ｐ_ｋは、ＤＮＡ混合物に対する細胞型ｋの比例的な寄与を表し、Ｍ_ｉｋは、細胞型ｋのＣｐＧ部位ｉのメチル化密度を表す。部位の数が臓器の数と同じかそれより多い場合、個々のｐ_ｋ値を決定することができる。有益性を改善するため、ＣｐＧ部位で、すべての参照組織型にわたってメチル化レベルが小さな変動を示すものを除外した。一実施形態では、特定のＣｐＧ部位のセットを使用して、分析を実施した。例えば、様々な組織にわたるメチル化レベルの変動係数（ＣＶ）が３０％を超えていること、および組織間の最大メチル化レベルと最小メチル化レベルとの間の差が２５％を超えていることによって、それらのＣｐＧ部位を特徴付けた。一部の他の実施形態では、５％、１０％、２０％、３０％、４０％、５０％、６０％、８０％、９０％、１００％、１１０％、２００％、３００％などのＣＶも使用することができ、５％、１０％、１５％、２０％、２５％、３０％、４０％、５０％、６０％、７０％、８０％、９０％、１００％などを超える組織間の最大メチル化レベルと最小メチル化レベルとの間の差を使用することができる。 2. Plasma DNA Tissue Mapping in Pregnant Women As shown in Figure 58, the accuracy of the methylation analysis allowed the plasma DNA methylation profiles in pregnant women to be mapped to different reference tissues (e.g., liver, neutrophils, lymphocytes, placenta, T cells, (B cells, heart, brain, etc.). Therefore, DNA contributions in the maternal plasma DNA pool from different cell types can be estimated using the following procedure. CpG methylation levels of DNA mixtures (e.g., plasma DNA) determined according to embodiments present in the present disclosure are recorded in vector (X) and reference methylation levels retrieved across different tissues are quantified (without limitation were recorded in a matrix (M) that can be bisulfite-sequenced). The proportional contribution (p) from different tissues to the DNA mixture can be solved by, but not limited to, quadratic programming. Here we describe the estimation of the proportional contributions of different organs to the DNA mixture using mathematical equations. A mathematical relationship between the methylation densities of different sites in a DNA mixture and the methylation densities of corresponding sites in different tissues can be expressed as follows.

During the ceremony,

is the methylation density of CpG site i in the DNA mixture, _pk is the proportional contribution of cell type k to the DNA mixture, and _Mik is the methylation density of CpG site i in cell type k. show. If the number of sites is equal to or greater than the number of organs, individual _pk values can be determined. To improve the informativeness, CpG sites showing small variations in methylation levels across all reference tissue types were excluded. In one embodiment, analysis was performed using a specific set of CpG sites. For example, the coefficient of variation (CV) of methylation levels across different tissues is greater than 30% and the difference between maximum and minimum methylation levels between tissues is greater than 25%. characterized their CpG sites by In some other embodiments, CV such as 5%, 10%, 20%, 30%, 40%, 50%, 60%, 80%, 90%, 100%, 110%, 200%, 300% can also be used, interstitial greater than 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%, etc. can be used.

追加の基準を、精度を改善するためのアルゴリズムに含めることができる。例えば、すべての細胞型の集約された寄与が１００％になるように制約され得る。すなわち、
Σ_ｋｐ_ｋ＝１００％
さらに、すべての臓器の寄与は、非負値である必要がある。
Ｐ_ｋ≧０、∀ｋ Additional criteria can be included in the algorithm to improve accuracy. For example, the aggregate contribution of all cell types can be constrained to be 100%. i.e.
Σ _k p _k = 100%
Furthermore, all organ contributions must be non-negative.
P _k ≥ 0, ∀k

生物学的変化により、観察された全体的なメチル化パターンは、組織のメチル化から推定されたメチル化パターンと完全に同一でなくてもよい。そのような状況では、個々の組織の最も可能性の高い比例的な寄与を決定するために数学的な分析が必要とされる。これに関して、ＤＮＡにおいて観察されたメチル化パターンと組織から推定されたメチル化パターンとの間の差は、Ｗで示される。

Due to biological variations, the observed global methylation pattern may not be exactly the same as that deduced from tissue methylation. In such situations, mathematical analysis is required to determine the most likely proportional contributions of individual tissues. In this regard, the difference between the methylation pattern observed in DNA and the methylation pattern deduced from tissue is indicated by W.

各ｐ_ｋの最もありそうな値は、観察されたメチル化パターンと推定されたメチル化パターンとの間の差であるＷを最小化することによって決定することができる。この方程式は、数学的アルゴリズムを使用して解くことができ、例えば、限定されないが、二次計画法、線形／非線形回帰、期待値最大化（ＥＭ）アルゴリズム、最尤推定アルゴリズム、最大事後確率推定、および最小二乗法を使用することができる。 The most likely value for each p _k can be determined by minimizing W, the difference between the observed and predicted methylation patterns. This equation can be solved using mathematical algorithms such as, but not limited to, quadratic programming, linear/nonlinear regression, expectation maximization (EM) algorithms, maximum likelihood estimation algorithms, maximum posterior probability estimation. , and the least squares method can be used.

図５９に示すように、図５８に示す血漿ＤＮＡ組織マッピングの方法を使用して、男性の胎児を有する妊婦の母体血漿への胎盤ＤＮＡの寄与が、Ｙ染色体リードにより推定された胎児ＤＮＡ画分とよく相関していることが観察された。この結果は、妊婦の血漿ＤＮＡの起源の組織を追跡するために動態を使用する実行可能性を示唆した。 As shown in Figure 59, using the method of plasma DNA tissue mapping shown in Figure 58, the contribution of placental DNA to the maternal plasma of pregnant women with male fetuses was estimated by Y-chromosome reads. was observed to correlate well with This result suggested the feasibility of using kinetics to trace the tissue of origin of plasma DNA in pregnant women.

３．領域のメチル化レベルの定量
このセクションでは、選択したゲノム領域のメチル化の代表的なレベルを決定するための技術について説明する。これは、比較的低レベルの配列決定を使用して実行され得る。メチル化レベルは、メチル化部位の数とメチル化部位の総数とを使用して、鎖ごと、分子ごと、または領域ごとに決定され得る。様々な組織のメチル化レベルも分析される。 3. Quantification of Methylation Levels of Regions This section describes techniques for determining representative levels of methylation of selected genomic regions. This can be done using relatively low-level sequencing. Methylation levels can be determined on a chain-by-strand, molecule-by-molecule, or region-by-region basis using the number of methylation sites and the total number of methylation sites. Methylation levels of various tissues are also analyzed.

１１個のヒト組織ＤＮＡ試料を、試料あたり中央値が３，０７０万個のサブリード（範囲：９．１～８，８６０万個）に配列決定し、ヒト参照ゲノム（ｈｇ１９）に整列され得る。各試料のサブリードは、中央値が３８０万個のＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓＳｉｎｇｌｅＭｏｌｅｃｕｌａｒＲｅａｌ－Ｔｉｍｅ（ＳＭＲＴ）配列決定ウェル（範囲：１１０～１１５０万個）から生成され、各ウェルには、ヒト参照ゲノムに整列し得るサブリードが、少なくとも１つ含有された。平均して、ＳＭＲＴウェル内の各分子は、平均９．９回配列決定された（範囲：６．５～１３．４回）。ヒト組織のＤＮＡ試料には、妊娠中の対象の母体バフィーコート試料が１つ、胎盤試料が１つ、肝細胞癌（ＨＣＣ）腫瘍組織が２つ、前述の２つのＨＣＣ組織とペアの隣接する非腫瘍組織が２つ、健康な対照の対象のバフィーコート試料が４つ（Ｍ１およびＭ２は男性対象から、Ｆ１およびＦ２は女性対象から）、ＨＣＣ細胞株（ＨｅｐＧ２）が１つ、含まれていた。配列決定データの要約の詳細を、図６０に示す。 Eleven human tissue DNA samples can be sequenced to a median of 30.7 million subreads per sample (range: 9.1-88.6 million) and aligned to the human reference genome (hg19). Subreads for each sample were generated from Pacific Biosciences Single Molecular Real-Time (SMRT) sequencing wells with a median of 3.8 million (range: 1.1-11.5 million), each well aligned to the human reference genome. At least one subread obtained was included. On average, each molecule in an SMRT well was sequenced an average of 9.9 times (range: 6.5-13.4 times). Human tissue DNA samples included 1 maternal buffy coat sample from pregnant subjects, 1 placental sample, 2 hepatocellular carcinoma (HCC) tumor tissues, and two adjacent HCC tissues paired as previously described. Two non-tumor tissues, four buffy coat samples from healthy control subjects (M1 and M2 from male subjects, F1 and F2 from female subjects), and one HCC cell line (HepG2) were included. rice field. Details of the sequencing data summary are shown in FIG.

図６０は、最初の列に異なる組織グループを示し、２番目の列に試料名を示している。「総サブリード」は、ワトソン鎖およびクリック鎖からのものを含む、ＳＭＲＴウェルから生成された配列の総数を示す。「マッピングされたサブリード」は、ヒト参照ゲノムに整列することができたサブリードの数を列挙する。「サブリードマッピング可能性」とは、ヒト参照ゲノムに整列できたサブリードの割合を指す。「ＳＭＲＴウェルあたりの平均サブリード深度」は、各ＳＭＲＴウェルから生成されたサブリードの平均数を示す。「ＳＭＲＴウェルの数」とは、検出可能なサブリードを生成したＳＭＲＴウェルの数を指す。「マッピング可能なウェル」は、少なくとも１つの整列可能なサブリードを含有するウェルの数を示す。「マッピング可能なウェルの比率（％）」は、少なくとも１つの整列可能なサブリードを含有するウェルのパーセンテージである。 FIG. 60 shows different tissue groups in the first column and sample names in the second column. "Total subreads" indicates the total number of sequences generated from SMRT wells, including those from Watson and Crick strands. "Mapped subreads" lists the number of subreads that could be aligned to the human reference genome. "Subread mappability" refers to the percentage of subreads that could be aligned to the human reference genome. "Average sub-read depth per SMRT well" indicates the average number of sub-reads generated from each SMRT well. "Number of SMRT wells" refers to the number of SMRT wells that generated detectable subreads. "Mappable wells" indicates the number of wells containing at least one alignable subread. "% mappable wells" is the percentage of wells containing at least one alignable subread.

ａ）メチル化レベルおよびパターン分析技術
一実施形態では、単一の核酸鎖（例えば、ＤＮＡまたはＲＮＡ）のメチル化密度を測定することができ、鎖内のメチル化塩基の数をその鎖内のメチル化可能な塩基の総数で割ったものとして定義される。この測定値は、「一本鎖メチル化レベル」とも呼ばれる。単一分子リアルタイム配列決定プラットフォームは、二本鎖ＤＮＡ分子の２本の鎖の各々から配列決定情報を取得できるので、この一本鎖測定は、本開示の文脈において特に実行可能である。これは、配列決定ライブラリを調製する際にヘアピンアダプターを使用することで容易になり、二本鎖ＤＮＡ分子のワトソン鎖およびクリック鎖が環状の形態で結合されて、一緒に配列決定されるようになる。実際、この構造により、同じ二本鎖ＤＮＡ分子のパートナーとなるワトソン鎖とクリック鎖を、同じ反応で配列決定することができるため、任意の二本鎖ＤＮＡ分子のワトソン鎖とクリック鎖の対応する相補部位のメチル化状態を、個別に決定し、直接比較することができる（例えば、図２０Ａおよび２０Ｂ）。 a) Methylation Level and Pattern Analysis Techniques In one embodiment, the methylation density of a single nucleic acid strand (e.g., DNA or RNA) can be measured, and the number of methylated bases in the strand is Defined as divided by the total number of methylatable bases. This measurement is also called "single-strand methylation level". This single-strand measurement is particularly feasible in the context of the present disclosure, as single-molecule real-time sequencing platforms can obtain sequencing information from each of the two strands of a double-stranded DNA molecule. This is facilitated by the use of hairpin adapters in preparing the sequencing library so that the Watson and Crick strands of the double-stranded DNA molecule are attached in a circular fashion and sequenced together. Become. In fact, this structure allows the partnered Watson and Crick strands of the same double-stranded DNA molecule to be sequenced in the same reaction, thus allowing the matching of the Watson and Crick strands of any double-stranded DNA molecule. The methylation status of complementary sites can be determined individually and compared directly (eg Figures 20A and 20B).

これらの鎖ベースのメチル化分析は、他の技術では容易に達成することができなかった。この出願に開示されている直接的なメチル化分析法を使用しなければ、例えば、バイサルファイト変換によってメチル化塩基を非メチル化塩基から区別するために、別の手段を適用する必要がある。バイサルファイト変換では、メチル化シトシンと非メチル化シトシンをそれぞれシトシンとチミンとして区別できるように、ＤＮＡを亜硫酸水素ナトリウムで処理する必要がある。多くのバイサルファイト変換プロトコルの変性条件下では、二本鎖ＤＮＡ分子の２本の鎖が互いに解離する。多くの配列決定の用途では、例えば、Ｉｌｌｕｍｉｎａプラットフォームを使用して、バイサルファイトで変換されたＤＮＡが、次いでポリメラーゼ連鎖反応（ＰＣＲ）によって増幅され、二本鎖ＤＮＡの一本鎖への解離を伴う。 These strand-based methylation analyzes could not be easily achieved with other techniques. Without using the direct methylation analysis methods disclosed in this application, other means would need to be applied to distinguish methylated from unmethylated bases, for example by bisulfite conversion. Bisulfite conversion requires treatment of DNA with sodium bisulfite so that methylated and unmethylated cytosines can be distinguished as cytosines and thymines, respectively. Under the denaturing conditions of many bisulfite conversion protocols, the two strands of a double-stranded DNA molecule dissociate from each other. In many sequencing applications, for example, using the Illumina platform, the bisulfite-converted DNA is then amplified by the polymerase chain reaction (PCR), with the dissociation of the double-stranded DNA into single strands. .

イルミナ配列決定では、バイサルファイト変換の前に、メチル化アダプターを使用して、ＰＣＲを使わずに配列決定ライブラリを調製することができる。この戦略を使用しても、二本鎖ＤＮＡ分子の各ＤＮＡ鎖は、フローセルでのブリッジ増幅のためにランダムに選択される。配列決定のランダムな性質により、同じＤＮＡ分子由来の各鎖が同じ反応で配列決定される可能性はほとんどない。同じ遺伝子座から読み取られた２つ以上の配列が同じ実行で分析されたとしても、２つのリードが１つの二本鎖ＤＮＡ分子のパートナーのワトソン鎖とクリック鎖の各々からのものか、または２つの異なる二本鎖ＤＮＡ分子からのものかを決定する簡単な手段はない。本発明の特定の実施形態では、二本鎖ＤＮＡ分子の２本の鎖が異なるメチル化パターンを示す可能性があるため、このような考慮が重要である。複数の核酸鎖（例えば、ＤＮＡまたはＲＮＡ）の一本鎖メチル化密度が測定される場合、図６１の「目的のゲノム領域のメチル化レベル」に関する概念および式に基づいて「多鎖メチル化レベル」を決定することもできる。 In Illumina sequencing, methylated adapters can be used prior to bisulfite conversion to prepare sequencing libraries without PCR. Using this strategy, each DNA strand of a double-stranded DNA molecule is also randomly selected for bridge amplification in the flow cell. Due to the random nature of sequencing, it is highly unlikely that each strand from the same DNA molecule will be sequenced in the same reaction. Even if two or more sequences read from the same locus are analyzed in the same run, two reads are from each of the partner Watson and Crick strands of one double-stranded DNA molecule, or two There is no easy way to determine which is from two different double-stranded DNA molecules. Such considerations are important because, in certain embodiments of the invention, the two strands of a double-stranded DNA molecule may exhibit different methylation patterns. When single-strand methylation densities of multiple nucleic acid strands (e.g., DNA or RNA) are measured, the concept and formula for "Methylation level of genomic region of interest" in FIG. ' can also be determined.

図６１は、メチル化パターンを分析する様々な方法を示している。配列およびメチル化情報が未知の二本鎖ＤＮＡ分子（Ｘ）は、アダプターと連結され、一例では、ヘアピン・ループ構造を形成する。その結果、この例では、ワトソン鎖Ｘ（ａ）とクリック鎖Ｘ（ｂ）の両方を含む、ＤＮＡ分子の２つの一本鎖が、環状に物理的に結合される。ワトソン鎖とクリック鎖の両方の部位のメチル化状態は、本開示に記載の方法を使用して取得することができる（例えば、動態、電子、電磁気、光信号、またはシーケンサーからの他の種類の物理的信号を使用する）。環状化ＤＮＡ分子のワトソン鎖およびクリック鎖は、同じ反応で調べることができる。配列決定後、アダプター配列は除かれる。 FIG. 61 shows various methods of analyzing methylation patterns. A double-stranded DNA molecule (X) with unknown sequence and methylation information is ligated with an adapter, forming, in one example, a hairpin-loop structure. As a result, in this example, the two single strands of the DNA molecule, including both Watson strand X (a) and Crick strand X (b), are physically joined in a circle. The methylation status of sites on both Watson and Crick strands can be obtained using methods described in this disclosure (e.g., kinetic, electronic, electromagnetic, optical signals, or other types of using physical signals). Watson and Crick strands of circularized DNA molecules can be examined in the same reaction. After sequencing, adapter sequences are removed.

分析から、異なるメチル化レベルが決定され得る。図６１の（Ｉ）では、Ｘ（ａ）またはＸ（ｂ）のいずれかなど、一本鎖分子のみのメチル化パターンを分析することができる。この分析は、一本鎖メチル化パターン分析と呼ぶことができる。分析には、限定されないが、部位のメチル化状態またはメチル化パターンの決定が含まれ得る。図６１では、一本鎖分子Ｘ（ａ）は、メチル化パターン５’－ＵＭＭＵＵ－３’を示し、「Ｕ」は、非メチル化部位を示し、「Ｍ」は、メチル化部位を示し、一方、その相補的な一本鎖分子Ｘ（ｂ）は、はメチル化パターン３’－ＵＭＵＵＵ－５’を示す。したがって、Ｘ（ｂ）は、Ｘ（ａ）とは異なるメチル化パターンを有する。Ｘ（ａ）およびＸ（ｂ）の対応する一本鎖メチル化レベルは、それぞれ４０％および２０％である。 From the analysis different methylation levels can be determined. In (I) of FIG. 61, the methylation pattern of only single-stranded molecules, such as either X(a) or X(b), can be analyzed. This analysis can be referred to as single-strand methylation pattern analysis. Analysis can include, but is not limited to, determining the methylation state or pattern of a site. In FIG. 61, single-stranded molecule X(a) shows the methylation pattern 5′-UMMUU-3′, where “U” indicates unmethylated sites, “M” indicates methylated sites, On the other hand, its complementary single-stranded molecule X(b) exhibits the methylation pattern 3'-UMUUU-5'. Therefore, X(b) has a different methylation pattern than X(a). The corresponding single-strand methylation levels of X(a) and X(b) are 40% and 20%, respectively.

対照的に、（ＩＩ）に示すように、単一の二本鎖ＤＮＡ分子レベルでメチル化パターンを分析することができる（すなわち、ワトソン鎖およびクリック鎖の両方のメチル化パターンを考慮する）。この分析は、単一分子二本鎖ＤＮＡのメチル化パターン分析と呼ぶことができる。この例示的な分子Ｘの単一分子二本鎖ＤＮＡのメチル化レベルは、３０％である。この分析の１つのバリアントである、ワトソン鎖とクリック鎖の両方からの動態信号を組み合わせて、修飾を分析する。特に、ＣｐＧ部位のメチル化は、一般に対称的であるため、部位のメチル化状態を決定する前に、ワトソン鎖およびクリック鎖からの動態信号を、部位について組み合わせることができる。状況によっては、分子のワトソン鎖およびクリック鎖からの組み合わされた動態信号を使用して塩基修飾を決定する性能は、一本鎖の動態信号を独立して使用する性能よりも優れている。例えば、図２０Ｂに示されるように、ワトソン鎖およびクリック鎖を含む両方の鎖からの動態信号を組み合わせて使用することで、一本鎖を独立して使用するのと比較して（ＡＵＣ：０．８５）、試験データセットにおいてより大きなＡＵＣ（０．９０）を与える。 In contrast, as shown in (II), methylation patterns can be analyzed at the level of single double-stranded DNA molecules (ie, both Watson and Crick strand methylation patterns are considered). This analysis can be referred to as methylation pattern analysis of single-molecule double-stranded DNA. The methylation level of single-molecule double-stranded DNA of this exemplary molecule X is 30%. In one variant of this analysis, kinetic signals from both Watson and Crick strands are combined to analyze modifications. In particular, since methylation of CpG sites is generally symmetrical, kinetic signals from Watson and Crick strands can be combined for the site before determining the methylation state of the site. In some situations, the performance of determining base modifications using the combined kinetic signals from the Watson and Crick strands of the molecule is superior to the performance of using single-strand kinetic signals independently. For example, as shown in FIG. 20B, using the combined kinetic signals from both strands, including the Watson and Crick strands, compared to using the single strands independently (AUC: 0 .85), giving a larger AUC (0.90) in the test data set.

図６１の（ＩＩＩ）では、目的のゲノム領域のメチル化レベルが決定され、異なる分子サイズおよび異なる数のメチル化可能部位（例えば、ＣｐＧ部位）を有する異なるＤＮＡ分子が、目的のゲノム領域に寄与し得る。この分析は、多鎖メチル化レベル分析と呼ばれることがある。「多鎖」という用語は、複数の一本鎖ＤＮＡ分子、または複数の二本鎖ＤＮＡ分子、またはそれらの任意の組み合わせを指し得る。この例では、目的のゲノム領域をカバーする３つの二本鎖ＤＮＡ分子：分子「Ｘ」、分子「Ｙ」、および分子「Ｚ」があり、各々は、「ａ」鎖および「ｂ」鎖を有する。この領域の対応するメチル化レベルは、９／２８、すなわち、３２％である。分析されるゲノム領域のサイズは、１ｎｔ、１０ｎｔ、２０ｎｔ、３０ｎｔ、４０ｎｔ、５０ｎｔ、１００ｎｔ、１ｋｎｔ（キロヌクレオチド、すなわち、１０００ヌクレオチド）、２ｋｎｔ、３ｋｎｔ、４ｋｎｔ、５ｋｎｔ、１０ｋｎｔ、２０ｋｎｔ、３０ｋｎｔ、４０ｋｎｔ、５０ｋｎｔ、１００ｋｎｔ、２００ｋｎｔ、３００ｋｎｔ、４００ｋｎｔ、５００ｋｎｔ、１Ｍｎｔ（メガヌクレオチド、すなわち、１００万ヌクレオチド）、２Ｍｎｔ、３Ｍｎｔ、４Ｍｎｔ、５Ｍｎｔ、１０Ｍｎｔ、２０Ｍｎｔ、３０Ｍｎｔ、４０Ｍｎｔ、５０Ｍｎｔ、１００Ｍｎｔ、または２００Ｍｎｔのサイズを有し得る。ゲノム領域は、染色体アームまたは全ゲノムであり得る。 In FIG. 61 (III), the methylation level of the genomic region of interest is determined, and different DNA molecules with different molecular sizes and different numbers of methylatable sites (e.g., CpG sites) contribute to the genomic region of interest. can. This analysis is sometimes referred to as multi-chain methylation level analysis. The term "multi-stranded" can refer to multiple single-stranded DNA molecules, or multiple double-stranded DNA molecules, or any combination thereof. In this example, there are three double-stranded DNA molecules covering the genomic region of interest: molecule 'X', molecule 'Y', and molecule 'Z', each representing an 'a' strand and a 'b' strand. have. The corresponding methylation level of this region is 9/28, or 32%. The size of the genomic region analyzed is 1 nt, 10 nt, 20 nt, 30 nt, 40 nt, 50 nt, 100 nt, 1 knt (kilonucleotides, i.e. 1000 nucleotides), 2 knt, 3 knt, 4 knt, 5 knt, 10 knt, 20 knt, 30 knt, 40 knt, having a size of 50 knt, 100 knt, 200 knt, 300 knt, 400 knt, 500 knt, 1 Mnt (meganucleotides, i.e., one million nucleotides), 2 Mnt, 3 Mnt, 4 Mnt, 5 Mnt, 10 Mnt, 20 Mnt, 30 Mnt, 40 Mnt, 50 Mnt, 100 Mnt, or 200 Mnt can. A genomic region can be a chromosomal arm or the entire genome.

メチル化パターンは、分子内の部位のメチル化状態を決定した後に決定することもできる。例えば、単一の二本鎖ＤＮＡ分子上に３つの連続したＣｐＧ部位があるシナリオでは、ワトソン鎖とクリック鎖の各々のメチル化パターンは、３つの部位について、メチル化（Ｍ）、非メチル化（Ｎ）、およびメチル化（Ｍ）が明らかにされ得る。このパターン、例えば、ワトソン鎖についてＭＮＭは、この領域のワトソン鎖の「メチル化ハプロタイプ」と呼ぶことができる。ＤＮＡのメチル化維持活性が存在するため、二本鎖ＤＮＡ分子のワトソン鎖およびクリック鎖のメチル化パターンは、互いに相補的であり得る。例えば、ワトソン鎖のＣｐＧ部位がメチル化されている場合、クリック鎖の相補的なＣｐＧ部位もメチル化されている可能性がある。同様に、ワトソン鎖の非メチル化ＣｐＧ部位は、クリック鎖の非メチル化ＣｐＧ部位と相補的である可能性がある。 Methylation patterns can also be determined after determining the methylation status of sites within the molecule. For example, in a scenario with three consecutive CpG sites on a single double-stranded DNA molecule, the methylation patterns of each of the Watson and Crick strands are methylated (M), unmethylated (N), and methylation (M) can be revealed. This pattern, eg, MNM for Watson strand, can be referred to as the "methylation haplotype" of Watson strand in this region. Due to the presence of DNA methylation maintenance activity, the methylation patterns of the Watson and Crick strands of a double-stranded DNA molecule can be complementary to each other. For example, if the CpG site of the Watson strand is methylated, the complementary CpG site of the Crick strand may also be methylated. Similarly, the unmethylated CpG site of the Watson strand can be complementary to the unmethylated CpG site of the Crick strand.

一実施形態では、単一のＤＮＡ分子のメチル化レベルを測定することができ、これは、分子内のメチル化された塩基またはヌクレオチドの数を、その分子内のメチル化可能な塩基またはヌクレオチドの総数で割ったものとして定義される。この測定値は、「単一分子メチル化レベル」とも呼ばれる。この単一分子測定は、単一分子リアルタイム配列決定プラットフォームで可能なロングリードの鎖長のために、本開示の文脈において特に有用であり得る。複数のＤＮＡ分子の単一分子のメチル化レベルを測定する場合、図６１の概念および式に基づいて、「複数分子のメチル化レベル」を決定することもできる。例えば、「複数分子のメチル化レベル」は、単一分子のメチル化レベルの平均または中央値であり得る。 In one embodiment, the methylation level of a single DNA molecule can be measured, which measures the number of methylated bases or nucleotides in the molecule versus the number of methylatable bases or nucleotides in that molecule. Defined as divided by the total number. This measurement is also called the "single-molecule methylation level". This single-molecule measurement may be particularly useful in the context of the present disclosure due to the length of long reads possible on single-molecule real-time sequencing platforms. When measuring single-molecule methylation levels of multiple DNA molecules, a "multi-molecular methylation level" can also be determined based on the concepts and formulas of FIG. For example, "methylation level of multiple molecules" can be the average or median methylation level of a single molecule.

一部の実施形態では、１つ以上の遺伝的多型（例えば、一塩基多型（ＳＮＰ））を、分子上の部位のメチル化状態とともにＤＮＡ分子に対して分析することができ、したがって、その分子の遺伝的およびエピジェネティックな情報の両方が明らかになる。このような分析により、分析されたＤＮＡ分子の「段階的メチル化ハプロタイプ」が明らかになる。段階的メチル化ハプロタイプ分析は、例えば、母体血漿中のゲノムインプリンティングおよび無細胞核酸（母体および胎児の遺伝的およびエピジェネティックな特性を有する無細胞ＤＮＡ分子の混合物を含有する）の研究に有用である。 In some embodiments, one or more genetic polymorphisms (e.g., single nucleotide polymorphisms (SNPs)) can be analyzed for a DNA molecule along with the methylation status of sites on the molecule, thus Both the genetic and epigenetic information of the molecule are revealed. Such analysis reveals "graded methylation haplotypes" of the analyzed DNA molecule. Graded methylation haplotype analysis is useful, for example, in studying genomic imprinting in maternal plasma and cell-free nucleic acids (which contain a mixture of cell-free DNA molecules with maternal and fetal genetic and epigenetic properties). be.

ｂ）メチル化結果の比較
図６０の表の組織の全ゲノムレベルでのメチル化密度は、本開示に記載されるように、バイサルファイト配列決定および単一分子リアルタイム配列決定を使用して決定される。図６２Ａは、ｙ軸にバイサルファイト配列決定によって定量されたメチル化密度を示し、ｘ軸に組織型を示す。図６２Ｂは、本開示に記載されている単一分子リアルタイム配列決定により定量したメチル化密度をｙ軸に示し、組織型をｘ軸に示す。 b) Comparison of Methylation Results Methylation densities at the genome-wide level for the tissues in the table of FIG. 60 were determined using bisulfite sequencing and single-molecule real-time sequencing, as described in this disclosure. be. FIG. 62A shows methylation density quantified by bisulfite sequencing on the y-axis and tissue type on the x-axis. FIG. 62B shows methylation density on the y-axis and tissue type on the x-axis as quantified by single-molecule real-time sequencing as described in this disclosure.

図６２Ａは、バイサルファイト配列決定を使用した異なる組織にわたるメチル化密度を示し（すなわち、試料はバイサルファイト変換され、次いでイルミナ配列決定にかけられた）（Ｌｉｓｔｅｒｅｔａｌ．Ｎａｔｕｒｅ．２００９；４６２：３１５－３２２）、ＨｅｐＧ２、ＨＣＣ腫瘍組織、一致するＨＣＣ腫瘍に隣接した正常な肝臓組織（すなわち、隣接する正常組織）、胎盤組織、およびバフィーコート試料を含んだ。ＨｅｐＧ２は、メチル化レベルが４０．４％であり、最低のメチル化レベルを示した。バフィーコート試料は、メチル化レベルが７６．５％であり、最高のメチル化レベルを示した。ＨＣＣ腫瘍組織の平均メチル化密度（５１．２％）は、一致する隣接する正常組織の平均メチル化密度（７１．０％）よりも低いことがわかった。これは、ＨＣＣの腫瘍が、隣接する正常組織と比較して、ゲノム全体のレベルで低メチル化されているという予想と一致している（Ｒｏｓｓｅｔａｌ．Ｅｐｉｇｅｎｏｍｉｃｓ．２０１０；２：２４５－６９）。データセットは、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ１．０によって調製されたＤＮＡから生成された。 Figure 62A shows methylation densities across different tissues using bisulfite sequencing (i.e., samples were bisulfite converted and then subjected to Illumina sequencing) (Lister et al. Nature. 2009;462:315- 322), HepG2, HCC tumor tissue, matched HCC tumor-adjacent normal liver tissue (ie, adjacent normal tissue), placental tissue, and buffy coat samples. HepG2 showed the lowest methylation level with a methylation level of 40.4%. The buffy coat sample showed the highest methylation level with a methylation level of 76.5%. The average methylation density of HCC tumor tissue (51.2%) was found to be lower than that of matched adjacent normal tissue (71.0%). This is consistent with the expectation that HCC tumors are hypomethylated at the genome-wide level compared to adjacent normal tissue (Ross et al. Epigenomics. 2010; 2:245-69). . The dataset was generated from DNA prepared by Sequel II Sequencing Kit 1.0.

同じ組織の一部を、単一分子リアルタイム配列決定、および本開示による方法を使用してメチル化分析にかけた。結果を図６２Ｂに示す。本開示の単一分子リアルタイム配列決定法を使用するメチル化分析から、ＨｅｐＧ２細胞株が最も低メチル化され、続いて分析されたＨＣＣ腫瘍組織、さらに続いて胎盤組織が低メチル化されていることを示すことができた。隣接する非腫瘍性肝臓組織試料は、ＨＣＣおよび胎盤組織を含む他の組織よりもメチル化されており、バフィーコートで最も高メチル化されていた。 A portion of the same tissue was subjected to single-molecule real-time sequencing and methylation analysis using methods according to the present disclosure. The results are shown in Figure 62B. From methylation analysis using the single-molecule real-time sequencing method of the present disclosure, the HepG2 cell line was the most hypomethylated, followed by the analyzed HCC tumor tissue, followed by the placental tissue. was able to show Adjacent non-neoplastic liver tissue samples were more methylated than other tissues, including HCC and placental tissue, with the buffy coat being the most hypermethylated.

図６３Ａ、６３Ｂ、および６３Ｃは、本明細書に記載の方法による、バイサルファイト配列決定および単一分子リアルタイム配列決定によって定量された全体的なメチル化レベルの相関を示す。図６３Ａは、ｘ軸に、バイサルファイト配列決定によって定量されたメチル化レベル、およびｙ軸に、本明細書に記載の方法を使用した単一分子リアルタイム配列決定によって定量されたメチル化レベルを示す。黒の実線は、近似した回帰直線である。破線は、２つの測定値が等しい箇所である。 Figures 63A, 63B, and 63C show the correlation of global methylation levels quantified by bisulfite sequencing and single-molecule real-time sequencing according to the methods described herein. FIG. 63A shows methylation levels quantified by bisulfite sequencing on the x-axis and methylation levels quantified by single-molecule real-time sequencing using the methods described herein on the y-axis. . The solid black line is the fitted regression line. The dashed line is where the two measurements are equal.

バイサルファイト配列決定と本明細書に開示された本発明による単一分子リアルタイム配列決定との間のメチル化レベルには、非常に高い相関があった（ｒ＝０．９９、Ｐ値＜０．０００１）。これらのデータは、本明細書に開示される単一分子リアルタイム配列決定法を使用するメチル化分析が、組織間のメチル化レベルを決定するために効果的な手段であり、これらの組織間のメチル化状態とメチル化プロファイルとの比較を可能にしたことを示した。メチル化レベルの２つの測定値について、図６３Ａの回帰直線の傾きが１からずれていることに注目した。これらの結果は、２つの測定値間に偏差があり（一部の文脈では、この偏差はバイアスと呼ばれることがある）、従来の超並列バイサルファイト配列決定と比較して、本開示による単一分子リアルタイム配列決定を使用したメチル化レベルの決定に存在する可能性があることを示唆した。 There was a very high correlation between methylation levels between bisulfite sequencing and single-molecule real-time sequencing according to the invention disclosed herein (r=0.99, P-value<0. 0001). These data demonstrate that methylation analysis using the single-molecule real-time sequencing method disclosed herein is an effective tool for determining methylation levels between tissues, and that It was shown that it enabled the comparison between methylation status and methylation profile. It was noted that the slope of the regression line in FIG. 63A deviated from 1 for the two measurements of methylation level. These results indicate that there is a deviation between the two measurements (in some contexts this deviation may be referred to as a bias), and compared with conventional massively parallel bisulfite sequencing, a single suggested that it may be present in determining methylation levels using molecular real-time sequencing.

一実施形態では、線形またはＬＯＥＳＳ（局所的に重み付けされた平滑化）回帰を使用して、バイアスを定量することができる。一例として、超並列バイサルファイト配列決定（イルミナ）を参照とみなした場合、本開示に従って単一分子リアルタイム配列決定で決定された結果は、回帰係数を使用して変換することができ、異なるプラットフォーム間で読み出しを調整することができる。図６３Ａでは、線形回帰式はＹ＝ａＸ＋ｂであり、式中、「Ｙ」は、本開示による単一分子リアルタイム配列決定によって決定されたメチル化レベルを表した、「Ｘ」は、バイサルファイト配列決定によって決定されたメチル化レベルを表し、「ａ」は、回帰直線の傾きを表し（例えば、ａ＝０．６２）、「ｂ」は、ｙ軸の切片を表した（例えば、ｂ＝１７．７２）。この場合、単一分子リアルタイム配列決定によって決定される調整済みメチル化値は、（Ｙ－ｂ）／ａによって計算される。別の実施形態では、２つの測定値の偏差（ΔＭ）と２つの測定値の対応する平均

との間の関係を使用することができ、以下の式（１）および（２）によって定義された。

式中、「Ｓ」は、本発明による単一分子リアルタイム配列決定によって決定されるメチル化レベルを表し、「バイサルファイトベースのメチル化」は、バイサルファイト配列決定によって決定されるメチル化レベルを表す。 In one embodiment, linear or LOESS (locally weighted smoothing) regression can be used to quantify bias. As an example, if massively parallel bisulfite sequencing (Illumina) is taken as a reference, results determined in single-molecule real-time sequencing according to the present disclosure can be transformed using regression coefficients and between different platforms. You can adjust the readout with . In FIG. 63A, the linear regression equation is Y=aX+b, where “Y” represents the methylation level determined by single-molecule real-time sequencing according to the present disclosure, “X” is the bisulfite sequence Represents the methylation level determined by the assay, where 'a' represents the slope of the regression line (e.g. a=0.62) and 'b' represents the y-axis intercept (e.g. b=17 .72). In this case, the adjusted methylation value determined by single-molecule real-time sequencing is calculated by (Yb)/a. In another embodiment, the deviation (ΔM) of the two measurements and the corresponding average of the two measurements

can be used, defined by equations (1) and (2) below.

where "S" represents methylation levels determined by single-molecule real-time sequencing according to the present invention, and "bisulfite-based methylation" represents methylation levels determined by bisulfite sequencing. .

図６３Ｂは、ΔＭと

との間の関係を示す。２つの測定値の平均

は、ｘ軸にプロットされ、２つの測定値間の偏差（ΔＭ）は、ｙ軸にプロットされる。破線は、水平にゼロを横切る線を表し、データポイントは、２つの測定値間に差がないことを示唆している。これらの結果は、平均値に応じて偏差が異なることを示唆した。２つの測定値の平均が高いほど、偏差の大きさが大きくなる。ΔＭ値の中央値は－８．５％（範囲：－１２．６％～＋２．５％）であり、方法間に不一致が存在することを示唆している。 FIG. 63B shows ΔM and

indicates the relationship between Average of two measurements

is plotted on the x-axis and the deviation (ΔM) between the two measurements is plotted on the y-axis. The dashed line represents a line crossing zero horizontally and the data points suggest no difference between the two measurements. These results suggested different deviations depending on the mean. The higher the average of the two measurements, the greater the magnitude of the deviation. The median ΔM value was −8.5% (range: −12.6% to +2.5%), suggesting that there is discrepancy between methods.

図６３Ｃは、２つの測定値の平均

をｘ軸に、相対偏差（ＲＤ）をｙ軸に示す。相対偏差は、以下の式によって定義される。

破線は、水平にゼロを横切る線を表し、データポイントは、２つの測定値間に差がないことを示唆している。これらの結果は、相対偏差が平均値に応じて異なることを示唆した。２つの測定値の平均が大きいほど、相対偏差の大きさが大きくなる。ＲＤ値の中央値は、－１２．５％であった（範囲：－１８．１％～＋６．０％）。 Figure 63C is the average of two measurements

is shown on the x-axis and the relative deviation (RD) on the y-axis. Relative deviation is defined by the following equation.

The dashed line represents a line crossing zero horizontally and the data points suggest no difference between the two measurements. These results suggested that the relative deviations differed according to mean values. The greater the average of the two measurements, the greater the magnitude of the relative deviation. The median RD value was −12.5% (range: −18.1% to +6.0%).

従来の全ゲノムバイサルファイト配列決定（Ｉｌｌｕｍｉｎａ）は、特定のゲノム領域では、方法間でメチル化レベルの定量にかなりの変動があり、著しくバイアスのある配列出力と過大評価された全体的なメチル化を導入することが報告された（Ｏｌｏｖａｅｔａｌ．ＧｅｎｏｍｅＢｉｏｌ．２０１８；１９：３３）。本明細書に開示される方法は、ＤＮＡを劇的に分解するバイサルファイト変換を用いずに実施することができ、プロセスを複雑にするか、またはメチル化レベルの決定に追加のエラーを導入する可能性があるＰＣＲ増幅を用いずに実施することができる。 Conventional whole-genome bisulfite sequencing (Illumina) has shown, in specific genomic regions, considerable variability in quantification of methylation levels between methods, resulting in significantly biased sequence output and overestimated global methylation. (Olova et al. Genome Biol. 2018; 19:33). The methods disclosed herein can be performed without bisulfite conversion, which dramatically degrades DNA, complicating the process or introducing additional errors in determining methylation levels. It can be performed without possible PCR amplification.

図６４Ａおよび６４Ｂは、１Ｍｂの分解能でのメチル化パターンを示す。図６４Ａは、ＨＣＣ細胞株（ＨｅｐＧ２）のメチル化パターンを示す。図６４Ｂは、健康な対照の対象からのバフィーコート試料のメチル化パターンを示す。染色体イデオグラム（各図の最も外側のリング）は、時計回りにｐ末端からｑ末端に編成されている。外側から２番目のリング（中央のリングとも呼ばれる）は、バイサルファイト配列決定によって決定されたメチル化レベルを示している。最も内側のリングは、本開示による単一分子リアルタイム配列決定によって決定されたメチル化レベルを示している。メチル化レベルは、５つのグレード、つまり、０～２０％（薄緑）、２０～４０％（緑）、４０～６０％（青）、６０～８０％（薄赤）、および８０～１００％（赤）に分類される。図６４Ａおよび６４Ｂに示されるように、１Ｍｂの分解能でのメチル化プロファイルは、バイサルファイト配列決定（中央のトラック）と本開示による単一分子リアルタイム配列決定（最も内側のトラック）との間で一貫していた。母体バフィーコート試料のメチル化レベルは、ＨＣＣ細胞株（ＨｅｐＧ２）よりも高いことが示された。 Figures 64A and 64B show methylation patterns at 1 Mb resolution. Figure 64A shows the methylation pattern of the HCC cell line (HepG2). FIG. 64B shows the methylation pattern of buffy coat samples from healthy control subjects. Chromosome ideograms (outermost rings in each figure) are organized clockwise from p-terminus to q-terminus. The second ring from the outside (also called middle ring) shows methylation levels as determined by bisulfite sequencing. The innermost ring shows methylation levels determined by single-molecule real-time sequencing according to the present disclosure. Methylation levels were graded in five grades: 0-20% (light green), 20-40% (green), 40-60% (blue), 60-80% (light red), and 80-100%. (red). As shown in FIGS. 64A and 64B, the methylation profile at 1 Mb resolution is consistent between bisulfite sequencing (middle track) and single-molecule real-time sequencing according to the present disclosure (innermost track). Was. Methylation levels of maternal buffy coat samples were shown to be higher than HCC cell lines (HepG2).

図６５Ａおよび６５Ｂは、１Ｍｂの分解能で測定されたメチル化レベルの散布図を示す。図６５Ａは、ＨＣＣ細胞株（ＨｅｐＧ２）のメチル化レベルを示す。図６５Ｂは、健康な対照の対象からのバフィーコート試料のメチル化レベルを示す。図６５Ａおよび図６５Ｂの両方について、バイサルファイト配列決定によって定量されたメチル化レベルは、ｘ軸にあり、本開示による単一分子リアルタイム配列決定によって測定されたメチル化レベルはｙ軸にある。実線は、近似した回帰直線である。破線は、２つの測定技術が等しい箇所である。ＨＣＣ細胞株の場合、１Ｍｂの分解能での単一分子リアルタイム配列決定によって決定されたメチル化レベルは、バイサルファイト配列決定によって測定されたものとよく相関していた（ｒ＝０．９９、Ｐ＜０．０００１）（図６５Ａ）。バフィーコート試料からのデータについても、相関が観察された（ｒ＝０．８７、Ｐ＜０．０００１）（図６５Ｂ）。 Figures 65A and 65B show scatter plots of methylation levels measured at 1 Mb resolution. FIG. 65A shows methylation levels of HCC cell line (HepG2). FIG. 65B shows methylation levels of buffy coat samples from healthy control subjects. For both Figures 65A and 65B, methylation levels quantified by bisulfite sequencing are on the x-axis and methylation levels measured by single-molecule real-time sequencing according to the present disclosure are on the y-axis. The solid line is the fitted regression line. The dashed line is where the two measurement techniques are equal. For HCC cell lines, methylation levels determined by single-molecule real-time sequencing at 1 Mb resolution correlated well with those measured by bisulfite sequencing (r=0.99, P< 0.0001) (Fig. 65A). A correlation was also observed for data from buffy coat samples (r=0.87, P<0.0001) (FIG. 65B).

図６６Ａおよび６６Ｂは、１００ｋｂの分解能で測定されたメチル化レベルの散布図を示す。図６６Ａは、ＨＣＣ細胞株（ＨｅｐＧ２）のメチル化レベルを示す。図６６Ｂは、健康な対照の対象からのバフィーコート試料のメチル化レベルを示す。図６６Ａおよび図６６の両方について、バイサルファイト配列決定によって定量されたメチル化レベルはｘ軸上にあり、本開示による単一分子リアルタイム配列決定によって測定されたメチル化レベルはｙ軸上にある。実線は、近似した回帰直線である。破線は、２つの測定技術が等しい箇所である。分析の分解能が１００ｋｂ（または１００ｋｎｔ）ウィンドウごとに増加した場合、１Ｍｂ（または１Ｍｎｔ）の分解能での２つの方法間のメチル化定量測定値間で、高度な相関も観察された。これらすべてのデータは、本開示の単一分子のリアルタイムアプローチが、異なる程度の分解能、例えば、１Ｍｂ（または１Ｍｎｔ）または１００ｋｂ（または１００ｋｎｔ）で変動するゲノム領域内のメチル化レベルまたはメチル化密度を定量するための効果的なツールであることを示している。データはまた、本発明が、領域間または試料間のメチル化プロファイルまたはメチル化パターンを評価するための効果的なツールであることを示す。 Figures 66A and 66B show scatter plots of methylation levels measured at 100 kb resolution. FIG. 66A shows methylation levels of HCC cell line (HepG2). FIG. 66B shows methylation levels of buffy coat samples from healthy control subjects. For both FIG. 66A and FIG. 66, methylation levels quantified by bisulfite sequencing are on the x-axis and methylation levels measured by single-molecule real-time sequencing according to the present disclosure are on the y-axis. The solid line is the fitted regression line. The dashed line is where the two measurement techniques are equal. A high degree of correlation was also observed between methylation quantification measurements between the two methods at 1 Mb (or 1 Mnt) resolution when the resolution of the analysis was increased by 100 kb (or 100 knt) windows. All these data demonstrate that the single-molecule real-time approach of the present disclosure can detect methylation levels or methylation densities within genomic regions varying with different degrees of resolution, e.g., 1 Mb (or 1 Mnt) or 100 kb (or 100 knt). It has been shown to be an effective tool for quantification. The data also show that the present invention is an effective tool for assessing methylation profiles or patterns between regions or samples.

図６７Ａおよび６７Ｂは、１Ｍｂの分解能でのメチル化パターンを示す。図６７Ａは、ＨＣＣ腫瘍組織（ＴＢＲ３０３３Ｔ）のメチル化パターンを示す。図６７Ｂは、隣接する正常組織（ＴＢＲ３０３３Ｎ）のメチル化パターンを示す。染色体イデオグラム（各図の最も外側のリング）は、時計回りにｐ末端からｑ末端に編成されている。外側から２番目のリング（中央のリングとも呼ばれる）は、バイサルファイト配列決定によって決定されたメチル化レベルを示している。最も内側のリングは、本開示による単一分子リアルタイム配列決定によって決定されたメチル化レベルを示している。メチル化レベルは、５つのグレード、つまり、０～２０％（薄緑）、２０～４０％（緑）、４０～６０％（青）、６０～８０％（薄赤）、および８０～１００％（赤）に分類される。図６７Ａに示されるように、ＨＣＣ腫瘍組織ＤＮＡ（ＴＢＲ３０３３Ｔ）における低メチル化を検出することができ、図６７Ｂの隣接する正常な肝臓組織ＤＮＡ（ＴＢＲ３０３３Ｎ）と区別することができる。バイサルファイト配列決定（中央のトラック）および本開示による単一分子リアルタイム配列決定（最も内側のトラック）によって決定されたメチル化レベルおよびメチル化パターンは一貫していた。隣接する正常組織ＤＮＡのメチル化レベルは、ＨＣＣ腫瘍組織ＤＮＡのメチル化レベルよりも高いことが示された。 Figures 67A and 67B show methylation patterns at 1 Mb resolution. Figure 67A shows the methylation pattern of HCC tumor tissue (TBR3033T). Figure 67B shows the methylation pattern of adjacent normal tissue (TBR3033N). Chromosome ideograms (outermost rings in each figure) are organized clockwise from p-terminus to q-terminus. The second ring from the outside (also called middle ring) shows methylation levels as determined by bisulfite sequencing. The innermost ring shows methylation levels determined by single-molecule real-time sequencing according to the present disclosure. Methylation levels were graded in five grades: 0-20% (light green), 20-40% (green), 40-60% (blue), 60-80% (light red), and 80-100%. (red). As shown in Figure 67A, hypomethylation in HCC tumor tissue DNA (TBR3033T) can be detected and can be distinguished from adjacent normal liver tissue DNA (TBR3033N) in Figure 67B. Methylation levels and patterns determined by bisulfite sequencing (middle track) and single-molecule real-time sequencing according to the present disclosure (innermost track) were consistent. The methylation level of adjacent normal tissue DNA was shown to be higher than that of HCC tumor tissue DNA.

図６８Ａおよび６８Ｂは、１Ｍｂの分解能で測定されたメチル化レベルの散布図を示す。図６８Ａは、ＨＣＣ腫瘍組織（ＴＢＲ３０３３Ｔ）のメチル化レベルを示す。図６８Ｂは、隣接する正常組織のメチル化レベルを示す。図６８Ａおよび図６８Ｂの両方について、バイサルファイト配列決定によって定量されたメチル化レベルはｘ軸にあり、本開示による単一分子リアルタイム配列決定によって測定されたメチル化レベルはｙ軸にある。実線は、近似した回帰直線である。破線は、２つの測定技術が等しい箇所である。ＨＣＣ腫瘍組織ＤＮＡの場合、１Ｍｂの分解能での単一分子リアルタイム配列決定によって測定されたメチル化レベルは、バイサルファイト配列決定によって決定されたものとよく相関していた（ｒ＝０．９６、Ｐ値＜０．０００１）（図６８Ａ）。隣接する正常な肝臓組織試料からのデータも相関していた（ｒ＝０．８３、Ｐ値＜０．０００１）（図６８Ｂ）。 Figures 68A and 68B show scatter plots of methylation levels measured at 1 Mb resolution. Figure 68A shows methylation levels of HCC tumor tissue (TBR3033T). FIG. 68B shows methylation levels in adjacent normal tissue. For both Figures 68A and 68B, methylation levels quantified by bisulfite sequencing are on the x-axis and methylation levels measured by single-molecule real-time sequencing according to the present disclosure are on the y-axis. The solid line is the fitted regression line. The dashed line is where the two measurement techniques are equal. For HCC tumor tissue DNA, methylation levels measured by single-molecule real-time sequencing at 1 Mb resolution correlated well with those determined by bisulfite sequencing (r=0.96, P value <0.0001) (Fig. 68A). Data from adjacent normal liver tissue samples were also correlated (r=0.83, P-value<0.0001) (FIG. 68B).

図６９Ａおよび６９Ｂは、１００ｋｂの分解能で測定されたメチル化レベルの散布図を示す。図６９Ａは、ＨＣＣ腫瘍組織（ＴＢＲ３０３３Ｔ）のメチル化レベルを示す。図６９Ｂは、隣接する正常組織（ＴＢＲ３０３３Ｎ）のメチル化レベルを示す。図６９Ａおよび図６９Ｂの両方について、バイサルファイト配列決定によって定量されたメチル化レベルはｘ軸にあり、本開示による単一分子リアルタイム配列決定によって測定されたメチル化レベルはｙ軸にある。実線は、近似した回帰直線である。破線は、２つの測定技術が等しい箇所である。１Ｍｂの分解能での２つの方法間のメチル化定量データのこのような高度な相関は、メチル化レベルの測定がより高い分解能、例えば、１００ｋｂウィンドウで、実施された場合でも観察された。 Figures 69A and 69B show scatter plots of methylation levels measured at 100 kb resolution. FIG. 69A shows methylation levels of HCC tumor tissue (TBR3033T). Figure 69B shows methylation levels in adjacent normal tissue (TBR3033N). For both Figures 69A and 69B, methylation levels quantified by bisulfite sequencing are on the x-axis and methylation levels measured by single molecule real-time sequencing according to the present disclosure are on the y-axis. The solid line is the fitted regression line. The dashed line is where the two measurement techniques are equal. Such a high degree of correlation of methylation quantification data between the two methods at 1 Mb resolution was observed even when measurements of methylation levels were performed at higher resolution, eg, 100 kb windows.

図７０Ａおよび７０Ｂは、他の腫瘍組織および正常組織の１Ｍｂ分解能でのメチル化パターンを示す。図７０Ａは、ＨＣＣ腫瘍組織（ＴＢＲ３０３２Ｔ）のメチル化パターンを示す。図７０Ｂは、隣接する正常組織（ＴＢＲ３０３２Ｎ）のメチル化パターンを示す。染色体イデオグラム（各図の最も外側のリング）は、時計回りにｐ末端からｑ末端に編成されている。外側から２番目のリング（中央のリングとも呼ばれる）は、バイサルファイト配列決定によって決定されたメチル化レベルを示している。最も内側のリングは、本開示による単一分子リアルタイム配列決定によって決定されたメチル化レベルを示している。メチル化レベルは、５つのグレード、つまり、０～２０％（薄緑）、２０～４０％（緑）、４０～６０％（青）、６０～８０％（薄赤）、および８０～１００％（赤）に分類される。図７０Ａに示されるように、本発明者らは、ＨＣＣ腫瘍組織ＤＮＡ（ＴＢＲ３０３２Ｔ）における低メチル化を検出することができ、図７０Ｂの隣接する正常な肝臓組織ＤＮＡ（ＴＢＲ３０３２Ｎ）と区別することができた。バイサルファイト配列決定（中央のトラック）および本発明を使用した単一分子リアルタイム配列決定（最も内側のトラック）によって決定されたメチル化レベルおよびメチル化パターンは一貫していた。隣接する正常組織ＤＮＡのメチル化レベルは、ＨＣＣ腫瘍組織ＤＮＡのメチル化レベルよりも高いことが示された。 Figures 70A and 70B show the methylation patterns of other tumor and normal tissues at 1 Mb resolution. Figure 70A shows the methylation pattern of HCC tumor tissue (TBR3032T). Figure 70B shows the methylation pattern of adjacent normal tissue (TBR3032N). Chromosome ideograms (outermost rings in each figure) are organized clockwise from p-terminus to q-terminus. The second ring from the outside (also called middle ring) shows methylation levels as determined by bisulfite sequencing. The innermost ring shows methylation levels determined by single-molecule real-time sequencing according to the present disclosure. Methylation levels were graded in five grades: 0-20% (light green), 20-40% (green), 40-60% (blue), 60-80% (light red), and 80-100%. (red). As shown in Figure 70A, we were able to detect hypomethylation in HCC tumor tissue DNA (TBR3032T) and distinguish it from adjacent normal liver tissue DNA (TBR3032N) in Figure 70B. did it. Methylation levels and patterns determined by bisulfite sequencing (middle track) and single-molecule real-time sequencing using the present invention (innermost track) were consistent. The methylation level of adjacent normal tissue DNA was shown to be higher than that of HCC tumor tissue DNA.

図７１Ａおよび７１Ｂは、１Ｍｂの分解能で測定されたメチル化レベルの散布図を示す。図７１Ａは、ＨＣＣ腫瘍組織（ＴＢＲ３０３２Ｔ）のメチル化レベルを示す。図７１Ｂは、隣接する正常組織のメチル化レベルを示す。図７１Ａおよび図７１Ｂの両方について、バイサルファイト配列決定によって定量されたメチル化レベルはｘ軸にあり、本開示による単一分子リアルタイム配列決定によって測定されたメチル化レベルはｙ軸にある。実線は、近似した回帰直線である。破線は、２つの測定技術が等しい箇所である。ＨＣＣ腫瘍組織ＤＮＡの場合、１Ｍｂの分解能での単一分子リアルタイム配列決定によって測定されたメチル化レベルは、バイサルファイト配列決定によって決定されたものとよく相関していた（ｒ＝０．９８、Ｐ＜０．０００１）（図７１Ａ）。隣接する正常な肝臓組織試料からのデータも相関していた（ｒ＝０．８７、Ｐ＜０．０００１）（図７１Ｂ）。 Figures 71A and 71B show scatter plots of methylation levels measured at 1 Mb resolution. FIG. 71A shows methylation levels of HCC tumor tissue (TBR3032T). FIG. 71B shows methylation levels in adjacent normal tissue. For both Figures 71A and 71B, methylation levels quantified by bisulfite sequencing are on the x-axis and methylation levels measured by single molecule real-time sequencing according to the present disclosure are on the y-axis. The solid line is the fitted regression line. The dashed line is where the two measurement techniques are equal. For HCC tumor tissue DNA, methylation levels measured by single-molecule real-time sequencing at 1 Mb resolution correlated well with those determined by bisulfite sequencing (r=0.98, P <0.0001) (Fig. 71A). Data from adjacent normal liver tissue samples were also correlated (r=0.87, P<0.0001) (FIG. 71B).

図７２Ａおよび７２Ｂは、１００ｋｂの分解能で測定されたメチル化レベルの散布図を示す。図７２Ａは、ＨＣＣ腫瘍組織（ＴＢＲ３０３２Ｔ）のメチル化レベルを示す。図７２Ｂは、隣接する正常組織（ＴＢＲ３０３２Ｎ）のメチル化レベルを示す。図７２Ａおよび図７２Ｂの両方について、バイサルファイト配列決定によって定量されたメチル化レベルはｘ軸にあり、本開示による単一分子リアルタイム配列決定によって測定されたメチル化レベルはｙ軸にある。実線は、近似した回帰直線である。破線は、２つの測定技術が等しい箇所である。１Ｍｂの分解能での２つの方法間のメチル化定量データのこのような高度な相関は、メチル化レベルの測定がより高い分解能、例えば、１００ｋｂウィンドウで、実施された場合でも観察された。 Figures 72A and 72B show scatter plots of methylation levels measured at 100 kb resolution. Figure 72A shows methylation levels of HCC tumor tissue (TBR3032T). Figure 72B shows methylation levels in adjacent normal tissue (TBR3032N). For both Figures 72A and 72B, methylation levels quantified by bisulfite sequencing are on the x-axis and methylation levels measured by single-molecule real-time sequencing according to the present disclosure are on the y-axis. The solid line is the fitted regression line. The dashed line is where the two measurement techniques are equal. Such a high degree of correlation of methylation quantification data between the two methods at 1 Mb resolution was observed even when measurements of methylation levels were performed at higher resolution, eg, 100 kb windows.

４．腫瘍と隣接する正常組織との間の可変メチル化領域
メチロミック異常は、癌ゲノムの領域でよく見られる。このような異常の一例は、選択されたゲノム領域の低メチル化および高メチル化である（Ｃａｄｉｅｕｘｅｔａｌ．ＣａｎｃｅｒＲｅｓ．２００６；６６：８４６９－７６、Ｇｒａｆｆｅｔａｌ．ＣａｎｃｅｒＲｅｓ．１９９５；５５：５１９５－９、Ｃｏｓｔｅｌｌｏｅｔａｌ．ＮａｔＧｅｎｅｔ．２０００；２４：１３２－８）。別の例は、選択されたゲノム領域におけるメチル化塩基および非メチル化塩基の異常なパターンである。このセクションでは、メチル化を決定する技術が、腫瘍を分析する際に、定量分析と診断の実施に使用され得ることを示す。 4. Variable Methylation Regions Between Tumors and Adjacent Normal Tissues Methylomic aberrations are common in regions of cancer genomes. One example of such abnormalities is hypomethylation and hypermethylation of selected genomic regions (Cadieux et al. Cancer Res. 2006;66:8469-76, Graff et al. Cancer Res. 1995;55: 5195-9, Costello et al. Nat Genet. 2000;24:132-8). Another example is the unusual pattern of methylated and unmethylated bases in selected genomic regions. This section shows that techniques that determine methylation can be used to perform quantitative analysis and diagnosis when analyzing tumors.

図７３は、腫瘍抑制遺伝子ＣＤＫＮ２Ａの近くのメチル化の異常なパターンの一例を示す。座標は青で強調表示され、下線はＣｐＧアイランドを示している。黒く塗りつぶされた点は、メチル化された部位を示す。塗りつぶされていない点は、非メチル化部位を示す。点の付いた各水平線の右側の括弧内の数字は、断片のサイズ、単一分子のメチル化密度、およびＣｐＧ部位の数を示す。例えば、（３．３ｋｂ、ＭＤ：１７．９％、ＣＧ：３９）は、断片のサイズが３．３ｋｂであり、断片のメチル化レベルが１７．９％であり、ＣｐＧ部位の数が３９箇所であることを意味する。ＭＤは、メチル化密度を表す。 Figure 73 shows an example of an aberrant pattern of methylation near the tumor suppressor gene CDKN2A. Coordinates are highlighted in blue and underlined indicate CpG islands. Filled dots indicate methylated sites. Unfilled dots indicate unmethylated sites. Numbers in parentheses to the right of each dotted horizontal line indicate fragment size, single-molecule methylation density, and number of CpG sites. For example, (3.3 kb, MD: 17.9%, CG: 39) has a fragment size of 3.3 kb, a fragment methylation level of 17.9%, and a number of CpG sites of 39. means that MD stands for methylation density.

図７３に示されるように、ＣＤＫＮ２Ａ（サイクリン依存性キナーゼ阻害因子２Ａ）遺伝子は、腫瘍抑制因子として作用する、ＩＮＫ４Ａ（ｐ１６）およびＡＲＦ（ｐ１４）を含む２つのタンパク質をコードする。腫瘍組織に隣接する非腫瘍組織のＣＤＫＮ２Ａ遺伝子と重複する領域をカバーする２つの分子（分子７３０１と分子７３０２）があった。分子７３０１および分子７３０２の単一二本鎖ＤＮＡ分子のメチル化レベルは、それぞれ１７．９％および７．６％であることが示された。対照的に、腫瘍組織に存在する分子７３０３の単一二本鎖ＤＮＡ分子のメチル化レベルは９３．９％であることがわかり、これはペアの隣接する非腫瘍組織に存在する分子のメチル化レベルよりもはるかに高かった。一方、腫瘍組織に隣接する非腫瘍組織に存在する分子７３０１および７３０２を使用して、多鎖メチル化レベルを計算することもできる。その結果、多鎖メチル化レベルは９．７％であり、これは、腫瘍組織のメチル化レベル（９３．９％）よりも低かった。異なるメチル化レベルは、一本鎖分子のメチル化レベルおよび／または多鎖メチル化レベルを使用して、癌などの疾患を検出または監視できることを示唆している。 As shown in Figure 73, the CDKN2A (cyclin-dependent kinase inhibitor 2A) gene encodes two proteins that act as tumor suppressors, including INK4A (p16) and ARF (p14). There were two molecules (molecule 7301 and molecule 7302) covering regions that overlapped with the CDKN2A gene in non-tumor tissue adjacent to the tumor tissue. The methylation levels of single double-stranded DNA molecules of molecule 7301 and molecule 7302 were shown to be 17.9% and 7.6%, respectively. In contrast, the methylation level of a single double-stranded DNA molecule of molecule 7303 present in tumor tissue was found to be 93.9%, comparable to the methylation of molecules present in paired adjacent non-tumor tissue. level was much higher. On the other hand, molecules 7301 and 7302 present in non-tumor tissue adjacent to tumor tissue can also be used to calculate multi-chain methylation levels. As a result, the multi-chain methylation level was 9.7%, which was lower than that of tumor tissue (93.9%). Differential methylation levels suggest that single-chain molecule methylation levels and/or multi-chain methylation levels can be used to detect or monitor diseases such as cancer.

図７４Ａおよび図７４Ｂは、本発明の実施形態による、単一分子リアルタイム配列決定によって検出された可変メチル化領域を示す。図７４Ａは、癌ゲノムにおける低メチル化を示す。図７４Ｂは、癌ゲノムにおける高メチル化を示す。ｘ軸は、ＣｐＧ部位の座標を示す。座標は青で強調表示され、下線はＣｐＧアイランドを示している。黒く塗りつぶされた点は、メチル化された部位を示す。塗りつぶされていない点は、非メチル化部位を示す。点が付いた各水平線の右側の括弧内の数字は、断片のサイズ、断片レベルのメチル化密度、およびＣｐＧ部位の数を示している。例えば、（３．１ｋｂ、ＭＤ：８８．９％、ＣＧ：１８０）は、断片のサイズが３．１ｋｂであり、断片のメチル化密度が８８．９％であり、ＣｐＧ部位の数が１８０箇所であることを意味する。 Figures 74A and 74B show variable methylated regions detected by single-molecule real-time sequencing, according to embodiments of the invention. FIG. 74A shows hypomethylation in cancer genomes. FIG. 74B shows hypermethylation in cancer genomes. The x-axis indicates the coordinates of the CpG sites. Coordinates are highlighted in blue and underlined indicate CpG islands. Filled dots indicate methylated sites. Unfilled dots indicate unmethylated sites. Numbers in parentheses to the right of each dotted horizontal line indicate fragment size, fragment-level methylation density, and number of CpG sites. For example, (3.1 kb, MD: 88.9%, CG: 180) has a fragment size of 3.1 kb, a fragment methylation density of 88.9%, and a number of CpG sites of 180. means that

図７４Ａは、隣接する正常な肝臓組織と比較して、ＨＣＣ腫瘍組織においてより多くの低メチル化断片を示すＧＮＡＳ遺伝子に近い領域を示す。図７４Ｂは、ＨＣＣ組織において高メチル化断片を表示するＥＳＲ１遺伝子に近い領域を示すが、対応する領域に整列するベアの隣接する非腫瘍組織からのＤＮＡ断片は、代わりに低メチル化を示した。図７４Ｂに示すように、個々のＤＮＡ分子のメチル化プロファイルまたはメチル化ハプロタイプは、癌試料を非癌試料と比較した場合、これらのゲノム領域、つまりＧＮＡＳおよびＥＳＲ１、の異常なメチル化状態を明らかにするのに十分であった。 FIG. 74A shows regions near the GNAS gene showing more hypomethylated fragments in HCC tumor tissue compared to adjacent normal liver tissue. FIG. 74B shows regions near the ESR1 gene displaying hypermethylated fragments in HCC tissues, whereas DNA fragments from Bear's adjacent non-tumor tissue that align to the corresponding regions instead showed hypomethylation. . As shown in Figure 74B, methylation profiles or methylation haplotypes of individual DNA molecules reveal aberrant methylation status of these genomic regions, GNAS and ESR1, when cancer samples are compared to non-cancer samples. was enough to

これらのデータは、本明細書に開示される単一分子リアルタイム配列決定のメチル化分析が、個々のＤＮＡ断片上の各ＣｐＧ部位（メチル化または非メチル化）でのメチル化状態を決定することができることを示している。単一分子リアルタイム配列決定のリード長は、典型的には、リードあたり１００～３００ｎｔの鎖長に及ぶ可能性があるイルミナ配列決定の場合よりもはるかに長くなる（キロベースのオーダー）（ＤｅＭａｉｏｅｔａｌ．ＭｉｃｏｂＧｅｎｏｍ．２０１９；５（９））。単一分子リアルタイム配列決定のロングリード長の特性を、本明細書に開示されるメチル化分析法と組み合わせることで、任意の単一ＤＮＡ分子に沿って存在する複数のＣｐＧ部位のメチル化ハプロタイプを容易に決定することができる。メチル化プロファイルとは、ＤＮＡの連続したストレッチ内（例えば、同じ染色体上、または細菌プラスミド内、またはウイルス内の単一のＤＮＡストレッチ内）のゲノムのある座標から別の座標までのＣｐＧ部位のメチル化状態を指す。 These data demonstrate that the single-molecule real-time sequencing methylation analysis disclosed herein determines the methylation status at each CpG site (methylated or unmethylated) on individual DNA fragments. It shows what you can do. Read lengths for single-molecule real-time sequencing are typically much longer (on the order of kilobases) than for Illumina sequencing, which can range from 100-300 nt chain lengths per read (De Maio et al. et al., Micob Genom. 2019;5(9)). Combining the long read length properties of single-molecule real-time sequencing with the methylation analysis methods disclosed herein, methylation haplotypes for multiple CpG sites along any single DNA molecule can be determined. can be easily determined. A methylation profile is the methylation of a CpG site from one coordinate to another in the genome within a continuous stretch of DNA (e.g., on the same chromosome, or within a bacterial plasmid, or within a single DNA stretch within a virus). refers to the state of

単一分子リアルタイム配列決定は、事前の増幅を必要とせずに各ＤＮＡ分子を個別に分析するため、個々のＤＮＡ分子について決定されたメチル化プロファイルは、実際にはメチル化ハプロタイプであり、同じＤＮＡ分子のある末端から別の末端までのＣｐＧ部位のメチル化状態を意味する。１つ以上の分子が同じゲノム領域から配列決定された場合、ゲノム領域内の配列決定されたすべてのＣｐＧ部位にわたる各ＣｐＧ部位のメチル化％（つまり、メチル化レベルまたはメチル化密度）は、図６１に示されるように、同じ式を使用して複数のＤＮＡ断片のデータから集約され得る。各ＣｐＧ部位のメチル化％は、配列決定されたすべてのＣｐＧ部位について報告され、配列決定されたゲノム領域のメチル化プロファイルを提供する。あるいは、配列決定されたゲノム領域内のすべてのリードおよびすべての部位からのデータを集約して、つまり、１Ｍｂまたは１ｋｂ領域のメチル化レベルが図６４～７２に示されるように計算された同じ様式で、領域の１％メチル化値を提供することもできる。 Because single-molecule real-time sequencing analyzes each DNA molecule individually without the need for prior amplification, the methylation profile determined for an individual DNA molecule is actually a methylation haplotype, not the same DNA It refers to the methylation status of CpG sites from one end of the molecule to another. If more than one molecule was sequenced from the same genomic region, the % methylation (i.e., methylation level or methylation density) at each CpG site across all sequenced CpG sites within the genomic region is shown in the figure. As shown in 61, the same formula can be used to aggregate data from multiple DNA fragments. The % methylation of each CpG site is reported for all sequenced CpG sites, providing the methylation profile of the sequenced genomic region. Alternatively, data from all reads and all sites within the sequenced genomic region were aggregated, i.e., methylation levels of 1 Mb or 1 kb regions were calculated in the same manner as shown in Figures 64-72. can also provide the 1% methylation value for the region.

５．ウイルスＤＮＡのメチル化分析
このセクションは、本開示のメチル化技術を使用して、ウイルスＤＮＡのメチル化レベルを正確に決定することができることを示している。 5. Viral DNA Methylation Analysis This section demonstrates that the methylation techniques of the present disclosure can be used to accurately determine viral DNA methylation levels.

図７５は、単一分子リアルタイム配列決定を使用した、２対のＨＣＣ組織試料と隣接する非腫瘍組織試料との間のＢ型肝炎ウイルスＤＮＡのメチル化パターンを示す。各矢印は、ＨＢＶゲノムの遺伝子注釈を表す。「Ｐ」、「Ｓ」、「Ｘ」、および「Ｃ」を有する矢印は、ＨＢＶゲノムに関する遺伝子注釈を示し、それぞれ、ポリメラーゼ、表面抗原、Ｘタンパク質、およびコアタンパク質をコードしている。本発明者らは、隣接する非腫瘍組織に由来するサイズが１，１８３ｂｐの１つの断片（分子Ｉ、破線の長方形で強調表示された２，２７８～３，１４１にまたがるＨＢＶゲノム）を特定し、１２％のメチル化レベルを示した。また、腫瘍組織に由来する３，２１５ｂｐ、２，９６１ｂｐ、および３，１０５ｂｐの３つの断片（分子ＩＩ、分子ＩＩＩ、および分子ＩＶ）も特定した。それらの中で、ＨＣＣ腫瘍の２つの断片（分子ＩＩＩおよび分子ＩＶ）は、非腫瘍組織の分子ＩがまたがるＨＢＶゲノム領域と重複していた。破線の長方形で強調表示されたＨＢＶ領域の低メチル化レベル（１２％）（ＨＢＶゲノム位置：２，２７８～３，１４１）とは対照的に、メチル化レベルは、ＨＣＣ組織のそれらの断片（分子ＩＩＩおよび分子ＩＶ）でより高かった（すなわち、２４％および３０％）。これらの結果は、単一分子リアルタイム配列決定を使用したアプローチが、ウイルスゲノムのメチル化パターンを決定するために実行可能であり、ＨＣＣ組織と非ＨＣＣ組織との間のＨＢＶの可変メチル化領域（ＤＭＲ）を特定できることを示唆した。したがって、本開示による単一分子リアルタイム配列決定を使用したウイルスゲノム全体のメチル化状態の決定は、組織生検を使用して臨床的関連性を研究するための新しいツールを提供するであろう。 Figure 75 shows hepatitis B virus DNA methylation patterns between two paired HCC tissue samples and adjacent non-tumor tissue samples using single-molecule real-time sequencing. Each arrow represents a gene annotation of the HBV genome. Arrows with 'P', 'S', 'X', and 'C' indicate gene annotation for the HBV genome, encoding polymerase, surface antigen, X protein, and core protein, respectively. We identified one fragment (molecule I, HBV genome spanning 2,278-3,141 highlighted by dashed rectangle) derived from adjacent non-tumor tissue, 1,183 bp in size. , showed a methylation level of 12%. We also identified three fragments of 3,215 bp, 2,961 bp, and 3,105 bp (molecule II, molecule III, and molecule IV) derived from tumor tissue. Among them, two fragments of HCC tumors (molecule III and molecule IV) overlapped the HBV genomic region spanned by molecule I of non-tumor tissue. In contrast to the hypomethylation level (12%) of the HBV region highlighted by the dashed rectangle (HBV genomic location: 2,278-3,141), methylation levels were significantly higher in those segments of HCC tissue ( was higher (ie, 24% and 30%) for molecules III and IV). These results demonstrate that an approach using single-molecule real-time sequencing is feasible to determine the methylation pattern of the viral genome, and the variable methylation regions of HBV between HCC and non-HCC tissues ( DMR) can be identified. Therefore, determination of the methylation status of the entire viral genome using single-molecule real-time sequencing according to the present disclosure will provide a new tool to study clinical relevance using tissue biopsies.

このＤＭＲ領域は、たまたま遺伝子Ｐ、Ｃ、およびＳと重複していた。この領域は、ＨＢＶ感染はある癌がない肝臓組織と比較して、ＨＣＣ組織でも高メチル化されていることが報告された（Ｊａｉｎｅｔａｌ．ＳｃｉＲｅｐ．２０１５；５：１０４７８、Ｆｅｒｎａｎｄｅｚｅｔａｌ．ＧｅｎｏｍｅＲｅｓ．２００９；１９：４３８－５１）。 This DMR region happened to overlap with the P, C, and S genes. This region was also reported to be hypermethylated in HCC tissue compared to cancer-free liver tissue with HBV infection (Jain et al. Sci Rep. 2015;5:10478, Fernandez et al. Genome Res. 2009;19:438-51).

本発明者らは、肝硬変はあるがＨＣＣがない４人の患者の肝臓組織のバイサルファイト配列決定結果をプールし、メチル化分析用に１，１５６個のＨＢＶ断片を取得した。図７６Ａは、肝硬変はあるがＨＣＣがない患者からの肝臓組織におけるＢ型肝炎ウイルスＤＮＡのメチル化レベルを示す。さらに、１５人の患者からのＨＣＣ腫瘍組織のバイサルファイト配列決定結果をプールし、メチル化分析用に７３６個のＨＢＶ断片を取得した。図７６Ｂは、ＨＣＣ腫瘍組織におけるＢ型肝炎ウイルスＤＮＡのメチル化レベルを示す。図７６Ａおよび図７６Ｂに示すように、超並列バイサルファイト配列決定により、肝硬変の肝臓組織よりもＨＣＣ組織のメチル化レベルが高いＨＢＶのＤＭＲ領域（ＨＢＶゲノム位置：１，９８２～２，４３５）も観察された。これらの結果は、ウイルスゲノムのメチル化状態を決定するためのアプローチが有効であることを示唆した。 We pooled the bisulfite sequencing results of liver tissue from four patients with cirrhosis but no HCC and obtained 1,156 HBV fragments for methylation analysis. FIG. 76A shows hepatitis B virus DNA methylation levels in liver tissue from patients with cirrhosis but no HCC. In addition, bisulfite sequencing results of HCC tumor tissues from 15 patients were pooled to obtain 736 HBV fragments for methylation analysis. FIG. 76B shows methylation levels of hepatitis B virus DNA in HCC tumor tissue. As shown in FIGS. 76A and 76B, massively parallel bisulfite sequencing also revealed that the DMR region of HBV (HBV genomic location: 1,982-2,435) with higher methylation levels in HCC tissue than in cirrhotic liver tissue. observed. These results suggested that the approach to determine the methylation status of viral genomes is valid.

６．バリアント関連メチル化分析
異なるアレルは、異なるメチル化プロファイルに関連付けることができる。例えば、インプリント遺伝子は、他のアレルよりもメチル化レベルが高い１つのアレルを有する場合がある。このセクションでは、メチル化プロファイルを使用して、特定のゲノム領域のアレルを識別することができることを示す。 6. Variant Associated Methylation Analysis Different alleles can be associated with different methylation profiles. For example, an imprinted gene may have one allele with a higher level of methylation than other alleles. In this section, we demonstrate that methylation profiles can be used to identify alleles in specific genomic regions.

単一のＤＮＡ鋳型を含有する１つの単一分子リアルタイム配列決定ウェルは、いくつかのサブリードを生成する。サブリードには、動態特徴［例えば、パルス間隔（ＩＰＤ）およびパルス幅（ＰＷ）］およびヌクレオチド組成が含まれる。一実施形態では、１つの単一分子リアルタイム配列決定ウェルからのサブリードを使用して、配列決定エラー（例えば、ミスマッチ、挿入または欠失）を劇的に低減させ得るコンセンサス配列（循環コンセンサス配列、ＣＣＳとも呼ばれる）を生成することができる。ＣＣＳの詳細について、本明細書で説明する。一実施形態では、コンセンサス配列は、ヒト参照ゲノムに整列されたそれらのサブリードを使用して、構築することができる。別の実施形態では、コンセンサス配列は、サブリードを、同じ単一分子リアルタイム配列決定ウェル内の最長のサブリードにマッピングすることによって構築することができる。 One single-molecule real-time sequencing well containing a single DNA template will generate several subreads. Subreads include kinetic characteristics [eg, pulse interval (IPD) and pulse width (PW)] and nucleotide composition. In one embodiment, sub-reads from one single-molecule real-time sequencing well are used to construct a consensus sequence (circular consensus sequence, CCS) that can dramatically reduce sequencing errors (e.g., mismatches, insertions or deletions). ) can be generated. Details of CCS are described herein. In one embodiment, a consensus sequence can be constructed using those subreads aligned to the human reference genome. In another embodiment, consensus sequences can be constructed by mapping subreads to the longest subread within the same single-molecule real-time sequencing well.

図７７は、段階的メチル化ハプロタイプ分析の原理を示す。塗りつぶされたロリポップは、メチル化として分類されるＣｐＧ部位を表す。塗りつぶされていないロリポップは、非メチル化として分類されるＣｐＧ部位を表す。 Figure 77 shows the principle of stepwise methylation haplotype analysis. Filled lollipops represent CpG sites classified as methylated. Unfilled lollipops represent CpG sites classified as unmethylated.

図７７の一実施形態に示されるように、サブリードは、ヒト参照ゲノムに整列された。１つの単一分子リアルタイム配列決定ウェルからの整列されたサブリードは、コンセンサス配列を形成するためにまとめられた。コンセンサス配列は、一般に、各整列位置全体のサブリード存在する最も頻度の高いヌクレオチドを使用して決定され得る。したがって、限定されないが、一塩基多型、挿入、および欠失を含むがヌクレオチドバリアントは、コンセンサス配列から特定することができた。ヌクレオチドバリアントによってタグ付けされた同じ分子内の平均化されたＩＰＤおよびＰＷを使用して、本開示に従って、メチル化パターンを決定することができる。したがって、バリアント関連メチル化パターンをさらに決定することができる。同じ分子のメチル化状態は、メチル化ハプロタイプとみなすことができる。メチル化ハプロタイプは、２つ以上の断片化された短鎖ＤＮＡ分子が元の単一分子に由来するかどうか、または２つ以上の異なる元の分子が寄与するかどうかを区別可能な分子マーカーが存在しないことから、２つ以上の短鎖ＤＮＡ分子から容易にかつ直接的に構築し得ない場合がある。合成ロングリード技術（１０ＸＧｅｎｏｍｉｃｓによって開発されたリンクリード配列など）は、単一の長鎖ＤＮＡ分子を分割（液滴など）に分配し、同じ分子バーコード配列を有するその長鎖ＤＮＡ分子に由来する短鎖ＤＮＡ分子にタグを付ける可能性を提供する。しかしながら、このバーコードステップには、元のメチル化状態が保持されないＰＣＲ増幅が含まれる。 As shown in one embodiment of Figure 77, the subreads were aligned to the human reference genome. Aligned subreads from one single-molecule real-time sequencing well were combined to form a consensus sequence. A consensus sequence can generally be determined using the most frequent nucleotide present in the subreads across each aligned position. Thus, nucleotide variants, including but not limited to single nucleotide polymorphisms, insertions, and deletions, could be identified from the consensus sequences. Using the averaged IPD and PW within the same molecule tagged with nucleotide variants, the methylation pattern can be determined according to the present disclosure. Thus, variant-associated methylation patterns can be further determined. Methylation states of the same molecule can be considered as methylation haplotypes. Methylation haplotypes are molecular markers that can distinguish whether two or more fragmented short DNA molecules are derived from a single original molecule, or whether two or more different original molecules contribute. Because they do not exist, they may not be readily and directly constructed from two or more short DNA molecules. Synthetic long-read technology (such as the linked-read sequence developed by 10X Genomics) distributes a single long DNA molecule into partitions (such as droplets) that have the same molecular barcode sequence derived from the long DNA molecule. It offers the possibility of tagging short DNA molecules that However, this barcode step involves PCR amplification in which the original methylation state is not preserved.

さらに、バイサルファイトを使用して長鎖ＤＮＡ分子を処理しようとするとき、バイサルファイトが特定の化学的条件では一本鎖ＤＮＡ分子にしか作用しないため、バイサルファイト処理の前の最初のステップでは、二本鎖ＤＮＡを一本鎖ＤＮＡに変化させる破壊的な条件下でのＤＮＡ変性が含まれる。このＤＮＡ変性ステップでは、長鎖ＤＮＡ分子が短い断片に分解され、元のメチル化ハプロタイプ情報が失われる。バイサルファイトベースのメチル化分析の第２の欠点は、バイサルファイト変換ステップで、二本鎖ＤＮＡが一本鎖ＤＮＡ、つまりワトソン鎖とクリック鎖に変性することである。一分子について、ワトソン鎖を配列決定する可能性は５０％であり、クリック鎖を配列決定する可能性は５０％である。数百万のワトソン鎖とクリック鎖の中で、分子のワトソン鎖とクリック鎖の両方を同時に配列決定する可能性は非常に低い。分子のワトソン鎖とクリック鎖の両方が配列決定されると想定しても、そのようなワトソン鎖とクリック鎖が元の単一断片に由来するのかどうか、または２つ以上の異なる元の断片が寄与するのかどうかを確実に決定することは依然として不可能である。Ｌｉｕらは、最近、テンイレブントランスロケーション（ＴＥＴ）酵素ベースの変換を使用して、ＤＮＡの分解を低減させる穏やかな条件下で、メチル化シトシンおよびヒドロキシメチルシトシンを検出するためのバイサルファイトフリーの配列決定法を導入した（Ｌｉｕｅｔａｌ．ＮａｔＢｉｏｔｅｃｈｎｏｌ．２０１９；３７：４２４－４２９）。しかしながら、酵素反応には、２つの連続したステップが含まれる。酵素反応のいずれかのステップの変換率が低いと、全体的な変換率に劇的な影響を及ぼす。さらに、メチル化シトシンを検出するためのこのバイサルファイトフリーの配列決定法でさえ、配列決定の結果から、分子のワトソン鎖とクリック鎖を区別することは依然として困難である。 Furthermore, when using bisulfite to treat long-chain DNA molecules, the first step prior to bisulfite treatment is to: DNA denaturation under destructive conditions that change double-stranded DNA into single-stranded DNA is included. This DNA denaturation step breaks long DNA molecules into shorter fragments and loses the original methylated haplotype information. A second drawback of bisulfite-based methylation analysis is that the bisulfite conversion step denatures double-stranded DNA into single-stranded DNA, the Watson and Crick strands. For one molecule, there is a 50% chance of sequencing the Watson strand and a 50% chance of sequencing the Crick strand. Among the millions of Watson and Crick strands, the probability of sequencing both the Watson and Crick strands of a molecule at the same time is extremely low. Even assuming that both the Watson and Crick strands of a molecule are sequenced, whether such Watson and Crick strands are derived from a single original fragment, or whether two or more different original fragments are It remains impossible to determine with certainty whether or not it will contribute. Liu et al. recently used ten eleven translocation (TET) enzyme-based conversion to develop a bisulfite-free bisulfite-free solution for the detection of methylated and hydroxymethylcytosines under mild conditions that reduce DNA degradation. A sequencing method was introduced (Liu et al. Nat Biotechnol. 2019;37:424-429). However, the enzymatic reaction involves two sequential steps. Low conversion in any step of the enzymatic reaction has a dramatic effect on the overall conversion. Moreover, even with this bisulfite-free sequencing method for detecting methylated cytosines, it is still difficult to distinguish between the Watson and Crick strands of the molecule from the sequencing results.

対照的に、本発明の実施形態では、分子のワトソン鎖およびクリック鎖は、ベル型アダプターを介して共有結合的に連結されて、環状ＤＮＡ分子を形成する。その結果、分子のワトソン鎖とクリック鎖の両方が同じ反応ウェルで配列決定され、各鎖のメチル化状態が決定され得る。 In contrast, in embodiments of the present invention, the Watson and Crick strands of the molecule are covalently linked via a bell-shaped adapter to form a circular DNA molecule. As a result, both Watson and Crick strands of the molecule can be sequenced in the same reaction well to determine the methylation status of each strand.

本発明の実施形態の１つの利点は、長鎖の連続したＤＮＡ分子（長さがエキロベースまたはキロヌクレオチド）に関するメチル化および遺伝的（すなわち配列）情報を確認する能力である。ショートリード配列決定テクノロジーを使用してこのような情報を生成することはより困難である。ショートリード配列決定テクノロジーの場合、メチル化と遺伝情報の長いストレッチを推定できるようにするには、遺伝的またはエピジェネティックな特性の足がかりを使用して、複数のショートリードに関する配列決定情報を組み合わせる必要がある。しかしながら、これは、そのような遺伝的またはエピジェネティックなアンカー間の距離のために、多くのシナリオでは困難であることが判明するであろう。例えば、平均でＳＮＰは１ｋｂあたり１つあるが、現在のショートリード配列決定テクノロジーは、典型的に、リードあたり最大で３００ｎｔの配列を決定することができ、ペアエンド形式であっても６００ｎｔである。 One advantage of embodiments of the present invention is the ability to ascertain methylation and genetic (ie, sequence) information on long, contiguous DNA molecules (ekilobases or kilonucleotides in length). Generating such information using short-read sequencing technology is more difficult. For short-read sequencing technologies, to be able to deduce long stretches of methylation and genetic information, it is necessary to combine the sequencing information on multiple short reads using a stepping stone of genetic or epigenetic traits. There is However, this will prove difficult in many scenarios due to the distance between such genetic or epigenetic anchors. For example, there is one SNP per kb on average, but current short-read sequencing technology can typically determine up to 300 nt of sequence per read, 600 nt even in paired-end format.

一実施形態では、バリアント関連メチル化ハプロタイプ分析を使用して、インプリント遺伝子のメチル化パターンを研究することができる。インプリント領域は、親起源の様式で、エピジェネティックな調節（例えば、ＣｐＧメチル化）を受ける。例えば、図６０の表では、１つのバフィーコートのＤＮＡ試料（Ｍ２）を配列決定して、約１億５２００万個のサブリードを取得した。この試料では、５３％の単一分子リアルタイム配列決定ウェルで、ヒト参照ゲノムと整列され得る少なくとも１つのサブリードが生成された。各ＳＭＲＴウェルの平均サブリード深度は、７．７倍であった。合計で、約３００万のコンセンサス配列が取得された。参照ゲノムの約９１％は、少なくとも１回はコンセンサス配列でカバーされていた。カバーされた領域について、配列決定深度は、７．９倍であった。データセットは、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ１．０によって調製されたＤＮＡから生成された。 In one embodiment, variant-associated methylation haplotype analysis can be used to study methylation patterns of imprinted genes. Imprinted regions are subject to epigenetic regulation (eg, CpG methylation) in a parental manner. For example, in the table of Figure 60, one buffy coat DNA sample (M2) was sequenced to obtain approximately 152 million subreads. In this sample, 53% of single-molecule real-time sequencing wells generated at least one subread that could be aligned with the human reference genome. The average sub-read depth of each SMRT well was 7.7 times. In total, approximately 3 million consensus sequences were obtained. Approximately 91% of the reference genome was covered by consensus sequences at least once. For the covered region, the sequencing depth was 7.9 times. The dataset was generated from DNA prepared by Sequel II Sequencing Kit 1.0.

図７８は、コンセンサス配列から決定された配列決定された分子のサイズ分布を示しており、サイズの中央値は６，２８９ｂｐ（範囲：６６～１９８，１０９ｂｐ）である。断片サイズ（ｂｐ）をｘ軸に示し、断片サイズに関連付けられた頻度（％）をｙ軸に示す。 Figure 78 shows the size distribution of the sequenced molecules determined from the consensus sequence, with a median size of 6,289 bp (range: 66-198,109 bp). Fragment size (bp) is shown on the x-axis and frequency (%) associated with fragment size is shown on the y-axis.

図７９Ａ、７９Ｂ、７９Ｃ、および７９Ｄは、インプリント領域におけるアレルのメチル化パターンの例を示す。ｘ軸は、ＣｐＧ部位の座標を示す。座標は青で強調表示され、下線はＣｐＧアイランドを示している。黒く塗りつぶされた点は、メチル化されたＣｐＧ部位を示す。塗りつぶされていない点は、非メチル化ＣｐＧ部位を示す。各水平方向の一連の塗りつぶされた点と塗りつぶされていない点の（すなわち、ＣｐＧ部位）の間に埋め込まれたアルファベットは、ＳＮＰ部位のアレルを示す。各水平方向の一連の点の右側にある括弧内の数字は、断片のサイズ、断片レベルのメチル化密度、およびＣｐＧ部位の数を示す。例えば、（１０．０ｋｂ、ＭＤ：７９．１％、ＣＧ：１３９）は、対応する断片のサイズが１０．０ｋｂであり、断片のメチル化密度が７９．１％であり、ＣｐＧ部位の数が１３９箇所であることを示唆する。破線の長方形は、各遺伝子内で最も可変メチル化領域の輪郭を示す。 Figures 79A, 79B, 79C, and 79D show examples of methylation patterns of alleles in imprinted regions. The x-axis indicates the coordinates of the CpG sites. Coordinates are highlighted in blue and underlined indicate CpG islands. Filled dots indicate methylated CpG sites. Unfilled dots indicate unmethylated CpG sites. The alphabet embedded between each horizontal series of filled and unfilled dots (ie, CpG sites) indicates the allele of the SNP site. Numbers in parentheses to the right of each horizontal series of dots indicate fragment size, fragment-level methylation density, and number of CpG sites. For example, (10.0 kb, MD: 79.1%, CG: 139) has a corresponding fragment size of 10.0 kb, a fragment methylation density of 79.1%, and a number of CpG sites of It suggests that there are 139 sites. Dashed rectangles outline the most variable methylation regions within each gene.

図７９Ａは、ＳＮＵＲＦ遺伝子に由来する、中央値が１１．２ｋｂ（範囲：１．３～２５ｋｂ）のサイズを有する１１個の配列決定された断片を示す。ＳＮＵＲＦ遺伝子は、母方にインプリントされ、つまり、個人が母親から受け継いだ遺伝子のコピーはメチル化されており、転写的にサイレントである。図７９Ａに示されるように、破線の長方形において、Ｃアレルに関連する断片は高度にメチル化されていたが、Ｔアレルに関連する断片は高度に非メチル化されていた。高度なメチル化は、部位の７０％、８０％、９０％、９５％、または９９％以上のメチル化を示す。アレル特異的メチル化パターンは、ＰＬＡＧＬ１（図７９Ｂ）、ＮＡＰ１Ｌ５（図７９Ｃ）、およびＺＩＭ２（図７９Ｄ）を含む他のインプリント遺伝子で観察することができた。図７９Ｂは、ＰＬＡＧＬ１の場合、Ｔアレルに関連する断片が高度に非メチル化されていたのに対して、Ｃアレルに関連する断片は高度にメチル化されていたことを示す。図７９Ｃは、ＮＡＰ１Ｌ５の場合、Ｃアレルに関連する断片が高度に非メチル化されていたのに対して、Ｔアレルに関連する断片は高度にメチル化されていたことを示す。図７９Ｄは、ＺＩＭ２の場合、Ｃアレルに関連する断片が高度に非メチル化されていたのに対して、Ｔアレルに関連する断片は高度にメチル化されていたことを示す。 Figure 79A shows 11 sequenced fragments with a median size of 11.2 kb (range: 1.3-25 kb) from the SNURF gene. The SNURF gene is maternally imprinted, ie, the copy of the gene that an individual inherits from the mother is methylated and transcriptionally silent. As shown in Figure 79A, in the dashed rectangle, the fragment associated with the C allele was highly methylated, while the fragment associated with the T allele was highly unmethylated. Highly methylated indicates methylation of 70%, 80%, 90%, 95%, or 99% or more of the site. Allele-specific methylation patterns could be observed in other imprinted genes, including PLAGL1 (Fig. 79B), NAP1L5 (Fig. 79C), and ZIM2 (Fig. 79D). FIG. 79B shows that for PLAGL1, the fragment associated with the T allele was highly unmethylated, whereas the fragment associated with the C allele was highly methylated. FIG. 79C shows that for NAP1L5, the fragment associated with the C allele was highly unmethylated, whereas the fragment associated with the T allele was highly methylated. Figure 79D shows that for ZIM2, the fragment associated with the C allele was highly unmethylated, whereas the fragment associated with the T allele was highly methylated.

図８０Ａ、８０Ｂ、８０Ｃ、および８０Ｄは、非インプリント領域におけるアレルのメチル化パターンの例を示す。ｘ軸は、ＣｐＧ部位の座標を示す。座標は青で強調表示され、下線はＣｐＧアイランドを示している。黒く塗りつぶされた点は、メチル化されたＣｐＧ部位を示す。塗りつぶされていない点は、非メチル化ＣｐＧ部位を示す。各水平方向の一連の塗りつぶされた点と塗りつぶされていない点の（すなわち、ＣｐＧ部位）の間に埋め込まれたアルファベットは、一塩基多型（ＳＮＰ部位のアレルを示す。各水平方向の一連の点の右側にある括弧内の数字は、断片のサイズ、断片レベルのメチル化密度、およびＣｐＧ部位の数を示す。破線の長方形は、括弧内に報告されているメチル化密度を計算するために、ランダムに選択された領域を示す。図７９Ａ～７９Ｄの結果とは対照的に、非インプリント遺伝子には、そのような観察可能なアレルのメチル化パターンは存在しなかった。図８０Ａは、ｃｈｒ７領域において、アレルのメチル化パターンに違いがないことを示す。図８０Ｂは、ｃｈｒ１２領域において、アレルのメチル化パターンに違いがないことを示す。図８０Ｃは、ｃｈｒ１領域において、アレルのメチル化パターンに違いがないことを示す。図８０Ｄは、別のｃｈｒ１領域において、アレルのメチル化パターンに違いがないことを示す。 Figures 80A, 80B, 80C, and 80D show examples of methylation patterns of alleles in non-imprinted regions. The x-axis indicates the coordinates of the CpG sites. Coordinates are highlighted in blue and underlined indicate CpG islands. Filled dots indicate methylated CpG sites. Unfilled dots indicate unmethylated CpG sites. The alphabet embedded between each horizontal series of filled and unfilled dots (i.e., CpG sites) indicates the allele of the single nucleotide polymorphism (SNP site). Numbers in parentheses to the right of the points indicate fragment sizes, fragment-level methylation densities, and number of CpG sites Dashed rectangles are used to calculate methylation densities reported in parentheses , shows randomly selected regions In contrast to the results in Figures 79A-79D, there was no such observable allelic methylation pattern in the non-imprinted genes. Figure 80B shows no difference in the methylation pattern of alleles in the chr7 region, Figure 80B shows no difference in the methylation pattern of the alleles in the chr12 region, Figure 80C shows the methylation of the alleles in the chr1 region. Figure 80D shows no difference in the methylation pattern of the alleles in the different chr1 regions.

図８１は、アレル特異的断片のメチル化レベルの表を示す。最初の列は、「インプリント遺伝子」と「ランダムに選択された領域」のカテゴリーを列挙している。２番目の列は、特定の遺伝子を列挙している。３番目の列は、遺伝子のＳＮＰの最初のアレルを列挙している。４番目の列は、遺伝子のＳＮＰの２番目のアレルを列挙している。５番目の列は、最初のアレルにリンクされた断片のメチル化レベルを示す。６番目の列は、２番目のアレルにリンクされた断片のメチル化レベルを示す。アレル２にリンクされた断片のメチル化レベル（平均：８８．６％、範囲８４．６～９１．１％）は、それらのインプリント遺伝子のアレル１にリンクされたそれらの断片（平均：１２．２％、範囲７．６～１５．７％）よりもはるかに高く（Ｐ値＝０．０３）、アレル特異的メチル化の存在を示す。対照的に、それらのランダムに選択された領域間でメチル化レベルに有意な変化はなく（Ｐ値＝１）、アレル特異的なメチル化がないことを示唆している。 Figure 81 shows a table of methylation levels of allele-specific fragments. The first column lists the categories 'imprinted gene' and 'randomly selected region'. The second column lists specific genes. The third column lists the first allele of the SNP of the gene. The fourth column lists the second allele of the SNP of the gene. The fifth column shows the methylation level of the fragment linked to the first allele. The sixth column shows the methylation level of the fragment linked to the second allele. The methylation levels of fragments linked to allele 2 (mean: 88.6%, range 84.6-91.1%) were higher than those of fragments linked to allele 1 of their imprinted genes (mean: 12 .2%, range 7.6-15.7%) (P-value=0.03), indicating the presence of allele-specific methylation. In contrast, there was no significant change in methylation levels between these randomly selected regions (P-value=1), suggesting no allele-specific methylation.

７．妊娠中の無細胞ＤＮＡ分析
この例示では、本明細書に開示される方法は、少なくとも１人の胎児の妊婦から取得された血漿または血清中の無細胞核酸の分析に適用可能であることを実証する。妊娠中、胎盤細胞からの無細胞ＤＮＡ分子および無細胞ＲＮＡ分子が、母体循環中に見られる。このような胎盤由来の無細胞核酸分子は、母体血漿中の無細胞胎児核酸または循環無細胞胎児核酸とも呼ばれる。無細胞胎児核酸は、母体の無細胞核酸の背景の中で母体血漿中に存在する。例えば、循環無細胞胎児ＤＮＡ分子は、母体の血漿および血清中の無細胞の母体ＤＮＡの背景の中で、希少種として存在する。 7. Cell-Free DNA Analysis in Pregnancy This example demonstrates that the methods disclosed herein are applicable to the analysis of cell-free nucleic acids in plasma or serum obtained from pregnant women of at least one fetus. do. During pregnancy, cell-free DNA and RNA molecules from placental cells are found in the maternal circulation. Such placenta-derived cell-free nucleic acid molecules are also referred to as cell-free fetal nucleic acids in maternal plasma or circulating cell-free fetal nucleic acids. Cell-free fetal nucleic acid is present in maternal plasma in a background of maternal cell-free nucleic acid. For example, circulating cell-free fetal DNA molecules exist as rare species in the background of cell-free maternal DNA in maternal plasma and serum.

母体血漿または血清中の無細胞胎児ＤＮＡを無細胞母体ＤＮＡと区別するために、遺伝的またはエピジェネティックな手段またはその組み合わせが使用され得ることが知られている。遺伝的に、胎児ゲノムは、父方の受け継がれた胎児特異的ＳＮＰアレル、父性遺伝の変異、またはデノボ変異によって、母体ゲノムと異なる可能性がある。エピジェネティックに、胎盤メチロームは、一般に母体血球のメチロームと比較して低メチル化されている（Ｌｕｎｅｔａｌ．ＣｌｉｎＣｈｅｍ．２０１３；５９：１５８３－９４）。胎盤は無細胞胎児ＤＮＡの主な寄与因子であり、一方、母体血球は母体循環（血漿または血清）における無細胞母体ＤＮＡの主な寄与因子であるため、無細胞胎児ＤＮＡ分子は、一般に、血漿または血清中の無細胞母体ＤＮＡと比較して低メチル化されている。母体の血球と比較して胎盤が高メチル化されている特定のゲノム遺伝子座がある。例えば、ＲＡＳＳＦ１Ａのプロモーターおよびエクソン１領域は、母体の血球よりも胎盤でメチル化されている（Ｃｈｉｕｅｔａｌ．ＡｍＪＰａｔｈｏｌ．２００７；１７０：９４１－９５０）。したがって、このＲＡＳＳＦ１Ａ遺伝子座に由来する循環無細胞胎児ＤＮＡは、同じ遺伝子座に由来する循環無細胞母体ＤＮＡと比較して、高メチル化されている。 It is known that genetic or epigenetic means or a combination thereof can be used to distinguish cell-free fetal DNA from cell-free maternal DNA in maternal plasma or serum. Genetically, the fetal genome can differ from the maternal genome by paternally inherited fetal-specific SNP alleles, paternally inherited mutations, or de novo mutations. Epigenetically, the placental methylome is generally hypomethylated compared to that of maternal blood cells (Lun et al. Clin Chem. 2013;59:1583-94). Since the placenta is the major contributor of cell-free fetal DNA, while maternal blood cells are the major contributor of cell-free maternal DNA in the maternal circulation (plasma or serum), cell-free fetal DNA molecules are generally expressed in plasma or hypomethylated compared to cell-free maternal DNA in serum. There are specific genomic loci that are hypermethylated in the placenta compared to maternal blood cells. For example, the promoter and exon 1 regions of RASSF1A are more methylated in placenta than in maternal blood cells (Chiu et al. Am J Pathol. 2007;170:941-950). Therefore, circulating cell-free fetal DNA derived from this RASSF1A locus is hypermethylated compared to circulating cell-free maternal DNA derived from the same locus.

実施形態では、無細胞胎児ＤＮＡは、循環核酸の２つのプール間の異なるメチル化状態に基づいて、無細胞母体ＤＮＡ分子から識別され得る。例えば、無細胞ＤＮＡ分子に沿ったＣｐＧ部位は、ほとんど非メチル化されていることがわかり、この分子は胎児に由来している可能性がある。無細胞ＤＮＡ分子に沿ったＣｐＧ部位がほとんどメチル化されていることがわかった場合、この分子は母親からのものである可能性が高い。そのような分子が実際に胎児または母親からのものであるかどうかを確認するために、当業者に既知のいくつかの方法がある。１つのアプローチは、配列決定された分子のメチル化パターンを、胎盤または母体の血球の対応する遺伝子座の既知のメチル化プロファイルと比較することである。 In embodiments, cell-free fetal DNA can be distinguished from cell-free maternal DNA molecules based on the differential methylation status between the two pools of circulating nucleic acids. For example, CpG sites along cell-free DNA molecules were found to be mostly unmethylated, suggesting that the molecule may be of fetal origin. If most of the CpG sites along the cell-free DNA molecule are found to be methylated, the molecule is likely maternal. There are several methods known to those skilled in the art to ascertain whether such molecules are in fact fetal or maternal. One approach is to compare the methylation patterns of sequenced molecules with the known methylation profiles of the corresponding loci of placental or maternal blood cells.

図８２は、メチル化プロファイルを使用して、妊娠中の血漿ＤＮＡの胎盤起源を決定するための一例を示す。座標は青で強調表示され、下線はＣｐＧアイランドを示している。黒く塗りつぶされた点は、メチル化された部位を示す。塗りつぶされていない点は、非メチル化部位を示す。点の付いた各水平線の近くの括弧内の数字は、断片のサイズ、単一分子のメチル化密度、およびＣｐＧ部位の数を示す。 FIG. 82 shows an example for using methylation profiles to determine the placental origin of plasma DNA during pregnancy. Coordinates are highlighted in blue and underlined indicate CpG islands. Filled dots indicate methylated sites. Unfilled dots indicate unmethylated sites. Numbers in parentheses near each dotted horizontal line indicate fragment size, single-molecule methylation density, and number of CpG sites.

図８２に示されるように、母体血漿無細胞ＤＮＡ分子が、ＲＡＳＳＦ１Ａのプロモーター領域（胎盤組織で特異的にメチル化されることが知られている領域）に整列し、かつ本発明の方法を使用して生成された配列決定データが高メチル化される場合、分子はおそらく胎児または胎盤に由来する。対照的に、低メチル化を示す分子は、母体の背景ＤＮＡ（主に造血起源）に由来可能している性が高い。 As shown in Figure 82, maternal plasma cell-free DNA molecules align to the promoter region of RASSF1A, a region known to be differentially methylated in placental tissue, and using the methods of the invention. If the sequencing data generated by the method is hypermethylated, the molecule probably originated from the fetus or placenta. In contrast, molecules exhibiting hypomethylation are likely derived from maternal background DNA (mainly of hematopoietic origin).

図８３は、胎児特異的メチル化分析のアプローチを示す。このアプローチには、胎児特異的ＳＮＰアレルまたは胎児特異的変異（例えば、父性遺伝または本質的にデノボ）を含有する配列決定された分子の利用が含まれる。そのような胎児特有の遺伝的特徴が特定される場合、同じ無細胞ＤＮＡ分子に存在する塩基のメチル化状態は、無細胞胎児ＤＮＡまたは胎盤メチロームのメチル化プロファイルを反映する。血漿無細胞ＤＮＡ配列決定で、母体ゲノムに存在しないアレルまたは変異が明らかになる場合（例えば、母体ゲノムＤＮＡの分析による）、または父方ＤＮＡの分析によってもしくは家族性で伝達されることが既知の場合（例えば、発端者由来のＤＮＡの分析による）、胎児特異的な遺伝的特徴が明らかにされ得る。 Figure 83 shows an approach for fetal-specific methylation analysis. This approach involves the use of sequenced molecules containing fetal-specific SNP alleles or fetal-specific mutations (eg, paternally inherited or de novo in nature). When such fetal-specific genetic features are identified, the methylation status of bases present in the same cell-free DNA molecule reflects the methylation profile of the cell-free fetal DNA or placental methylome. Plasma cell-free DNA sequencing reveals an allele or mutation not present in the maternal genome (e.g., by analysis of maternal genomic DNA), or known to be transmitted by analysis of paternal DNA or in familial Fetal-specific genetic features can be revealed (eg, by analysis of DNA from the proband).

胎児特異的ＤＮＡ分子のメチル化は、母体ゲノムのホモ接合性アレルとは異なるアレルを有するそれらのＤＮＡ断片を分析することによって決定することができる。胎児のＤＮＡ分子のメチル化は、母体のＤＮＡ分子のメチル化よりも低いと予想され得る。 Methylation of fetal-specific DNA molecules can be determined by analyzing those DNA fragments that have alleles that differ from the homozygous alleles of the maternal genome. Methylation of fetal DNA molecules can be expected to be lower than that of maternal DNA molecules.

一例として、１人の妊婦のバフィーコートＤＮＡとそれに対応する胎盤ＤＮＡを配列決定して、それぞれ、５９倍と５８倍のハプロイドのゲノムカバレッジを取得した。本発明者らは、母親がホモ接合で胎児がヘテロ接合である、合計８２２，４０９個の有益なＳＮＰを特定した。単一分子リアルタイム配列決定を通して、母体血漿（Ｍ１３１６０）で、２，６５２個の胎児特異的断片と２４，８３７個の共有断片（すなわち、共有アレルを有する断片、主に母体由来）を見出した。胎児のＤＮＡ画分は、１９．３％であった。本開示に従って、これらの胎児特異的断片および共有断片のメチル化プロファイルが推定された。その結果、胎児特異的断片のメチル化レベルが５７．４％であったのに対し、共有断片のメチル化レベルは６９．９％であることがわかった。この発見は、胎児ＤＮＡのメチル化レベルが妊婦の血漿中の母体ＤＮＡよりも低いという現在の知見と一致していた（Ｌｕｎｅｔａｌ．，ＣｌｉｎＣｈｅｍ．２０１３；５９：１５８３－９４）。 As an example, buffy coat DNA and the corresponding placental DNA of one pregnant woman were sequenced to obtain genomic coverage of 59-fold and 58-fold haploids, respectively. We identified a total of 822,409 informative SNPs that were homozygous in the mother and heterozygous in the fetus. Through single-molecule real-time sequencing, we found 2,652 fetal-specific fragments and 24,837 shared fragments (ie, fragments with shared alleles, mostly maternally derived) in maternal plasma (M13160). The fetal DNA fraction was 19.3%. Methylation profiles of these fetal-specific and shared fragments were deduced according to the present disclosure. As a result, the methylation level of the fetal-specific fragment was found to be 57.4%, whereas the methylation level of the shared fragment was found to be 69.9%. This finding was consistent with current findings that methylation levels in fetal DNA are lower than maternal DNA in plasma of pregnant women (Lun et al., Clin Chem. 2013;59:1583-94).

メチル化パターンは、診断または監視の目的で使用することができる。例えば、母体の血漿試料のメチル化プロファイルは、妊娠期間を決定するために使用されている（ｈｔｔｐｓ：／／ｗｗｗ．ｎｃｂｉ．ｎｌｍ．ｎｉｈ．ｇｏｖ／ｐｕｂｍｅｄ／２７９７９９５９）。１つの用途は、品質管理ステップである。別の潜在的な用途は、妊娠の「生物年齢」と「暦年齢」を監視することである。この用途は、早産の検出またはリスク評価に使用することができる。他の実施形態は、母体血中の胎児細胞の分析に使用することができる。さらに他の実施形態では、このような胎児細胞は、抗体ベースのアプローチによって、または細胞マーカーを使用する選択的染色によって（例えば、細胞表面または細胞質内で）特定され得るか、あるいはフローサイトメトリーまたはマイクロマニピュレーションまたはマイクロダイセクションまたは物理的方法（例えば、チャンバー、表面または容器を通る差動流）によって濃縮され得る。 Methylation patterns can be used for diagnostic or surveillance purposes. For example, methylation profiles of maternal plasma samples have been used to determine gestational age (https://www.ncbi.nlm.nih.gov/pubmed/27979959). One application is for quality control steps. Another potential application is monitoring the "biological age" and "chronological age" of pregnancy. This application can be used for preterm birth detection or risk assessment. Other embodiments can be used for analysis of fetal cells in maternal blood. In still other embodiments, such fetal cells can be identified by antibody-based approaches or by selective staining using cell markers (e.g., at the cell surface or within the cytoplasm), or by flow cytometry or Concentration can be by micromanipulation or microdissection or physical methods such as differential flow through a chamber, surface or container.

Ｃ．異なる試薬を使用したメチル化検出
このセクションでは、メチル化技術が特定の試薬システムに限定されないことを示す。 C. Methylation Detection Using Different Reagents This section shows that methylation techniques are not limited to specific reagent systems.

メチル化分析は、異なる試薬システムを使用して実施され、技術が適用され得ることを確認した。一例として、ＳｅｑｕｅｌＩＩシステム（ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ）を使用して、ＳＭＲＴ－ｓｅｑを実施し、単一分子リアルタイム配列決定を実行した。剪断されたＤＮＡ分子は、ＳＭＲＴｂｅｌｌＥｘｐｒｅｓｓＴｅｍｐｌａｔｅＰｒｅｐＫｉｔ２．０（ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ）を使用して、単一分子リアルタイム（ＳＭＲＴ）配列決定の鋳型の構築にかけられた。配列決定プライマーのアニーリングとポリメラーゼ結合の条件は、ＳＭＲＴＬｉｎｋｖ８．０ソフトウェア（ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ）を使用して計算した。簡単に、配列決定プライマーｖ２を配列決定鋳型にアニーリングし、次いでＳｅｑｕｅｌＩＩＢｉｎｄｉｎｇａｎｄＩｎｔｅｒｎａｌＣｏｎｔｒｏｌＫｉｔ２．０（ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ）を使用して、ポリメラーゼを鋳型に結合させた。配列決定は、ＳｅｑｕｅｌＩＩＳＭＲＴＣｅｌｌ８Ｍで行った。配列決定の動画は、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ２．０（ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ）を使用して、ＳｅｑｕｅｌＩＩシステムで３０時間収集した。他の実施形態では、他の化学試薬および反応緩衝液がＳＭＲＴ－ｓｅｑに使用されるであろう。一実施形態では、ポリメラーゼは、そのメチル化状態に応じて、ＤＮＡ鋳型鎖に沿ってヌクレオチドを組み込む異なる動態特徴を有するであろう（Ｈｕｂｅｒｅｔａｌ．ＮｕｃｌｅｉｃＡｃｉｄｓＲｅｓ．２０１６；４４：９８８１－９８９０）。本開示において、結果は、特に断らない限り、配列決定プライマーｖ１を使用して生成される。 Methylation analysis was performed using different reagent systems to confirm that the technique could be applied. As an example, SMRT-seq was performed to perform single-molecule real-time sequencing using the Sequel II system (Pacific Biosciences). The sheared DNA molecules were subjected to single molecule real-time (SMRT) sequencing template construction using SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences). Conditions for sequencing primer annealing and polymerase binding were calculated using SMRT Link v8.0 software (Pacific Biosciences). Briefly, sequencing primer v2 was annealed to the sequencing template, then polymerase was allowed to bind to the template using the Sequel II Binding and Internal Control Kit 2.0 (Pacific Biosciences). Sequencing was performed on Sequel II SMRT Cell 8M. Sequencing movies were collected for 30 hours on the Sequel II system using the Sequel II Sequencing Kit 2.0 (Pacific Biosciences). In other embodiments, other chemical reagents and reaction buffers will be used for SMRT-seq. In one embodiment, polymerases will have different kinetic characteristics for incorporating nucleotides along the DNA template strand, depending on their methylation state (Huber et al. Nucleic Acids Res. 2016;44:9881-9890). . In this disclosure, results are generated using sequencing primer v1 unless otherwise stated.

異なる試薬を使用して本明細書に記載の本開示における本発明の使用を実証するために、本発明者らは、限定されないが、ＳｅｑｕｅｌＩＳｅｑｕｅｎｃｉｎｇＫｉｔ３．０、ＲＳＩＩ、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ１．０およびＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ２．０を含む異なる配列決定キットに基づいて生成されたＳＭＲＴ－ｓｅｑデータを分析した。ＲＳＩＩには、ＳＭＲＴセルあたり１５０，０００ＺＭＷが含まれる。Ｓｅｑｕｅｌは、ＳＭＲＴセルあたり１，０００，０００ＺＭＷを使用する。ＳｅｑｕｅｌＩＩは、２つの配列決定キット（１．０および２．０）を用いて、ＳＭＲＴセルあたり８００万ＺＭＷを使用する。この分析には、２つのデータセットが含まれていた。最初のデータセットは、全ゲノム増幅後のＤＮＡに基づいて調製され、非メチル化状態を表している。２番目の種類のデータセットは、Ｍ．ＳｓｓｓＩメチルトランスフェラーゼ処理後のＤＮＡに基づいて調製され、メチル化状態を表している。これらのデータは、Ｓｅｑｕｅｌシーケンサーの場合、ＳｅｑｕｅｌＳｅｑｕｅｎｃｉｎｇＫｉｔ３．０を使用して生成され、ＳｅｑｕｅｌＩＩシーケンサーの場合、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ１．０およびＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ２．０を使用して生成された。したがって、本発明者らは、異なる試薬（例えば、ポリメラーゼ）で生成された動態プロファイルを有する３つのデータセットを取得した。各データセットは、本開示によるＣＮＮモデルを使用して性能を評価するために、訓練データセットと試験データセットに分割された。 To demonstrate the use of the invention in the present disclosure described herein using different reagents, the inventors used, but are not limited to, Sequel I Sequencing Kit 3.0, RS II, Sequel II Sequencing Kit SMRT-seq data generated based on different sequencing kits including Sequel II Sequencing Kit 2.0 and Sequel II Sequencing Kit 2.0 were analyzed. RS II includes 150,000 ZMW per SMRT cell. Sequel uses 1,000,000 ZMW per SMRT cell. Sequel II uses 8 million ZMW per SMRT cell with two sequencing kits (1.0 and 2.0). Two data sets were included in this analysis. The first dataset was prepared based on DNA after whole genome amplification and represents the unmethylated state. A second type of dataset is the M. Prepared based on DNA after SsssI methyltransferase treatment and represents the methylation status. These data were generated using Sequel Sequencing Kit 3.0 for the Sequel sequencer and Sequel II Sequencing Kit 1.0 and Sequel II Sequencing Kit 2.0 for the Sequel II sequencer. rice field. We therefore acquired three data sets with kinetic profiles generated with different reagents (eg polymerases). Each dataset was split into a training dataset and a test dataset for performance evaluation using the CNN model according to the present disclosure.

１．測定ウィンドウ
図８４Ａ、８４Ｂ、および８４Ｃは、全ゲノム増幅データ（非メチル化ＣｐＧ部位）およびＭ．ＳｓｓｓＩ処理データ（メチル化ＣｐＧ部位）を含む訓練データセットにおけるＳＭＲＴ－ｓｅｑ用の異なる試薬キットにわたる異なる測定ウィンドウのサイズの性能を示している。真陽性率はｙ軸にプロットされ、偽陽性率はｘ軸にプロットされている。図８４Ａは、ＳｅｑｕｅｌＳｅｑｕｅｎｃｉｎｇＫｉｔ３．０に基づいて生成されたＳＭＲＴ－ｓｅｑデータを示す。図８４Ｂは、ＳｅｑｕｅｌＩＩｓｅｑｕｅｎｃｉｎｇＫｉｔ１．０に基づいて生成されたＳＭＲＴ－ｓｅｑデータを示す。図８４Ｃは、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ２．０に基づいて生成されたＳＭＲＴ－ｓｅｑデータを示す。図中、分析されるＣｐＧシトシン部位の上流信号を、「－」で示した。分析されるＣｐＧシトシン部位の下流信号を、「＋」で示した。例えば、「－６ｎｔ」は、分析されるＣｐＧシトシン部位の６ｎｔ上流信号を表す。「＋６ｎｔ」は、分析されるＣｐＧシトシン部位の６ｎｔ下流信号を表す。「±６ｎｔ」は、分析されるＣｐＧシトシン部位の６ｎｔ上流信号と６ｎｔ下流信号の両方を含むことを示した（すなわち、ＣｐＧシトシン部位に隣接する合計１２ｎｔの配列）。 1. Measurement windows Figures 84A, 84B, and 84C show whole genome amplification data (unmethylated CpG sites) and M. Figure 3 shows the performance of different measurement window sizes across different reagent kits for SMRT-seq on a training dataset containing Ssssl-treated data (methylated CpG sites). The true positive rate is plotted on the y-axis and the false positive rate is plotted on the x-axis. FIG. 84A shows SMRT-seq data generated based on Sequel Sequencing Kit 3.0. FIG. 84B shows SMRT-seq data generated based on Sequel II sequencing Kit 1.0. FIG. 84C shows SMRT-seq data generated based on Sequel II Sequencing Kit 2.0. In the figure, the signal upstream of the analyzed CpG cytosine site is indicated by "-". Signals downstream of the analyzed CpG cytosine site are indicated by "+". For example, "-6nt" represents a 6nt upstream signal for the CpG cytosine site being analyzed. "+6nt" represents the 6nt downstream signal of the CpG cytosine site being analyzed. "±6nt" indicated inclusion of both a 6nt upstream signal and a 6nt downstream signal of the CpG cytosine site being analyzed (ie, a total of 12nt sequences flanking the CpG cytosine site).

図８４Ａに示されるように、分析されるＣｐＧシトシンの信号およびそのシトシンの６ｎｔ上流（－６ｎｔで示される）信号（例えば、ＩＰＤ、ＰＷ、相対位置、配列組成）を含む測定ウィンドウを使用した、ＳｅｑｕｅｌＳｅｑｕｅｎｃｉｎｇＫｉｔ３．０に基づく訓練データセットの場合、０．５０のＡＵＣ値は、メチル化ＣｐＧシトシンを非メチル化シトシンから区別する際の識別力がないことを示唆した。しかしながら、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ１．０および２．０に基づく訓練データセットの場合、対応するＡＵＣ値は、０．６２（図８４Ｂ）および０．７５（図８４Ｃ）であった。これらのデータは、ＳＭＲＴ－ｓｅｑで使用される異なる試薬に固有の異なる動態プロファイルがあることを示している。これらのデータは、本明細書に開示される方法が、異なる試薬の使用に容易に適合されることを示す。さらに、塩基修飾を検出する精度は、試薬のさらなる開発、例えば、異なるポリメラーゼの使用および他の化学により、潜在的に改善され得る。 Using a measurement window that includes the signal of the CpG cytosine to be analyzed and the signal 6 nt upstream of that cytosine (designated −6 nt) (e.g., IPD, PW, relative position, sequence composition), as shown in FIG. For the training dataset based on Sequel Sequencing Kit 3.0, an AUC value of 0.50 suggested no discriminatory power in distinguishing methylated CpG cytosines from unmethylated cytosines. However, for training datasets based on Sequel II Sequencing Kit 1.0 and 2.0, the corresponding AUC values were 0.62 (Fig. 84B) and 0.75 (Fig. 84C). These data indicate that there are different kinetic profiles inherent to different reagents used in SMRT-seq. These data demonstrate that the methods disclosed herein are readily adapted for use with different reagents. Moreover, the accuracy of detecting base modifications can potentially be improved by further development of reagents, such as using different polymerases and other chemistries.

別の例として、図８４Ａに示すように、ＣｐＧシトシン部位の１０ｂｐ上流（－１０ｎｔと表示）の信号を含む測定ウィンドウを使用して、ＳｅｑｕｅｌＳｅｑｕｅｎｃｉｎｇＫｉｔ３．０に基づく訓練データセットの場合、０．５０のＡＵＣ値により、メチル化されたＣｐＧシトシンを非メチル化シトシンと区別する識別力はないことが示唆された。しかしながら、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ１．０および２．０に基づく訓練データセットの場合、対応するＡＵＣ値は０．６６（図８４Ｂ）および０．７９（図８４Ｃ）であり、６ｎｔ上流信号を含む測定ウィンドウと比較して、改善されていることが示された。これらのデータは、ＳＭＲＴ－ｓｅｑに使用された異なる試薬に固有の異なる動態プロファイルがあることを確認した。これらのデータは、本明細書に開示される方法が、異なる試薬の使用に容易に適合されることを示す。 As another example, as shown in FIG. 84A, using a measurement window containing the signal 10 bp upstream of the CpG cytosine site (labeled −10 nt), for a training dataset based on Sequel Sequencing Kit 3.0, 0 An AUC value of 0.50 suggested that there was no discriminatory power to distinguish methylated CpG cytosines from unmethylated cytosines. However, for training datasets based on Sequel II Sequencing Kit 1.0 and 2.0, the corresponding AUC values were 0.66 (Fig. 84B) and 0.79 (Fig. 84C), indicating that measurements containing 6 nt upstream signals It was shown to be improved compared to windows. These data confirmed that there are different kinetic profiles inherent to the different reagents used for SMRT-seq. These data demonstrate that the methods disclosed herein are readily adapted for use with different reagents.

上流信号を有する測定ウィンドウとは対照的に、下流信号を有する測定ウィンドウは、分類性能の大幅な改善につながる可能性がある。例えば、図８４Ａに示されるように、ＣｐＧシトシン部位の６ｎｔ下流信号（＋６ｎｔ）を含む測定ウィンドウを使用したＳｅｑｕｅｌＳｅｑｕｅｎｃｉｎｇＫｉｔ３．０に基づく訓練データセットの場合、ＡＵＣ値が０．９４であり、６ｎｔ上流信号を使用した場合（ＡＵＣ：０．５）よりもはるかに大きかった。ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ１．０および２．０に基づく訓練データセットの場合、対応するＡＵＣ値は、それぞれ０．９５（図８４Ｂ）および０．９２（図８４Ｃ）であり、６ｎｔ上流を含む測定ウィンドウと比較して、改善を示している。これらのデータは、配列文脈にリンクした動態特徴が、限定されないがＣＮＮモデルを使用した分類力を改善することを示唆している。これらのデータはまた、本明細書の開示が、測定ウィンドウを調整することを通して、異なる試薬および異なる配列決定条件（例えば、異なるポリメラーゼ、他の化学試薬、それらの濃度および配列決定の反応パラメータ（例えば、持続時間））によって生成されるデータセットに適用可能であることを示唆した。同様の結論は、ＣｐＧシトシン部位の１０ｎｔ下流信号を含む測定ウィンドウを使用した分析から導き出される（図８４Ａ、８４Ｂ、および８４Ｃ）。 Measurement windows with downstream signals, as opposed to measurement windows with upstream signals, can lead to significantly improved classification performance. For example, as shown in FIG. 84A, for a training dataset based on Sequel Sequencing Kit 3.0 using a measurement window containing 6 nt downstream signals (+6 nt) of CpG cytosine sites, the AUC value is 0.94, It was much larger than when using the 6nt upstream signal (AUC: 0.5). For training datasets based on Sequel II Sequencing Kit 1.0 and 2.0, the corresponding AUC values are 0.95 (Fig. 84B) and 0.92 (Fig. 84C), respectively, with a measurement window containing 6 nt upstream shows an improvement compared to These data suggest that kinetic features linked to sequence context improve classification power using, but not limited to, CNN models. These data also allow the disclosure herein to use different reagents and different sequencing conditions (e.g., different polymerases, other chemical reagents, their concentrations and sequencing reaction parameters (e.g., , duration)) is applicable to datasets generated by Similar conclusions are drawn from analyzes using measurement windows containing 10 nt downstream signals of CpG cytosine sites (FIGS. 84A, 84B, and 84C).

別の実施形態では、分析されるシトシン上の信号、およびそのシトシンの上流および下流の両方の信号を含む測定ウィンドウを使用することができる。例えば、図８４Ａ、８４Ｂ、および８４Ｃに示されるように、６ｎｔ上流信号と６ｎｔ下流信号（±６ｎｔで示される）を含む測定ウィンドウを使用すると、ＡＵＣ値は、ＳｅｑｕｅｌＳｅｑｕｅｎｃｉｎｇＫｉｔ３．０、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ１．０および２．０に基づく訓練データセットについて、それぞれ０．９４、０．９５、および０．９２であることがわかった。１０ｎｔ上流信号と１０ｎｔ下流信号（±１０ｎｔで示される）を含む測定ウィンドウを使用すると、ＡＵＣ値は、ＳｅｑｕｅｌＳｅｑｕｅｎｃｉｎｇＫｉｔ３．０、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ１．０および２．０に基づく訓練データセットについて、それぞれ０．９４、０．９５、および０．９４であることがわかった。これらのデータは、本明細書の開示が、異なる試薬および異なる配列決定の反応パラメータによって生成されたデータセットに広く適用可能であることを示唆した。 In another embodiment, a measurement window can be used that includes the signal on the cytosine being analyzed and the signal both upstream and downstream of that cytosine. For example, as shown in FIGS. 84A, 84B, and 84C, using a measurement window containing 6 nt upstream and 6 nt downstream signals (indicated by ±6 nt), AUC values are calculated using Sequel Sequencing Kit 3.0, Sequel II It was found to be 0.94, 0.95, and 0.92 for training datasets based on Sequencing Kit 1.0 and 2.0, respectively. Using a measurement window containing 10 nt upstream and 10 nt downstream signals (denoted as ±10 nt), the AUC values are calculated for training datasets based on Sequel Sequencing Kit 3.0, Sequel II Sequencing Kit 1.0 and 2.0. , were found to be 0.94, 0.95, and 0.94, respectively. These data suggested that the disclosure herein is broadly applicable to data sets generated with different reagents and different sequencing reaction parameters.

訓練データセットで訓練されたＣＮＮモデルを適用した場合、異なる配列キット全体で異なる測定ウィンドウを用いた試験データセットから取得された結果を、図８５Ａ、８５Ｂ、および８５Ｃに示した。真陽性率はｙ軸にプロットされ、偽陽性率はｘ軸にプロットされている。凡例のラベリングは、図８４Ａ、８４Ｂ、および８４Ｃで使用されたラベリングと同等である。図８５Ａは、ＳｅｑｕｅｌＳｅｑｕｅｎｃｉｎｇＫｉｔ３．０に基づいて生成されたＳＭＲＴ－ｓｅｑデータを示す。図８５Ｂは、ＳｅｑｕｅｌＩＩｓｅｑｕｅｎｃｉｎｇＫｉｔ１．０に基づいて生成されたＳＭＲＴ－ｓｅｑデータを示す。図８５Ｃは、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ２．０に基づいて生成されたＳＭＲＴ－ｓｅｑを示す。訓練データセットで導き出されたすべての結論は、訓練プロセスに関与しなかったこれらの独立した試験データセットで検証され得る。さらに、３つの独立した試験データセットの中で、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ１．０および２．０を含む２つのデータセット（２／３）の分析では、１０ｎｔ上流および１０ｎｔ下流信号（±１０ｎｔで示される）を含む測定ウィンドウの使用が、他のデータセットよりも優れていることが示された。 Results obtained from test datasets using different measurement windows across different sequence kits when applying the CNN model trained on the training dataset are shown in FIGS. 85A, 85B, and 85C. The true positive rate is plotted on the y-axis and the false positive rate is plotted on the x-axis. The labeling of the legend is comparable to the labeling used in Figures 84A, 84B, and 84C. FIG. 85A shows SMRT-seq data generated based on Sequel Sequencing Kit 3.0. FIG. 85B shows SMRT-seq data generated based on Sequel II sequencing Kit 1.0. FIG. 85C shows SMRT-seq generated based on Sequel II Sequencing Kit 2.0. All conclusions drawn on the training dataset can be validated on these independent test datasets that were not involved in the training process. Furthermore, among the three independent test datasets, analysis of two datasets (2/3) containing Sequel II Sequencing Kits 1.0 and 2.0 showed 10 nt upstream and 10 nt downstream signals (shown as ±10 nt). ) was shown to be superior to other datasets.

２．バイサルファイト配列決定との比較
図８６Ａ、８６Ｂ、および８６Ｃは、バイサルファイト配列決定およびＳＭＲＴ－ｓｅｑ（ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ２．０）によって定量された全体的なメチル化レベルの相関を示す。図８６Ａでは、ＳＭＲＴ－ｓｅｑによって定量されたパーセンテージとしてのメチル化レベルを、ｙ軸に示す。図８６Ｂでは、バイサルファイト配列決定によって定量されたパーセンテージとしてのメチル化レベルを、ｘ軸に示す。黒い線は、近似した回帰直線である。破線は、２つの尺度が等しい対角線である。図８６Ｂは、ブランド・アルトマンプロットを示す。ｘ軸は、本開示によるＳＭＲＴ－ｓｅｑおよびバイサルファイト配列決定によって定量されたメチル化レベルの平均を示す。ｙ軸は、本開示によるＳＭＲＴ－ｓｅｑとバイサルファイト配列決定（すなわち、ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓメチル化－バイサルファイトベースのメチル化）との間のメチル化レベルの違いを示す。破線は、２つの尺度間で差がないゼロを横切る水平線に対応する。破線から外れたデータポイントは、尺度間に偏差が存在することを示している。図８６Ｃは、バイサルファイト配列決定によって定量された値に対するパーセンテージ変化を示す。ｘ軸は、本開示によるＳＭＲＴ－ｓｅｑおよびバイサルファイト配列決定によって定量されたメチル化レベルの平均を示す。ｙ軸は、メチル化レベルの平均に対する２つの尺度間のメチル化レベルの差のパーセンテージを示す。破線は、２つの尺度間に差がないゼロを横切る水平線に対応する。破線から外れたデータポイントは、尺度間に偏差が存在することを示している。 2. Comparison with Bisulfite Sequencing Figures 86A, 86B, and 86C show the correlation of global methylation levels quantified by bisulfite sequencing and SMRT-seq (Sequel II Sequencing Kit 2.0). In Figure 86A, the methylation level as a percentage quantified by SMRT-seq is shown on the y-axis. In Figure 86B, the methylation level as a percentage quantified by bisulfite sequencing is shown on the x-axis. The black line is the fitted regression line. The dashed line is a diagonal line with two equal scales. FIG. 86B shows a Bland-Altman plot. The x-axis shows the average methylation levels quantified by SMRT-seq and bisulfite sequencing according to the present disclosure. The y-axis shows the difference in methylation levels between SMRT-seq according to the present disclosure and bisulfite sequencing (ie, Pacific Biosciences methylation-bisulfite-based methylation). Dashed lines correspond to horizontal lines crossing zero where there is no difference between the two scales. Data points outside the dashed line indicate deviations between scales. FIG. 86C shows percentage changes relative to values quantified by bisulfite sequencing. The x-axis shows the average methylation levels quantified by SMRT-seq and bisulfite sequencing according to the present disclosure. The y-axis shows the percentage difference in methylation level between the two scales relative to the mean methylation level. Dashed lines correspond to horizontal lines crossing zero where there is no difference between the two scales. Data points outside the dashed line indicate deviations between scales.

図８６Ａに関して、線形回帰式はＹ＝ａＸ＋ｂであり、式中、「Ｙ」は、本開示によるＳＭＲＴ－ｓｅｑによって決定されたメチル化レベルを表し、「Ｘ」は、バイサルファイト配列決定によって決定されたメチル化レベルを表し、「ａ」は、回帰直線の傾きを表し（例えば、ａ＝１．４５）、「ｂ」は、ｙ軸の切片を表す（例えば、ｂ＝－２０．９８）。この場合、ＳＭＲＴ－ｓｅｑによって決定されるメチル化値は、（Ｙ－ｂ）／ａによって計算される。このグラフは、ＳＭＲＴ－ｓｅｑによって決定されたメチル化レベルが、バイサルファイト配列決定によって決定されたメチル化レベルに変換され得ることを示し、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ１．０と同様にＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ２．０について逆も然りである。 With respect to FIG. 86A, the linear regression equation is Y=aX+b, where “Y” represents the methylation level as determined by SMRT-seq according to the present disclosure and “X” is determined by bisulfite sequencing. 'a' represents the slope of the regression line (eg, a=1.45) and 'b' represents the intercept of the y-axis (eg, b=−20.98). In this case, the methylation value determined by SMRT-seq is calculated by (Yb)/a. This graph shows that methylation levels determined by SMRT-seq can be converted to methylation levels determined by bisulfite sequencing, Sequel II Sequencing Kit 2 as well as Sequel II Sequencing Kit 1.0. .0 and vice versa.

図８６Ｂは、本開示によるＳＭＲＴ－ｓｅｑとバイサルファイト配列決定との間のメチル化の定量のバイアスを示すブランド・アルトマンプロットであり、ｘ軸は、本開示によるＳＭＲＴ－ｓｅｑおよびバイサルファイト配列決定によって定量されたメチル化レベルの平均を示し、ｙ軸は、本開示によるＳＭＲＴ－ｓｅｑおよびバイサルファイト配列決定によって定量されたメチル化レベルの差を示す。２つの測定値間の中央値の差は、－６．８５％（範囲：－１０．１～１．７％）であった。バイサルファイト配列決定による値に対して、本開示によって定量されたメチル化レベルのパーセンテージ変化の中央値は、－９．９６％（範囲：－１４，７６～３．２１％）であった。この差は、平均値に応じて異なる。２つの尺度の平均が大きいほど、バイアスが大きくなる。 FIG. 86B is a Bland-Altman plot showing the bias in methylation quantification between SMRT-seq and bisulfite sequencing according to the present disclosure, where the x-axis is the Mean quantified methylation levels are shown and the y-axis shows differences in methylation levels quantified by SMRT-seq and bisulfite sequencing according to the present disclosure. The median difference between the two measurements was -6.85% (range: -10.1 to 1.7%). The median percentage change in methylation levels quantified by the present disclosure relative to values by bisulfite sequencing was −9.96% (range: −14,76 to 3.21%). This difference varies depending on the average value. The greater the average of the two measures, the greater the bias.

図８６Ｃは、図８６Ｂと同じデータを示しているが、メチル化レベルの差は、２つのメチル化レベルの平均で割ったものである。図８６Ｃはまた、２つの測定値の平均が大きいほど、バイアスが大きくなることを示す。 Figure 86C shows the same data as Figure 86B, but the difference in methylation levels is divided by the average of the two methylation levels. FIG. 86C also shows that the greater the average of the two measurements, the greater the bias.

エラーはバイサルファイト配列決定にある可能性があり、ＳＭＲＴ－ｓｅｑを使用した方法とは関係ない。従来の全ゲノムバイサルファイト配列決定（Ｉｌｌｕｍｉｎａ）は、特定のゲノム領域では、方法間でメチル化レベルの定量にかなりの変動があり、著しくバイアスのある配列出力および過大評価された全体的なメチル化を導入することが報告された（Ｏｌｏｖａｅｔａｌ．ＧｅｎｏｍｅＢｉｏｌ．２０１８；１９：３３）。本明細書に開示される実施形態は、いくつかの例示的な利点を有し、ＤＮＡを劇的に分解するバイサルファイト変換なしで実施することができ、ＰＣＲ増幅なしで実施することができる。 The error could be in the bisulfite sequencing and not related to the method using SMRT-seq. Conventional whole-genome bisulfite sequencing (Illumina) has shown, in specific genomic regions, considerable variation in quantification of methylation levels between methods, resulting in significantly biased sequence output and overestimated global methylation. (Olova et al. Genome Biol. 2018; 19:33). Embodiments disclosed herein have several exemplary advantages, can be performed without bisulfite conversion, which dramatically degrades DNA, and can be performed without PCR amplification.

３．組織起源
本開示の実施形態に従って、単一分子リアルタイム配列決定（ＳＭＲＴ－ｓｅｑ、ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ）を使用して、様々な癌のタイプにわたるメチル化分析を実施した。ＳＭＲＴ－ｓｅｑに使用される癌のタイプには、大腸癌（ｎ＝３）、食道癌（ｎ＝２）、乳癌（ｎ＝２）、腎細胞癌（ｎ＝２）、肺癌（ｎ＝２）、卵巣癌（ｎ＝２）、前立腺癌（ｎ＝２）、胃癌（ｎ＝２）、および膵臓癌（ｎ＝１）が含まれるが、これらに限定されない。それらの一致する隣接する非腫瘍組織も、ＳＭＲＴ－ｓｅｑに含まれた。データセットは、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ２．０によって調製されたＤＮＡから生成された。 3. Tissue Origin Methylation analysis across various cancer types was performed using single-molecule real-time sequencing (SMRT-seq, Pacific Biosciences) according to an embodiment of the present disclosure. Cancer types used for SMRT-seq include colon cancer (n=3), esophageal cancer (n=2), breast cancer (n=2), renal cell carcinoma (n=2), lung cancer (n=2). ), ovarian cancer (n=2), prostate cancer (n=2), gastric cancer (n=2), and pancreatic cancer (n=1). Their matching adjacent non-tumor tissue was also included in the SMRT-seq. The dataset was generated from DNA prepared by the Sequel II Sequencing Kit 2.0.

図８７Ａおよび８７Ｂは、様々な腫瘍組織とペアの隣接する非腫瘍組織との間の全体的なメチル化レベルの比較を示す。メチル化レベルは、パーセンテージとしてｙ軸にある。図８７Ａでは、ＳＭＲＴ－ｓｅｑによって、メチル化レベルが定量されている。図８７Ｂでは、バイサルファイト配列決定によって、メチル化レベルが定量されている。組織のタイプ（すなわち、腫瘍組織または隣接する非腫瘍組織）は、ｘ軸にある。異なるシンボルは、異なる起源の組織を表す。 Figures 87A and 87B show a comparison of global methylation levels between various tumor tissues and paired adjacent non-tumor tissues. Methylation levels are on the y-axis as percentages. In Figure 87A, methylation levels are quantified by SMRT-seq. In Figure 87B, methylation levels are quantified by bisulfite sequencing. Tissue type (ie, tumor tissue or adjacent non-tumor tissue) is on the x-axis. Different symbols represent tissues of different origin.

図８７Ａは、乳癌、大腸癌、食道癌、肝臓癌、肺癌、卵巣癌、膵臓癌、腎細胞癌、および胃癌を含む腫瘍組織の全体的なメチル化レベルが、対応する非－腫瘍組織（それぞれ、乳房、結腸、食道、肝臓、肺、卵巣、膵臓、前立腺、腎臓、および胃を含む）よりも有意に低かった（Ｐ値＝０．００６、対応のある標本のウィルコクソンの符号順位検定）。腫瘍とペアの非腫瘍組織との間のメチル化レベルの差の中央値は、－２．７％であった（ＩＱＲ：－６．４～－０．８％）。 Figure 87A shows that the global methylation levels of tumor tissues, including breast, colon, esophageal, liver, lung, ovarian, pancreatic, renal cell, and gastric cancers, were compared with the corresponding non-tumor tissues (respectively). , including breast, colon, esophagus, liver, lung, ovary, pancreas, prostate, kidney, and stomach) (P-value=0.006, Wilcoxon signed-rank test of paired samples). The median difference in methylation levels between tumor and paired non-tumor tissues was −2.7% (IQR: −6.4 to −0.8%).

図８４Ｂは、腫瘍組織におけるより低いメチル化レベルを確認する。したがって、これらの結果は、様々な癌のタイプおよび組織にわたるメチル化パターンが、本開示によるＳＭＲＴ－ｓｅｑによって正確に決定できることを示唆し、組織生検の基礎となる癌の早期発見、予後、診断および治療のための本開示の幅広い用途を意味している。様々な腫瘍のタイプにわたるメチル化レベルの低下の程度の違いは、メチル化パターンが癌のタイプに関連していることを示唆している可能性があり、癌の起源の組織を決定することができる。 Figure 84B confirms lower methylation levels in tumor tissue. Thus, these results suggest that methylation patterns across various cancer types and tissues can be accurately determined by SMRT-seq according to the present disclosure, providing early detection, prognosis, and diagnosis of cancer on the basis of tissue biopsy. and broad application of the present disclosure for therapy. Differences in the degree of reduction in methylation levels across various tumor types may suggest that methylation patterns are related to cancer type, and may be useful in determining the tissue of origin of the cancer. can.

Ｄ．強化検出および他の技術
一部の実施形態では、塩基修飾（例えば、メチル化）の分析は、次のパラメータ：配列文脈、ＩＰＤおよびＰＷ、のうちの１つ以上を使用して実施され得る。ＩＰＤとＰＷは、参照ゲノムに整列することなく、配列決定反応から決定することができる。単一分子リアルタイム配列決定アプローチの態様により、配列文脈、ＩＰＤ、およびＰＷを決定する精度がさらに強化され得る。１つの態様は、配列鋳型の特定の箇所を複数回測定し得る循環コンセンサス配列の性能であり、これにより、これらの複数のリードによる値の平均または分布に基づいて、配列文脈、ＩＰＤ、およびＰＷを測定することが可能になる。特定の実施形態では、整列プロセスを伴わない塩基修飾の分析は、計算効率を高め、所用時間を短縮し、分析のコストを削減し得る。実施形態は、整列プロセスなしで実施することができる。さらに他の実施形態では、整列プロセスを使用することができ、また、それが好ましい場合があり、例えば、整列プロセスを使用して、検出された塩基修飾の臨床的または生物学的意味を確認する場合（例えば、腫瘍抑制因子は高メチル化されている場合）、または、整列プロセスを使用して、さらなる分析のために目的の特定のゲノム領域に対応する配列決定データのサブセットを選択する場合である。選択されたゲノム領域からのデータが望まれる実施形態の場合、これらの実施形態は、ゲノム内の目的の領域、例えば、制限酵素またはＣＲＩＳＰＲ－Ｃａｓ９システムで切断することができる１つ以上の酵素または酵素ベースの方法論を使用して、そのような領域を標的化することを伴い得る。ＰＣＲ増幅は、典型的には、ＤＮＡの塩基修飾に関する情報が保存されないため、ＣＲＩＳＰＲ－Ｃａｓ９システムはＰＣＲベースの方法よりも好ましい場合がある。そのような選択された（生物情報学的に〔例えば、整列を介して〕またはＣＲＩＳＰＲ－Ｃａｓ９などの方法を介して）領域のメチル化レベルを分析して、組織起源、胎児障害、妊娠障害、および癌に関する情報を提供することができる。 D. Enhanced Detection and Other Techniques In some embodiments, analysis of base modifications (eg, methylation) can be performed using one or more of the following parameters: sequence context, IPD and PW. IPDs and PWs can be determined from sequencing reactions without alignment to a reference genome. Aspects of single-molecule real-time sequencing approaches can further enhance the accuracy of determining sequence context, IPD, and PW. One aspect is the ability of a circular consensus sequence to measure a particular location in a sequence template multiple times, thereby providing sequence context, IPD, and PW values based on the mean or distribution of values from these multiple reads. can be measured. In certain embodiments, analysis of base modifications without an alignment process can increase computational efficiency, reduce turnaround time, and reduce the cost of analysis. Embodiments can be implemented without an alignment process. In still other embodiments, an alignment process can be used, and may be preferred, e.g., to confirm the clinical or biological significance of detected base modifications. (e.g., the tumor suppressor is hypermethylated), or the alignment process is used to select subsets of the sequencing data corresponding to specific genomic regions of interest for further analysis. be. For those embodiments in which data from a selected genomic region is desired, these embodiments include the region of interest within the genome, e.g., one or more enzymes or It may involve targeting such regions using enzyme-based methodologies. CRISPR-Cas9 systems may be preferred over PCR-based methods because PCR amplification typically does not preserve information about base modifications in DNA. Analysis of methylation levels of such selected regions (biinformatically [e.g., via alignment] or via methods such as CRISPR-Cas9) can be used to determine tissue origin, fetal defects, pregnancy defects, and can provide information about cancer.

１．参照ゲノムに整列せずにサブリードを使用したメチル化分析
実施形態では、メチル化分析は、参照ゲノムへの整列なしで、サブリードからの動態特徴および配列文脈を含む測定ウィンドウを使用して実施され得る。図８８に示されるように、ゼロモード導波（ＺＭＷ）に由来するサブリードを使用して、コンセンサス配列８８０２（循環コンセンサス配列（ＣＣＳ）としても知られている）を構築した。限定されないがＰＷおよびＩＰＤ値を含むＣＣＳの各位置での平均動態値を計算した。ＣｐＧ部位を取り巻く配列文脈は、そのＣｐＧ部位の上流および下流配列に基づいてＣＣＳから決定された。したがって、本開示で定義される測定ウィンドウは、訓練のために構築され、測定ウィンドウには、ＣＣＳに関連する動態特徴を有するサブリードに従う、ＰＷ、ＩＰＤ値、および配列文脈が含まれる。この手順により、サブリードを参照ゲノムに整列することが不要になる。 1. Methylation Analysis Using Subreads Without Alignment to Reference Genome In embodiments, methylation analysis can be performed using measurement windows that include kinetic features and sequence context from subreads without alignment to a reference genome. . A consensus sequence 8802 (also known as circular consensus sequence (CCS)) was constructed using subreads derived from zero-mode waveguiding (ZMW), as shown in FIG. Mean kinetic values at each location of CCS, including but not limited to PW and IPD values, were calculated. The sequence context surrounding the CpG site was determined from the CCS based on the upstream and downstream sequences of the CpG site. Thus, the measurement window defined in this disclosure is constructed for training, and includes PW, IPD values, and sequence context according to subreads with CCS-relevant kinetic features. This procedure eliminates the need to align subreads to the reference genome.

図８８に示される原理を試験するために、全ゲノム増幅ＤＮＡに由来する６０１，９４２個の非メチル化ＣｐＧ部位と、ＣｐＧメチルトランスフェラーゼ（例えば、Ｍ．ＳｓｓＩ）処理ＤＮＡに由来する１６３，５２７個のメチル化ＣｐＧ部位とを使用して、訓練データセットを作成した。全ゲノム増幅ＤＮＡに由来する５４６，３９３個の非メチル化ＣｐＧ部位と、ＣｐＧメチルトランスフェラーゼ（例えば、Ｍ．ＳｓｓＩ）処理ＤＮＡに由来する１９３，６４１個のメチル化ＣｐＧ部位を使用して、試験データセットを作成した。データセットは、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ２．０によって調製されたＤＮＡから生成された。 To test the principle shown in Figure 88, 601,942 unmethylated CpG sites derived from whole genome amplified DNA and 163,527 derived from CpG methyltransferase (e.g., M.SssI) treated DNA. and methylated CpG sites were used to generate a training data set. Using 546,393 unmethylated CpG sites derived from whole genome amplified DNA and 193,641 methylated CpG sites derived from CpG methyltransferase (e.g., M.SssI) treated DNA, test data created a set. The dataset was generated from DNA prepared by the Sequel II Sequencing Kit 2.0.

図８９に示されるように、一実施形態では、サブリードおよびＣＣＳに関連する動態特徴および配列文脈を使用して、メチル化を決定するための畳み込みニューラルネットワーク（ＣＮＮ）モデルを訓練すると、試験データセットおよび訓練データセットにおいて、メチル化ＣｐＧ部位と非メチル化ＣｐＧ部位を区別するＡＵＣ値が、それぞれ０．９４および０．９５を達成することができる。他の実施形態では、他のニューラルネットワークモデル、深層学習アルゴリズム、人工知能、および／または機械学習アルゴリズムを使用することができる。 As shown in FIG. 89, in one embodiment, sub-read and CCS-related kinetic features and sequence context are used to train a convolutional neural network (CNN) model for determining methylation, resulting in and in the training data set, AUC values of 0.94 and 0.95, respectively, that discriminate between methylated and unmethylated CpG sites can be achieved. Other embodiments may use other neural network models, deep learning algorithms, artificial intelligence, and/or machine learning algorithms.

メチル化の確率のカットオフを０．２設定すると、メチル化ＣｐＧ部位の検出において、８２．４％の感度と９１．７％の特異度を取得することができる。これらの結果は、参照ゲノムへの事前の整列なしで、動態特徴を伴うサブリードを使用して、メチル化ＣｐＧ部位と非メチル化ＣｐＧ部位を区別することができることを示している。 Setting the methylation probability cutoff to 0.2, it is possible to obtain a sensitivity of 82.4% and a specificity of 91.7% in detecting methylated CpG sites. These results demonstrate that subreads with kinetic features can be used to distinguish between methylated and unmethylated CpG sites without prior alignment to the reference genome.

別の実施形態では、ＣｐＧ部位全体のメチル化状態を決定するために、ＣＣＳ情報なしで、かつ参照ゲノムへの事前の整列なしで、サブリードから直接配列文脈とともに動態特徴を使用することもできる。メチル化状態を決定するためのＣＮＮモデルを訓練するために、サブリードに存在するＣｐＧの２０ｎｔ上流と２０ｎｔ下流にまたがる位置でのＰＷおよびＩＰＤ値を含む動態特徴を使用した。図９０に示されるように、本開示の実施形態による、サブリードに関連する動態特徴を使用するＲＯＣ曲線のＡＵＣは、訓練および試験データセットにおいてメチル化ＣｐＧ部位を検出するために、それぞれ０．７０および０．６９であった。これらのデータは、本開示の実施形態を使用して、サブリードに関連する動態特徴を使用するが、コンセンサス配列の事前の整列および構築なしで、ＤＮＡ分子のメチル化パターンを推測することが実行可能であることを示唆した。しかしながら、この実施形態でメチル化を決定する性能は、本開示に記載されるように、整列情報またはコンセンサス配列を組み合わせて利用する実施形態よりも劣っていた。サブリードと動態値を生成する際の精度の強化により、サブリードとそれに関連する動態特徴を使用して塩基修飾を決定する性能が改善すると考えられる。 In another embodiment, kinetic features can also be used with sequence context directly from subreads, without CCS information and without prior alignment to the reference genome, to determine the methylation status across CpG sites. Kinetic features including PW and IPD values at positions spanning 20 nt upstream and 20 nt downstream of CpGs present in subreads were used to train a CNN model for determining methylation status. As shown in FIG. 90, the AUC of the ROC curve using sub-read related kinetic features, according to embodiments of the present disclosure, is 0.70 for detecting methylated CpG sites in the training and test datasets, respectively. and 0.69. These data demonstrate that using embodiments of the present disclosure, it is feasible to infer methylation patterns of DNA molecules using kinetic features associated with subreads, but without prior alignment and construction of consensus sequences. suggested that However, the performance of this embodiment to determine methylation was inferior to the embodiments utilizing a combination of alignment information or consensus sequences as described in this disclosure. Enhanced accuracy in generating subreads and kinetic values is expected to improve the ability to determine base modifications using subreads and their associated kinetic features.

２．標的化単一分子リアルタイム配列決定を使用した、欠失領域のメチル化分析
本明細書に記載の方法はまた、１つ以上の選択されたゲノム領域を分析するために適用され得る。一実施形態では、目的の領域（複数可）は、最初に、目的の領域（複数可）由来のＤＮＡ分子が相補的配列を有する合成オリゴヌクレオチドにハイブリダイズすることを可能にするハイブリダイゼーション法によって濃縮され得る。本明細書に記載の方法を使用した塩基修飾の分析では、元のＤＮＡ分子の塩基修飾情報がＰＣＲ産物に伝達されないため、配列決定にかける前に、標的ＤＮＡ分子をＰＣＲで増幅することができない。ＰＣＲ増幅を行わずにこれらの標的領域を濃縮するために、いくつかの方法が開発されている。 2. Methylation Analysis of Deletion Regions Using Targeted Single-Molecular Real-Time Sequencing The methods described herein can also be applied to analyze one or more selected genomic regions. In one embodiment, the region(s) of interest are first obtained by hybridization methods that allow DNA molecules from the region(s) of interest to hybridize to synthetic oligonucleotides having complementary sequences. can be concentrated. Analysis of base modifications using the methods described herein does not transfer the base modification information of the original DNA molecule to the PCR product, so the target DNA molecule cannot be amplified by PCR prior to sequencing. . Several methods have been developed to enrich for these target regions without PCR amplification.

別の実施形態では、標的領域（複数可）は、ＣＲＩＳＰＲ－Ｃａｓ９システムの使用を通して濃縮することができる（Ｓｔｅｖｅｎｓｅｔａｌ．ＰＬＯＳＯｎｅ２０１９；１４（４）：ｅ０２１５４４１、Ｗａｔｓｏｎｅｔａｌ．ＬａｂＩｎｖｅｓｔ２０２０；１００：１３５－１４６）。一実施形態では、ＤＮＡ試料中のＤＮＡ分子の末端を最初に脱リン酸化することで、それらが配列決定アダプターに直接連結されないようにする。次いで、目的の領域（複数可）は、ガイドＲＮＡ（ｃｒＲＮＡ）を伴うＣａｓ９タンパク質によって誘導されて、二本鎖切断を作成する。次いで、二本鎖切断と隣接する両側の目的の領域（複数可）を、選択した配列決定プラットフォームによって指定された配列決定アダプターに連結する。別の実施形態では、Ｃａｓ９タンパク質と結合していないＤＮＡ分子が分解されるように、ＤＮＡをエキソヌクレアーゼで処理することができる（Ｓｔｅｖｅｎｓｅｔａｌ．ＰＬＯＳＯｎｅ２０１９；１４（４）：ｅ０２１５４４１）。これらの方法は、ＰＣＲ増幅を伴わないため、塩基修飾を含む元のＤＮＡ分子の配列を決定し、塩基修飾を決定することができる。一実施形態では、この方法を使用して、相同配列を共有する多数の領域、例えば、長鎖散在反復配列（ＬＩＮＥ）を標的にすることができる。一実施例では、そのような分析は、胎児の異数性の検出のために、母体血漿中の循環無細胞ＤＮＡの分析に使用することができる（Ｋｉｎｄｅｅｔａｌ．ＰＬＯＳＯｎｅ２０１２；７（７）：ｅ４１１６２）。 In another embodiment, the target region(s) can be enriched through the use of the CRISPR-Cas9 system (Stevens et al. PLOS One 2019;14(4):e0215441, Watson et al. Lab Invest 2020; 100:135-146). In one embodiment, the ends of the DNA molecules in the DNA sample are first dephosphorylated so that they are not directly ligated to sequencing adapters. The region(s) of interest are then directed by the Cas9 protein with guide RNA (crRNA) to create a double-strand break. The region(s) of interest flanking the double-strand break are then ligated to sequencing adapters specified by the sequencing platform of choice. In another embodiment, DNA can be treated with an exonuclease such that DNA molecules not bound to Cas9 protein are degraded (Stevens et al. PLOS One 2019;14(4):e0215441). Since these methods do not involve PCR amplification, the original DNA molecule containing the base modifications can be sequenced and the base modifications determined. In one embodiment, this method can be used to target multiple regions that share homologous sequences, such as long interspersed repeats (LINEs). In one example, such assays can be used to analyze circulating cell-free DNA in maternal plasma for the detection of fetal aneuploidy (Kinde et al. PLOS One 2012; 7(7). ): e41162).

図９１に示されるように、ＣＲＩＳＰＲ（クラスター化して規則的な配置の短い回文配列リピート）／Ｃａｓ９（ＣＲＩＳＰＲ関連タンパク質９）システムを使用して、標的化単一分子リアルタイム配列決定を実装することができる。５’ホスホリル基（すなわち、５’－Ｐ）および３’ヒドロキシル基（すなわち、３’－ＯＨ）を有するＤＮＡ断片（例えば、分子９１０２）は、５’－Ｐを除去し、３’－ＯＨをジデオキシヌクレオチド（すなわち、ｄｄＮＴＰ）と連結することで、末端ブロックプロセスにかけた。したがって、末端が修飾された得られた分子（例えば、分子９１０４）は、その後のＤＮＡライブラリ調製のためのアダプターと連結できなかった。しかしながら、末端ブロックされた分子は、ＣＲＩＳＰＲ／Ｃａｓ９システムによって媒介される標的特異的切断にかけられ、目的の分子に５’－Ｐおよび３’－ＯＨ末端を導入した。５’－Ｐおよび３’－ＯＨ末端を有するそのような新たに切断されたＤＮＡ分子（例えば、分子９１０６）は、ヘアピンアダプターと連結して、環状分子９１０８を形成することができるようになった。連結されていないアダプター、直鎖ＤＮＡ、および１つの切断のみを有する分子を、エキソヌクレアーゼＩＩＩおよびＶＩＩによる消化にかけた。その結果、２つのヘアピンアダプターで連結された分子が濃縮され、単一分子リアルタイム配列決定にかけられた。これらの標的分子は、本開示に存在する実施形態による塩基修飾分析（すなわち、標的化単一分子リアルタイム配列決定）に適していた。 Implementing targeted single-molecule real-time sequencing using the CRISPR (clustered and regularly arranged short palindromic repeats)/Cas9 (CRISPR-associated protein 9) system, as shown in FIG. can be done. A DNA fragment (eg, molecule 9102) with a 5′ phosphoryl group (ie, 5′-P) and a 3′ hydroxyl group (ie, 3′-OH) has the 5′-P removed and the 3′-OH removed. It was subjected to the end-blocking process by ligation with dideoxynucleotides (ie, ddNTPs). Therefore, the resulting end-modified molecules (eg, molecule 9104) could not be ligated with adapters for subsequent DNA library preparation. However, end-blocked molecules were subjected to target-specific cleavage mediated by the CRISPR/Cas9 system to introduce 5'-P and 3'-OH ends to the molecule of interest. Such newly cleaved DNA molecules with 5′-P and 3′-OH ends (eg, molecule 9106) can now be ligated with hairpin adapters to form circular molecule 9108. . Unligated adapters, linear DNA, and molecules with only one break were subjected to digestion with exonucleases III and VII. As a result, molecules linked by two hairpin adapters were enriched and subjected to single-molecule real-time sequencing. These target molecules were suitable for base modification analysis (ie, targeted single-molecule real-time sequencing) according to embodiments present in the present disclosure.

図９２に示されるように、ＣＲＩＳＰＲ／Ｃａｓ９システムのＣａｓ９タンパク質は、ＣＲＩＳＰＲＲＮＡ（ｃｒＲＮＡ、ＤＮＡ標的化に関与）およびトランス活性化ｃｒＲＮＡ（ｔｒａｃｒＲＮＡ、Ｃａｓ９との複合体の形成に関与）を含むガイドＲＮＡ（すなわち、ｇＲＮＡ）と相互作用した（Ｐｉｃｋａｒ－Ｏｌｉｖｅｒｅｔａｌ．ＮａｔＲｅｖＭｏｌＣｅｌｌｂｉｏｌ．２０１９；２０：４９０－５０７）。曲線状の形は、Ｃａｓ９タンパク質を表している。これは、ＣＲＩＳＰＲ配列をガイドとして使用して、ＣＲＩＳＰＲ配列の一部に相補的なＤＮＡの特定の鎖を認識して切断する酵素である。ｃｒＲＮＡは、ｔｒａｃｒＲＮＡにアニーリングされた。一実施形態では、合成単一ＲＮＡ配列は、シングルガイドＲＮＡ（ｓｇＲＮＡ）と呼ばれるｃｒＲＮＡおよびｔｒａｃｒＲＮＡ配列の両方を含有していた。スペーサー配列と呼ばれるｃｒＲＮＡのセグメントは、Ｃａｓ９タンパク質が、標的領域への相補的な塩基対形成を通して、二本鎖ＤＮＡ（ｄｓＤＮＡ）の特定の鎖を認識して切断するように導く。一実施形態では、スペーサー配列と標的ｄｓＤＮＡとの間の相補性に関与するミスマッチはなかった。別の実施形態では、スペーサー配列と標的ｄｓＤＮＡとの間の相補的な塩基対形成は、ミスマッチを可能にするであろう。例えば、ミスマッチの数は、限定されないが、１、２、３、４、５、６、７、８などである。一実施形態では、ＣＲＩＳＰＲ配列は、切断効率、特異性、感度、および異なるＣＲＩＳＰＲ／Ｃａｓ複合体の設計の多重化の能力に応じて、プログラム可能である。 As shown in Figure 92, the Cas9 proteins of the CRISPR/Cas9 system consist of guide RNAs, including CRISPR RNA (crRNA, involved in DNA targeting) and transactivating crRNA (tracrRNA, involved in forming a complex with Cas9). (ie, gRNA) (Pickar-Oliver et al. Nat Rev Mol Cell biol. 2019;20:490-507). The curved shape represents the Cas9 protein. It is an enzyme that uses the CRISPR sequence as a guide to recognize and cut a specific strand of DNA that is complementary to a portion of the CRISPR sequence. crRNA was annealed to tracrRNA. In one embodiment, the synthetic single RNA sequence contained both crRNA and tracrRNA sequences, referred to as single guide RNA (sgRNA). A segment of crRNA called the spacer sequence directs the Cas9 protein to recognize and cleave a particular strand of double-stranded DNA (dsDNA) through complementary base-pairing to the target region. In one embodiment, there were no mismatches involved in complementarity between the spacer sequence and the target dsDNA. In another embodiment, complementary base pairing between the spacer sequence and the target dsDNA will allow for mismatches. For example, the number of mismatches can be, but is not limited to, 1, 2, 3, 4, 5, 6, 7, 8, and the like. In one embodiment, the CRISPR sequences are programmable according to cleavage efficiency, specificity, sensitivity, and ability to multiplex different CRISPR/Cas complex designs.

図９３に示されるように、本発明者らは、ヒトゲノムのＡｌｕエレメントにまたがる２つの切断を標的とするＣＲＩＳＰＲ／Ｃａｓ９複合体のペアを設計した。「ＸＸＸ」は、Ｃａｓ９ヌクレアーゼ切断部位に隣接する３つのヌクレオチドを示す。「ＹＹＹ」は、「ＸＸＸ」に相補的な３つの対応するヌクレオチドを示す。５’－ＮＧＧは、プロトスペーサー隣接モチーフ（ＰＡＭ）配列を表す。他のＣＲＩＳＰＲ／Ｃａｓシステムでは、ＰＡＭ配列が異なる場合があり、Ｃａｓヌクレアーゼ切断部位に隣接する配列が異なり得る。この図では、Ａｌｕ領域のサイズは、２２３ｂｐであった。１，１７５，３２９個のＡｌｕ領域があり、各々がヒトゲノム内のこのようなＡｌｕエレメントのホモログを含有していた。中央値として５つのＣｐＧ部位が、このＡｌｕエレメントに位置していた（範囲：０～３４）。一例として、この設計には、２０ｎｔのスペーサー配列を含有する３６ｎｔのｃｒＲＮＡが含有された。詳細なｇＲＮＡの配列情報は、以下のとおりである。 As shown in Figure 93, we designed a pair of CRISPR/Cas9 complexes that target two breaks spanning the Alu element of the human genome. "XXX" indicates the three nucleotides flanking the Cas9 nuclease cleavage site. "YYY" indicates the three corresponding nucleotides complementary to "XXX". 5'-NGG represents the protospacer adjacent motif (PAM) sequence. Other CRISPR/Cas systems may have different PAM sequences and different sequences flanking the Cas nuclease cleavage site. In this figure, the size of the Alu region was 223 bp. There were 1,175,329 Alu regions, each containing homologues of such Alu elements in the human genome. A median of 5 CpG sites were located in this Alu element (range: 0-34). As an example, this design contained a 36 nt crRNA containing a 20 nt spacer sequence. Detailed gRNA sequence information is as follows.

最初の切断を導入するための第１のＣＲＩＳＰＲ／Ｃａｓ９複合体：（５’から３’までのすべての配列）
ｃｒＲＮＡ：ＧＣＣＵＧＵＡＡＵＣＣＣＡＧＣＡＣＵＵＵＧＵＵＵＵＡＧＡＧＣＵＡＵＧＣＵ
ｔｒａｃｒＲＮＡ：ＡＧＣＡＵＡＧＣＡＡＧＵＵＡＡＡＡＵＡＡＧＧＣＵＡＧＵＣＣＧＵＵＡＵＣＡＡＣＵＵＧＡＡＡＡＡＧＵＧＧＣＡＣＣＧＡＧＵＣＧＧＵＧＣＵＵＵ First CRISPR/Cas9 complex to introduce the first cut: (all sequences from 5' to 3')
crRNA: GCCUGUAAUCCCAGCACUUUGUUUUAGAGCUAUGCU
tracrRNA: AGCAUAGCAAGUUAAAAAAAGGCUAGUCCGUUAUCAACUUGAAAAAAGUGGGCACCGAGUCGGUGCUUU

２番目の切断を導入するための第２のＣＲＩＳＰＲ／Ｃａｓ９複合体：
ｃｒＲＮＡ：ＡＧＧＧＵＣＵＣＧＣＵＣＵＧＵＣＧＣＣＣＧＵＵＵＵＡＧＡＧＣＵＡＵＧＣＵ
ｔｒａｃｒＲＮＡ：ＡＧＣＡＵＡＧＣＡＡＧＵＵＡＡＡＡＵＡＡＧＧＣＵＡＧＵＣＣＧＵＵＡＵＣＡＡＣＵＵＧＡＡＡＡＡＧＵＧＧＣＡＣＣＧＡＧＵＣＧＧＵＧＣＵＵＵ A second CRISPR/Cas9 complex to introduce a second cut:
crRNA: AGGGUCUCGCUCUGUCGCCCCGUUUUUAGAGCUAUGCU
tracrRNA: AGCAUAGCAAGUUAAAAAAAGGCUAGUCCGUUAUCAACUUGAAAAAAGUGGGCACCGAGUCGGUGCUUU

ｃｒＲＮＡ分子をｔｒａｃｒＲＮＡ（例えば、６７ｎｔ）にアニーリングして、ｇＲＮＡの骨格を形成した。設計されたｇＲＮＡを含むＣａｓ９ヌクレアーゼは、特定のレベルの特異性で、標的切断部位を有する末端ブロックされた分子の両方の鎖を切断することができる。ヒトゲノムには、設計されたＣＲＩＳＰＲ／Ｃａｓ９複合体によって切断されると想定された目的のＡｌｕ領域が１１６，１８４箇所あった。したがって、Ｃａｓ９複合体による標的切断後、これらのＡｌｕ領域をヘアピンアダプターに連結することができる。ヘアピンアダプターに連結されたこれらの分子は、単一分子リアルタイム配列決定によって配列決定され得る。これらのＡｌｕ領域のメチル化パターンは、標的化の様式で決定することができる。一実施形態では、２つのＣａｓ９複合体からのスペーサー配列は、二本鎖ＤＮＡ基質の同じ鎖（例えば、ワトソン鎖またはクリック鎖）と、塩基対を形成することができる。一実施形態では、２つのＣａｓ９複合体由来のｇＲＮＡのスペーサー配列は、二本鎖ＤＮＡ基質の異なる鎖と塩基対を形成することができる。例えば、Ｃａｓ９複合体の一方のスペーサー配列は、二本鎖ＤＮＡ基質のワトソン鎖に相補的であり、かつＣａｓ９複合体の他方のスペーサー配列は、二本鎖ＤＮＡ基質のクリック鎖に相補的であり、その逆も同様であった。 The crRNA molecule was annealed to the tracrRNA (eg, 67nt) to form the backbone of the gRNA. A Cas9 nuclease containing engineered gRNA can cleave both strands of an end-blocked molecule with a target cleavage site with a certain level of specificity. There were 116,184 Alu regions of interest in the human genome that were predicted to be cleaved by the designed CRISPR/Cas9 complex. Therefore, these Alu regions can be ligated to hairpin adapters after target cleavage by the Cas9 complex. These molecules ligated to hairpin adapters can be sequenced by single-molecule real-time sequencing. The methylation pattern of these Alu regions can be determined in a targeted manner. In one embodiment, spacer sequences from two Cas9 complexes can base pair with the same strand (eg, Watson strand or Crick strand) of a double-stranded DNA substrate. In one embodiment, the spacer sequences of two Cas9 complex-derived gRNAs are capable of base-pairing with different strands of a double-stranded DNA substrate. For example, one spacer sequence of the Cas9 complex is complementary to the Watson strand of the double-stranded DNA substrate and the other spacer sequence of the Cas9 complex is complementary to the Crick strand of the double-stranded DNA substrate. and vice versa.

一実施形態では、ヘアピンアダプターに連結されたＤＮＡ分子は、エキソヌクレアーゼ消化に耐性がある環状形態であった。したがって、アダプターに連結されたＤＮＡ産物を、エキソヌクレアーゼ（例えば、エキソヌクレアーゼＩＩＩおよびＶＩＩ）で処理して、直鎖ＤＮＡ（例えば、オフターゲットＤＮＡ分子）を除去することができる。エキソヌクレアーゼを使用するこのステップは、標的分子をさらに濃縮することができる。配列決定される標的分子のサイズは、１つ以上のＣａｓ９ヌクレアーゼによって導入される２つの切断部位間のスパンサイズ（例えば、１０ｂｐ、２０ｂｐ、３０ｂｐ、４０ｂｐ、５０ｂｐ、１００ｂｐ、２００ｂｐ、３００ｂｐ、４００ｂｐ、５００ｂｐ、１０００ｂｐ、２０００ｂｐ、３０００ｂｐ、４０００ｂｐ、５０００ｂｐ、１０ｋｂ、２０ｋｂ、３０ｋｂ、４０ｋｂ、５０ｋｂ、１００ｋｂ、２００ｋｂ、３００ｋｂ、５００ｋｂ、および１Ｍｂを含むが、これらに限定されない）に依存した。 In one embodiment, the DNA molecule ligated to the hairpin adapter was in a circular form that is resistant to exonuclease digestion. Thus, the adapter-ligated DNA product can be treated with exonucleases (eg, exonucleases III and VII) to remove linear DNA (eg, off-target DNA molecules). This step using an exonuclease can further enrich target molecules. The size of the target molecule to be sequenced is the span size between the two cleavage sites introduced by one or more Cas9 nucleases (e.g. , 1000 bp, 2000 bp, 3000 bp, 4000 bp, 5000 bp, 10 kb, 20 kb, 30 kb, 40 kb, 50 kb, 100 kb, 200 kb, 300 kb, 500 kb, and 1 Mb).

一例として、Ａｌｕ領域を標的とするｇＲＮＡを含むＣａｓ９を使用して、本発明者らは、単一分子リアルタイム配列決定を使用して、ヒト肝細胞癌（ＨＣＣ）腫瘍組織試料から１８７，０１０分子を配列決定した。それらの中で、１１３，４９１個の分子が、標的切断を有していた（すなわち、オンターゲット切断率は分子の約６０．７％であった）。データセットは、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ２．０によって調製されたＤＮＡから生成された。言い換えれば、この実施例では、Ｃａｓ９複合体によって目的の分子に導入された切断部位の特異度は６０．７％であった。他の実施形態では、Ｃａｓ９または他のＣａｓ複合体によって目的の分子に導入される切断部位の特異度は変化し、限定されないが、１％、５％、１０％、２０％、３０％、４０％、５０％、６０％、７０％、８０％、９０％、および１００％を含むであろう。Ａｌｕ配列のＣｐＧ部位でのメチル化状態を決定するために、参照ゲノムに整列せずにＣＣＳおよびサブリードに由来するＩＰＤ値、ＰＷ値、および配列文脈を使用した。 As an example, using Cas9 containing gRNAs targeting the Alu region, we used single-molecule real-time sequencing to identify 187,010 molecules from a human hepatocellular carcinoma (HCC) tumor tissue sample. were sequenced. Among them, 113,491 molecules had targeted cleavage (ie, the on-target cleavage rate was approximately 60.7% of the molecules). The dataset was generated from DNA prepared by the Sequel II Sequencing Kit 2.0. In other words, in this example, the specificity of the cleavage site introduced into the molecule of interest by the Cas9 complex was 60.7%. In other embodiments, the specificity of the cleavage site introduced into the molecule of interest by Cas9 or other Cas complexes varies, including but not limited to 1%, 5%, 10%, 20%, 30%, 40% %, 50%, 60%, 70%, 80%, 90%, and 100%. To determine the methylation status at CpG sites of Alu sequences, IPD values, PW values and sequence context derived from CCS and subreads without alignment to the reference genome were used.

図９４に示されるように、バイサルファイト配列決定および本開示による単一分子リアルタイム配列決定によって決定されたメチル化レベル間で、同様のメチル化の分布が観察された。図９４は、バイサルファイト配列決定および単一分子リアルタイム配列決定（ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ）のメチル化密度（パーセント）のヒストグラムを示す。ｙ軸は、ｘ軸に示されている特定のメチル化密度を有する試料内の分子の割合を示す。この結果は、Ｃａｓ９を介した標的化単一分子リアルタイム配列決定を使用して、メチル化パターンを決定することが実行可能であることを示唆した。この結果はまた、参照ゲノムに整列することなく、ＰＷ値およびＩＰＤ値を含むサブリードに関連する動態特徴を使用して、メチル化を決定できることも示唆した。図９４に示すように、低メチル化を示すかなりの量のＡｌｕ領域が観察され、これは、Ａｌｕリピート領域で癌ゲノムが脱メチル化されるという以前の知見と一致していた（Ｒｏｄｒｉｇｕｅｚｅｔａｌ．ＮｕｃｌｅｉｃＡｃｉｄｓＲｅｓ．２００８；３６：７７０－７８４）。 As shown in Figure 94, a similar distribution of methylation was observed between methylation levels determined by bisulfite sequencing and single-molecule real-time sequencing according to the present disclosure. Figure 94 shows histograms of percent methylation density for bisulfite sequencing and single molecule real-time sequencing (Pacific Biosciences). The y-axis shows the percentage of molecules in the sample with a particular methylation density indicated on the x-axis. This result suggested that it would be feasible to determine methylation patterns using Cas9-mediated targeted single-molecule real-time sequencing. The results also suggested that kinetic features associated with subreads, including PW and IPD values, could be used to determine methylation without alignment to the reference genome. As shown in Figure 94, we observed significant amounts of Alu regions exhibiting hypomethylation, which was consistent with previous findings that cancer genomes are demethylated at Alu repeat regions (Rodriguez et al. .Nucleic Acids Res.2008;36:770-784).

図９５は、ｙ軸に、本開示による単一分子リアルタイム配列決定によって決定されたメチル化レベルの分布を示し、ｘ軸に、バイサルファイト配列決定によって決定されたメチル化密度を示す。図９５に示されるように、Ａｌｕ領域のメチル化レベルは、バイサルファイト配列決定の結果に従って、５つのカテゴリー、つまり０～２０％、２０～４０％、４０～６０％、６０～８０％、および８０～１００％に分類された。同じセットのＡｌｕ領域のメチル化レベルは、Ａｌｕ領域の各カテゴリーの動態特徴および配列文脈（ｙ軸）を含む測定ウィンドウを使用して、モデルによってさらに決定された。本発明者らのモデルによって決定されたメチル化レベルの分布は、ビン分けされたカテゴリー全体のメチル化レベルの昇順に従って、徐々に増加した。繰り返すが、これらの結果は、Ｃａｓ９を介した標的化単一分子リアルタイム配列決定を使用して、メチル化パターンを決定することが可能であることを示唆している。参照ゲノムに整列することなく、ＰＷ値およびＩＰＤ値を含むサブリードに関連する動態特徴を使用して、メチル化を決定することができる。 FIG. 95 shows on the y-axis the distribution of methylation levels determined by single-molecule real-time sequencing according to the present disclosure and on the x-axis methylation density determined by bisulfite sequencing. As shown in Figure 95, the methylation level of the Alu region was divided into five categories according to the results of bisulfite sequencing: 0-20%, 20-40%, 40-60%, 60-80%, and Classified from 80 to 100%. The methylation levels of the same set of Alu regions were further determined by the model using a measurement window containing the kinetic features and sequence context (y-axis) of each category of Alu regions. The distribution of methylation levels determined by our model gradually increased according to the ascending order of methylation levels across binned categories. Again, these results suggest that Cas9-mediated targeted single-molecule real-time sequencing can be used to determine methylation patterns. Methylation can be determined using kinetic features associated with subreads, including PW and IPD values, without alignment to the reference genome.

さらに別の実施形態では、他のタイプのＣＲＩＳＰＲ／Ｃａｓシステム、例えば、限定されないが、Ｃａｓ１２ａ、Ｃａｓ３、および他のオーソログ（例えば、ＳｔａｐｈｙｌｏｃｏｃｃｕｓａｕｒｅｕｓＣａｓ９）または改変されたＣａｓタンパク質（強化されたＡｃｉｄａｍｉｎｏｃｏｃｃｕｓｓｐｐＣａｓ１２ａ）を使用して、標的化単一分子リアルタイム配列決定を行うことができる。 In yet another embodiment, other types of CRISPR/Cas systems, including but not limited to Cas12a, Cas3, and other orthologues (e.g., Staphylococcus aureus Cas9) or modified Cas proteins (enhanced Acidaminococcus spp Cas12a ) can be used to perform targeted single-molecule real-time sequencing.

一実施形態では、ヌクレアーゼ活性のない非活性化Ｃａｓ９（ｄＣａｓ９）を使用して、切断することなく、標的分子を濃縮することができる。例えば、標的ＤＮＡ分子は、ビオチン化ｄＣａｓ９および標的配列特異的ｇＲＮＡを含む複合体と結合した。ｄＣａｓ９はヌクレアーゼを欠損しているため、このような標的ＤＮＡ分子は、ｄＣａｓ９によって切断されない可能性がある。ストレプトアビジンでコーティングされた磁気ビーズの使用を通して、標的ＤＮＡ分子を濃縮することができる。 In one embodiment, non-activated Cas9 (dCas9), which lacks nuclease activity, can be used to enrich target molecules without cleavage. For example, a target DNA molecule bound to a complex containing biotinylated dCas9 and target sequence-specific gRNA. Since dCas9 lacks a nuclease, such target DNA molecules may not be cleaved by dCas9. Target DNA molecules can be enriched through the use of streptavidin-coated magnetic beads.

一実施形態では、Ｃａｓタンパク質とインキュベートした後、エキソヌクレアーゼを使用して、ＤＮＡ混合物を消化することができる。エキソヌクレアーゼは、Ｃａｓタンパク質非結合ＤＮＡ分子を分解する可能性がある一方で、エキソヌクレアーゼは、Ｃａｓタンパク質結合ＤＮＡ分子を分解しないか、または分解の効率が大幅に低下する可能性がある。したがって、Ｃａｓタンパク質が結合した標的分子に関する情報は、最終的な配列決定結果において、さらに濃縮され得る。 In one embodiment, exonucleases can be used to digest the DNA mixture after incubation with Cas proteins. Exonucleases may degrade Cas protein-unbound DNA molecules, whereas exonucleases may not degrade Cas protein-bound DNA molecules, or may degrade them much less efficiently. Therefore, information about target molecules to which Cas proteins have bound can be further enriched in the final sequencing results.

図９６は、組織および組織内のＡｌｕ領域のメチル化レベルの表を示す。多くの組織は、８８％～９２％の範囲を含む、８５～９２％の範囲のメチル化レベルを示す。ＨＣＣ腫瘍組織および胎盤組織は、８０％未満のメチル化レベルを示した。図９６に見られるように、ＨＣＣ腫瘍は、本発明者らの設計によって標的とされたＡｌｕ領域において、頻繁に低メチル化されていることが示された。したがって、本開示に存在するＡｌｕ領域のメチル化決定は、腫瘍生検または他の組織もしくは細胞から抽出されたＤＮＡを使用して、腫瘍の進行中または腫瘍の治療中の癌の検出、病期分類、および監視に使用することができる。 Figure 96 shows a table of tissues and methylation levels of Alu regions within tissues. Many tissues exhibit methylation levels in the 85-92% range, including the 88%-92% range. HCC tumor and placental tissues showed methylation levels below 80%. As seen in Figure 96, HCC tumors were shown to be frequently hypomethylated in the Alu regions targeted by our design. Thus, methylation determination of Alu regions presented in the present disclosure is useful for cancer detection, staging, during tumor progression or during tumor treatment, using DNA extracted from tumor biopsies or other tissues or cells. Can be used for classification and monitoring.

Ａｌｕ領域全体の胎盤組織の低メチル化は、妊婦の血漿ＤＮＡを使用して非侵襲的な出生前検査を行うために使用することができる。例えば、低メチル化の程度が高い場合は、妊婦の胎児ＤＮＡ画分が高いことを示している可能性がある。別の例では、女性が染色体異数性の胎児を妊娠している場合、このアプローチによって検出された影響を受けた染色体に由来するＡｌｕ断片の数は、正倍数性の胎児を妊娠している女性とは量的に異なる（すなわち、増加または減少のいずれか）可能性がある。したがって、胎児が２１番染色体トリソミーを有する場合、このアプローチによって検出される２１番染色体に由来するＡｌｕ断片の数は、正倍数性の胎児を妊娠している女性と比較した場合、増加している可能性がある。一方、胎児が一染色体性の染色体を有する場合、正倍数性の胎児を妊娠している女性と比較した場合、このアプローチによって検出されたその染色体に由来するＡｌｕ断片の数は、減少している可能性がある。影響を受けていない染色体と比較して、血漿中の影響を受けた染色体（１３、１８、または２１）の余分な低メチル化の提示の決定は、正常な胎児と異常な胎児を妊娠している女性を区別するための分子指標として使用することができる。 Placental tissue hypomethylation across the Alu region can be used to perform non-invasive prenatal testing using maternal plasma DNA. For example, a high degree of hypomethylation may indicate a high fraction of fetal DNA in pregnant women. In another example, if a woman is pregnant with a chromosomal aneuploid fetus, the number of Alu fragments derived from the affected chromosomes detected by this approach will increase the number of Alu fragments that are pregnant with a euploid fetus. May be quantitatively different (ie either increased or decreased) from females. Thus, if the fetus has trisomy 21, the number of chromosome 21-derived Alu fragments detected by this approach is increased when compared to women carrying euploid fetuses. there is a possibility. On the other hand, if the fetus has a monosomic chromosome, the number of Alu fragments from that chromosome detected by this approach is reduced when compared to women carrying euploid fetuses. there is a possibility. Determination of the presentation of extra hypomethylation of affected chromosomes (13, 18, or 21) in plasma compared with unaffected chromosomes in pregnancies of normal and abnormal fetuses can be used as a molecular marker to distinguish females with

３．異なるタイプの癌についてのＣａｓ９複合体の標的となるＡｌｕ領域のメチル化分析
標的のＡｌｕリピートは異なる組織で高度にメチル化されていたが、本発明者らは、異なるタイプの癌がそれらのＡｌｕリピート全体で異なる脱メチル化パターンを有していると仮定した。一実施形態では、Ｃａｓ９ベースの標的化単一分子リアルタイム配列決定を使用して、メチル化パターンを分析し、本明細書に存在する開示に従って異なる癌のタイプを決定することができる。 3. Methylation Analysis of Alu Regions Targeted by the Cas9 Complex for Different Types of Cancers Although the targeted Alu repeats were highly methylated in different tissues, we found that different types of cancers were affected by their Alu We hypothesized that we have different demethylation patterns across repeats. In one embodiment, Cas9-based targeted single-molecule real-time sequencing can be used to analyze methylation patterns and determine different cancer types according to the disclosure present herein.

図９７は、異なるタイプの癌のＡｌｕリピートに関連するメチル化信号のクラスター分析を示す。ＴＣＧＡデータベース（ｗｗｗ．ｃａｎｃｅｒ．ｇｏｖ／ａｂｏｕｔ－ｎｃｉ／ｏｒｇａｎｉｚａｔｉｏｎ／ｃｃｇ／ｒｅｓｅａｒｃｈ／ｓｔｒｕｃｔｕｒａｌ－ｇｅｎｏｍｉｃｓ／ｔｃｇａ）からの癌対象は、マイクロアレイ技術（ＩｎｆｉｎｉｕｍＨｕｍａｎＭｅｔｈｙｌａｔｉｏｎ４５０ＢｅａｄＣｈｉｐ、ＩｌｌｕｍｉｎａＩｎｃ）を使用して分析されたＣｐＧ部位において、メチル化状態を有した。マイクロアレイチップに存在し、ＣＲＩＳＰＲ／Ｃａｓ９複合体の標的となるＡｌｕ領域と重複する３，０２４個のＣｐＧ部位にわたるメチル化状態を分析した。患者の目的のＡｌｕ領域に由来するＣｐＧがいくつかある。各ＣｐＧのメチル化レベルは、マイクロアレイによって定量した（メチル化指数またはベータ値とも呼ばれる）。患者全体のそれらのＣｐＧ部位でのメチル化レベルの数に基づいて階層的クラスター分析を行った。したがって、それらのＣｐＧ部位で同様のメチル化レベルのパターンを有する患者は、一緒にまとめられてクレードを形成する。異なる患者全体のメチル化パターンの類似性は、クラスタリング樹状図の高さの値によって示される。この例では、高さはユークリッド距離に従って計算された。他の実施形態では、他の距離メトリックが使用され、限定されないが、ミンコフスキー、チェビシェフ、マハラノビス、マンハッタン、コサイン、相関、スピアマン、ハミング、ジャッカード距離などを含む。本明細書で使用される高さは、クラスター間の距離メトリックの値を表し、クラスター間の関連性を反映している。例えば、２つのクラスターが高さｘで重なり合うのを観察した場合、それらのクラスター間の距離はｘ（例えば、すべてのクラスター間患者間の平均距離）であることが示唆された。 Figure 97 shows cluster analysis of methylation signals associated with Alu repeats in different types of cancer. Cancer subjects from the TCGA database (www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga) were analyzed using microarray technology (Infinium HumanMethylation450 BeadChip, Illumina Inc) had a methylation state in We analyzed the methylation status across 3,024 CpG sites present on the microarray chip and overlapping the Alu regions targeted by the CRISPR/Cas9 complex. There are some CpGs derived from Alu regions of interest in patients. The methylation level of each CpG was quantified by microarray (also called methylation index or beta value). Hierarchical cluster analysis was performed based on the number of methylation levels at those CpG sites across patients. Thus, patients with similar patterns of methylation levels at their CpG sites are grouped together to form clades. The similarity of methylation patterns across different patients is indicated by the clustering dendrogram height values. In this example the height was calculated according to the Euclidean distance. In other embodiments, other distance metrics are used, including but not limited to Minkowski, Chebyshev, Mahalanobis, Manhattan, Cosine, Correlation, Spearman, Hamming, Jaccard distance, and the like. Height, as used herein, represents the value of the distance metric between clusters and reflects the relationships between clusters. For example, if two clusters were observed to overlap with height x, it was suggested that the distance between those clusters was x (eg, the average distance between all inter-cluster patients).

ＣｐＧ部位のメチル化状態を使用して、クラスター分析の結果で、患者は癌のタイプに応じて異なる別個のグループにクラスター化された。癌のタイプとしては、膀胱尿路上皮癌（ＢＬＣＡ）、浸潤性乳癌肉腫（ＢＲＣＡ）、卵巣漿液性嚢胞腺癌（ＯＶ）、膵臓腺癌（ＰＡＡＤ）、ＨＣＣ、肺腺癌（ＬＵＡＤ）、胃腺癌（ＳＴＡＤ）、皮膚黒色腫（ＳＫＣＭ）、および子宮癌肉腫（ＵＣＳ）が含まれる。図中の癌のタイプの後の数字は、患者を示している。したがって、クラスタリングは、本発明者らが選択したＡｌｕリピートのメチル化信号が、図９７に示されていない癌のタイプを含む癌のタイプを分類するために有益であったことを示唆している。一実施形態では、組織生検におけるメチル化パターンに基づいて、原発性腫瘍と続発性腫瘍を区別することができる。 Using the methylation status of CpG sites, cluster analysis results clustered patients into different and distinct groups according to cancer type. Cancer types include bladder urothelial carcinoma (BLCA), invasive breast carcinosarcoma (BRCA), ovarian serous cystadenocarcinoma (OV), pancreatic adenocarcinoma (PAAD), HCC, lung adenocarcinoma (LUAD), gastric gland Included are carcinoma (STAD), cutaneous melanoma (SKCM), and uterine carcinosarcoma (UCS). The number after the cancer type in the figure indicates the patient. Thus, clustering suggests that the Alu repeat methylation signals we selected were beneficial for classifying cancer types, including those not shown in FIG. . In one embodiment, primary and secondary tumors can be distinguished based on methylation patterns in tissue biopsies.

４．サブリードの深度とサイズカットオフ
このセクションは、サブリードの深度および／またはサイズカットオフを使用して、メチル化検出の精度および／または効率を改善できることを示す。特定のサブリードの深度またはサイズを試験するために、ライブラリ調製を変更する場合がある。 4. Subread Depth and Size Cutoffs This section shows that subread depth and/or size cutoffs can be used to improve the accuracy and/or efficiency of methylation detection. Library preparation may be modified to test the depth or size of a particular subread.

ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ２．０に基づいて、全ゲノム増幅またはＭ．ＳｓｓｓＩ処理後の試料から生成された試験データセットの全体的なメチル化レベルの定量に対するリード深度の影響を分析した。少なくとも特定のカットオフを有するサブリードでカバーされているゲノム部位を、例えば、限定されないが、１倍、１０倍、２０倍、３０倍、４０倍、５０倍、６０倍、７０倍、８０倍、９０倍、１００倍などで調査した。 Based on the Sequel II Sequencing Kit 2.0, whole genome amplification or M.I. We analyzed the effect of read depth on the quantification of global methylation levels in test datasets generated from samples after SsssI treatment. Genomic sites covered by subreads with at least a particular cutoff, for example, but not limited to, 1-fold, 10-fold, 20-fold, 30-fold, 40-fold, 50-fold, 60-fold, 70-fold, 80-fold, Investigations were made at 90x, 100x, etc.

図９８Ａは、全ゲノム増幅に関与した試験データセットにおける全体的なメチル化レベルの定量に対するリード深度の影響を示す。図９８Ｂは、Ｍ．ＳｓｓｓＩ処理に関与した試験データセットにおける全体的なメチル化レベルの定量に対するリード深度の影響を示す。ｙ軸は、全体的なメチル化レベルをパーセンテージで示している。倍軸は、サブリード深度を示す。破線は、全体的なメチル化レベルの期待値を示している。 FIG. 98A shows the effect of read depth on quantification of global methylation levels in test datasets involving whole genome amplification. Figure 98B shows M. Figure 2 shows the effect of read depth on quantification of global methylation levels in test datasets involving SsssI treatment. The y-axis shows the overall methylation level in percentage. The double axis indicates the sub-read depth. The dashed line indicates the expected global methylation level.

図９８Ａに示されるように、全ゲノム増幅を含むデータセットの場合、全体的なメチル化は、５．７％から５．２％の範囲で、１倍、１０倍、２０倍、４０倍、５０倍などの最初のいくつかのカットオフで低下した。メチル化レベルは、５０倍以上のカットオフで、約５％で徐々に安定した。 As shown in Figure 98A, for the dataset containing whole genome amplification, global methylation ranged from 5.7% to 5.2%, with 1-fold, 10-fold, 20-fold, 40-fold, The first few cutoffs, such as 50-fold, fell off. Methylation levels gradually stabilized at about 5% with a 50-fold cutoff.

一方、図９８Ｂでは、Ｍ．ＳｓｓｓＩ処理後の試料から生成されたデータセットの場合、全体的なメチル化は、７０％から８３％の範囲で、１倍、１０倍、２０倍、４０倍、５０倍などの最初のいくつかのカットオフで増加した。メチル化レベルは、５０倍以上のカットオフで、約８３％で徐々に安定した。 On the other hand, in FIG. 98B, M. For datasets generated from samples after SsssI treatment, global methylation ranged from 70% to 83% with the first few such as 1-fold, 10-fold, 20-fold, 40-fold, 50-fold. increased with a cutoff of . Methylation levels gradually stabilized at about 83% with a cutoff of 50-fold or greater.

一実施形態では、サブリード深度カットオフを調整して、塩基修飾分析の性能を異なる用途にわたって受け入れられるようにすることができる。他の実施形態では、やや緩和したサブリード深度カットオフを使用すると、下流分析に好適なより多くのＺＭＷ（すなわち、分子の数）を取得することができる。さらに別の実施形態では、本開示によるＳＭＲＴ－ｓｅｑによって決定されたメチル化レベルの読み出しを、第２の測定で較正することができる（例えば、限定されないが、ＢＳ－ｓｅｑ、デジタルドロップレットＰＣＲ（バイサルファイト変換試料で）、メチル化特異的ＰＣＲ、またはメチル化シトシン結合抗体もしくは他のタンパク質）。別の実施形態では、第２の測定値は、５ｍＣに保持された全ゲノム増幅後のＤＮＡ分子をＢＳ－ｓｅｑ、デジタルドロップレットＰＣＲ（バイサルファイト変換試料上）、メチル化特異的ＰＣＲ、またはメチルＣｐＧ結合ドメイン（ＭＢＤ）タンパク質濃縮ゲノム配列決定（ＭＢＤ－ｓｅｑ）にかけることによって取得される。一例として、５ｍＣ保持全ゲノム増幅は、ＤＮＡプライマーゼＴｔｈＰｒｉｍＰｏｌ、ポリメラーゼｐｈｉ２９、およびＤＮＭＴ１（ＤＮＡメチルトランスフェラーゼ１）によって媒介される可能性がある。 In one embodiment, the sub-read depth cutoff can be adjusted to make the performance of base modification analysis acceptable across different applications. In other embodiments, using a slightly relaxed sub-read depth cutoff can obtain more ZMWs (ie number of molecules) suitable for downstream analysis. In yet another embodiment, readouts of methylation levels determined by SMRT-seq according to the present disclosure can be calibrated with a second measurement (e.g., but not limited to BS-seq, digital droplet PCR ( on bisulfite-converted samples), methylation-specific PCR, or methylated cytosine-binding antibodies or other proteins). In another embodiment, the second measurement is BS-seq, digital droplet PCR (on bisulfite converted samples), methylation-specific PCR, or methylation-specific PCR of DNA molecules after whole-genome amplification held at 5mC. CpG binding domain (MBD) is obtained by subjecting it to protein-enriched genome sequencing (MBD-seq). As an example, 5mC-retained whole-genome amplification could be mediated by DNA primase TthPrimPol, polymerase phi29, and DNMT1 (DNA methyltransferase 1).

異なるサブリード深度について、様々なタイプの癌および非腫瘍組織にわたるメチル化レベルを分析した。本開示によるＳＭＲＴ－ｓｅｑによって決定されたメチル化レベルも、ＢＳ－ｓｅｑ配列決定の結果と比較された。ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ２．０を使用して、中央値が４，３００万個のサブリード（四分位範囲（ＩＱＲ）：３，０００～５，２００万個）を取得し、これにより、中央値が４６０万個の循環コンセンサス配列（ＣＣＳ）の生成が可能となり、ヒト参照ゲノム（ＩＱＲ：２８０～５８０万個）と整列した。これらの試料のうち、２２の試料は、メチル化パターンを決定するための確立された超並列バイサルファイト配列決定（ＢＳ－ｓｅｑ）にもかけられ、メチル化レベルを比較するための第２の測定値を提供する。 We analyzed methylation levels across different types of cancer and non-tumor tissues for different sub-read depths. Methylation levels determined by SMRT-seq according to the present disclosure were also compared with BS-seq sequencing results. Using Sequel II Sequencing Kit 2.0, we obtained a median of 43 million sub-reads (interquartile range (IQR): 30-52 million), which yielded a median allowed the generation of 4.6 million circular consensus sequences (CCS), aligned with the human reference genome (IQR: 2.8-5.8 million). Of these samples, 22 were also subjected to established Massively Parallel Bisulfite Sequencing (BS-seq) to determine methylation patterns and a second measurement to compare methylation levels. provide value.

図９９は、本開示によるＳＭＲＴ－ｓｅｑ（ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ２．０）によって決定された全体的なメチル化レベルと、異なるサブリード深度カットオフを使用したＢＳ－ｓｅｑとの間の比較を示す。ＳＭＲＴ－ｓｅｑによって決定されたパーセンテージとしてのメチル化レベルは、ｙ軸に示される。バイサルファイト配列決定によって決定されたパーセンテージとしてのメチル化レベルは、ｘ軸にある。記号は、１倍、１０倍、および３０倍の異なるサブリードの深度を示す。３本の対角線は、異なるサブリード深度に近似した線を示す。 FIG. 99 shows a comparison between global methylation levels determined by SMRT-seq (Sequel II Sequencing Kit 2.0) according to the present disclosure and BS-seq using different sub-read depth cutoffs. Methylation levels as percentages determined by SMRT-seq are shown on the y-axis. Methylation levels as percentages determined by bisulfite sequencing are on the x-axis. The symbols indicate different sub-read depths of 1x, 10x and 30x. The three diagonal lines indicate approximate lines for different sub-read depths.

図９９は、サブリードによって少なくとも１回カバーされたゲノム部位を分析すると（すなわち、サブリード深度カットオフが１倍以上）、本開示によるＳＭＲＴ－ｓｅｑによって決定されたＣｐＧ部位のメチル化レベルが、ＢＳ－ｓｅｑによって決定されたものとよく相関していることを示した（ｒ＝０．８、Ｐ値＜０．０００１）。これらの結果は、本開示に存在する実施形態が、限定されないが、大腸癌、結腸直腸組織、食道癌、食道組織、乳癌、非癌性乳房組織、腎細胞癌、腎臓組織、肺癌、および肺組織を含む異なる組織型のメチル化レベルを測定するために使用され得ることを示唆した。また、本発明者らは、サブリード深度のカットオフを、それぞれ１０倍および３０倍に増加すると、これら２つの測定値間の相関が、０．８７（Ｐ値＜０．０００１）および０．９５（Ｐ値＜０．０００１）に改善することも観察した。一部の実施形態では、サブリード深度の増加、またはより多くのサブリードをカバーするゲノム領域の選択により、本開示によるＳＭＲＴ－ｓｅｑベースのメチル化決定の性能が改善するであろう。 FIG. 99 shows that when analyzing genomic sites that were covered at least once by subreads (i.e., subread depth cutoff of 1-fold or greater), methylation levels of CpG sites determined by SMRT-seq according to the present disclosure were higher than BS- It was shown to correlate well with that determined by seq (r=0.8, P value <0.0001). These results demonstrate that embodiments present in this disclosure include, but are not limited to colon cancer, colorectal tissue, esophageal cancer, esophageal tissue, breast cancer, non-cancerous breast tissue, renal cell carcinoma, kidney tissue, lung cancer, and lung cancer. It has been suggested that it can be used to measure methylation levels in different tissue types including tissue. We also found that when the sub-read depth cutoff was increased by 10-fold and 30-fold, respectively, the correlations between these two measurements were 0.87 (P-value < 0.0001) and 0.95. (P value < 0.0001) was also observed. In some embodiments, increasing the sub-read depth, or selecting genomic regions that cover more sub-reads, will improve the performance of SMRT-seq-based methylation determinations according to the present disclosure.

図１００は、ＳＭＲＴ－ｓｅｑ（ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ２．０）およびＢＳ－ｓｅｑによる２つの測定値間のメチル化レベルの相関に対するサブリード深度の影響を示す表である。最初の列は、サブリード深度のカットオフを示す。２番目の列は、相関係数であるピアソンのｒを示す。３番目の列は、カットオフに関連付けられたＣｐＧ部位の数を、括弧内の部位の数の範囲とともに示す。 FIG. 100 is a table showing the effect of sub-read depth on the correlation of methylation levels between the two measurements by SMRT-seq (Sequel II Sequencing Kit 2.0) and BS-seq. The first column shows the sub-read depth cutoff. The second column shows Pearson's r, the correlation coefficient. The third column shows the number of CpG sites associated with the cutoff, with the number range of sites in brackets.

図１００に示されるように、ＳＭＲＴ－ｓｅｑとＢＳ－ｓｅｑによる２つの測定値間のメチル化レベルの相関は、異なるサブリード深度カットオフに応じて変化した。一実施形態では、メチル化シトシンを非メチル化シトシンから区別するためのサブリード深度の最適なカットオフを決定するために、サブリード深度カットオフと２つの測定値間の相関係数（例えば、ピアソンの相関係数）との間の関係を利用することができる。図１００は、サブリード深度カットオフが３０倍（すなわち、３０倍以上）では、本開示によるＳＭＲＴ－ｓｅｑによって測定されたメチル化レベルは、ＢＳ－ｓｅｑによって生成された結果と最も高い相関を示した（ピアソンのｒ＝０．９５２）。他の実施形態では、限定されないが、１倍、１０倍、３０倍、４０倍、５０倍、６０倍、７０倍、８０倍、９００倍、１００倍、２００倍、３００倍、４００倍、５００倍、６００倍、７００倍、８００倍などのサブリード深度カットオフを使用することができる。 As shown in Figure 100, the correlation of methylation levels between the two measurements by SMRT-seq and BS-seq varied according to different sub-read depth cutoffs. In one embodiment, to determine the optimal cutoff of subread depth for distinguishing methylated from unmethylated cytosines, a subread depth cutoff and a correlation coefficient between the two measurements (e.g., Pearson's correlation coefficient) can be utilized. FIG. 100 shows that at a subread depth cutoff of 30-fold (i.e., 30-fold or greater), methylation levels measured by SMRT-seq according to the present disclosure showed the highest correlation with results generated by BS-seq. (Pearson's r=0.952). Other embodiments include, but are not limited to, 1x, 10x, 30x, 40x, 50x, 60x, 70x, 80x, 900x, 100x, 200x, 300x, 400x, 500x Sub-read depth cutoffs of 1x, 600x, 700x, 800x, etc. can be used.

メチル化分析に使用されるＣｐＧ部位の数は、図１００に示されるように、サブリード深度のカットオフの増加とともに減少する。サブリード深度カットオフが１００倍では、３０倍のサブリード深度カットオフ（ピアソンのｒ＝０．９５２）と比較して、メチル化レベルの２つの測定値間により低い相関（ピアソンのｒ＝０．８７５）が観察された。より高いサブリードカットオフで相関が低いことは、より厳しいサブリード深度カットオフを満たすＣｐＧ部位の数がより少ないことに起因している可能性がある。一実施形態では、サブリード深度の要件とメチル化分析に使用され得る分子の数との間で、トレードオフを考慮することができる。例えば、メチル化パターンについてゲノム全体をスキャンすることを目的とした場合、より多くの分子が望ましいことがある。標的ＳＭＲＴ－ｓｅｑを使用して特定の領域に焦点を合わせた場合、その領域のメチル化パターンを取得するには、より高いサブリード深度が望ましいことがある。 The number of CpG sites used for methylation analysis decreases with increasing subread depth cutoff, as shown in FIG. Lower correlation between the two measures of methylation levels (Pearson's r=0.875) at a sub-read depth cutoff of 100-fold compared to a 30-fold sub-read depth cutoff (Pearson's r=0.952). ) was observed. The lower correlation at higher subread cutoffs may be due to the lower number of CpG sites that meet the more stringent subread depth cutoffs. In one embodiment, a trade-off can be considered between sub-read depth requirements and the number of molecules that can be used for methylation analysis. For example, more molecules may be desirable if the goal is to scan the entire genome for methylation patterns. When using targeted SMRT-seq to focus on a specific region, a higher sub-read depth may be desirable to obtain the methylation pattern of that region.

図１０１は、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ２．０によって生成されたデータの断片サイズに関するサブリード深度分布を示している。ｙ軸に、サブリードの深度を示し、ｘ軸に、ＤＮＡ分子の鎖長を示す。ＤＮＡ分子の鎖長は、循環コンセンサス配列（ＣＣＳ）のサイズから推定された。 FIG. 101 shows the sub-read depth distribution with respect to fragment size for data generated by Sequel II Sequencing Kit 2.0. The y-axis indicates the sub-read depth and the x-axis indicates the strand length of the DNA molecule. The length of the DNA molecule was estimated from the size of the circular consensus sequence (CCS).

サブリード深度は、ＳＭＲＴ－ｓｅｑデータを使用したメチル化決定の性能に影響を与える可能性があり、サブリード深度は、配列決定されるＤＮＡ分子の鎖長の関数であるため、ＤＮＡ分子のサイズは、試料のメチル化パターン分析に最適なサブリード深度を取得するために重要な場合がある。図１０１に示されるように、ＤＮＡが長いほど、サブリードの深度が浅くなる。例えば、サイズが１ｋｂの分子の集団の場合、サブリード深度の中央値は５０倍であった。サイズが１０ｋｂの分子の集団の場合、サブリード深度の中央値は１５倍であった。 Sub-read depth can affect the performance of methylation determinations using SMRT-seq data, and since sub-read depth is a function of the length of the DNA molecule being sequenced, the size of the DNA molecule is It may be important for obtaining optimal sub-read depth for sample methylation pattern analysis. As shown in FIG. 101, the longer the DNA, the shallower the subread depth. For example, for a population of molecules 1 kb in size, the median sub-read depth was 50-fold. For a population of molecules 10 kb in size, the median sub-read depth was 15-fold.

一実施形態では、図１００に示されるように、サブリード深度の最適なカットオフは、少なくとも３０倍であり得、最高の相関係数をもたらす。３０倍の最適なサブリード深度カットオフを満たす分子のスループットをさらに改善するために、サブリード深度とＤＮＡ鋳型分子の鎖長との関係を利用することができる。例えば、図１０１では、３０倍は、約４ｋｂの鎖長を有する分子のサブリード深度の中央値である。したがって、ＳＭＲＴ－ｓｅｑライブラリを調製する前に、４ｋｂのＤＮＡ分子を分画し、配列決定を４ｋｂのＤＮＡ分子に制限することができる。他の実施形態では、ＤＮＡ分子の分画用に他のサイズのカットオフを使用することができ、限定されないが、１００ｂｐ、２００ｂｐ、３００ｂｐ、４００ｂｐ、５００ｂｐ、６００ｂｐ、７００ｂｐ、８００ｂｐ、９００ｂｐ、１ｋｂ、２ｋｂ、３ｋｂ、４ｋｂ、５ｋｂ、６ｋｂ、７ｋｂ、９ｋｂ、１０ｋｂ、２０ｋｂ、３０ｋｂ、４０ｋｂ、５０ｋｂ、６０ｋｂ、７０ｋｂ、８０ｋｂ、９０ｋｂ、１００ｋｂ、５００ｋｂ、１Ｍｂ、またはサイズカットオフが異なる組み合わせを含む。 In one embodiment, as shown in diagram 100, the optimal cutoff for sub-read depth may be at least 30 times, yielding the highest correlation coefficient. To further improve the throughput of molecules that meet the 30-fold optimal subread depth cutoff, the relationship between subread depth and DNA template molecule strand length can be exploited. For example, in FIG. 101, 30× is the median sub-read depth for molecules with chain lengths of approximately 4 kb. Therefore, prior to preparing the SMRT-seq library, 4 kb DNA molecules can be fractionated to limit sequencing to 4 kb DNA molecules. In other embodiments, other size cutoffs can be used for fractionation of DNA molecules, including but not limited to 100 bp, 200 bp, 300 bp, 400 bp, 500 bp, 600 bp, 700 bp, 800 bp, 900 bp, 1 kb, 2 kb, 3 kb, 4 kb, 5 kb, 6 kb, 7 kb, 9 kb, 10 kb, 20 kb, 30 kb, 40 kb, 50 kb, 60 kb, 70 kb, 80 kb, 90 kb, 100 kb, 500 kb, 1 Mb, or combinations with different size cutoffs.

５．制限酵素ベースの標的化単一分子リアルタイム配列決定
このセクションでは、制限酵素を使用して、修飾の検出の実用性および／またはスループットおよび／または費用対効果を改善することを説明する。制限酵素で生成されたＤＮＡ断片は、試料の起源を特定するために使用することができる。 5. Restriction Enzyme-Based Targeted Single Molecule Real-Time Sequencing This section describes the use of restriction enzymes to improve the practicality and/or throughput and/or cost-effectiveness of detection of modifications. Restriction enzyme generated DNA fragments can be used to identify the origin of a sample.

ａ）制限酵素を使用してＤＮＡ分子を消化する
実施形態では、単一分子リアルタイム配列決定（例えば、ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓｓｙｓｔｅｍを使用）の前に、１つ以上の制限酵素を使用して、ＤＮＡ分子を消化することができる。制限酵素の認識部位の分布は、ヒトゲノムに不均一に存在するため、制限酵素によって消化されたＤＮＡは、歪んだサイズ分布を生成する可能性がある。制限酵素の認識部位がより多いゲノム領域は、より小さな断片に消化され、一方、制限酵素の認識部位が少ないゲノム領域は、より長い断片に消化され得る。実施形態では、サイズ範囲によって、１つ以上の制限酵素の同様の切断パターンを有する１つ以上の領域に由来するＤＮＡ分子を選択的に取得することができる。サイズ選択に必要なサイズ範囲は、１つ以上の制限酵素のインシリコの切断分析によって決定することができる。コンピュータプログラムを使用して、参照ゲノム（例えば、ヒト参照ゲノム）における目的の制限酵素の認識部位の数を決定することができる。このような参照ゲノムは、目的のゲノム領域のサイズ情報を提供するそれらの認識サイトに従って、インシリコで断片に剪断された。 a) Using Restriction Enzymes to Digest DNA Molecules In embodiments, prior to single molecule real-time sequencing (e.g., using a Pacific Biosciences system), DNA molecules are digested using one or more restriction enzymes. can be digested. Because the distribution of restriction enzyme recognition sites is heterogeneous in the human genome, DNA digested by restriction enzymes can produce a skewed size distribution. Genomic regions with more restriction enzyme recognition sites can be digested into smaller fragments, while genomic regions with fewer restriction enzyme recognition sites can be digested into longer fragments. In embodiments, the size range allows selective retrieval of DNA molecules derived from one or more regions with similar cleavage patterns of one or more restriction enzymes. The size range required for size selection can be determined by in silico cleavage analysis of one or more restriction enzymes. A computer program can be used to determine the number of recognition sites for a restriction enzyme of interest in a reference genome (eg, a human reference genome). Such reference genomes were sheared in silico into fragments according to their recognition sites, which provide size information for genomic regions of interest.

図１２６は、ＤＮＡ末端修復およびＡテーリングを使用したＭｓｐＩベースの標的化単一分子リアルタイム配列決定の方法を示す。実施形態では、図１２６に示されるように、５’Ｃ＾ＣＧＧ３’部位を認識するＭｓｐＩを使用して、生物のＤＮＡ試料、例えば、限定されないが、ヒトＤＮＡ試料を消化することができる。５’ＣＧオーバーハングを有する消化されたＤＮＡ断片を、サイズ選択にかけ、ＣｐＧアイランドに由来するＤＮＡ分子を濃縮した。ＧおよびＣ残基（ＧＣ含量とも呼ばれる）が濃縮されたゲノム領域は、より短い断片を生成する場合がある。したがって、目的の領域のＧＣ含量に基づいて選択を行う断片サイズの範囲を決定することができる。様々なＤＮＡ断片サイズ選択ツールが当業者に利用可能であり、限定されないが、ゲル電気泳動、サイズ排除電気泳動、キャピラリー電気泳動、クロマトグラフィー、質量分析、濾過アプローチ、沈殿ベースのアプローチ、マイクロフルイディクス、およびナノフルイディクスを含む。サイズ分画されたＤＮＡ分子は、ＤＮＡ末端修復およびＡテーリングにかけられ、所望のＤＮＡ産物が、５’Ｔオーバーハングを有するヘアピンアダプターと連結され、環状ＤＮＡ鋳型が形成された。 Figure 126 shows a method for MspI-based targeted single-molecule real-time sequencing using DNA end repair and A tailing. In embodiments, an MspI that recognizes the 5'C^CGG3' site can be used to digest biological DNA samples, such as, but not limited to, human DNA samples, as shown in Figure 126 . Digested DNA fragments with 5'CG overhangs were subjected to size selection to enrich for DNA molecules derived from CpG islands. Genomic regions enriched in G and C residues (also called GC content) may generate shorter fragments. Thus, one can determine a range of fragment sizes for selection based on the GC content of the region of interest. A variety of DNA fragment size selection tools are available to those skilled in the art, including but not limited to gel electrophoresis, size exclusion electrophoresis, capillary electrophoresis, chromatography, mass spectrometry, filtration approaches, precipitation-based approaches, microfluidics. , and nanofluidics. The size-fractionated DNA molecules were subjected to DNA end-repair and A-tailing, and the desired DNA product was ligated with hairpin adapters with 5'T overhangs to form a circular DNA template.

例えば、限定されないが、エキソヌクレアーゼ（エキソヌクレアーゼＩＩＩおよびＶＩＩ）を使用して、連結されていないアダプター、直鎖ＤＮＡ、および不完全な環状ＤＮＡを除去した後、ヘアピンアダプターに連結されたＤＮＡ分子を、単一分子リアルタイム配列決定に使用して、本明細書に開示されるメチル化プロファイルを決定する際のＩＰＤ、ＰＷ、および配列文脈を決定することができる。ＣｐＧで濃縮されたゲノム領域を分析することによって、異なる組織または異なる疾患および／もしくは生理学的状態を有する組織あるいは生体試料から取得されたＤＮＡを、本開示の配列決定データ分析方法によって決定されるそれらのメチル化プロファイルによって区別および分類することができる。 For example, without limitation, exonucleases (exonucleases III and VII) are used to remove unligated adapters, linear DNA, and incomplete circular DNA, followed by DNA molecules ligated to hairpin adapters. , can be used for single-molecule real-time sequencing to determine the IPD, PW, and sequence context in determining the methylation profiles disclosed herein. DNA obtained from different tissues or tissues or biological samples with different diseases and/or physiological conditions by analyzing genomic regions enriched in CpGs determined by the sequencing data analysis methods of the present disclosure. can be distinguished and classified by their methylation profiles.

実施形態では、図１２６のサイズ選択を含むステップの場合、所望のサイズ範囲は、ＭｓｐＩのインシリコ切断分析によって決定することができる。ヒト参照において、合計２，２８６，５４１箇所のＭｓｐＩ切断部位を決定した。ヒト参照ゲノムは、それらのＭｓｐＩ切断部位に従って、インシリコで断片に剪断された。合計２，２８６，５６５個の断片を取得した。個々の断片サイズは、その断片のヌクレオチドの総数によって決定された。 In embodiments, for the step involving size selection in FIG. 126, the desired size range can be determined by an in silico cleavage analysis of MspI. A total of 2,286,541 MspI cleavage sites were determined in the human reference. The human reference genome was sheared in silico into fragments according to their MspI cleavage sites. A total of 2,286,565 fragments were obtained. Individual fragment sizes were determined by the total number of nucleotides in that fragment.

図１２７Ａおよび１２７Ｂは、ＭｓｐＩで消化された断片のサイズ分布を示す。これらの図のｙ軸は、特定のサイズの断片の頻度（パーセント）である。図１２７Ａは、５０から５００，０００ｂｐの範囲のｘ軸について対数目盛を有する。図１２７Ｂは、５０から１，０００ｂｐの範囲のｘ軸について線形目盛を有する。 Figures 127A and 127B show the size distribution of MspI-digested fragments. The y-axis of these figures is the frequency (percentage) of fragments of a particular size. Figure 127A has a logarithmic scale for the x-axis ranging from 50 to 500,000 bp. Figure 127B has a linear scale for the x-axis ranging from 50 to 1,000 bp.

図１２７Ａおよび１２７Ｂに示されるように、ＭｓｐＩで消化されたＤＮＡ分子は、歪んだサイズ分布を有する。ＭｓｐＩで消化された断片のサイズの中央値は、４０４ｂｐ（ＩＱＲ：９８～１，４１１ｂｐ）であった。それらのＭｓｐＩで消化された断片の約５３％は、１ｋｂ未満であった。サイズプロファイルには、反復エレメントに起因する可能性がある一連のスパイクピークがあった。特定のリピート要素は、ＭｓｐＩ切断部位の同様のパターンを共有する可能性があり、同様の断片サイズを有するＭｓｐＩ消化に由来する分子のセットにつながる。例えば、最も高い頻度（すなわち、合計４９，０７９）のスパイクピークは、６４ｂｐのサイズに対応した。それらの中で、４５，８９４（９４％）は、Ａｌｕリピートと重複していた。サイズが６４ｂｐのＤＮＡ分子を選択して、Ａｌｕリピートに由来するＤＮＡ分子を濃縮することができる。データは、サイズ選択を使用して、本開示による下流のメチル化分析のために所望のＤＮＡ分子を濃縮できることを示唆している。 As shown in Figures 127A and 127B, MspI-digested DNA molecules have a skewed size distribution. The median size of MspI-digested fragments was 404 bp (IQR: 98-1,411 bp). Approximately 53% of those MspI-digested fragments were less than 1 kb. The size profile had a series of spike peaks that could be attributed to repetitive elements. Certain repeat elements are likely to share a similar pattern of MspI cleavage sites, leading to a set of molecules derived from MspI digestion with similar fragment sizes. For example, the highest frequency (ie, 49,079 total) spike peaks corresponded to a size of 64 bp. Among them, 45,894 (94%) overlapped with Alu repeats. A DNA molecule of 64 bp in size can be selected to enrich for DNA molecules derived from Alu repeats. The data suggest that size selection can be used to enrich desired DNA molecules for downstream methylation analysis according to the present disclosure.

図１２８は、特定の選択されたサイズ範囲のＤＮＡ分子の数の表を示す。最初の列は、塩基対のサイズ範囲を示す。２番目の列は、すべての断片に対するサイズ範囲内の分子のパーセンテージを示す。３番目の列は、ＣｐＧアイランドと重複するサイズ範囲内の分子の数を示す。４番目の列は、ＣｐＧアイランドと重複するサイズ範囲内の分子のパーセンテージを示す。５番目の列は、配列決定されるＣｐＧ部位の数を示す。６番目の列は、ＣｐＧアイランド内にあるＣｐＧ部位の数を示す。７番目の列は、サイズ選択の対象であり、ＣｐＧアイランド内にあるＣｐＧ部位のパーセンテージを示す。図１２８に示されるように、ＭｓｐＩ消化にかけられたヒトゲノムから生成されたＤＮＡ分子の量は、問題の異なるサイズ範囲に従って変化した。ＣｐＧアイランドと重複するＤＮＡ分子の数は、サイズ範囲によって異なる。 Figure 128 shows a table of the number of DNA molecules in certain selected size ranges. The first column indicates the size range in base pairs. The second column shows the percentage of molecules within the size range for all fragments. The third column shows the number of molecules within the size range that overlap with the CpG islands. The fourth column shows the percentage of molecules within the size range that overlap with the CpG islands. The fifth column indicates the number of CpG sites sequenced. The sixth column indicates the number of CpG sites within the CpG island. The seventh column shows the percentage of CpG sites that were size-selected and that are within CpG islands. As shown in Figure 128, the amount of DNA molecules generated from the human genome subjected to MspI digestion varied according to the different size ranges of interest. The number of DNA molecules that overlap CpG islands varies with size range.

ＣＣＧＧモチーフは、ＣｐＧアイランドで優先的に発生するため、特定のカットオフ未満のサイズの分子を選択して、ＣｐＧアイランドに由来するＤＮＡ分子の濃縮を可能にすることができる。例えば、５０～２００ｂｐのサイズ範囲の場合、分子の数は５２６，５４３個であり、ＭｓｐＩ消化にかけられたヒトゲノムに由来するＤＮＡ断片全体の２３．０３％を占めている。５２６，５４３個のＤＮＡ分子のうち、１０４，０７９個（１９．７６％）がＣｐＧアイランドと重複していた。６００～８００ｂｐのサイズ範囲では、分子の数は１３３，９２７個であり、ＭｓｐＩ消化にかけられたヒトゲノムに由来するＤＮＡ断片全体の５．８６％を占めていた。１３３，９２７分子のうち、３，６７３（２．７４％）分子が、ＣｐＧアイランドと重複していた。一例として、５０～２００ｂｐのサイズを選択して、ＣｐＧアイランドに由来するＤＮＡ断片を濃縮することができる。 Since CCGG motifs occur preferentially in CpG islands, molecules of size below a certain cutoff can be selected to allow enrichment of DNA molecules derived from CpG islands. For example, for the 50-200 bp size range, the number of molecules is 526,543, accounting for 23.03% of the total DNA fragments derived from the human genome that were subjected to MspI digestion. Of the 526,543 DNA molecules, 104,079 (19.76%) overlapped with CpG islands. In the 600-800 bp size range, the number of molecules was 133,927, accounting for 5.86% of the total DNA fragments derived from the human genome subjected to MspI digestion. Of the 133,927 molecules, 3,673 (2.74%) molecules overlapped with CpG islands. As an example, a size of 50-200 bp can be chosen to enrich for DNA fragments derived from CpG islands.

ＭｓｐＩベースの標的単一分子リアルタイム配列決定を介してＣｐＧアイランドと重複するＣｐＧ部位の濃縮度を計算するために、超音波処理によって剪断されたＤＮＡのシミュレーションを行い、正規分布に基づく標準偏差が２０ｂｐ、平均サイズが２００ｂｐのＺＭＷから生成された５２６，５４３断片をシミュレートした。ＣｐＧアイランドと重複するＤＮＡ分子は、わずか０．８８％であった。合計７１，４９５のＣｐＧ部位が、ＣｐＧアイランドと重複していた。図１２８に示されるように、５０～２００ｂｐの範囲のＭｓｐＩ消化断片を選択すると、１９．８％の断片がＣｐＧアイランドと重複する。したがって、これらのデータは、ＭｓｐＩ消化によって調製されたＤＮＡは、超音波処理によって調製されたＤＮＡと比較して、ＣｐＧアイランドに由来するＤＮＡ断片が２２．５倍濃縮されている可能性があることを示唆している。さらに、ＭｓｐＩ消化を通してＣｐＧアイランドで濃縮されているＣｐＧ部位を分析した。５０～２００ｂｐの範囲のＭｓｐＩ消化断片の選択により、ＣｐＧアイランドと重複する８８５，０４１箇所のＣｐＧ部位が生じる可能性があり、そのサイズ範囲内の配列決定された断片からの総ＣｐＧ部位の３７．５％を占める。超音波処理によって調製されたＤＮＡと比較して、ＣｐＧアイランドと重複するＣｐＧ部位が、１２．３倍（すなわち、８８５，０４１／７１，４９５）濃縮されていた。図１２８に示される情報に基づいて、好適なサイズ範囲を選択して、ＣｐＧ部位の所望の数およびＣｐＧアイランド内のＣｐＧ部位の所望の濃縮倍率を含むことができる。 To calculate the enrichment of CpG sites overlapping CpG islands via MspI-based targeted single-molecule real-time sequencing, we performed simulations of DNA sheared by sonication, with a standard deviation of 20 bp based on a normal distribution. , simulated 526,543 fragments generated from ZMW with an average size of 200 bp. Only 0.88% of the DNA molecules overlapped with CpG islands. A total of 71,495 CpG sites overlapped with CpG islands. As shown in Figure 128, selecting MspI digested fragments ranging from 50-200 bp results in 19.8% of the fragments overlapping CpG islands. These data therefore suggest that DNA prepared by MspI digestion may be 22.5-fold enriched for DNA fragments derived from CpG islands compared to DNA prepared by sonication. It suggests. In addition, CpG sites enriched in CpG islands were analyzed through MspI digestion. Selection of MspI digested fragments in the 50-200 bp range could result in 885,041 CpG sites overlapping CpG islands, with 37.5 of the total CpG sites from sequenced fragments within that size range. 5%. CpG sites overlapping CpG islands were enriched 12.3-fold (ie, 885,041/71,495) compared to DNA prepared by sonication. Based on the information presented in Figure 128, a suitable size range can be selected to contain the desired number of CpG sites and the desired enrichment factor of CpG sites within a CpG island.

図１２９は、制限酵素消化後のＤＮＡ断片のサイズに対する、ＣｐＧアイランド内のＣｐＧ部位のパーセントカバレッジのグラフである。ｙ軸は、所与のサイズを有する断片によってカバーされたＣｐＧアイランド内のＣｐＧ部位のパーセンテージを示す。ｘ軸は、制限酵素消化後のＤＮＡ断片のサイズ範囲の上限を示している。図１２９は、サイズ選択範囲を広げることによってカバーされるＣｐＧアイランド内のＣｐＧ部位のパーセンテージを示す。図１２９では、サイズ範囲は、５０ｂｐからｘ軸に示されるサイズまでである。他の実施形態では、サイズ範囲の下限をカスタマイズすることができ、例えば、限定されないが、６０ｂｐ、７０ｂｐ、８０ｂｐ、９０ｂｐ、１００ｂｐ、２００ｂｐ、３００ｂｐ、４００ｂｐ、および５００ｂｐであり得る。上限を大きくすることでサイズ範囲が広がると、ＣｐＧアイランド内のＣｐＧ部位のパーセントカバレッジが徐々に増加し、６５％で横ばいになっていることがわかる。一部のＣｐＧ部位は、５０ｂｐ未満のＤＮＡ断片内にあるか、または非常に長い分子（例えば、＞１００，０００ｂｐ）内の断片内にあるため、カバーされていない。 FIG. 129 is a graph of percent coverage of CpG sites within CpG islands versus DNA fragment size after restriction enzyme digestion. The y-axis shows the percentage of CpG sites within the CpG island covered by fragments with a given size. The x-axis indicates the upper limit of the DNA fragment size range after restriction enzyme digestion. Figure 129 shows the percentage of CpG sites within a CpG island that are covered by widening the size selection range. In Figure 129, the size range is from 50 bp to the size indicated on the x-axis. In other embodiments, the lower end of the size range can be customized, including but not limited to 60bp, 70bp, 80bp, 90bp, 100bp, 200bp, 300bp, 400bp, and 500bp. It can be seen that increasing the size range by increasing the upper limit gradually increases the percent coverage of CpG sites within the CpG islands and levels off at 65%. Some CpG sites are not covered because they are within DNA fragments less than 50 bp or within fragments within very long molecules (eg >100,000 bp).

一部の実施形態では、ＤＮＡ試料を、２つ以上の異なる制限酵素（異なる制限部位を有する）を使用して分析することができるため、ＣｐＧアイランド内のＣｐＧ部位のカバレッジを増加させることができる。異なる酵素によるＤＮＡ試料の消化は、各反応に１種類の制限酵素のみが存在するように、個々の反応で実行することができる。例えば、ＣＧ＾ＣＧ部位を認識するＡｃｃＩＩを使用して、ＣｐＧアイランドを優先的に切断することができる。他の実施形態では、認識部位の一部としてＣＧジヌクレオチドを含む他の制限酵素を使用することができる。ヒトゲノム内には、６７８，６６９個のＡｃｃＩＩ切断部位があった。ＡｃｃＩＩ制限を使用して、ヒト参照ゲノムのインシリコ切断を実施し、合計６７８，６９３個の断片を取得した。次いで、本発明者らは、これらの断片のインシリコでのサイズ選択を行い、ＭｓｐＩ消化について上に記載の方法に従って、ＣｐＧアイランド内のＣｐＧ部位のパーセントカバレッジを計算した。サイズ選択範囲の拡大に伴って、ＣｐＧ部位のパーセントカバレッジが徐々に増加していることがわかる。パーセントカバレッジは、約５０％で横ばいになる。ＣｐＧ部位のカバレッジは、２種類の酵素の消化実験（つまり、ＭｓｐＩ消化とＡｃｃＩＩ消化）からのデータを組み合わせることで、さらに増加する。ＣｐＧアイランド内のＣｐＧ部位の８０％は、サイズが５０ｂｐ～４００ｂｐのＤＮＡ断片を選択することでカバーされる。このパーセンテージは、この２種類の酵素のいずれかのみによる消化実験についての、それぞれの数値よりも高くなっている。他の制限酵素を使用してＤＮＡ試料を分析することを通して、カバレッジをさらに高めることができる。ＤＮＡ試料が２つのアリコートに分割されている場合、一方のアリコートをＭｓｐＩで消化し、他方アリコートをＡｃｃＩＩで消化する。２つの消化されたＤＮＡ試料を、等モル濃度で混合し、５００万ＺＭＷによる単一分子リアルタイム配列決定を使用して、配列を決定する。インシリコ分析に基づいて、ＣｐＧアイランド内のＣｐＧ部位の８３％（すなわち、１，７３４，３４５箇所）は、循環コンセンサス配列に関して少なくとも４回配列決定される。 In some embodiments, DNA samples can be analyzed using two or more different restriction enzymes (with different restriction sites), thus increasing the coverage of CpG sites within the CpG islands. . Digestion of DNA samples with different enzymes can be performed in separate reactions such that only one restriction enzyme is present in each reaction. For example, AccII, which recognizes CG^CG sites, can be used to preferentially cleave CpG islands. In other embodiments, other restriction enzymes containing CG dinucleotides as part of the recognition site can be used. There were 678,669 AccII cleavage sites within the human genome. An in silico digestion of the human reference genome was performed using AccII restriction and a total of 678,693 fragments were obtained. We then size-selected these fragments in silico and calculated the percent coverage of CpG sites within the CpG islands according to the method described above for MspI digestion. It can be seen that the percent coverage of CpG sites gradually increases with increasing size selection. The percent coverage levels off at about 50%. CpG site coverage is further increased by combining data from two enzymatic digestion experiments (ie, MspI and AccII digestion). 80% of the CpG sites within the CpG islands are covered by selecting DNA fragments between 50bp and 400bp in size. This percentage is higher than the respective figures for digestion experiments with either of the two enzymes alone. Coverage can be further enhanced through analysis of DNA samples using other restriction enzymes. If the DNA sample is divided into two aliquots, one aliquot is digested with MspI and the other with AccII. The two digested DNA samples are mixed at equimolar concentrations and sequenced using single-molecule real-time sequencing with 5 million ZMW. Based on in silico analysis, 83% of the CpG sites within a CpG island (ie, 1,734,345 sites) are sequenced at least four times with respect to a circular consensus sequence.

図１３０は、ＤＮＡ末端修復およびＡテーリングを用いない、ＭｓｐＩベースの標的化単一分子リアルタイム配列決定を示す。実施形態では、消化されたＤＮＡ分子とヘアピンアダプターとの間の連結は、ＤＮＡ末端修復およびＡテーリングのプロセスなしで実施され得る。５’ＣＧオーバーハングを有する消化されたＤＮＡ分子を、５’ＣＧオーバーハングを有するヘアピンアダプターと直接連結して、単一分子リアルタイム配列決定用の環状ＤＮＡ鋳型を形成することができる。連結されていないアダプターおよび自己連結したアダプターダイマーをクリーンアップした後、一部の実施形態では、連結されていないアダプター、直鎖ＤＮＡ、および不完全な環状ＤＮＡを除去した後、ヘアピンアダプターと連結されたＤＮＡ分子は、単一分子リアルタイム配列決定に好適で、ＩＰＤ、ＰＷ、および配列文脈を取得することができる。単一分子のメチル化プロファイルは、本開示に従って、ＩＰＤ、ＰＷおよび配列文脈を使用して決定されるであろう。 FIG. 130 shows MspI-based targeted single-molecule real-time sequencing without DNA end repair and A-tailing. In embodiments, the ligation between the digested DNA molecule and the hairpin adapter can be performed without the processes of DNA end repair and A-tailing. Digested DNA molecules with 5'CG overhangs can be directly ligated with hairpin adapters with 5'CG overhangs to form circular DNA templates for single-molecule real-time sequencing. After cleaning up unligated adapters and self-ligated adapter dimers, in some embodiments, unligated adapters, linear DNA, and incomplete circular DNA are removed prior to ligation with hairpin adapters. DNA molecules are suitable for single-molecule real-time sequencing, and IPD, PW, and sequence context can be obtained. The methylation profile of a single molecule will be determined using IPD, PW and sequence context according to this disclosure.

図１３１は、アダプターの自己連結の可能性が低い、ＭｓｐＩベースの標的化単一分子リアルタイム配列決定を示す。基礎となるシトシン塩基は、５’リン酸基のない塩基を示す。一部の実施形態では、アダプター連結のプロセス中に起こり得る自己連結アダプターダイマーの形成の可能性を最小限にするために、脱リン酸化ヘアピンアダプターを使用して、それらのＭｓｐＩ消化ＤＮＡ分子とアダプター連結を行うことができる。これらの脱リン酸化ヘアピンアダプターは、５’リン酸基がないため、自己連結アダプターダイマーを形成することができない。連結後、その産物をアダプタークリーンアップのステップにかけ、ヘアピンアダプターと連結されたＤＮＡ分子を精製する。ニックを有する可能性のあるヘアピンアダプターと連結されたＤＮＡ分子は、さらにリン酸化（例えば、Ｔ４ポリヌクレオチドキナーゼ）およびＤＮＡリガーゼ（例えば、Ｔ４ＤＮＡリガーゼ）によるニックシーリングにかけた。実施形態では、連結されていないアダプター、直鎖ＤＮＡ、および不完全な環状ＤＮＡの除去をさらに行うことができる。ヘアピンアダプターと連結されたＤＮＡ分子は、ＩＰＤ、ＰＷ、および配列文脈を取得するための単一分子のリアルタイム配列に好適である。単一分子のメチル化プロファイルは、本開示に従って、ＩＰＤ、ＰＷおよび配列文脈を使用して決定されるであろう。 FIG. 131 shows MspI-based targeted single-molecule real-time sequencing with low likelihood of adapter self-ligation. Underlying cytosine bases refer to bases without a 5' phosphate group. In some embodiments, dephosphorylated hairpin adapters are used to minimize the possibility of self-ligated adapter dimer formation during the process of adaptor ligation, and these MspI-digested DNA molecules and adaptors. Concatenation can be done. These dephosphorylated hairpin adapters are unable to form self-ligated adapter dimers due to the lack of a 5' phosphate group. After ligation, the product is subjected to an adapter cleanup step to purify the hairpin adapter ligated DNA molecules. DNA molecules ligated with potentially nicked hairpin adapters are further subjected to phosphorylation (eg, T4 polynucleotide kinase) and nick sealing by DNA ligase (eg, T4 DNA ligase). In embodiments, removal of unligated adapters, linear DNA, and incomplete circular DNA can be further performed. DNA molecules ligated with hairpin adapters are suitable for IPD, PW, and single-molecule real-time sequencing to obtain sequence context. The methylation profile of a single molecule will be determined using IPD, PW and sequence context according to this disclosure.

ＭｓｐＩに加えて、認識部位ＣＣＣＧＧＧを含むＳｍａＩなどの他の制限酵素も使用することができる。 In addition to MspI, other restriction enzymes such as SmaI, which contains the recognition site CCCGGG, can also be used.

一部の実施形態では、所望のサイズ選択プロセスは、ＤＮＡ末端修復ステップの後に行うことができる。一部の実施形態では、サイズ選択の結果に対するヘアピンアダプターの効果が決定された場合、ヘアピンアダプターを連結した後、所望のサイズ選択プロセスを行うことができる。これらおよび他の実施形態では、ＭｓｐＩベースの標的化単一分子リアルタイム配列決定に関わる手順的なステップの順序は、実験状況に応じて変化し得る。 In some embodiments, the desired size selection process can be performed after the DNA end repair step. In some embodiments, once the effect of a hairpin adapter on size selection results is determined, the desired size selection process can be performed after ligation of the hairpin adapter. In these and other embodiments, the order of the procedural steps involved in MspI-based targeted single-molecule real-time sequencing may vary depending on the experimental context.

実施形態では、サイズ選択は、ゲル電気泳動ベースの方法および／または磁気ビーズベースの方法を使用して行われる。実施形態では、制限酵素としては、限定されないが、ＢｇＩＩＩ、ＥｃｏＲＩ、ＥｃｏＲＩＩ、ＢａｍＨＩ、ＨｉｎｄＩＩＩ、ＴａｑＩ、ＮｏｔＩ、ＨｉｎＦＩ、ＰｖｕＩＩ、Ｓａｕ３ＡＩ、ＳｍａＩ、ＨａｅＩＩＩ、ＨｇａＩ、ＨｐａＩＩ、ＡｌｕＩ、ＥｃｏＲＶ、ＥｃｏＰ１５Ｉ、ＫｐｎＩ、ＰｓｔＩ、ＳａｃＩ、ＳａｌＩ、ＳｃａＩ、ＳｐｅＩ、ＳｐｈＩ、ＳｔｕＩ、ＸｂａＩ、およびそれらの組み合わせが挙げられる。 In embodiments, size selection is performed using gel electrophoresis-based methods and/or magnetic bead-based methods. In embodiments, the restriction enzymes include, but are not limited to, BgIII, EcoRI, EcoRII, BamHI, HindIII, TaqI, NotI, HinFI, PvuII, Sau3AI, SmaI, HaeIII, HgaI, HpaII, AluI, EcoRV, EcoP15I, KpnI, PstI. , SacI, SalI, ScaI, SpeI, SphI, StuI, XbaI, and combinations thereof.

ｂ）メチル化による生体試料の種類の区別
このセクションでは、制限酵素消化によって生成された断片を使用して決定されたメチル化プロファイルを使用して、異なる生体試料間を識別しやすくする方法について説明する。 b) Distinguishing Biological Sample Types by Methylation This section describes how methylation profiles determined using fragments generated by restriction enzyme digestion can be used to facilitate discrimination between different biological samples. do.

本開示の実施形態による、ＭｓｐＩベースの単一分子リアルタイム配列決定によって決定されたメチル化プロファイルを使用して、生体試料間のメチル化プロファイルの違いを評価した。一例として、胎盤組織ＤＮＡとバフィーコートＤＮＡ試料を取り上げた。ＭｓｐＩベースの標的化単一分子リアルタイム配列決定に基づいて、胎盤とバフィーコートのＤＮＡ試料に関するデータを生成するためのコンピュータシミュレーションを行った。シミュレーションは、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ１．０を使用して、全ゲノムカバレッジで胎盤組織ＤＮＡおよびバフィーコートＤＮＡをＳＭＲＴ配列決定することによって以前に生成された各ヌクレオチドのＩＰＤおよびＰＷを含む動態値に基づいていた。次いで、胎盤ＤＮＡとバフィーコートＤＮＡ試料をＭｓｐＩ消化にかけ、その後、５０～２００ｂｐのサイズ範囲を使用してゲルベースのサイズ選択する条件をシミュレートした。選択したＤＮＡ分子をヘアピンアダプターで連結して、環状ＤＮＡ鋳型を形成した。環状ＤＮＡ鋳型は、ＩＰＤ、ＰＷ、および配列文脈に関する情報を取得するために、単一分子のリアルタイム配列にかけられた。 Methylation profiles determined by MspI-based single-molecule real-time sequencing, according to embodiments of the present disclosure, were used to assess differences in methylation profiles between biological samples. Placental tissue DNA and buffy coat DNA samples were taken as an example. Computer simulations were performed to generate data for placental and buffy coat DNA samples based on MspI-based targeted single-molecule real-time sequencing. Simulations were based on kinetic values including IPD and PW for each nucleotide previously generated by SMRT sequencing placental tissue DNA and buffy coat DNA with whole-genome coverage using Sequel II Sequencing Kit 1.0. was Placental DNA and buffy coat DNA samples were then subjected to MspI digestion, after which a size range of 50-200 bp was used to simulate gel-based size selection conditions. Selected DNA molecules were ligated with hairpin adapters to form circular DNA templates. The circular DNA template was subjected to single-molecule real-time sequencing to obtain information on IPD, PW, and sequence context.

ＳＭＲＴ配列決定サブリードを生成するＺＭＷが５００，０００個あると仮定すると、これらのサブリードは、表１に示されるように、５０～２００ｂｐのサイズ範囲内のＭｓｐＩ消化断片のゲノム分布に従った。サブリード深度は、胎盤とバフィーコートの両方のＤＮＡ試料について、３０倍と想定された。胎盤ＤＮＡ試料およびバフィーコートＤＮＡ試料について、それぞれシミュレーションを１０回繰り返した。したがって、ＭｓｐＩ消化標的化単一分子リアルタイム配列決定によってインシリコで生成されたデータセットは、合計１０個の胎盤ＤＮＡ試料を含み、かつ１０個のバフィーコートＤＮＡ試料を取得した。データセットを、ＣＮＮによってさらに分析し、本開示に従って各試料のメチル化プロファイルを決定した。中央値が９，１９８箇所のＣｐＧアイランドからのＣｐＧ部位（範囲：５，４９７～１３，９２８箇所）を取得し、配列決定されたＣｐＧ部位全体（範囲：４５，３０４～９０，７６２箇所）の１３．６％を占めていた。各分子の各ＣｐＧ部位のメチル化状態は、本開示に従ってＣＮＮモデルによって決定した。 Assuming 500,000 ZMWs generating SMRT sequencing subreads, these subreads followed the genomic distribution of MspI digested fragments within the size range of 50-200 bp, as shown in Table 1. The sub-read depth was assumed to be 30-fold for both placental and buffy coat DNA samples. The simulation was repeated 10 times for each placental DNA sample and buffy coat DNA sample. Thus, the dataset generated in silico by MspI-digested targeted single-molecule real-time sequencing contained a total of 10 placental DNA samples and obtained 10 buffy coat DNA samples. The dataset was further analyzed by CNN to determine the methylation profile of each sample according to the present disclosure. CpG sites from a median of 9,198 CpG islands (range: 5,497-13,928) were obtained, and all sequenced CpG sites (range: 45,304-90,762) were obtained. It accounted for 13.6%. The methylation status of each CpG site in each molecule was determined by the CNN model according to this disclosure.

図１３２は、ＭｓｐＩベースの標的化単一分子リアルタイム配列決定によって決定された胎盤およびバフィーＤＮＡ試料間の全体的なメチル化レベルのグラフである。ｙ軸は、パーセントとしてのメチル化レベルである。ｘ軸に、試料の種類を列挙した。図１３２は、全体的なメチル化レベルが、バフィーコート試料（中央値：６９．５％；範囲：６８．９％～７０．４％）と比較して、胎盤試料（中央値：５７．６％；範囲：５６．９％～５９．１％）で低かったことを示している（Ｐ値＜０．０００１、マンホイットニのＵ検定）。これらの結果は、ＭｓｐＩベースの単一分子リアルタイム配列決定によって決定されたメチル化プロファイルを、メチル化の違いに基づいて組織試料または生体試料を区別するために使用することができることを示唆した。これらのデータは、ＭｓｐＩベースの単一分子リアルタイム配列決定によって検出されたメチル化の違いにより、胎盤由来のＤＮＡを、バフィーコートＤＮＡから識別できることを示していることから、この方法を、母体血漿中の胎児ＤＮＡ画分の測定に適用することができる。母体血漿中または母体血清中の胎児ＤＮＡは胎盤に由来し、一方、試料中の残りのＤＮＡ分子は主に母体バフィーコート細胞に由来するため、メチル化を使用して胎児ＤＮＡ画分を測定することができる。実施形態では、この技術は、異なる組織、または異なる疾患および／もしくは生理学的状態を有する組織、あるいは生体試料を区別するための有用なツールである。 Figure 132 is a graph of global methylation levels between placental and buffy DNA samples determined by MspI-based targeted single-molecule real-time sequencing. The y-axis is the methylation level as a percentage. On the x-axis the sample type was listed. Figure 132 shows that overall methylation levels were higher in placenta samples (median: 57.6%) compared to buffy coat samples (median: 69.5%; range: 68.9%-70.4%). %; range: 56.9%-59.1%) (P value <0.0001, Mann-Whitney U test). These results suggested that methylation profiles determined by MspI-based single-molecule real-time sequencing could be used to distinguish between tissue or biological samples based on methylation differences. These data demonstrate that methylation differences detected by MspI-based single-molecule real-time sequencing can distinguish placental-derived DNA from buffy-coat DNA, suggesting that this method can be used in maternal plasma. can be applied to measure the fetal DNA fraction of Methylation is used to measure the fetal DNA fraction because the fetal DNA in maternal plasma or serum is derived from the placenta, while the remaining DNA molecules in the sample are primarily derived from maternal buffy coat cells. be able to. In embodiments, this technique is a useful tool for distinguishing between different tissues, or tissues with different diseases and/or physiological conditions, or biological samples.

ＣｐＧアイランドのメチル化プロファイルを使用して胎盤ＤＮＡ試料とバフィーコートＤＮＡ試料との間のクラスター分析を行うために、ＣｐＧアイランドのすべてのＣｐＧ部位の中でメチル化として分類されたＣｐＧ部位の割合を使用して、ＣｐＧアイランドのＤＮＡメチル化レベルを計算した。例示の目的で、ＣｐＧアイランド領域のメチル化レベルを使用してクラスター分析を行った。 To perform cluster analysis between placental and buffy coat DNA samples using the methylation profile of CpG islands, the percentage of CpG sites classified as methylated among all CpG sites of CpG islands was was used to calculate the DNA methylation levels of CpG islands. For illustrative purposes, cluster analysis was performed using methylation levels of CpG island regions.

図１３３は、ＭｓｐＩベースの標的単一分子リアルタイム配列決定によって決定されたＤＮＡメチル化プロファイルを使用した胎盤およびバフィーコートの試料のクラスター分析を示している。異なる患者にわたるＣｐＧアイランドからのメチル化パターンの類似性は、クラスタリング樹状図の高さの値によって示される。この例では、高さはユークリッド距離に従って計算される。一実施形態では、高さカットオフ１００を使用して、クラスタリングツリーを２つのグループに分割し、１００％の感度および特異度で、胎盤試料およびバフィーコート試料を区別することができる。他の実施形態では、他の高さカットオフを使用することができ、限定されないが、５０、６０、７０、８０、９０、１２０、１３０、１４０、および１５０などが含まれる。図１３３は、１０個の胎盤ＤＮＡ試料および１０個のバフィーコートＤＮＡ試料が、本開示によるＭｓｐＩベースの単一分子リアルタイム配列決定によって決定されたＣｐＧアイランドのメチル化プロファイルを使用して、別々の２つのグループに明確にクラスター化された。 Figure 133 shows cluster analysis of placenta and buffy coat samples using DNA methylation profiles determined by MspI-based targeted single-molecule real-time sequencing. The similarity of methylation patterns from CpG islands across different patients is indicated by the clustering dendrogram height values. In this example, height is calculated according to Euclidean distance. In one embodiment, a height cutoff of 100 can be used to divide the clustering tree into two groups and distinguish between placental and buffy coat samples with 100% sensitivity and specificity. Other height cutoffs may be used in other embodiments, including but not limited to 50, 60, 70, 80, 90, 120, 130, 140, and 150, and the like. FIG. 133 shows that 10 placental DNA samples and 10 buffy coat DNA samples were analyzed using CpG island methylation profiles determined by MspI-based single-molecule real-time sequencing according to the present disclosure. clearly clustered into one group.

Ｖ．訓練と検出の方法
このセクションでは、塩基修飾を検出するために機械学習モデルを訓練する方法、および機械学習モデルを使用して塩基修飾を検出する方法の例を示す。 V. Training and Detection Methods This section provides examples of how to train a machine learning model to detect base modifications and how to use a machine learning model to detect base modifications.

Ａ．モデル訓練
図１０２は、核酸分子中のヌクレオチドの修飾を検出する例示的な方法１０２０を示す。例示的な方法１０２０は、修飾を検出するためにモデルを訓練する方法であり得る。修飾には、メチル化が含まれ得る。メチル化は、本明細書に記載の任意のメチル化を含み得る。修飾は、メチル化および非メチル化などの個別の状態を有することができ、メチル化の種類を指定する可能性がある。したがって、ヌクレオチドには、３つ以上の状態（分類）が存在してもよい。 A. Model Training FIG. 102 illustrates an exemplary method 1020 for detecting modifications of nucleotides in nucleic acid molecules. An exemplary method 1020 can be a method of training a model to detect modifications. Modifications can include methylation. Methylation can include any methylation described herein. Modifications can have separate states, such as methylated and unmethylated, and may specify the type of methylation. Thus, a nucleotide may have more than two states (classifications).

ブロック１０２２では、複数の第１のデータ構造が受信される。データ構造の様々な例が、ここに、例えば、図４～１６に記載されている。第１の複数の第１のデータ構造の各第１のデータ構造は、複数の第１の核酸分子のそれぞれの核酸分子において配列決定されたヌクレオチドのそれぞれのウィンドウに対応し得る。第１の複数のデータ構造に関連する各ウィンドウは、４つ以上の連続したヌクレオチドを含んでもよく、５、６、７、８、９、１０、１１、１２、１３、１４、１５、１６、１７、１８、１９、２０、２１またはそれ以上の連続したヌクレオチドが含まれる。各ウィンドウには、同じ数の連続したヌクレオチドが含まれ得る。ウィンドウは、重複している場合がある。各ウィンドウは、第１の核酸分子の第１の鎖上のヌクレオチドおよび第１の核酸分子の第２の鎖上のヌクレオチドを含み得る。第１のデータ構造はまた、ウィンドウ内の各ヌクレオチドについて、鎖特性の値を含み得る。鎖特性は、存在するヌクレオチドか、または第１の鎖もしくは第２の鎖のいずれかを示し得る。ウィンドウは、第１の鎖の対応する位置のヌクレオチドに相補的ではない第２の鎖のヌクレオチドを含み得る。一部の実施形態では、第２の鎖上のすべてのヌクレオチドは、第１の鎖のヌクレオチドに相補的である。一部の実施形態では、各ウィンドウは、第１の核酸分子の１つの鎖のみのヌクレオチドを含み得る。 At block 1022, a plurality of first data structures are received. Various examples of data structures are described herein, eg, in FIGS. 4-16. Each first data structure of the first plurality of first data structures may correspond to a respective window of sequenced nucleotides in a respective nucleic acid molecule of the plurality of first nucleic acid molecules. Each window associated with the first plurality of data structures may comprise 4 or more consecutive nucleotides, and 17, 18, 19, 20, 21 or more contiguous nucleotides are included. Each window may contain the same number of contiguous nucleotides. Windows may overlap. Each window can include nucleotides on the first strand of the first nucleic acid molecule and nucleotides on the second strand of the first nucleic acid molecule. The first data structure may also include a strand property value for each nucleotide in the window. Strand characteristics can indicate either the nucleotides present, or the first or second strand. The window may include nucleotides of the second strand that are not complementary to nucleotides at corresponding positions of the first strand. In some embodiments, all nucleotides on the second strand are complementary to nucleotides on the first strand. In some embodiments, each window may contain nucleotides from only one strand of the first nucleic acid molecule.

第１の核酸分子は、環状ＤＮＡ分子であり得る。環状ＤＮＡ分子は、二本鎖ＤＮＡ分子を切断することによって形成することができ、Ｃａｓ９複合体を使用して、切断された二本鎖ＤＮＡ分子を形成する。ヘアピンアダプターは、切断された二本鎖ＤＮＡ分子の末端に連結することができる。実施形態では、二本鎖ＤＮＡ分子の両端を切断して連結することができる。例えば、切断、連結、およびその後の分析は、図９１に記載されているように進めてもよい。 The first nucleic acid molecule can be a circular DNA molecule. A circular DNA molecule can be formed by cleaving a double-stranded DNA molecule, using the Cas9 complex to form the cleaved double-stranded DNA molecule. Hairpin adapters can be ligated to the ends of cleaved double-stranded DNA molecules. In embodiments, both ends of a double-stranded DNA molecule can be cleaved and ligated. For example, cleavage, ligation, and subsequent analysis may proceed as described in FIG.

第１の複数の第１のデータ構造は、５，０００～１０，０００、１０，０００～５０，０００、５０，０００～１００，０００、１００，０００～２００，０００、２００，０００～５００，０００、５００，０００～１，０００，０００、または１，０００，０００以上の第１のデータ構造を含み得る。複数の第１の核酸分子は、少なくとも１，０００、１０，０００、５０，０００、１００，０００、５００，０００、１，０００，０００、５，０００，０００、またはそれ以上の核酸分子を含み得る。さらなる例として、少なくとも１０，０００または５０，０００または１００，０００または５００，０００または１，０００，０００または５，０００，０００の配列リードを生成することができる。 the first plurality of first data structures are: 5,000-10,000; 10,000-50,000; 50,000-100,000; 100,000-200,000; 000, 500,000 to 1,000,000, or more than 1,000,000 first data structures. The plurality of first nucleic acid molecules comprises at least 1,000, 10,000, 50,000, 100,000, 500,000, 1,000,000, 5,000,000 or more nucleic acid molecules. obtain. As a further example, at least 10,000 or 50,000 or 100,000 or 500,000 or 1,000,000 or 5,000,000 sequence reads can be generated.

第１の核酸分子の各々は、ヌクレオチドに対応する信号のパルスを測定することによって配列決定される。信号は、蛍光信号、または他の種類の光信号（例えば、化学発光、測光）であり得る。信号は、ヌクレオチドまたはヌクレオチドと結合したタグに起因する場合がある。 Each of the first nucleic acid molecules is sequenced by measuring pulses of signals corresponding to nucleotides. The signal can be a fluorescent signal, or other type of optical signal (eg, chemiluminescence, photometry). A signal may result from a nucleotide or a tag attached to a nucleotide.

修飾は、各第１の核酸分子の各ウィンドウの標的位置のヌクレオチドの既知の第１の状態を有する。第１の状態は、修飾がヌクレオチドに存在しないか、または修飾がヌクレオチドに存在するかであり得る。修飾は、第１の核酸分子に存在しないことが既知の場合があり、または第１の核酸分子は、修飾が存在しないように処理を受ける場合がある。修飾は、第１の核酸分子に存在することが既知の場合があり、または第１の核酸分子は、修飾が存在するように処理を受ける場合がある。第１の状態が、修飾が存在しない状態である場合、修飾は、各第１の核酸分子の各ウィンドウに存在せず、標的位置にだけ存在する場合がある。既知の第１の状態は、第１のデータ構造の第１の箇所のメチル化状態と、第１のデータ構造の第２の箇所の非メチル化状態とを含み得る。 The modifications have a known first state of the nucleotide at the target position of each window of each first nucleic acid molecule. The first state can be that the modification is absent from the nucleotide or the modification is present at the nucleotide. The modification may be known to be absent from the first nucleic acid molecule, or the first nucleic acid molecule may be treated so that the modification is absent. The modification may be known to be present in the first nucleic acid molecule, or the first nucleic acid molecule may be treated such that the modification is present. If the first state is the absence of modifications, the modifications may not be present in each window of each first nucleic acid molecule, but only at the target position. The first known state may include a methylation state at a first location of the first data structure and an unmethylation state at a second location of the first data structure.

標的位置は、それぞれのウィンドウの中心であり得る。遇数のヌクレオチドにまたがるウィンドウの場合、標的位置は、ウィンドウの中心のすぐ上流またはすぐ下流の位置であり得る。一部の実施形態では、標的位置は、第１の位置または最後の位置を含む、それぞれのウィンドウの他の任意の位置にあってもよい。例えば、ウィンドウが、一方の鎖のｎヌクレオチド、１番目の位置からｎ番目の位置（上流または下流のいずれか）にまたがる場合、標的位置は、１番目の位置からｎ番目の位置までの任意の位置にあってもよい。 The target position can be the center of each window. For windows that span an even number of nucleotides, the target position can be the position immediately upstream or downstream of the center of the window. In some embodiments, the target position may be at any other position in each window, including the first position or the last position. For example, if the window spans n nucleotides on one strand, position 1 to position n (either upstream or downstream), then the target position can be any position from position 1 to position n. may be in position.

各第１のデータ構造には、ウィンドウ内の特性についての値が含まれる。特性は、ウィンドウ内の各ヌクレオチドについてのものであり得る。特性は、ヌクレオチドの識別（ｉｄｅｎｔｉｔｙ）を含み得る。識別（ｉｄｅｎｔｉｔｙ）は、塩基（例えば、Ａ、Ｔ、Ｃ、またはＧ）を含み得る。特性はまた、それぞれのウィンドウ内の標的位置に対するヌクレオチドの位置を含み得る。例えば、位置は、標的位置に対するヌクレオチドの距離であり得る。ヌクレオチドが標的位置からある方向へ１ヌクレオチド離れている場合、位置は＋１であり得、ヌクレオチドが標的位置から反対方向へ１ヌクレオチド離れている場合、位置は－１であり得る。
Each first data structure contains values for the properties within the window. A property can be for each nucleotide within the window. Properties can include nucleotide identities . An identity can include a base (eg, A, T, C, or G). The properties can also include the nucleotide position relative to the target position within each window. For example, a position can be a nucleotide distance to a target position. A position can be +1 if the nucleotide is one nucleotide away from the target position in one direction, and a position can be -1 if the nucleotide is one nucleotide away from the target position in the opposite direction.

特性は、ヌクレオチドに対応するパルスの幅を含み得る。パルスの幅は、パルスの最大値の半分でのパルスの幅であり得る。特性は、ヌクレオチドに対応するパルスと近傍のヌクレオチドに対応するパルスとの間の時間を表すパルス間隔（ＩＰＤ）をさらに含み得る。パルス間隔は、ヌクレオチドに関連するパルスの最大値と近傍のヌクレオチドに関連するパルスの最大値との間の時間であり得る。近傍のヌクレオチドは、隣接するヌクレオチドであり得る。特性は、ウィンドウ内の各ヌクレオチドに対応するパルスの高さも含み得る。特性は、ヌクレオチドが第１の核酸分子の第１の鎖または第２の鎖のどちらに存在するかを示す鎖特性の値をさらに含み得る。鎖の表示は、図６に示されるマトリックスと同様であり得る。 A characteristic may include the width of a pulse corresponding to a nucleotide. The width of the pulse may be the width of the pulse at half its maximum value. The characteristics may further include a pulse interval (IPD) representing the time between pulses corresponding to a nucleotide and pulses corresponding to neighboring nucleotides. A pulse interval can be the time between a pulse maximum associated with a nucleotide and a pulse maximum associated with a neighboring nucleotide. Neighboring nucleotides can be adjacent nucleotides. The characteristics may also include the pulse height corresponding to each nucleotide within the window. The properties can further include a strand property value that indicates whether the nucleotide is on the first strand or the second strand of the first nucleic acid molecule. The representation of the strands can be similar to the matrix shown in FIG.

複数の第１のデータ構造の各データ構造は、ＩＰＤまたはカットオフ値未満の幅を有する第１の核酸分子を除外し得る。１０パーセンタイル（または１、５、１５、２０、３０、４０、５０、６０、７０、８０、９０、または９５パーセンタイル）より大きいＩＰＤ値を有する第１の核酸分子のみを使用することができる。パーセンタイルは、参照試料または参照試料内のすべての核酸分子からのデータに基づいてもよい。幅のカットオフ値も、パーセンタイルに対応する場合がある。 Each data structure of the plurality of first data structures may exclude first nucleic acid molecules having widths less than the IPD or cutoff value. Only first nucleic acid molecules that have an IPD value greater than the 10th percentile (or the 1st, 5th, 15th, 20th, 30th, 40th, 50th, 60th, 70th, 80th, 90th, or 95th percentile) can be used. Percentiles may be based on data from the reference sample or all nucleic acid molecules within the reference sample. Width cutoff values may also correspond to percentiles.

ブロック１０２４では、複数の第１の訓練試料が記憶される。各第１の訓練試料は、第１の複数の第１のデータ構造のうちの１つと、標的位置のヌクレオチドの修飾についての第１の状態を示す第１のラベルとを含む。 At block 1024, a plurality of first training samples are stored. Each first training sample includes one of the first plurality of first data structures and a first label indicative of a first state of modification of the nucleotide at the target position.

ブロック１０２６では、第２の複数の第２のデータ構造が受信される。ブロック１０２６は、任意選択的であり得る。第２の複数の第２のデータ構造の各第２のデータ構造は、複数の第２の核酸分子のそれぞれの核酸分子において配列決定されたヌクレオチドのそれぞれのウィンドウに対応する。第２の複数の核酸分子は、複数の第１の核酸分子と同じであっても異なっていてもよい。修飾は、各第２の核酸分子の各ウィンドウ内の標的位置にあるヌクレオチドの既知の第２の状態を有する。第２の状態は、最初の状態とは異なる状態である。例えば、最初の状態に修飾が存在する場合、第２の状態には修飾が存在せず、その逆も同様である。各第２のデータ構造は、第１の複数の第１のデータ構造と同じ特性についての値を含む。 At block 1026, a second plurality of second data structures is received. Block 1026 may be optional. Each second data structure of the second plurality of second data structures corresponds to a respective window of sequenced nucleotides in a respective nucleic acid molecule of the plurality of second nucleic acid molecules. The second plurality of nucleic acid molecules may be the same or different than the first plurality of nucleic acid molecules. Modifications have a known second state of the nucleotide at the target position within each window of each second nucleic acid molecule. The second state is a state different from the first state. For example, if there is a modification in the first state, there is no modification in the second state, and vice versa. Each second data structure contains values for the same property as the first plurality of first data structures.

複数の第１の訓練試料は、多置換増幅（ＭＤＡ）を使用して生成することができる。一部の実施形態では、複数の第１の訓練試料は、ヌクレオチドのセットを使用して、第１の複数の核酸分子を増幅することによって生成され得る。ヌクレオチドのセットは、特定の比率で第１のタイプのメチル化（例えば、６ｍＡまたは任意の他のメチル化［例えば、ＣｐＧ］）を含み得る。指定された比率は、非メチル化ヌクレオチドに対して、１：１０、１：１００、１：１０００、１：１００００、１：１０００００、または１：１００００００を含み得る。複数の第２の核酸分子は、第１のタイプの非メチル化ヌクレオチドを用いた多置換増幅を使用して生成され得る。 A plurality of first training samples can be generated using multiple displacement amplification (MDA). In some embodiments, a plurality of first training samples can be generated by amplifying a first plurality of nucleic acid molecules using a set of nucleotides. A set of nucleotides may contain a first type of methylation (eg, 6mA or any other methylation [eg, CpG]) in a specific ratio. Specified ratios can include 1:10, 1:100, 1:1000, 1:10000, 1:100000, or 1:1000000 relative to unmethylated nucleotides. A plurality of second nucleic acid molecules can be generated using multiple substitution amplification with unmethylated nucleotides of the first type.

ブロック１０２８では、複数の第２の訓練試料が記憶される。ブロック１０２８は、任意選択的であり得る。各第２の訓練試料は、第２の複数の第２のデータ構造のうちの１つと、標的位置のヌクレオチドの修飾についての第２の状態を示す第２のラベルとを含む。 At block 1028, a plurality of second training samples are stored. Block 1028 may be optional. Each second training sample includes one of the second plurality of second data structures and a second label indicative of a second state of modification of the nucleotide at the target position.

ブロック１０２９では、モデルは、複数の第１の訓練試料、および任意選択的に複数の第２の訓練試料を使用して訓練される。訓練は、第１の複数の第１のデータ構造および任意選択的に第２の複数の第２のデータ構造がモデルに入力される場合、第１のラベルおよび任意選択的に第２のラベルの対応するラベルに一致するまたは一致しないモデルの出力に基づいて、モデルのパラメータを最適化することによって行われる。モデルの出力は、それぞれのウィンドウにおける標的位置のヌクレオチドが修飾を有するかどうかを指定する。モデルが外れ値を第１の状態とは異なる状態であると特定する可能性があるため、この方法は、複数の第１の訓練試料のみを含み得る。モデルは、機械学習モデルとも呼ばれる、統計モデルであり得る。 At block 1029, the model is trained using a plurality of first training samples and optionally a plurality of second training samples. Training is performed on a first label and optionally a second label when a first plurality of first data structures and optionally a second plurality of second data structures are input to the model. This is done by optimizing the parameters of the model based on the model's outputs that match or do not match the corresponding labels. The output of the model specifies whether the nucleotide at the target position in each window has the modification. The method may only include a plurality of first training samples, as the model may identify outliers as different states than the first state. A model can be a statistical model, also called a machine learning model.

一部の実施形態では、モデルの出力は、複数の状態の各々における確率を含み得る。確率が最も高い状態を、その状態とみなすことができる。 In some embodiments, the output of the model may include probabilities at each of multiple states. The state with the highest probability can be considered that state.

モデルには、畳み込みニューラルネットワーク（ＣＮＮ）が含まれ得る。ＣＮＮは、第１の複数のデータ構造および任意選択的に第２の複数のデータ構造をフィルタリングするように構成された畳み込みフィルターのセットを含み得る。フィルターは、本明細書に記載の任意のフィルターであり得る。各層のフィルターの数は、１０～２０、２０～３０、３０～４０、４０～５０、５０～６０、６０～７０、７０～８０、８０～９０、９０～１００、１００～１５０、１５０～２００、またはそれ以上であり得る。フィルターのカーネルサイズは、２、３、４、５、６、７、８、９、１０、１１、１２、１３、１４、１５、１５～２０、２０～３０、３０～４０、またはそれ以上であり得る。ＣＮＮは、フィルタリングされた第１の複数のデータ構造、および任意選択的にフィルタリングされた第２の複数のデータ構造を受信するように構成された入力層を含み得る。ＣＮＮはまた、複数のノードを含む複数の隠れ層を含み得る。入力層には、複数の隠れ層の第１の層が結合した。ＣＮＮは、複数の隠れ層の最後の層に結合され、出力データ構造を出力するように構成された出力層をさらに含み得る。出力データ構造には、特性が含まれ得る。 A model may include a convolutional neural network (CNN). A CNN may include a set of convolution filters configured to filter a first plurality of data structures and optionally a second plurality of data structures. The filter can be any filter described herein. The number of filters in each layer is 10-20, 20-30, 30-40, 40-50, 50-60, 60-70, 70-80, 80-90, 90-100, 100-150, 150-200 , or more. The filter kernel size can be 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 15-20, 20-30, 30-40, or larger. could be. The CNN may include an input layer configured to receive a first plurality of filtered data structures and, optionally, a second plurality of filtered data structures. A CNN may also include multiple hidden layers containing multiple nodes. The input layer was combined with the first of the multiple hidden layers. The CNN may further include an output layer coupled to the last layer of the plurality of hidden layers and configured to output an output data structure. The output data structure may contain properties.

モデルには、教師あり学習モデルが含まれ得る。教師あり学習モデルには、異なるアプローチおよびアルゴリズムが含まれてもよく、分析的学習、人工ニューラルネットワーク、誤差逆伝播、ブースティング（メタアルゴリズム）、ベイズ統計、事例ベース推論、決定木学習、帰納論理プログラミング、ガウス過程回帰、遺伝的プログラミング、データ処理のグループ法、カーネル推定器、学習オートマトン、学習分類器システム、最小メッセージ長（決定木、決定グラフなど）、多重線形部分空間学習、ナイーブベイズ分類器、最大エントロピー分類器、条件付き確率場、最近傍アルゴリズム、確率的で近似的に正しい学習（ＰＡＣ）学習、リップルダウンルール、知識獲得法論、シンボリック機械学習アルゴリズム、サブシンボリック機械学習アルゴリズム、サポートベクトルマシン、最小複雑性マシン（ＭＣＭ）、ランダムフォレスト、分類器のアンサンブル、通常分類、データ事前処理、不均衡データセットの処理、統計的関係学習、またはＰｒｏａｆｔｎ、多基準分類アルゴリズムが含まれる。モデルは、線形回帰、ロジスティック回帰、深層再帰型ニューラルネットワーク（例えば、長期短期メモリ、ＬＳＴＭ）、ベイズ分類器、隠れマルコフモデル（ＨＭＭ）、線形判別分析（ＬＤＡ）、ｋ平均クラスタリング、ノイズを伴うアプリケーションの密度ベースの空間クラスタリング（ＤＢＳＣＡＮ）、ランダムフォレストアルゴリズム、サポートベクトルマシン（ＳＶＭ）、または本明細書に記載の任意のモデルであってもよい。 Models may include supervised learning models. Supervised learning models may include different approaches and algorithms, analytic learning, artificial neural networks, backpropagation, boosting (meta-algorithms), Bayesian statistics, case-based inference, decision tree learning, inductive logic. programming, Gaussian process regression, genetic programming, group methods of data processing, kernel estimators, learning automata, learning classifier systems, minimum message length (decision trees, decision graphs, etc.), multilinear subspace learning, naive Bayes classifiers , maximum entropy classifiers, conditional random fields, nearest neighbor algorithms, probabilistic and approximately correct learning (PAC) learning, ripple-down rules, knowledge acquisition methodologies, symbolic machine learning algorithms, sub-symbolic machine learning algorithms, support vector machines , Minimum Complexity Machines (MCM), Random Forests, Ensemble of Classifiers, Regular Classification, Data Pre-Processing, Processing of Imbalanced Data Sets, Statistical Relationship Learning, or Proaftn, Multi-Criteria Classification Algorithms. Models include linear regression, logistic regression, deep recurrent neural networks (e.g. long short-term memory, LSTM), Bayesian classifiers, hidden Markov models (HMM), linear discriminant analysis (LDA), k-means clustering, applications with noise density-based spatial clustering (DBSCAN), random forest algorithm, support vector machine (SVM), or any model described herein.

機械学習モデルの訓練の一環として、機械学習モデルのパラメータ（重み、閾値など、例えば、ニューラルネットワークの活性化関数に使用することができるもの）を訓練試料（訓練セット）に基づいて最適化して、標的位置のヌクレオチドの修飾を分類する際に最適化された精度を提供する。様々な形式の最適化を行うことができ、例えば、誤差逆伝播、経験的リスク最小化、および構造的リスク最小化などである。試料の検証セット（データ構造とラベル）を使用して、モデルの精度を検証することができる。交差検証は、訓練と検証のために訓練セットの様々な箇所を使用して行うことができる。モデルは、複数のサブモデルを含むことができ、それによって、アンサンブルモデルを提供する。サブモデルは、より弱いモデルであり得るが、組み合わせると、より正確な最終モデルを提供する。 As part of training a machine learning model, the parameters of the machine learning model (weights, thresholds, etc., that can be used, for example, in the activation function of a neural network) are optimized based on a training sample (training set), Provides optimized accuracy in classifying nucleotide modifications at target positions. Various forms of optimization can be performed, such as backpropagation, empirical risk minimization, and structural risk minimization. A validation set of samples (data structures and labels) can be used to validate the accuracy of the model. Cross-validation can be performed using various portions of the training set for training and validation. A model can contain multiple sub-models, thereby providing an ensemble model. The sub-models can be weaker models, but when combined they provide a more accurate final model.

一部の実施形態では、キメラまたはハイブリッド核酸分子は、モデルを検証するために使用することができる。複数の第１の核酸分子の少なくともいくつかは、各々、第１の参照配列に対応する第１の箇所および第２の参照配列に対応する第２の箇所を含む。第１の参照配列は、第２の参照配列とは異なる染色体、組織（例えば、腫瘍または非腫瘍）、生物、または種に由来し得る。第１の参照配列はヒトであり得、第２の参照配列は異なる動物からのものであり得る。各キメラ核酸分子は、第１の参照配列に対応する第１の箇所および第２の参照配列に対応する第２の箇所を含み得る。第１の箇所は、第１のメチル化パターンを有し得、第２の箇所は、第２のメチル化パターンを有し得る。第１の箇所は、メチラーゼで処理することができる。第２の箇所は、メチラーゼで処理され得ず、第２の参照配列の非メチル化箇所に対応し得る。 In some embodiments, chimeric or hybrid nucleic acid molecules can be used to validate models. At least some of the plurality of first nucleic acid molecules each include a first portion corresponding to the first reference sequence and a second portion corresponding to the second reference sequence. The first reference sequence can be from a different chromosome, tissue (eg, tumor or non-tumor), organism, or species than the second reference sequence. The first reference sequence can be human and the second reference sequence can be from a different animal. Each chimeric nucleic acid molecule can comprise a first portion corresponding to a first reference sequence and a second portion corresponding to a second reference sequence. The first location can have a first methylation pattern and the second location can have a second methylation pattern. The first site can be treated with a methylase. The second position may not be treated with a methylase and may correspond to an unmethylated position in the second reference sequence.

Ｂ．修飾の検出
図１０３は、核酸分子中のヌクレオチドの修飾を検出するための方法１０３０を示す。修飾は、図１０２の方法１０２０で説明される任意の修飾であり得る。 B. Detecting Modifications FIG. 103 shows a method 1030 for detecting modifications of nucleotides in a nucleic acid molecule. The modification can be any modification described in method 1020 of FIG.

ブロック１０３２では、入力データ構造が受信される。入力データ構造は、試料核酸分子で配列決定されたヌクレオチドのウィンドウに対応し得る。試料核酸分子は、ヌクレオチドに対応する光信号のパルスを測定することによって配列決定することができる。ウィンドウは、図１０２のブロック１０２２で説明されている任意のウィンドウであり得、配列決定は、図１０２のブロック１０２２で説明されている任意の配列決定であり得る。入力データ構造は、図１０２のブロック１０２２で説明されているものと同じ特性についての値を含むことができる。方法１０３０は、試料核酸分子の配列決定を含み得る。 At block 1032, an input data structure is received. The input data structure may correspond to a window of sequenced nucleotides in a sample nucleic acid molecule. A sample nucleic acid molecule can be sequenced by measuring pulses of light signals corresponding to nucleotides. The window can be any window described in block 1022 of FIG. 102 and the sequencing can be any sequence described in block 1022 of FIG. The input data structure may contain values for the same properties as described in block 1022 of FIG. Method 1030 can include sequencing the sample nucleic acid molecules.

ウィンドウ内のヌクレオチドは、参照ゲノムに整列される場合と整列されない場合がある。ウィンドウ内のヌクレオチドは、配列決定されたヌクレオチドを参照ゲノムに整列させることなく、循環コンセンサス配列（ＣＣＳ）を使用して決定することができる。各ウィンドウのヌクレオチドは、参照ゲノムに整列するのではなく、ＣＣＳによって特定される場合がある。一部の実施形態では、ウィンドウは、ＣＣＳを用いずに、かつ配列決定されたヌクレオチドの参照ゲノムに整列させることなく、決定され得る。 Nucleotides within the window may or may not be aligned to the reference genome. Nucleotides within the window can be determined using circular consensus sequences (CCS) without aligning the sequenced nucleotides to a reference genome. The nucleotides in each window may be specified by CCS rather than aligned to the reference genome. In some embodiments, the window can be determined without using CCS and without aligning the sequenced nucleotides to a reference genome.

ウィンドウ内のヌクレオチドは、濃縮またはフィルタリングすることができる。濃縮は、Ｃａｓ９を含むアプローチによる場合がある。Ｃａｓ９アプローチは、図９１と同様に、Ｃａｓ９複合体を使用して二本鎖ＤＮＡ分子を切断して、切断された二本鎖ＤＮＡ分子を形成し、ヘアピンアダプターを切断された二本鎖ＤＮＡ分子の末端に連結することを含み得る。フィルタリングは、サイズ範囲内のサイズを有する二本鎖ＤＮＡ分子を選択することによるものであり得る。ヌクレオチドは、これらの二本鎖ＤＮＡ分子に由来する場合がある。分子のメチル化状態を維持する他の方法を使用することができる（例えば、メチル結合タンパク質）。 Nucleotides within the window can be enriched or filtered. Enrichment may be by an approach involving Cas9. The Cas9 approach uses the Cas9 complex to cleave a double-stranded DNA molecule to form a cleaved double-stranded DNA molecule, similar to FIG. can include ligating to the end of the Filtering may be by selecting double-stranded DNA molecules having a size within a size range. Nucleotides may be derived from these double-stranded DNA molecules. Other methods of maintaining the methylation status of molecules can be used (eg, methyl-binding proteins).

ブロック１０３４において、入力データ構造が、モデルに入力される。モデルは、図１０２の方法１０２０によって訓練され得る。 At block 1034, the input data structure is input to the model. The model may be trained by method 1020 of FIG.

一部の実施形態では、キメラ核酸分子は、モデルを検証するために使用され得る。複数の第１の核酸分子の少なくともいくつかは、各々、第１の参照配列に対応する第１の箇所と、第１の参照配列とは異なる第２の参照配列に対応する第２の箇所とを含む。第１の参照配列は、第２の参照配列とは異なる染色体、組織（例えば、腫瘍または非腫瘍）、細胞小器官（例えば、ミトコンドリア、核、葉緑体）、生物（哺乳動物、ウイルス、細菌など）、または種に由来し得る。第１の参照配列はヒトであり得、第２の参照配列は異なる動物からのものであり得る。各キメラ核酸分子は、第１の参照配列に対応する第１の箇所および第２の参照配列に対応する第２の箇所を含み得る。第１の箇所は、第１のメチル化パターンを有し得、第２の箇所は、第２のメチル化パターンを有し得る。第１の箇所は、メチラーゼで処理することができる。第２の箇所は、メチラーゼで処理され得ず、第２の参照配列の非メチル化箇所に対応し得る。 In some embodiments, chimeric nucleic acid molecules can be used to validate models. At least some of the plurality of first nucleic acid molecules each have a first location corresponding to a first reference sequence and a second location corresponding to a second reference sequence different from the first reference sequence. including. The first reference sequence may be a different chromosome, tissue (e.g., tumor or non-tumor), organelle (e.g., mitochondria, nucleus, chloroplast), organism (mammal, virus, bacteria) than the second reference sequence. etc.), or from a species. The first reference sequence can be human and the second reference sequence can be from a different animal. Each chimeric nucleic acid molecule can comprise a first portion corresponding to a first reference sequence and a second portion corresponding to a second reference sequence. The first location can have a first methylation pattern and the second location can have a second methylation pattern. The first site can be treated with a methylase. The second position may not be treated with a methylase and may correspond to an unmethylated position in the second reference sequence.

ブロック１０３６において、修飾が、入力データ構造のウィンドウ内の標的位置のヌクレオチドに存在するかどうかは、モデルを使用して決定される。 At block 1036, it is determined using the model whether the modification is present at the nucleotide at the target position within the window of the input data structure.

入力データ構造は、複数の入力データ構造のうちの１つの入力データ構造であり得る。各入力データ構造は、複数の試料核酸分子のそれぞれの試料核酸分子において配列決定されたヌクレオチドのそれぞれのウィンドウに対応し得る。複数の試料核酸分子は、対象の生体試料から取得することができる。生体試料は、本明細書に記載の任意の生体試料であり得る。方法１０３０は、入力データ構造ごとに繰り返すことができる。この方法は、複数の入力データ構造を受信することを含み得る。複数の入力データ構造を、モデルに入力することができる。修飾が、各入力データ構造のそれぞれのウィンドウ内の標的位置のヌクレオチドに修飾が存在するかどうかは、モデルを使用して決定することができる。 The input data structure may be one of a plurality of input data structures. Each input data structure may correspond to a respective window of sequenced nucleotides in a respective sample nucleic acid molecule of the plurality of sample nucleic acid molecules. A plurality of sample nucleic acid molecules can be obtained from the subject's biological sample. The biological sample can be any biological sample described herein. Method 1030 can be repeated for each input data structure. The method may include receiving multiple input data structures. Multiple input data structures can be input to the model. A model can be used to determine whether a modification exists at the nucleotide at the target position within each window of each input data structure.

複数の試料核酸分子の各試料核酸分子は、カットオフサイズよりも大きいサイズを有し得る。例えば、カットオフサイズは、１００ｂｐ、２００ｂｐ、３００ｂｐ、４００ｂｐ、５００ｂｐ、６００ｂｐ、７００ｂｐ、８００ｂｐ、９００ｂｐ、１ｋｂ、２ｋｂ、３ｋｂ、４ｋｂ、５ｋｂ、６ｋｂ、７ｋｂ、９ｋｂ、１０ｋｂ、２０ｋｂ、３０ｋｂ、４０ｋｂ、５０ｋｂ、６０ｋｂ、７０ｋｂ、８０ｋｂ、９０ｋｂ、１００ｋｂ、５００ｋｂ、または１Ｍｂであり得る。サイズカットオフがあると、サブリード深度が高くなる可能性があり、どちらの場合も、修飾検出の精度が増加する可能性がある。一部の実施形態では、この方法は、ＤＮＡ分子を配列決定する前に、特定のサイズについてＤＮＡ分子を分画することを含み得る。 Each sample nucleic acid molecule of the plurality of sample nucleic acid molecules can have a size greater than the cutoff size. For example, cut-off sizes are 100 bp, 200 bp, 300 bp, 400 bp, 500 bp, 600 bp, 700 bp, 800 bp, 900 bp, 1 kb, 2 kb, 3 kb, 4 kb, 5 kb, 6 kb, 7 kb, 9 kb, 10 kb, 20 kb, 30 kb, 40 kb, 50 kb. , 60 kb, 70 kb, 80 kb, 90 kb, 100 kb, 500 kb, or 1 Mb. Having a size cutoff can lead to higher sub-read depths, and in both cases can increase the accuracy of modification detection. In some embodiments, the method may comprise fractionating the DNA molecules for a particular size prior to sequencing the DNA molecules.

複数の試料核酸分子は、複数のゲノム領域に整列し得る。複数のゲノム領域の各ゲノム領域について、いくつかの試料核酸分子をゲノム領域に整列させることができる。試料核酸分子の数は、カットオフ数よりも多い場合がある。カットオフ数は、サブリード深度のカットオフであり得る。サブリード深度のカットオフ数は、１倍、１０倍、３０倍、４０倍、５０倍、６０倍、７０倍、８０倍、９００倍、１００倍、２００倍、３００倍、４００倍、５００倍、６００倍、７００倍、または８００倍であり得る。サブリード深度のカットオフ数は、精度を改善または最適化するために決定することができる。サブリード深度のカットオフ数は、複数のゲノム領域の数に関連している場合がある。例えば、サブリード深度のカットオフ数がより高いほど、複数のゲノム領域の数はより少ない。 Multiple sample nucleic acid molecules can be aligned to multiple genomic regions. For each genomic region of the plurality of genomic regions, several sample nucleic acid molecules can be aligned to the genomic region. The number of sample nucleic acid molecules may be greater than the cutoff number. The cutoff number may be a sub-read depth cutoff. The cutoff numbers for the sub-read depth are 1, 10, 30, 40, 50, 60, 70, 80, 900, 100, 200, 300, 400, 500, It can be 600-fold, 700-fold, or 800-fold. A sub-read depth cutoff number can be determined to improve or optimize accuracy. The sub-read depth cutoff number may be related to the number of multiple genomic regions. For example, the higher the sub-read depth cutoff number, the lower the number of multiple genomic regions.

修飾は、１つ以上のヌクレオチドに存在していると決定され得る。障害の分類は、１つ以上のヌクレオチドの修飾の存在を使用して、決定することができる。障害の分類は、修飾の数を使用することを含み得る。修飾の数は、閾値と比較され得る。代替的または追加的に、分類は、１つ以上の修飾の位置を含み得る。１つ以上の修飾の位置は、核酸分子の配列リードを参照ゲノムに整列することによって、決定することができる。障害と相関していることが知られている特定の位置に修飾があることが示された場合、障害を決定することができる。例えば、メチル化部位のパターンを、障害の参照パターンと比較することができ、その比較に基づいて、障害を決定することができる。参照パターンとの一致または参照パターンとの実質的な一致（例えば、８０％、９０％、または９５％以上）は、障害または障害の可能性が高いことを示している場合がある。障害は、癌または本明細書に記載の任意の障害（例えば、妊娠関連障害、自己免疫疾患）であり得る。 Modifications can be determined to be present at one or more nucleotides. Disorder classification can be determined using the presence of one or more nucleotide modifications. Classification of disorders may involve using a number of modifications. The number of modifications can be compared to a threshold. Alternatively or additionally, a classification may include one or more positions of modification. The location of one or more modifications can be determined by aligning the sequence reads of the nucleic acid molecule to a reference genome. A disorder can be determined when a modification is shown to be at a particular position that is known to be correlated with the disorder. For example, the pattern of methylation sites can be compared to a reference pattern of disorders, and the disorder can be determined based on the comparison. A match to the reference pattern or a substantial match to the reference pattern (eg, 80%, 90%, or 95% or more) may indicate a disorder or a high probability of a disorder. The disorder can be cancer or any disorder described herein (eg, pregnancy-related disorders, autoimmune diseases).

統計的に有意な数の核酸分子を分析して、障害、組織起源、または臨床関連ＤＮＡ画分を正確に決定することができる。一部の実施形態では、少なくとも１，０００個の核酸分子が分析される。他の実施形態では、少なくとも１０，０００または５０，０００または１００，０００または５００，０００または１，０００，０００または５，０００，０００、またはそれ以上の核酸分子を分析することができる。さらなる例として、少なくとも１０，０００または５０，０００または１００，０００または５００，０００または１，０００，０００または５，０００，０００の配列リードを生成することができる。 A statistically significant number of nucleic acid molecules can be analyzed to accurately determine a lesion, tissue origin, or clinically relevant DNA fraction. In some embodiments, at least 1,000 nucleic acid molecules are analyzed. In other embodiments, at least 10,000 or 50,000 or 100,000 or 500,000 or 1,000,000 or 5,000,000 or more nucleic acid molecules can be analyzed. As a further example, at least 10,000 or 50,000 or 100,000 or 500,000 or 1,000,000 or 5,000,000 sequence reads can be generated.

本方法は、障害の分類は、対象が障害を有すると決定することを含み得る。分類は、修飾の数および／または修飾の部位を使用して、障害のレベルを含み得る。 The method may include classifying the disorder including determining that the subject has the disorder. Classification can include level of disorder using the number and/or site of modification.

臨床関連のＤＮＡ画分、胎児のメチル化プロファイル、母体のメチル化プロファイル、インプリント遺伝子領域の存在、または起源組織（例えば、異なる細胞型の混合物を含有する試料から）は、１つ以上のヌクレオチドの修飾の存在を使用して、決定することができる。臨床関連のＤＮＡ画分としては、限定されないが、胎児ＤＮＡ画分、腫瘍ＤＮＡ画分（例えば、腫瘍細胞と非腫瘍細胞の混合物を含有する試料から）、および移植物ＤＮＡ画分（例えば、ドナー細胞とレシピエント細胞の混合物を含有する試料から）が含まれる。 Clinically relevant DNA fractions, fetal methylation profiles, maternal methylation profiles, presence of imprinted gene regions, or tissue of origin (e.g., from a sample containing a mixture of different cell types) can be identified by one or more nucleotide can be determined using the presence of modifications of Clinically relevant DNA fractions include, but are not limited to, fetal DNA fractions, tumor DNA fractions (e.g., from samples containing a mixture of tumor and non-tumor cells), and graft DNA fractions (e.g., donor from samples containing mixtures of cells and recipient cells).

本方法は、障害の治療をさらに含み得る。治療は、決定された障害のレベル、特定された修飾、および／または起源の組織（例えば、癌患者の循環から単離された腫瘍細胞の）に従って、提供することができる。例えば、特定された修飾は、特定の薬物または化学療法を用いて標的化することができる。起源の組織を使用して、手術または任意の他の形態の治療を誘導することができる。また、障害のレベルを使用して、任意のタイプの治療に対してどれほど侵襲性であるかを判断することができる。 The method may further include treating the disorder. Treatment can be provided according to the level of damage determined, the modification identified, and/or the tissue of origin (eg, of tumor cells isolated from the circulation of a cancer patient). For example, identified modifications can be targeted using specific drugs or chemotherapy. The tissue of origin can be used to guide surgery or any other form of treatment. Also, the level of injury can be used to determine how aggressive any type of treatment is.

実施形態は、患者における障害のレベルを決定した後に、患者における障害を治療することを含み得る。治療には、本明細書で言及される参考文献に記載される任意の治療を含む、任意の好適な療法、薬物、化学療法、放射線照射、または手術が含まれ得る。参考文献における治療に関する情報は、参照により本明細書に組み込まれる。 Embodiments may include treating the disorder in the patient after determining the level of disorder in the patient. Treatment may include any suitable therapy, drug, chemotherapy, radiation, or surgery, including any treatment described in the references mentioned herein. The information regarding therapy in the references is incorporated herein by reference.

ＶＩ．ハプロタイプ分析
２つのハプロタイプ間のメチル化プロファイルの違いは、腫瘍組織の試料で見つかった。したがって、ハプロタイプ間のメチル化不均衡を使用して、癌または他の障害のレベルの分類を決定することができる。ハプロタイプの不均衡はまた、胎児によるハプロタイプの遺伝を特定するために使用され得る。また、胎児の障害は、ハプロタイプ間のメチル化不均衡を分析することを通して特定することもできる。細胞ＤＮＡは、ハプロタイプのメチル化レベルを分析するために使用することができる。 VI. Haplotype Analysis Differences in methylation profiles between the two haplotypes were found in tumor tissue samples. Therefore, methylation imbalance between haplotypes can be used to determine the level classification of cancer or other disorders. Haplotype imbalance can also be used to identify inheritance of haplotypes by the fetus. Fetal disorders can also be identified through analysis of methylation imbalance between haplotypes. Cellular DNA can be used to analyze haplotype methylation levels.

Ａ．ハプロタイプ関連のメチル化分析
単一分子リアルタイム配列決定技術により、個々のＳＮＰを特定することが可能になる。単一分子リアルタイム配列決定ウェルから生成された長いリード（例えば、最大数キロベース）は、各コンセンサスリードに存在するハプロタイプ情報を活用することによって、ゲノムのバリアントを段階化する（ｐｈａｓｉｎｇ）ことができる（Ｅｄｇｅｅｔａｌ．ＧｅｎｏｍｅＲｅｓ．２０１７；２７：８０１－８１２、Ｗｅｎｇｅｒｅｔａｌ．ＮａｔＢｉｏｔｅｃｈｎｏｌ．２０１９；３７：１１５５－１１６２）。ハプロタイプのメチル化プロファイルは、図７７に示すように、ＣＣＳによってそれぞれのハプロタイプのアレルにリンクされたＣｐＧ部位のメチル化レベルから分析することができる。この段階的なメチル化ハプロタイプ分析は、相同染色体の２つのコピーが、癌などの異なる臨床関連状態で類似するまたは異なるメチル化パターンを共有するかどうかに関する疑問を解決するために使用することができる。一実施形態では、ハプロタイプのメチル化は、そのハプロタイプに割り当てられたいくつかのＤＮＡ断片が寄与する集約されたメチル化レベルであろう。ハプロタイプは、異なるサイズのブロックであり得、限定されないが、５０ｎｔ、１００ｎｔ、２００ｎｔ、３００ｎｔ、４００ｎｔ、５００ｎｔ、１ｋｎｔ、２ｋｎｔ、３ｋｎｔ、４ｋｎｔ、５ｋｎｔ、１０ｋｎｔ、２０ｋｎｔ、３０ｋｎｔ、４０ｋｎｔ、５０ｋｎｔ、１００ｋｎｔ、２００ｋｎｔ、３００ｋｎｔ、４００ｋｎｔ、５００ｋｎｔ、１Ｍｎｔ、２Ｍｎｔ、および３Ｍｎｔを含む。 A. Haplotype-associated methylation analysis Single-molecule real-time sequencing technology allows individual SNPs to be identified. Long reads (e.g., up to several kilobases) generated from single-molecule real-time sequencing wells are capable of phasing genomic variants by exploiting the haplotype information present in each consensus read. (Edge et al. Genome Res. 2017;27:801-812, Wenger et al. Nat Biotechnol. 2019;37:1155-1162). Haplotype methylation profiles can be analyzed from the methylation levels of CpG sites linked to each haplotype allele by CCS, as shown in FIG. This stepwise methylation haplotype analysis can be used to resolve questions as to whether two copies of homologous chromosomes share similar or different methylation patterns in different clinically relevant conditions such as cancer. . In one embodiment, the methylation of a haplotype will be the aggregate methylation level contributed by several DNA fragments assigned to that haplotype. Haplotypes can be blocks of different sizes, including but not limited to 50 nt, 100 nt, 200 nt, 300 nt, 400 nt, 500 nt, 1 knt, 2 knt, 3 knt, 4 knt, 5 knt, 10 knt, 20 knt, 30 knt, 40 knt, 50 knt, 100 knt, 200 knt , 300 knt, 400 knt, 500 knt, 1 Mnt, 2 Mnt, and 3 Mnt.

Ｂ．相対的なハプロタイプベースのメチル化不均衡分析
図１０４は、相対的なハプロタイプベースのメチル化不均衡分析を示す。ハプロタイプ（すなわち、ＨａｐＩおよびＨａｐＩＩ）は、単一分子リアルタイム配列決定の結果を分析することによって決定された。各ハプロタイプにリンクされたメチル化パターンは、図７７に記載されたアプローチに従ってメチル化プロファイルが決定されたハプロタイプ関連の断片を使用して決定することができる。それによって、ＨａｐＩとＨａｐＩＩの間のメチル化パターンを比較することができる。 B. Relative Haplotype-Based Methylation Imbalance Analysis FIG. 104 shows a relative haplotype-based methylation imbalance analysis. Haplotypes (ie, Hap I and Hap II) were determined by analyzing the results of single-molecule real-time sequencing. The methylation pattern linked to each haplotype can be determined using haplotype-associated fragments whose methylation profiles have been determined according to the approach described in FIG. It allows comparison of methylation patterns between Hap I and Hap II.

ＨａｐＩとＨａｐＩＩの間のメチル化の違いを定量するために、ＨａｐＩとＨａｐＩＩの間のメチル化レベルの違い（ΔＦ）を計算した。違いΔＦは次のように計算される。
ΔＦ＝Ｍ_ＨａｐＩ－Ｍ_{ＨａｐＩＩ}
ここで、ΔＦはＨａｐＩとＨａｐＩＩの間のメチル化レベルの差を表し、Ｍ_ＨａｐＩとＭ_{ＨａｐＩＩ}は、それぞれ、ＨａｐＩとＨａｐＩＩのメチル化レベルを表す。ΔＦの正の値は、ＨａｐＩＩと比較して、ＨａｐＩのＤＮＡのメチル化レベルがより高いことを示唆している。 To quantify the methylation difference between Hap I and Hap II, the difference in methylation level (ΔF) between Hap I and Hap II was calculated. The difference ΔF is calculated as follows.
ΔF=M _HapI −M _HapII
where ΔF represents the difference in methylation level between Hap I and Hap II, and M _{Hap I} and M _Hap II represent the methylation levels of Hap I and Hap II, respectively. A positive value for ΔF suggests a higher DNA methylation level for Hap I compared to Hap II.

Ｃ．ＨＣＣ腫瘍ＤＮＡの相対的ハプロタイプベースのメチル化不均衡分析
一実施形態では、ハプロタイプメチル化分析は、癌ゲノムにおけるメチル化異常を検出するのに有用であり得る。例えば、ゲノム領域内の２つのハプロタイプ間のメチル化の変化が分析される。ゲノム領域内のハプロタイプは、ハプロタイプブロックとして定義される。ハプロタイプブロックは、段階化された染色体上のアレルのセットとみなすことができる。一部の実施形態では、ハプロタイプブロックは、染色体上に物理的にリンクした２つのアレルを支持する配列情報のセットに従って、可能な限り長く延長される。ケース３０３３の場合、隣接する正常組織ＤＮＡの配列決定の結果から９７，４７５個のハプロタイプブロックを取得した。ハプロタイプブロックのサイズの中央値は、２．８ｋｂであった。ハプロタイプブロックの２５％は、サイズが８．２ｋｂを超えていた。ハプロタイプブロックの最大サイズは、２８２．２ｋｂであった。データセットは、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ１．０によって調製されたＤＮＡから生成された。 C. Relative Haplotype-Based Methylation Imbalance Analysis of HCC Tumor DNA In one embodiment, haplotype methylation analysis can be useful for detecting methylation abnormalities in cancer genomes. For example, methylation changes between two haplotypes within a genomic region are analyzed. Haplotypes within a genomic region are defined as haplotype blocks. A haplotype block can be viewed as a staged set of chromosomal alleles. In some embodiments, the haplotype block is extended as long as possible according to the set of sequence information supporting two alleles physically linked on the chromosome. For case 3033, 97,475 haplotype blocks were obtained from sequencing results of adjacent normal tissue DNA. The median haplotype block size was 2.8 kb. 25% of the haplotype blocks exceeded 8.2 kb in size. The maximum size of the haplotype block was 282.2 kb. The dataset was generated from DNA prepared by Sequel II Sequencing Kit 1.0.

説明のために、いくつかの基準を使用して、隣接する非腫瘍組織ＤＮＡと比較して、腫瘍ＤＮＡのＨａｐＩとＨａｐＩＩとの間で異なるメチル化を示した潜在的なハプロタイプブロックを特定した。基準は次のとおりであった。（１）分析されるハプロタイプブロックには、３つの配列決定ウェルからそれぞれ生成された少なくとも３つの３つのＣＣＳ配列が含有されていた。（２）隣接する非腫瘍組織ＤＮＡにおけるＨａｐＩとＨａｐＩＩとの間のメチル化レベルの絶対差は５％未満であった。（３）腫瘍組織ＤＮＡにおけるＨａｐＩとＨａｐＩＩとの間のメチル化レベルの絶対差は３０％を超えていた。上記の基準を満たす７３のハプロタイプブロックを特定した。 To illustrate, several criteria were used to identify potential haplotype blocks that showed differential methylation between Hap I and Hap II of tumor DNA compared to adjacent non-tumor tissue DNA. bottom. The criteria were as follows. (1) The haplotype blocks analyzed contained at least three triplicate CCS sequences each generated from triplicate sequencing wells. (2) The absolute difference in methylation levels between Hap I and Hap II in adjacent non-tumor tissue DNA was less than 5%. (3) The absolute difference in methylation levels between Hap I and Hap II in tumor tissue DNA exceeded 30%. We identified 73 haplotype blocks that met the above criteria.

図１０５Ａおよび１０５Ｂは、ケースＴＢＲ３０３３の隣接する非腫瘍組織ＤＮＡと比較した、ＨＣＣ腫瘍ＤＮＡにおけるＨａｐＩとＨａｐＩＩとの間の異なるメチル化レベルを示す７３個のハプロタイプブロックの表である。最初の列は、ハプロタイプブロックに関連する染色体を示す。２番目の列は、染色体内のハプロタイプブロックの開始座標を示す。３番目の列は、ハプロタイプブロックの終止座標を示す。４番目の列は、ハプロタイプブロックの長さを示す。４番目の列は、ハプロタイプブロックのＩＤを列挙している。５番目の列は、腫瘍組織に隣接する非腫瘍組織におけるＨａｐＩのメチル化レベルを示す。６番目の列は、非腫瘍組織におけるＨａｐＩＩのメチル化レベルを示す。７番目の列は、腫瘍組織におけるＨａｐＩのメチル化レベルを示す。８番目の列は、腫瘍組織におけるＨａｐＩＩのメチル化レベルを示す。 Figures 105A and 105B are tables of 73 haplotype blocks showing differential methylation levels between Hap I and Hap II in HCC tumor DNA compared to adjacent non-tumor tissue DNA of case TBR3033. The first column indicates the chromosomes associated with the haplotype block. The second column indicates the starting coordinates of the haplotype block within the chromosome. The third column shows the ending coordinates of the haplotype block. The fourth column indicates the length of the haplotype block. The fourth column lists the ID of the haplotype block. The fifth column shows Hap I methylation levels in non-tumor tissue adjacent to tumor tissue. The sixth column shows Hap II methylation levels in non-tumor tissues. The seventh column shows Hap I methylation levels in tumor tissue. The eighth column shows Hap II methylation levels in tumor tissue.

腫瘍組織ＤＮＡのハプロタイプ間でメチル化レベルに３０％を超える差を示す７３のハプロタイプブロックとは対照的に、非腫瘍組織ＤＮＡでは３０％を超える差を示したが、腫瘍組織ＤＮＡでは５％未満の差を示したハプロタイプブロックは１つだけであった。一部の実施形態では、別の一連の基準を使用して、異なるメチル化を示すハプロタイプブロックを特定することができる。他の最大および最小の閾値の差を使用することができる。例えば、最小の閾値の差は、１０％、１５％、２０％、２５％、３０％、３５％、４０％、４５％、５０％、またはそれ以上であり得る。例として、最大の閾値の差は、１％、５％、１０％、１５％、２０％、または３０％である。これらの結果は、ハプロタイプ間のメチル化の違いの変動が、癌の診断、検出、監視、予後診断、および治療のためのガイダンスのための新しいバイオマーカーとして役立つ可能性があることを示唆した。 In contrast to the 73 haplotype block showing greater than 30% difference in methylation levels between haplotypes in tumor tissue DNA, greater than 30% difference in non-tumor tissue DNA but less than 5% in tumor tissue DNA Only one haplotype block showed a difference in In some embodiments, another set of criteria can be used to identify haplotype blocks exhibiting differential methylation. Other maximum and minimum threshold differences can be used. For example, the minimum threshold difference can be 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, or more. As examples, the maximum threshold difference is 1%, 5%, 10%, 15%, 20%, or 30%. These results suggested that variations in methylation differences between haplotypes may serve as new biomarkers for cancer diagnosis, detection, surveillance, prognosis, and guidance for treatment.

一部の実施形態では、メチル化パターンを研究する場合、長いハプロタイプブロックは、インシリコで、より小さなブロックに分割される。 In some embodiments, when studying methylation patterns, long haplotype blocks are split in silico into smaller blocks.

ケース３０３２の場合、隣接する非腫瘍組織ＤＮＡの配列決定の結果から６１，９５８個のハプロタイプブロックを取得した。ハプロタイプブロックのサイズの中央値は、９．３ｋｂであった。ハプロタイプブロックの２５％は、サイズが２７．６ｋｂを超えていた。ハプロタイプブロックの最大サイズは、７１７．８ｋｂであった。例として、上記と同じ３つの基準を使用して、隣接する正常組織ＤＮＡと比較して、腫瘍ＤＮＡのＨａｐＩとＨａｐＩＩとの間で異なるメチル化を示した潜在的なハプロタイプブロックを特定した。上記の基準を満たす２０のハプロタイプブロックを特定した。データセットは、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ１．０によって調製されたＤＮＡから生成された。 For case 3032, 61,958 haplotype blocks were obtained from sequencing results of adjacent non-tumor tissue DNA. The median haplotype block size was 9.3 kb. 25% of the haplotype blocks exceeded 27.6 kb in size. The maximum size of the haplotype block was 717.8 kb. As an example, the same three criteria as above were used to identify potential haplotype blocks that showed differential methylation between Hap I and Hap II of tumor DNA compared to adjacent normal tissue DNA. . Twenty haplotype blocks were identified that met the above criteria. The dataset was generated from DNA prepared by Sequel II Sequencing Kit 1.0.

図１０６は、ケースＴＢＲ３０３２の隣接する正常組織ＤＮＡと比較して、腫瘍ＤＮＡにおけるＨａｐＩとＨａｐＩＩとの間の異なるメチル化レベルを示す２０個のハプロタイプブロックの表である。最初の列は、ハプロタイプブロックに関連する染色体を示す。２番目の列は、染色体内のハプロタイプブロックの開始座標を示す。３番目の列は、ハプロタイプブロックの終止座標を示す。４番目の列は、ハプロタイプブロックの長さを示す。４番目の列は、ハプロタイプブロックのＩＤを列挙している。５番目の列は、腫瘍組織に隣接する非腫瘍組織におけるＨａｐＩのメチル化レベルを示す。６番目の列は、非腫瘍組織におけるＨａｐＩＩのメチル化レベルを示す。７番目の列は、腫瘍組織におけるＨａｐＩのメチル化レベルを示す。８番目の列は、腫瘍組織におけるＨａｐＩＩのメチル化レベルを示す。 FIG. 106 is a table of 20 haplotype blocks showing differential methylation levels between Hap I and Hap II in tumor DNA compared to adjacent normal tissue DNA of case TBR3032. The first column indicates the chromosomes associated with the haplotype block. The second column indicates the starting coordinates of the haplotype block within the chromosome. The third column shows the ending coordinates of the haplotype block. The fourth column indicates the length of the haplotype block. The fourth column lists the ID of the haplotype block. The fifth column shows Hap I methylation levels in non-tumor tissue adjacent to tumor tissue. The sixth column shows Hap II methylation levels in non-tumor tissues. The seventh column shows Hap I methylation levels in tumor tissue. The eighth column shows Hap II methylation levels in tumor tissue.

図１０６のＨＣＣ腫瘍組織の違いを示す２０個のハプロタイプブロックとは対照的に、１つのハプロタイプブロックのみが、非腫瘍組織で３０％超の違いを示し、しかし、腫瘍組織では５％未満の違いを示した。これらの結果はさらに、ハプロタイプ間のメチル化の違いの変動が、癌の診断、検出、監視、予後診断、および治療のためのガイダンスのための新しいバイオマーカーとして役立つ可能性があることを示唆している。他の実施形態では、他の基準を使用して、異なるメチル化を示すハプロタイプブロックを特定することができる。 In contrast to the 20 haplotype blocks showing differences in HCC tumor tissues in Figure 106, only one haplotype block showed greater than 30% difference in non-tumor tissue, but less than 5% difference in tumor tissue. showed that. These results further suggest that variations in methylation differences between haplotypes may serve as novel biomarkers for cancer diagnosis, detection, surveillance, prognosis, and guidance for treatment. ing. In other embodiments, other criteria can be used to identify haplotype blocks that exhibit differential methylation.

Ｄ．他の腫瘍タイプからのＤＮＡの相対的ハプロタイプベースのメチル化不均衡分析
上述のように、ハプロタイプ間のメチル化レベルの分析は、ＨＣＣ腫瘍組織が、ペアの隣接する非腫瘍組織と比較して、メチル化の不均衡を示すより多くのハプロタイプブロックを有していたことを明らかにした。一例として、腫瘍組織でメチル化不均衡を示すハプロタイプブロックの基準は、次のとおりであった。（１）分析されるハプロタイプブロックには、３つの配列決定ウェルから生成された少なくとも３つのＣＣＳ配列が含有されていた。（２）過去のデータに基づく隣接する非腫瘍組織ＤＮＡまたは正常組織ＤＮＡにおけるＨａｐＩとＨａｐＩＩとの間のメチル化レベルの絶対差は５％未満であった。（３）腫瘍組織ＤＮＡにおけるＨａｐＩとＨａｐＩＩとの間のメチル化レベルの絶対差は３０％を超えていた。メチル化レベルでハプロタイプ不均衡を示す非腫瘍／正常組織は、腫瘍領域ではなくインプリント領域を示している可能性があるため、基準（２）が含まれた。非腫瘍組織におけるメチル化不均衡を示すハプロタイプブロックの基準は、次のとおりであった。（１）分析されるハプロタイプブロックには、３つの配列決定ウェルから生成された少なくとも３つのＣＣＳ配列が含有されていた。（２）過去のデータに基づく隣接する非腫瘍組織ＤＮＡまたは正常組織ＤＮＡにおけるＨａｐＩとＨａｐＩＩとの間のメチル化レベルの絶対差は３０％を超えていた。（３）腫瘍組織ＤＮＡにおけるＨａｐＩとＨａｐＩＩとの間のメチル化レベルの絶対差は５％未満であった。 D. Relative Haplotype-Based Methylation Imbalance Analysis of DNA from Other Tumor Types As described above, analysis of methylation levels between haplotypes showed that HCC tumor tissues compared to paired adjacent non-tumor tissues were: We found that we had more haplotype blocks indicating methylation imbalance. As an example, the criteria for haplotype block showing methylation imbalance in tumor tissue were as follows. (1) The haplotype block analyzed contained at least three CCS sequences generated from three sequencing wells. (2) The absolute difference in methylation levels between Hap I and Hap II in adjacent non-tumor or normal tissue DNA based on historical data was less than 5%. (3) The absolute difference in methylation levels between Hap I and Hap II in tumor tissue DNA exceeded 30%. Criterion (2) was included because non-tumor/normal tissues showing haplotype imbalance at the methylation level may represent imprinted regions rather than tumor regions. Criteria for haplotype block indicating methylation imbalance in non-tumor tissues were as follows. (1) The haplotype block analyzed contained at least three CCS sequences generated from three sequencing wells. (2) The absolute difference in methylation levels between Hap I and Hap II in adjacent non-tumor or normal tissue DNA based on historical data exceeded 30%. (3) The absolute difference in methylation levels between Hap I and Hap II in tumor tissue DNA was less than 5%.

他の実施形態では、他の規準を使用することができる。例えば、不均衡なハプロタイプＩの癌ゲノムを特定するために、非腫瘍組織では、ＨａｐＩとＨａｐＩＩとの間のメチル化レベルの差が、１％、５％、１０％、２０％、４０％、５０％、または６０％未満などであってもよく、腫瘍組織では、ＨａｐＩとＨａｐＩＩとの間のメチル化レベルの差が、１％、５％、１０％、２０％、４０％、５０％、または６０％超などであってもよい。不均衡なハプロタイプＩの非癌ゲノムを特定するために、非腫瘍組織では、ＨａｐＩとＨａｐＩＩとの間のメチル化レベルの差が、１％、５％、１０％、２０％、４０％、５０％、または６０％超などであってもよく、一方、腫瘍組織では、ＨａｐＩとＨａｐＩＩとの間のメチル化レベルの差が、１％、５％、１０％、２０％、４０％、５０％、または６０％未満などであってもよい。 Other criteria may be used in other embodiments. For example, to identify an imbalanced haplotype I cancer genome, in non-tumor tissues, differences in methylation levels between Hap I and Hap II were 1%, 5%, 10%, 20%, 40% %, 50%, or less than 60%, etc., and in tumor tissue the difference in methylation levels between Hap I and Hap II is 1%, 5%, 10%, 20%, 40% , 50%, or greater than 60%, and the like. To identify imbalanced haplotype I non-cancer genomes, the difference in methylation levels between Hap I and Hap II was 1%, 5%, 10%, 20%, 40% in non-tumor tissues. , 50%, or more than 60%, etc., while in tumor tissue the difference in methylation levels between Hap I and Hap II can be 1%, 5%, 10%, 20%, 40%, etc. %, 50%, or less than 60%.

図１０７Ａは、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ２．０によって生成されたデータに基づいて、腫瘍と隣接する非腫瘍組織との間の２つのハプロタイプ間のメチル化不均衡を示すハプロタイプブロックの数をまとめた表である。最初の列は、組織型を列挙している。２番目の列は、腫瘍組織における２つのハプロタイプ間のメチル化不均衡を示すハプロタイプブロックの数を列挙している。３番目の列は、ペアの隣接する非腫瘍組織における２つのハプロタイプ間のメチル化不均衡を示すハプロタイプブロックの数を列挙している。これらの行は、ペアの隣接する非腫瘍組織よりも腫瘍組織で、２つのハプロタイプ間のメチル化不均衡を示すハプロタイプブロックがより多いことを示している。 FIG. 107A is a table summarizing the number of haplotype blocks showing methylation imbalance between two haplotypes between tumor and adjacent non-tumor tissue, based on data generated by Sequel II Sequencing Kit 2.0. is. The first column lists the tissue type. The second column lists the number of haplotype blocks showing methylation imbalance between the two haplotypes in tumor tissue. The third column lists the number of haplotype blocks showing methylation imbalance between the two haplotypes in paired adjacent non-tumor tissues. These rows show that there are more haplotype blocks indicating methylation imbalance between the two haplotypes in tumor tissue than in paired adjacent non-tumor tissue.

この分析に含まれたハプロタイプブロックの長さの中央値は１５．７ｋｂ（ＩＱＲ：１０．３～２６．１ｋｂ）であった。肝臓のＨＣＣの結果を含めて、これらのデータは、７つの組織型で、腫瘍組織がメチル化不均衡を伴うより多くのハプロタイプブロックを有することを示している。肝臓に加えて、他の組織には、結腸、乳房、腎臓、肺、前立腺、および胃の組織が含まれる。したがって、一部の実施形態では、メチル化不均衡を有するハプロタイプブロックの数を使用して、患者が、腫瘍または癌を有しているかどうかを検出することができる。 The median length of the haplotype blocks included in this analysis was 15.7 kb (IQR: 10.3-26.1 kb). These data, including the liver HCC results, indicate that in seven tissue types, tumor tissue has more haplotype blocks with methylation imbalance. In addition to liver, other tissues include colon, breast, kidney, lung, prostate, and stomach tissue. Thus, in some embodiments, the number of haplotype blocks with methylation imbalance can be used to detect whether a patient has a tumor or cancer.

図１０７Ｂは、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ２．０によって生成されたデータに基づいて、異なる腫瘍病期の腫瘍組織における２つのハプロタイプ間のメチル化不均衡を示すハプロタイプブロックの数をまとめた表である。最初の列は、腫瘍を伴う組織型を示す。２番目の列は、腫瘍組織における２つのハプロタイプ間のメチル化不均衡を有するハプロタイプブロックの数を示す。３番目の列は、悪性腫瘍のＴＮＭ分類を使用した腫瘍病期分類情報を列挙している。Ｔ３とＴ３ａは、Ｔ２よりも大きなサイズの腫瘍である。 FIG. 107B is a table summarizing the number of haplotype blocks showing methylation imbalance between two haplotypes in tumor tissues of different tumor stages, based on data generated by Sequel II Sequencing Kit 2.0. The first column indicates histological types with tumors. The second column shows the number of haplotype blocks with methylation imbalance between the two haplotypes in tumor tissue. The third column lists tumor staging information using the TNM classification of malignancies. T3 and T3a are larger size tumors than T2.

この表は、乳房と腎臓の両方で、腫瘍がより大きいほど、メチル化不均衡を示すハプロタイプブロックがより多いことを示している。例えば、乳房組織の場合、腫瘍グレードＴ３（ＴＮＭ病期分類）、ＥＲ陽性、およびＥＲＢＢ２増幅を示すとして分類された組織は、腫瘍グレードＴ２（ＴＮＭ病期分類）、ＰＲ（プロゲステロン受容体）／ＥＲ（エストロゲン受容体）陽性、ＥＲＢＢ２増幅なしとして分類された組織のハプロタイプブロック（１８）よりもメチル化不均衡を示すハプロタイプブロック（５７）が多かった。腎臓組織の場合、腫瘍グレードＴ３ａに分類された組織は、腫瘍グレードＴ２に分類された組織のハプロタイプブロック（０）よりも、メチル化不均衡を示すハプロタイプブロック（６８）が多かった。 The table shows that in both breast and kidney, the larger the tumor, the more haplotype blocks exhibiting methylation imbalance. For example, for breast tissue, tissue classified as tumor grade T3 (TNM staging), ER positive, and exhibiting ERBB2 amplification is tumor grade T2 (TNM staging), PR (progesterone receptor)/ER There were more haplotype blocks showing methylation imbalance (57) than tissues classified as (estrogen receptor) positive, no ERBB2 amplification (18). For kidney tissue, tissue classified as tumor grade T3a had more haplotype blocks (68) showing methylation imbalance than tissue classified as tumor grade T2 (0).

一部の実施形態では、腫瘍の分類のために、およびそれらの臨床的挙動（例えば、進行、予後、または治療応答）と相関させるために、メチル化不均衡を示すハプロタイプブロックを利用することができる。これらのデータは、ハプロタイプベースのメチル化不均衡の程度が、腫瘍の分類子として役立つ可能性があり、臨床研究または治験または最終的な臨床サービスに組み込まれ得ることを示唆した。腫瘍の分類には、サイズと重症度が含まれ得る。 In some embodiments, haplotype blocks exhibiting methylation imbalance can be utilized for tumor classification and to correlate with their clinical behavior (e.g., progression, prognosis, or therapeutic response). can. These data suggested that the degree of haplotype-based methylation imbalance could serve as a tumor classifier and could be incorporated into clinical studies or trials or eventual clinical services. Tumor classification may include size and severity.

Ｅ．母体血漿無細胞ＤＮＡのハプロタイプベースのメチル化分析
両方の親またはいずれかの親のハプロタイプを決定することができる。ハプロタイピング法には、ロングリード単一分子配列決定、リンクされたショートリード配列決定（例えば、１０ｘゲノミクス）、長距離単一分子ＰＣＲ、または母集団推論が含まれる。父方のハプロタイプがわかっている場合、父方のハプロタイプに沿って存在する少なくとも１つの父方特異的ＳＮＰアレルをそれぞれ含有する複数の無細胞ＤＮＡ分子のメチル化プロファイルをリンクすることによって、無細胞胎児ＤＮＡメチロームを構築することができる。言い換えれば、父方のハプロタイプは、胎児特異的リード配列をリンクするための足場として使用される。 E. Haplotype-Based Methylation Analysis of Maternal Plasma Cell-Free DNA Haplotypes of both parents or either parent can be determined. Haplotyping methods include long-read single-molecule sequencing, linked short-read sequencing (eg, 10x genomics), long-range single-molecule PCR, or population inference. If the paternal haplotype is known, the cell-free fetal DNA methylome can be determined by linking the methylation profiles of multiple cell-free DNA molecules each containing at least one paternal-specific SNP allele present along the paternal haplotype. can be constructed. In other words, the paternal haplotype is used as a scaffold to link fetal-specific lead sequences.

図１０８は、相対的なメチル化不均衡についてのハプロタイプの分析を示す。母方のハプロタイプがわかっている場合、２つのハプロタイプ（すなわち、ＨａｐＩとＨａｐＩＩ）間のメチル化不均衡を使用して、胎児に遺伝した母方のハプロタイプを決定することができる。図１０８に示されるように、妊婦由来の血漿ＤＮＡ分子は、単一分子リアルタイム配列決定技術を使用して配列決定される。メチル化およびアレル情報は、本明細書の開示に従って決定することができる。一実施形態では、疾患を引き起こす遺伝子に関連するＳＮＰは、ＨａｐＩとして割り当てられる。胎児がＨａｐＩを受け継いだ場合、ＨａｐＩのアレルを有する断片は、ＨａｐＩＩのアレルを有するものと比較して、母体血漿中により多く存在する。胎児に由来するＤＮＡ断片の低メチル化は、ＨａｐＩＩのメチル化レベルと比較して、ＨａｐＩのメチル化レベルを低下させる。その結果、ＨａｐＩのメチル化がＨａｐＩＩよりも低いメチル化レベルを示す場合、胎児は母方のＨａｐＩを受け継ぐ可能性がより高くなる。そうでない場合、胎児は、母方のＨａｐＩＩを受け継ぐ可能性がより高くなる。臨床試験では、ハプロタイプベースのメチル化不均衡分析を使用して、胎児が、例えば、限定されないが、脆弱Ｘ症候群、筋ジストロフィー、ハンチントン病またはβサラセミアなどの遺伝性障害に関連する母方のハプロタイプを受け継いでいるかどうかを決定することができる。 Figure 108 shows analysis of haplotypes for relative methylation imbalance. If the maternal haplotype is known, the methylation imbalance between the two haplotypes (ie, Hap I and Hap II) can be used to determine the maternal haplotype inherited to the fetus. As shown in Figure 108, plasma DNA molecules from pregnant women are sequenced using single-molecule real-time sequencing technology. Methylation and allelic information can be determined according to the disclosures herein. In one embodiment, SNPs associated with disease-causing genes are assigned as Hap I. When the fetus inherits Hap I, fragments with the Hap I allele are more abundant in the maternal plasma compared to those with the Hap II allele. Hypomethylation of fetal-derived DNA fragments reduces the methylation level of Hap I relative to that of Hap II. As a result, if Hap I methylation exhibits a lower level of methylation than Hap II, the fetus is more likely to inherit maternal Hap I. Otherwise, the fetus is more likely to inherit maternal Hap II. Clinical trials have used haplotype-based methylation imbalance analysis to determine that fetuses inherit maternal haplotypes associated with genetic disorders such as, but not limited to, fragile X syndrome, muscular dystrophy, Huntington's disease or beta-thalassemia. can decide whether

Ｆ．障害の分類方法の実施例
図１０９は、第１のハプロタイプおよび第２のハプロタイプを有する生物における障害を分類する、例示的な方法１０９０を示す。方法１０９０は、２つのハプロタイプ間の相対的なメチル化レベルを比較することを含む。 F. Example Method for Classifying Disorders FIG. 109 illustrates an exemplary method 1090 for classifying disorders in organisms having a first haplotype and a second haplotype. Method 1090 includes comparing relative methylation levels between the two haplotypes.

ブロック１０９１では、生体試料由来のＤＮＡ分子を分析して、生物に対応する参照ゲノムにおけるそれらの位置を特定する。ＤＮＡ分子は、細胞のＤＮＡ分子であり得る。例えば、ＤＮＡ分子を配列決定して、配列リードを取得することができ、配列リードを参照ゲノムにマッピングする（整列させる）ことができる。生物がヒトの場合、参照ゲノムは、潜在的には特定の亜集団からの参照ヒトゲノムである。別の例として、ＤＮＡ分子を（例えば、ＰＣＲまたは他の増幅の後に）異なるプローブで分析することができ、各プローブは、以下に説明するように、ヘテロ接合の１つ以上のＣｐＧ部位を網羅し得るゲノム位置に対応する。 At block 1091, DNA molecules from the biological sample are analyzed to identify their location in the reference genome corresponding to the organism. The DNA molecule can be a cellular DNA molecule. For example, a DNA molecule can be sequenced to obtain sequence reads, which can be mapped (aligned) to a reference genome. If the organism is human, the reference genome is potentially a reference human genome from a particular subpopulation. As another example, a DNA molecule can be analyzed (e.g., after PCR or other amplification) with different probes, each covering one or more CpG sites that are heterozygous, as described below. correspond to possible genomic locations.

さらに、ＤＮＡ分子を分析して、ＤＮＡ分子のそれぞれのアレルを決定することができる。例えば、ＤＮＡ分子のアレルは、配列決定から取得された配列リードから、またはＤＮＡ分子にハイブリダイズする特定のプローブから決定することができ、両方の技術は、配列リードを提供することができる（例えば、ハイブリダイズする場合、プローブを配列リードとして扱うことができる）。ＤＮＡ分子について、１つ以上の部位（例えば、ＣｐＧ部位）の各々におけるメチル化状態を決定することができる。 Additionally, a DNA molecule can be analyzed to determine each allele of the DNA molecule. For example, the allele of a DNA molecule can be determined from sequence reads obtained from sequencing or from specific probes that hybridize to the DNA molecule, both techniques can provide sequence reads (e.g. , the probes can be treated as sequence reads when hybridized). For a DNA molecule, the methylation status at each of one or more sites (eg, CpG sites) can be determined.

ブロック１０９２では、第１の染色体領域の第１の箇所の１つ以上のヘテロ接合遺伝子座が特定される。各ヘテロ接合遺伝子座は、第１のハプロタイプの対応する第１のアレルおよび第２のハプロタイプの対応する第２のアレルを含むことができる。１つ以上のヘテロ接合遺伝子座は、第１の複数のヘテロ接合遺伝子座であってもよく、第２の複数のヘテロ接合遺伝子座は、異なる染色体領域に対応し得る。 At block 1092, one or more heterozygous loci at the first location of the first chromosomal region are identified. Each heterozygous locus can comprise a corresponding first allele of the first haplotype and a corresponding second allele of the second haplotype. The one or more heterozygous loci may be the first plurality of heterozygous loci and the second plurality of heterozygous loci may correspond to different chromosomal regions.

ブロック１０９３では、複数のＤＮＡ分子の第１のセットが特定される。複数のＤＮＡ分子の各々は、ブロック１０９６からのヘテロ接合遺伝子座のうちのいずれか１つに位置し、対応する第１のアレルを含むため、ＤＮＡ分子は、第１のハプロタイプに対応するものとして特定され得る。ＤＮＡ分子が２つ以上のヘテロ接合遺伝子座に位置する可能性があるが、典型的には、リードには、１つのヘテロ接合遺伝子座のみが含まれる。また、ＤＮＡ分子の第１のセットの各々には、Ｎ個のゲノム部位のうちの少なくとも１つが含まれ、ゲノム部位は、メチル化レベルを測定するために使用される。Ｎは整数であり、例えば、１、２、３、４、５、１０、２０、５０、１００、２００、５００、１，０００、２，０００、または５，０００以上である。したがって、ＤＮＡ分子のリードは、１部位、２部位などのカバレッジを示すことができる。１ゲノム部位は、ＣｐＧヌクレオチドが存在する部位を含み得る。 At block 1093, a first set of multiple DNA molecules is identified. Each of the plurality of DNA molecules is located at any one of the heterozygous loci from block 1096 and contains the corresponding first allele, so that the DNA molecule is regarded as corresponding to the first haplotype can be specified. A DNA molecule can be located at more than one heterozygous locus, but typically a read contains only one heterozygous locus. Each of the first set of DNA molecules also includes at least one of the N genomic sites, and the genomic sites are used to measure methylation levels. N is an integer, such as 1, 2, 3, 4, 5, 10, 20, 50, 100, 200, 500, 1,000, 2,000, or 5,000 or greater. Thus, a DNA molecule read can exhibit 1-site, 2-site, etc. coverage. A genomic site can include the site where a CpG nucleotide is present.

ブロック１０９４では、第１のハプロタイプの第１の箇所の第１のメチル化レベルが、複数のＤＮＡ分子の第１のセットを使用して決定される。第１のメチル化レベルは、本明細書に記載の任意の方法によって決定することができる。第１の箇所は、単一の部位に対応していても、多くの部位を含んでいてもよい。第１のハプロタイプの第１の箇所は、１ｋｂ以上であり得る。例えば、第１のハプロタイプの第１の箇所は、１ｋｂ、５ｋｂ、１０ｋｂ、１５ｋｂ、または２０ｋｂ以上であってもよい。メチル化データは、細胞ＤＮＡからのデータであってもよい。 At block 1094, a first methylation level at the first location of the first haplotype is determined using the first set of multiple DNA molecules. The first methylation level can be determined by any method described herein. The first location may correspond to a single site or may include many sites. The first location of the first haplotype can be 1 kb or greater. For example, the first location of the first haplotype may be 1 kb, 5 kb, 10 kb, 15 kb, or 20 kb or more. Methylation data may be data from cellular DNA.

一部の実施形態では、複数の第１のメチル化レベルは、第１のハプロタイプの複数の箇所について決定され得る。各箇所は、５ｋｂ以上の鎖長、または第１のハプロタイプの第１の箇所について本明細書に開示される任意のサイズを有し得る。 In some embodiments, multiple first methylation levels may be determined for multiple locations of the first haplotype. Each locus can have a chain length of 5 kb or longer, or any size disclosed herein for the first locus of the first haplotype.

ブロック１０９５では、複数ＤＮＡ分子の第２のセットが特定される。複数のＤＮＡ分子の各々は、ブロック１０９６からのヘテロ接合遺伝子座のうちのいずれか１つに位置し、対応する第２のアレルを含むため、ＤＮＡ分子は、第２のハプロタイプに対応するものとして特定され得る。また、ＤＮＡ分子の第２のセットの各々には、Ｎ個のゲノム部位のうちの少なくとも１つが含まれ、ゲノム部位は、メチル化レベルを測定するために使用される。 At block 1095, a second set of multiple DNA molecules is identified. Each of the plurality of DNA molecules is located at any one of the heterozygous loci from block 1096 and contains the corresponding second allele, so that the DNA molecule is considered to correspond to the second haplotype. can be specified. Each of the second set of DNA molecules also includes at least one of the N genomic sites, and the genomic sites are used to measure methylation levels.

ブロック１０９６では、第２のハプロタイプの第１の箇所の第２のメチル化レベルが、複数のＤＮＡ分子の第２のセットを使用して決定される第２のメチル化レベルは、本明細書に記載の任意の方法によって決定することができる。第２のハプロタイプの第１の箇所は、１ｋｂ以上または第１のハプロタイプの第１の箇所の任意のサイズよりも長くてもよい。第１のハプロタイプの第１の箇所は、第２のハプロタイプの第１の箇所と相補的であり得る。第１のハプロタイプの第１の箇所および第２のハプロタイプの第１の箇所は、環状ＤＮＡ分子を形成し得る。第１のハプロタイプの第１の箇所の第１のメチル化レベルは、環状ＤＮＡ分子からのデータを使用して決定され得る。例えば、環状ＤＮＡの分析は、図１、図２、図４、図５、図６、図７、図８、図５０、または図６１で説明される分析を含み得る。 At block 1096, a second methylation level at the first location of the second haplotype is determined using a second set of a plurality of DNA molecules. It can be determined by any method described. The first portion of the second haplotype may be greater than 1 kb or longer than any size of the first portion of the first haplotype. The first position of the first haplotype can be complementary to the first position of the second haplotype. The first portion of the first haplotype and the first portion of the second haplotype can form a circular DNA molecule. A first methylation level at the first location of the first haplotype can be determined using data from the circular DNA molecule. For example, analysis of circular DNA can include the analysis described in FIGS. 1, 2, 4, 5, 6, 7, 8, 50, or 61.

環状ＤＮＡ分子は、二本鎖ＤＮＡ分子を切断することによって形成することができ、Ｃａｓ９複合体を使用して、切断された二本鎖ＤＮＡ分子を形成する。ヘアピンアダプターは、切断された二本鎖ＤＮＡ分子の末端に連結することができる。実施形態では、二本鎖ＤＮＡ分子の両端を切断して連結することができる。例えば、切断、連結、およびその後の分析は、図９１に記載されているように進めてもよい。 A circular DNA molecule can be formed by cleaving a double-stranded DNA molecule, using the Cas9 complex to form the cleaved double-stranded DNA molecule. Hairpin adapters can be ligated to the ends of cleaved double-stranded DNA molecules. In embodiments, both ends of a double-stranded DNA molecule can be cleaved and ligated. For example, cleavage, ligation, and subsequent analysis may proceed as described in FIG.

一部の実施形態では、複数の第２のメチル化レベルは、第２のハプロタイプの複数の箇所について決定され得る。第２のハプロタイプの複数の箇所の各箇所は、第１のハプロタイプの複数の箇所の一箇所に相補的であり得る。 In some embodiments, multiple second methylation levels may be determined for multiple locations of the second haplotype. Each location of the plurality of locations of the second haplotype can be complementary to a location of the plurality of locations of the first haplotype.

ブロック１０９７では、パラメータの値は、第１のメチル化レベルおよび第２のメチル化レベルを使用して計算される。このパラメータは、分離値による場合がある。分離値は、２つのメチル化レベル間の差、または２つのメチル化レベルの比率であってもよい。 At block 1097, the value of the parameter is calculated using the first methylation level and the second methylation level. This parameter may depend on the separation value. A separation value may be the difference between two methylation levels or the ratio of two methylation levels.

第２のハプロタイプの複数の箇所を使用する場合、第２のハプロタイプの複数の箇所の各箇所について、分離値は、第２のハプロタイプの箇所の第２のメチル化レベル、および第１のハプロタイプの相補的な箇所を使用した第１のメチル化レベルを使用して計算され得る。分離値は、カットオフ値と比較され得る。 When multiple locations of the second haplotype are used, for each location of the multiple locations of the second haplotype, the separation value is the second methylation level of the location of the second haplotype and the It can be calculated using the first methylation level using complementary points. A separation value can be compared to a cutoff value.

カットオフ値は、障害を有さない組織から決定することができる。パラメータは、分離値がカットオフ値を超える第２のハプロタイプの箇所の数であってもよい。例えば、分離値がカットオフ値を超える第２のハプロタイプの箇所の数は、図１０５Ａ、図１０５Ｂ、および図１０６において３０％を超える差を有することが示されている領域の数と同様であり得る。図１０５Ａ、図１０５Ｂ、および図１０６では、分離値は比率であり、カットオフ値は３０％である。一部の実施形態では、カットオフ値は、障害を有する組織から決定され得る。 A cut-off value can be determined from non-disturbed tissue. The parameter may be the number of locations of the second haplotype where the separation value exceeds the cutoff value. For example, the number of locations of the second haplotype whose segregation value exceeds the cutoff value is similar to the number of regions shown to have greater than 30% difference in Figures 105A, 105B, and 106. obtain. In Figures 105A, 105B, and 106, the cutoff value is a ratio and the cutoff value is 30%. In some embodiments, the cutoff value can be determined from tissue with the disorder.

別の実施例では、各箇所の分離値は、集計する（例えば、合計する）ことができ、これは、それぞれの分離値の加重合計または関数の合計によって行うことができる。このような集計により、パラメータの値を提供することができる。 In another example, the separation values for each location can be aggregated (eg, summed), which can be done by a weighted sum or functional sum of the respective separation values. Such aggregation can provide the value of the parameter.

ブロック１０９８では、パラメータの値を参照値と比較する。参照値は、障害のない参照組織を使用して決定することができる。参照値は、分離値であってもよい。例えば、参照値は、２つのハプロタイプのメチル化レベル間に有意差があってはならないことを表す場合がある。例えば、参照値は、０の統計的差異または約１の比率であり得る。複数の箇所が使用される場合、参照値は、２つのハプロタイプがカットオフ値を超える分離値を示すような、健康な生物における箇所の数であり得る。一部の実施形態では、参照値は、障害を伴う参照組織を使用して決定することができる。 At block 1098, the value of the parameter is compared with the reference value. Reference values can be determined using non-disturbed reference tissue. The reference value may be a separate value. For example, a reference value may represent that there should be no significant difference between the methylation levels of two haplotypes. For example, the reference value can be a statistical difference of 0 or a ratio of about 1. If multiple locations are used, the reference value can be the number of locations in healthy organisms where the two haplotypes show a separation value above the cutoff value. In some embodiments, the reference value can be determined using a reference tissue with a lesion.

ブロック１０９９において、生物における障害の分類は、パラメータの値と参照値との比較を使用して決定される。パラメータの値が参照値を超える場合、障害が存在するか、より可能性が高いと判断される場合がある。障害には、癌が含まれ得る。癌は、本明細書に記載の任意の癌であり得る。障害の分類は、障害の可能性であり得る。障害の分類には、障害の重症度が含まれ得る。例えば、ハプロタイプの不均衡を伴う箇所の数がより多いことを示すより大きなパラメータ値は、より重篤な形態の癌を示し得る。 At block 1099, a classification of the disorder in the organism is determined using a comparison of the parameter's value and the reference value. If the value of the parameter exceeds the reference value, it may be determined that a fault exists or is more likely. Disorders can include cancer. The cancer can be any cancer described herein. A classification of the disorder may be the likelihood of the disorder. Classification of disorders may include severity of disorders. For example, a larger parameter value indicating a higher number of sites with haplotype imbalance may indicate a more severe form of cancer.

図１０９で説明されている方法は障害の分類を含むが、同様の方法を使用して、ハプロタイプ間のメチル化レベルの不均衡から生じる得る任意の状態または特性を決定することができる。例えば、胎児ＤＮＡからのハプロタイプのメチル化レベルは、母体ＤＮＡからのハプロタイプのメチル化よりも低い可能性がある。メチル化レベルは、核酸を母体または胎児として分類するために使用することができる。 Although the method described in FIG. 109 involves classification of disorders, similar methods can be used to determine any condition or trait that can result from imbalance in methylation levels between haplotypes. For example, haplotype methylation levels from fetal DNA may be lower than haplotype methylation levels from maternal DNA. Methylation levels can be used to classify nucleic acids as maternal or fetal.

障害が癌である場合、腫瘍の異なる染色体領域は、メチル化のそのような違いを示す可能性がある。影響を受ける領域に応じて、異なる治療が提供され得る。さらに、メチル化のそのような違いを示す異なる領域を有する対象は、異なる予後を有する可能性がある。 If the disorder is cancer, different chromosomal regions of the tumor may exhibit such differences in methylation. Different treatments may be provided depending on the area affected. Furthermore, subjects with different regions exhibiting such differences in methylation may have different prognoses.

十分な分離を有する（例えば、カットオフ値より大きい）染色体領域（箇所）は、異常である（または異常な分離がある）と特定することができる。異常領域のパターン（ハプロタイプが他よりも高い可能性があることを説明する）は、参照パターンと比較することができる（例えば、癌を有する対象、潜在的に特定の種類の癌、または健康な対象から決定される）。２つのパターンが、特定の分類を有する参照パターンよりも閾値内で同じである場合（例えば、異なる領域／箇所の指定された数未満）、対象は、障害についてその分類を有すると特定され得る。そのような分類は、例えば、本明細書に記載されるように、インプリント障害を含み得る。 Chromosomal regions (locations) with sufficient segregation (eg, greater than a cutoff value) can be identified as abnormal (or with aberrant segregation). Patterns of abnormal regions (explaining which haplotypes are more likely than others) can be compared to reference patterns (e.g., subjects with cancer, potentially certain types of cancer, or healthy individuals). determined from the subject). If the two patterns are the same within a threshold than the reference pattern with a particular classification (e.g., less than a specified number of different regions/locations), the subject may be identified as having that classification for the disorder. Such classifications can include, for example, imprint disorders, as described herein.

ＶＩＩ．ハイブリッド分子の単一分子メチル化分析
核酸の塩基修飾の決定に関して本明細書に開示される実施形態の性能および有用性をさらに評価するために、ヒト部分がメチル化され、マウス部分が非メチル化された、またはその逆であるヒトおよびマウスのハイブリッドＤＮＡ断片を人工的に作成した。ハイブリッドまたはキメラＤＮＡ分子の接合部を決定することにより、癌を含む様々な障害または疾患の遺伝子融合を検出できる可能性がある。 VII. Single Molecule Methylation Analysis of Hybrid Molecules To further evaluate the performance and utility of the embodiments disclosed herein for determining base modifications of nucleic acids, the human portion was methylated and the mouse portion was unmethylated. We have engineered hybrid human and murine DNA fragments that have been modified and vice versa. By determining the junctions of hybrid or chimeric DNA molecules, it may be possible to detect gene fusions in various disorders or diseases, including cancer.

Ａ．ヒトとマウスのハイブリッドＤＮＡ断片を作成する方法
このセクションでは、ハイブリッドＤＮＡ断片の作成、次いで断片のメチル化プロファイルを決定する手順について説明する。 A. Methods for Making Hybrid Human and Mouse DNA Fragments This section describes procedures for making hybrid DNA fragments and then determining the methylation profile of the fragments.

一実施形態では、ヒトＤＮＡは、全ゲノム増幅によって増幅され、その結果、全ゲノム増幅ではメチル化状態が保存されないため、ヒトゲノムの元のメチル化特性が排除される。全ゲノム増幅は、ゲノム上でランダムに結合し得るプライマーとしてのエキソヌクレアーゼ耐性チオリン酸修飾縮重ヘキサマーを使用して行うことができ、ポリメラーゼ（例えば、Ｐｈｉ２９ＤＮＡポリメラーゼ）が熱サイクルなしでＤＮＡを増幅することが可能になる。増幅されたＤＮＡ産物は、メチル化されていない。増幅されたヒトＤＮＡ分子は、ＣｐＧメチルトランスフェラーゼであるＭ．ＳｓｓＩでさらに処理された。これは、理論上、二本鎖ＤＮＡ、非メチル化ＤＮＡ、またはヘミメチル化ＤＮＡにおいて、ＣｐＧ文脈でのすべてのシトシンを完全にメチル化する。したがって、Ｍ．ＳｓｓＩによって処理されたこのような増幅ヒトＤＮＡは、メチル化されたＤＮＡ分子になる。 In one embodiment, human DNA is amplified by whole genome amplification such that whole genome amplification does not preserve the methylation state, thereby eliminating the original methylation signature of the human genome. Whole-genome amplification can be performed using exonuclease-resistant thiophosphate-modified degenerate hexamers as primers that can bind randomly on the genome, allowing polymerases (e.g., Phi29 DNA polymerase) to amplify DNA without thermal cycling. it becomes possible to Amplified DNA products are unmethylated. Amplified human DNA molecules undergo the CpG methyltransferase M. Further treated with SssI. This theoretically fully methylates all cytosines in the CpG context in double-stranded, unmethylated, or hemimethylated DNA. Therefore, M. Such amplified human DNA treated with Sssl results in methylated DNA molecules.

対照的に、非メチル化マウスＤＮＡ断片が生成されるように、マウスＤＮＡを、全ゲノム増幅にかけた。 In contrast, mouse DNA was subjected to whole genome amplification such that unmethylated mouse DNA fragments were generated.

図１１０は、マウス部分が非メチル化され、ヒト部分がメチル化されているヒト－マウスハイブリッドＤＮＡ断片の作成を示す。塗りつぶされたロリポップは、メチル化されたＣｐＧ部位を表す。塗りつぶされていないロリポップは、非メチル化ＣｐＧ部位を表す。斜めの縞模様の太い棒１１０１０は、メチル化されたヒト部分を表す。縦縞の太い棒１１０２０は、非メチル化マウス部分を表している。 Figure 110 shows the construction of a human-mouse hybrid DNA fragment in which the mouse portion is unmethylated and the human portion is methylated. Filled lollipops represent methylated CpG sites. Unfilled lollipops represent unmethylated CpG sites. The diagonally striped thick bars 11010 represent methylated human moieties. Vertically striped thick bars 11020 represent the unmethylated mouse portion.

ハイブリッドヒト－マウスＤＮＡ分子の生成のために、一実施形態では、全ゲノム増幅およびＭ．ＳｓｓＩ処理ＤＮＡ分子をＨｉｎｄＩＩＩおよびＮｃｏＩでさらに消化して、下流の連結を容易にするための粘着末端を生成した。一実施形態では、メチル化されたヒトＤＮＡ断片は、等モル比で非メチル化マウスＤＮＡ断片とさらに混合された。そのようなヒト－マウスＤＮＡ混合物は、一実施形態では、２０℃で１５分間のＤＮＡリガーゼによって媒介される連結プロセスにかけられた。図１１０に示されるように、この連結反応により、ヒト－マウスハイブリッドＤＮＡ分子（ａ：ヒト－マウスハイブリッド断片）、ヒトのみのＤＮＡ分子（ｂ：ヒト－ヒト連結、およびｃ：連結されていないヒトＤＮＡ）、およびマウスのみのＤＮＡ分子（ｄ：マウス－マウス連結、およびｅ：連結されていないマウスＤＮＡ）を含む、３種類の結果としての分子が生成される。連結後のＤＮＡ産物は、単一分子リアルタイム配列決定にかけられた。配列決定の結果は、メチル化状態を決定するために本明細書に提供される開示に従って分析された。 For the generation of hybrid human-mouse DNA molecules, in one embodiment, whole genome amplification and M . The SssI-treated DNA molecule was further digested with HindIII and NcoI to generate sticky ends to facilitate downstream ligation. In one embodiment, methylated human DNA fragments were further mixed with unmethylated mouse DNA fragments in equimolar ratios. Such human-mouse DNA mixtures, in one embodiment, were subjected to a DNA ligase-mediated ligation process at 20° C. for 15 minutes. As shown in Figure 110, this ligation results in a human-mouse hybrid DNA molecule (a: human-mouse hybrid fragment), a human-only DNA molecule (b: human-human ligation, and c: unligated human Three types of resulting molecules are produced, including mouse-only DNA molecules (d: mouse-mouse ligation, and e: unligated mouse DNA). Post-ligation DNA products were subjected to single-molecule real-time sequencing. Sequencing results were analyzed according to the disclosure provided herein to determine methylation status.

図１１１は、ヒト部分が非メチル化され、マウス部分がメチル化されているヒト－マウスハイブリッドＤＮＡ断片の作成を示す。塗りつぶされたロリポップは、メチル化されたＣｐＧ部位を表す。塗りつぶされていないロリポップは、非メチル化ＣｐＧ部位を表す。斜めの縞模様の太い棒１１１１０は、メチル化されたマウス部分を表している。縦縞の太い棒１１１２０は、非メチル化ヒト部分を表している。 Figure 111 shows the construction of a human-mouse hybrid DNA fragment in which the human portion is unmethylated and the mouse portion is methylated. Filled lollipops represent methylated CpG sites. Unfilled lollipops represent unmethylated CpG sites. The diagonally striped thick bar 11110 represents the methylated mouse portion. Vertically striped thick bars 11120 represent unmethylated human portions.

図１１１の実施形態では、マウスゲノムの元のメチル化が排除されるように、マウスＤＮＡ分子が全ゲノム増幅を介して増幅された。増幅されたＤＮＡ産物は、メチル化されていない。増幅されたマウスＤＮＡは、さらにＭ．ＳｓｓＩで処理される。したがって、Ｍ．ＳｓｓＩによって処理されたそのような増幅されたマウスＤＮＡは、メチル化されたＤＮＡ分子になる。対照的に、非メチル化ヒト断片が取得されるように、ヒトＤＮＡ断片を全ゲノム増幅にかけた。一実施形態では、メチル化されたヒト断片は、等モル比で非メチル化断片とさらに混合された。このようなヒト－マウスＤＮＡ混合物を、ＤＮＡリガーゼによって媒介される連結プロセスにかけた。図１１１に示すように、この連結反応により、ヒト－マウスハイブリッドＤＮＡ分子（ａ：ヒト－マウスハイブリッド断片）、ヒトのみのＤＮＡ分子（ｂ：ヒト－ヒト連結、およびｃ：連結されてないヒトＤＮＡ）、およびマウスのみのＤＮＡ分子（ｄ：マウス－マウス連結、およびｅ：連結されてないマウスＤＮＡ）を含む、３種類の結果としての分子が生成される。連結後のＤＮＡ産物は、単一分子リアルタイム配列決定にかけられた。配列決定の結果は、メチル化状態を決定するために本明細書に提供される開示に従って分析された。 In the embodiment of Figure 111, mouse DNA molecules were amplified via whole genome amplification such that the original methylation of the mouse genome was eliminated. Amplified DNA products are unmethylated. Amplified mouse DNA was further isolated from M. processed with SssI. Therefore, M. Such amplified mouse DNA treated with Sssl results in methylated DNA molecules. In contrast, human DNA fragments were subjected to whole genome amplification so as to obtain unmethylated human fragments. In one embodiment, the methylated human fragment was further mixed with the unmethylated fragment in an equimolar ratio. Such human-mouse DNA mixtures were subjected to a ligation process mediated by DNA ligase. As shown in Figure 111, this ligation reaction results in a human-mouse hybrid DNA molecule (a: human-mouse hybrid fragment), a human-only DNA molecule (b: human-human ligation, and c: unligated human DNA). ), and mouse-only DNA molecules (d: mouse-mouse ligation, and e: unligated mouse DNA). Post-ligation DNA products were subjected to single-molecule real-time sequencing. Sequencing results were analyzed according to the disclosure provided herein to determine methylation status.

図１１０に示される実施形態によれば、本発明者らは、人工ＤＮＡ混合物（試料ＭＩＸ０１と命名）を調製し、ヒト－マウスハイブリッドＤＮＡ分子、ヒトのみのＤＮＡ、およびマウスのみのＤＮＡが含まれ、ヒトに関連するＤＮＡ分子がメチル化され、マウスＤＮＡ分子はメチル化されていなかった。試料ＭＩＸ０１の場合、ヒトもしくはマウスの参照ゲノム、または部分的にヒトゲノムおよび部分的にマウスゲノムのいずれかに整列され得る１億６６００万個のサブリードを取得した。これらのサブリードは、約５００万のＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ単一分子リアルタイム（ＳＭＲＴ）配列決定ウェルから生成された。単一分子リアルタイム配列決定ウェルの各分子は、平均３２回（範囲：１～８８１回）配列決定された。 According to the embodiment shown in FIG. 110, the inventors prepared an artificial DNA mixture (designated sample MIX01), containing human-mouse hybrid DNA molecules, human-only DNA, and mouse-only DNA. , the human-associated DNA molecule was methylated and the mouse DNA molecule was unmethylated. For sample MIX01, 166 million subreads were obtained that could be aligned to either the human or mouse reference genome, or the partially human and partially mouse genome. These subreads were generated from approximately 5 million Pacific Biosciences single molecule real-time (SMRT) sequencing wells. Each molecule in a single-molecule real-time sequencing well was sequenced an average of 32 times (range: 1-881 times).

ハイブリッド断片のヒトＤＮＡ部分およびマウスＤＮＡ部分を決定するために、まず、ウェル内のすべての関連するサブリードからのヌクレオチド情報を組み合わせることによって、コンセンサス配列を構築した。合計で、試料ＭＩＸ０１について、３，４３５，６５７個のコンセンサス配列が取得された。データセットは、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ１．０によって調製されたＤＮＡから生成された。 To determine the human and mouse DNA portions of hybrid fragments, a consensus sequence was first constructed by combining nucleotide information from all relevant subreads within a well. In total, 3,435,657 consensus sequences were obtained for sample MIX01. The dataset was generated from DNA prepared by Sequel II Sequencing Kit 1.0.

コンセンサス配列は、ヒト参照とマウス参照の両方を含む参照ゲノムに整列された。３２０万の整列したコンセンサス配列を取得した。それらの中で、それらの３９．６％が、ヒトのみのＤＮＡ型として分類され、それらの２６．５％が、マウスのみのＤＮＡ型として分類され、それらの３０．２％が、ヒト－マウスハイブリッドＤＮＡとして分類された。 Consensus sequences were aligned to reference genomes, including both human and mouse references. 3.2 million aligned consensus sequences were obtained. Among them, 39.6% of them were classified as human-only DNA type, 26.5% of them were classified as mouse-only DNA type, and 30.2% of them were human-mouse classified as hybrid DNA.

図１１２は、連結後のＤＮＡ混合物中のＤＮＡ分子の鎖長分布を示す（試料ＭＩＸ０１）。ｘ軸は、ＤＮＡ分子の鎖長を示す。ｙ軸は、ＤＮＡ分子の鎖長に関連する頻度を示す。図１１２に示されるように、ヒト－マウスハイブリッドＤＮＡ分子は、より長い鎖長分布を有し、それらが少なくとも２つの種類の分子の組み合わせであるという事実と一致していた。 FIG. 112 shows the chain length distribution of DNA molecules in the DNA mixture after ligation (sample MIX01). The x-axis indicates the chain length of the DNA molecule. The y-axis indicates the frequency related to the length of the DNA molecule. As shown in Figure 112, the human-mouse hybrid DNA molecules had a longer chain length distribution, consistent with the fact that they are a combination of at least two types of molecules.

図１１３は、第１のＤＮＡ（Ａ）および第２のＤＮＡ（Ｂ）が一緒に結合される接合領域を示す。ＤＮＡ（Ａ）およびＤＮＡ（Ｂ）は、制限酵素で消化することができる。一実施形態では、付着末端を使用する連結の効率を改善するために、連結のステップの前に、制限酵素ＨｉｎｄＩＩＩおよびＮｃｏＩ（それぞれＡ＾ＡＧＣＴＴおよびＣ＾ＣＡＴＧＧ部位を認識する）を使用して、ヒトおよびマウスのＤＮＡを消化した。次に、ＤＮＡ（Ａ）およびＤＮＡ（Ｂ）を連結することができる。接合領域を有する６９８，４９２個のヒト－マウスハイブリッドＤＮＡ分子の中で、Ａ＾ＡＧＣＴＴおよびＣ＾ＣＡＴＧＧの酵素認識部位を有するヒト－マウスハイブリッドＤＮＡ分子の８８％が見つかり、さらにヒトとマウスのＤＮＡ断片間の連結が起きたことを示唆している。当該接合領域は、第１のＤＮＡ断片および第２のＤＮＡ断片が物理的に一緒に結合された領域または部位として定義される。接合部にはＤＮＡ（Ａ）とＤＮＡ（Ｂ）の両方に共通の配列が含まれているため、接合部に対応する１つの鎖の箇所は、配列だけではＤＮＡ（Ａ）またはＤＮＡ（Ｂ）の一部であると判断することができない。接合部に対応する１つの鎖の箇所のメチル化パターンまたは密度を分析することは、その箇所がＤＮＡ（Ａ）またはＤＮＡ（Ｂ）からのものであるかどうかを決定するために使用され得る。一例として、ＤＮＡ（Ａ）はウイルスＤＮＡであり得、ＤＮＡ（Ｂ）はヒトＤＮＡであり得る。正確な接合部の決定は、そのような統合されたＤＮＡが、タンパク質の構造を破壊するかどうか、およびどのように破壊するかを知らせることができる。 FIG. 113 shows junction regions where a first DNA (A) and a second DNA (B) are joined together. DNA (A) and DNA (B) can be digested with restriction enzymes. In one embodiment, to improve the efficiency of ligations using sticky ends, the restriction enzymes HindIII and NcoI (recognizing A^AGCTT and C^CATGG sites, respectively) were used prior to the ligation step to Human and mouse DNA were digested. DNA (A) and DNA (B) can then be ligated. Among 698,492 human-mouse hybrid DNA molecules with junction regions, 88% of human-mouse hybrid DNA molecules with enzyme recognition sites of A^AGCTT and C^CATGG were found, and human and mouse DNA suggesting that ligation between fragments occurred. The junction region is defined as the region or site where the first DNA segment and the second DNA segment are physically joined together. Since the junction contains sequences common to both DNA(A) and DNA(B), the point on one strand corresponding to the junction is either DNA(A) or DNA(B) by sequence alone. cannot be determined to be part of Analyzing the methylation pattern or density of a point on one strand corresponding to the junction can be used to determine whether the point is from DNA(A) or DNA(B). As an example, DNA (A) can be viral DNA and DNA (B) can be human DNA. Determination of precise junctions can inform whether and how such integrated DNA disrupts protein structure.

図１１４は、ＤＮＡ混合物のメチル化分析を示している。斜めの縞模様のある棒１１４１０は、連結前の制限酵素処理によって導入されるであろう整列分析で観察された接合領域を示す。「ＲＥ部位」は、制限酵素（ＲＥ）認識部位を表す。 Figure 114 shows the methylation analysis of the DNA mixture. Diagonally striped bars 11410 indicate junctional regions observed in alignment analysis that would be introduced by restriction enzyme treatment prior to ligation. "RE site" refers to a restriction enzyme (RE) recognition site.

図１１４に示されるように、一実施形態では、整列されたコンセンサス配列は、以下のように３つのカテゴリーにグループ化された。 As shown in Figure 114, in one embodiment, the aligned consensus sequences were grouped into three categories as follows.

（１）配列決定されたＤＮＡは、１つ以上の整列基準を参照して、ヒト参照ゲノムにのみ整列され、マウス参照ゲノムには整列されなかった。一実施形態では、１つの整列基準は、限定されないが、配列決定されたＤＮＡの連続したヌクレオチドの１００％、９５％、９０％、８０％、７０％、６０％、５０％、４０％、３０％、または２０％がヒト参照に整列され得るものとして定義され得る。一実施形態では、１つの整列基準は、ヒト参照に整列しなかった配列決定された断片の残りの部分が、マウス参照ゲノムに整列し得ないことである。一実施形態では、１つの整列基準は、配列決定されたＤＮＡが参照ヒトゲノムの単一の領域に整列され得ることであった。一実施形態では、整列は完全であり得る。さらに他の実施形態では、整列は、挿入、ミスマッチ、および欠失を含むヌクレオチドの不一致に対応可能であり、ただし、そのような不一致は特定の閾値未満であり、限定されないが、整列された配列の長さの１％、２％、３％、４％、５％、１０％、２０％、または３０％などである。別の実施形態では、整列されたものは、参照ゲノムの２つ以上の位置にあり得る。さらに他の実施形態では、参照ゲノムの１つ以上の部位への整列は、確率的な様式で記述され（例えば、誤った整列の可能性を示す）、確率の測定は、その後の処理で使用され得る。 (1) Sequenced DNA was aligned only to the human reference genome and not to the mouse reference genome, with reference to one or more alignment criteria. In one embodiment, one alignment criterion includes, but is not limited to, 100%, 95%, 90%, 80%, 70%, 60%, 50%, 40%, 30% of the contiguous nucleotides of the sequenced DNA. %, or 20% can be aligned to the human reference. In one embodiment, one alignment criterion is that the remainder of the sequenced fragments that did not align to the human reference cannot align to the mouse reference genome. In one embodiment, one alignment criterion was that the sequenced DNA could be aligned to a single region of the reference human genome. In one embodiment, the alignment can be perfect. In still other embodiments, the alignment can accommodate nucleotide discrepancies, including insertions, mismatches, and deletions, provided that such discrepancies are below a certain threshold, including, but not limited to, the aligned sequences. such as 1%, 2%, 3%, 4%, 5%, 10%, 20%, or 30% of the length of the In another embodiment, the aligned can be at more than one location in the reference genome. In still other embodiments, the alignment to one or more sites of the reference genome is described in a probabilistic fashion (e.g., indicating possible misalignments) and the probability measure is used in subsequent processing. can be

（２）配列決定されたＤＮＡは、１つ以上の整列基準を参照して、マウス参照ゲノムにのみ整列されたが、ヒト参照ゲノムには整列されなかった。一実施形態では、１つの整列基準は、限定されないが、配列決定されたＤＮＡの連続したヌクレオチドの１００％、９５％、９０％、８０％、７０％、６０％、５０％、４０％、３０％、または２０％がマウス参照に整列され得るものとして定義され得る。一実施形態では、１つの整列基準は、残りの部分がヒト参照ゲノムに整列し得ないことである。一実施形態では、１つの整列基準は、配列決定されたＤＮＡが参照マウスゲノムの単一の領域に整列され得ることであった。一実施形態では、整列は完全であり得る。さらに他の実施形態では、整列は、挿入、ミスマッチ、および欠失を含むヌクレオチドの不一致に対応可能であり、ただし、そのような不一致は特定の閾値未満であり、限定されないが、整列された配列の長さの１％、２％、３％、４％、５％、１０％、２０％、または３０％などである。別の実施形態では、整列されたものは、参照ゲノムの２つ以上の位置にあり得る。さらに他の実施形態では、参照ゲノムの１つ以上の部位への整列は、確率的な様式で記述され（例えば、誤った整列の可能性を示す）、確率の測定は、その後の処理で使用され得る。 (2) the sequenced DNA was aligned only to the mouse reference genome, but not to the human reference genome, with reference to one or more alignment criteria; In one embodiment, one alignment criterion includes, but is not limited to, 100%, 95%, 90%, 80%, 70%, 60%, 50%, 40%, 30% of the contiguous nucleotides of the sequenced DNA. %, or 20% can be defined as being aligned to the mouse reference. In one embodiment, one alignment criterion is that the remainder cannot be aligned to the human reference genome. In one embodiment, one alignment criterion was that the sequenced DNA could be aligned to a single region of the reference mouse genome. In one embodiment, the alignment can be perfect. In still other embodiments, the alignment can accommodate nucleotide discrepancies, including insertions, mismatches, and deletions, provided that such discrepancies are below a certain threshold, including, but not limited to, the aligned sequences. such as 1%, 2%, 3%, 4%, 5%, 10%, 20%, or 30% of the length of the In another embodiment, the aligned can be at more than one location in the reference genome. In still other embodiments, the alignment to one or more sites of the reference genome is described in a probabilistic fashion (e.g., indicating possible misalignments) and the probability measure is used in subsequent processing. can be

（３）配列決定されたＤＮＡの一部分は、ヒト参照ゲノムと一意的に整列されたが、別の部分は、マウスの参照ゲノムと一意的に整列された。一実施形態では、連結の前に制限酵素を使用した場合、整列分析で、制限酵素切断部位に対応する接合領域が観察されるであろう。一部の実施形態では、ヒトとマウスのＤＮＡ部分の間の接合領域は、配列決定エラーおよび整列エラーのために、特定の領域内でのみ、おおよそ決定することができた。一部の実施形態では、制限酵素の切断なしに分子の連結が見られた場合（例えば、平滑末端の連結があった場合）、ヒト－マウスハイブリッドＤＮＡ断片の接合領域において、制限酵素認識部位は観察されない。 (3) A portion of the sequenced DNA was uniquely aligned with the human reference genome, while another portion was uniquely aligned with the mouse reference genome. In one embodiment, if restriction enzymes were used prior to ligation, alignment analysis would observe junction regions corresponding to restriction enzyme cleavage sites. In some embodiments, the junction region between the human and mouse DNA segments could only be approximately determined within certain regions due to sequencing and alignment errors. In some embodiments, at the junction region of the human-mouse hybrid DNA fragment, the restriction enzyme recognition site is not observed.

パルス間隔（ＩＰＤ）、パルス幅（ＰＷ）、およびＣｐＧ部位を取り巻く配列文脈は、コンセンサス配列に対応するそれらのサブリードから取得された。それによって、ヒトのみＤＮＡ、マウスのみＤＮＡ、およびヒト－マウスハイブリッドＤＮＡを含む各ＤＮＡ分子のメチル化は、本開示に存在する実施形態に従って決定することができた。 The pulse interval (IPD), pulse width (PW), and sequence context surrounding the CpG sites were obtained from those subreads corresponding to the consensus sequences. Thereby, the methylation of each DNA molecule, including human-only DNA, mouse-only DNA, and human-mouse hybrid DNA, could be determined according to the embodiments present in the present disclosure.

Ｂ．メチル化の結果
このセクションでは、ハイブリッドＤＮＡ断片のメチル化の結果について説明する。メチル化密度は、ハイブリッドＤＮＡ断片の様々な部分の起源を特定するために使用することができる。 B. Methylation Results This section describes the results of methylation of hybrid DNA fragments. Methylation density can be used to identify the origin of various parts of the hybrid DNA fragment.

図１１５は、試料ＭＩＸ０１のＣｐＧ部位がメチル化される確率の箱ひげ図を示す。ｘ軸は、試料ＭＩＸ０１に存在する３つの異なる分子：ヒトのみのＤＮＡ、マウスのみのＤＮＡ、およびヒトとマウスのハイブリッドＤＮＡ（ヒト部分とマウス部分の両方を含む）を示す。ｙ軸は、特定の単一ＤＮＡ分子のＣｐＧ部位がメチル化されている確率を示す。このアッセイは、ヒトＤＮＡがよりメチル化され、マウスＤＮＡがより非メチル化されるような方法で行われた。 Figure 115 shows a boxplot of the probability that CpG sites in sample MIX01 are methylated. The x-axis indicates the three different molecules present in sample MIX01: human-only DNA, mouse-only DNA, and human-mouse hybrid DNA (containing both human and mouse portions). The y-axis indicates the probability that a particular single DNA molecule CpG site is methylated. This assay was performed in such a way that human DNA was more methylated and mouse DNA was more unmethylated.

図１１５に示されるように、ヒトのみのＤＮＡにおいてＣｐＧ部位がメチル化されている確率（中央値：０．６６、範囲：０～１）は、マウスのみのＤＮＡの確率（中央値：０．０６、範囲：０～１）よりも有意に高かった（Ｐ値＜０．０００１）。これらの結果は、アッセイ設計と一致していた。つまり、ヒトＤＮＡは、ＣｐＧメチルトランスフェラーゼＭ．ＳｓｓＩの処理のために、よりメチル化されていたが、マウスＤＮＡは、全ゲノム増幅中にメチル化が維持されないために、より非メチル化されていた。さらに、ヒト－マウスハイブリッドＤＮＡ分子のヒトＤＮＡ部分内のＣｐＧ部位（中央値：０．０６、範囲：０～１）は、マウスＤＮＡ部分内のＣｐＧ部位（中央値：０．６９、範囲：０～１）と比較してメチル化されている確率が高かった（Ｐ値＜０．０００１）。これらのデータは、開示された方法が、ＤＮＡ分子ならびにＤＮＡ分子内のセグメントのメチル化状態を正確に決定できることを示している。 As shown in FIG. 115, the probability that the CpG site is methylated in human-only DNA (median: 0.66, range: 0-1) is higher than that in mouse-only DNA (median: 0.66, range: 0-1). 06, range: 0-1) (P-value < 0.0001). These results were consistent with the assay design. Thus, human DNA contains the CpG methyltransferase M. Although it was more methylated due to SssI treatment, mouse DNA was more unmethylated due to failure to maintain methylation during whole genome amplification. Furthermore, the CpG sites within the human DNA portion of the human-mouse hybrid DNA molecule (median: 0.06, range: 0-1) are significantly less than the CpG sites within the mouse DNA portion (median: 0.69, range: 0). 1) was more likely to be methylated (P-value <0.0001). These data demonstrate that the disclosed method can accurately determine the methylation status of DNA molecules as well as segments within DNA molecules.

メチル化の確率は、使用される統計モデルに基づいた、単一分子内の特定のＣｐＧ部位の推定確率を指す。確率１は、統計モデルに基づいて、測定されたパラメータ（ＩＰＤ、ＰＷ、および配列文脈を含む）を使用して、ＣｐＧ部位の１００％がメチル化されていることを示す。確率０は、統計モデルに基づいて、測定されたパラメータ（ＩＰＤ、ＰＷ、および配列文脈を含む）を使用して、ＣｐＧ部位の０％がメチル化されていることを示す。言い換えると、測定されたパラメータを使用して、すべてのＣｐＧ部位はメチル化されていない。図１１５は、メチル化の確率の分布を示しており、ヒトのみのＤＮＡの分布およびヒト部分の分布は、マウスの対応物よりも広くなっている。バイサルファイト配列決定を使用して、類似の試料のメチル化を測定し、メチル化が完了していないことを確認する。結果を以下に示す。図１１５は、ヒトＤＮＡ対マウスＤＮＡにおけるメチル化間の有意差を示す。 Methylation probability refers to the estimated probability of a particular CpG site within a single molecule based on the statistical model used. A probability of 1 indicates that 100% of the CpG sites are methylated, based on statistical models and using the measured parameters (including IPD, PW, and sequence context). A probability of 0 indicates that 0% of the CpG sites are methylated, based on statistical models and using the measured parameters (including IPD, PW, and sequence context). In other words, all CpG sites are unmethylated using the measured parameters. FIG. 115 shows the distribution of methylation probabilities, where the human-only DNA distribution and the human partial distribution are broader than their mouse counterparts. Bisulfite sequencing is used to measure methylation in similar samples to ensure that methylation is not complete. The results are shown below. Figure 115 shows significant differences between methylation in human versus mouse DNA.

図１１１に示される実施形態によれば、本発明者らは、人工ＤＮＡ混合物（試料ＭＩＸ０２と命名）を調製し、ヒト－マウスハイブリッドＤＮＡ分子、ヒトのみのＤＮＡ、およびマウスのみのＤＮＡが含まれ、ヒト部分が非メチル化され、マウス部分がメチル化されていた。試料ＭＩＸ０２の場合、ヒトもしくはマウスの参照ゲノム、または部分的にヒトゲノムおよび部分的にマウスゲノムのいずれかに整列され得る１億４０００万個のサブリードを取得した。これらのサブリードは、約５００万のＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｅｓ単一分子リアルタイム（ＳＭＲＴ）配列決定ウェルから生成された。単一分子リアルタイム配列決定ウェルの各分子は、平均２７回（範囲：１～１０２８回）配列決定された。 According to the embodiment shown in FIG. 111, the inventors prepared an artificial DNA mixture (designated sample MIX02), containing human-mouse hybrid DNA molecules, human-only DNA, and mouse-only DNA. , the human part was unmethylated and the mouse part was methylated. For sample MIX02, 140 million subreads were obtained that could be aligned to either the human or mouse reference genome, or the partially human and partially mouse genome. These subreads were generated from approximately 5 million Pacific Biosciences single molecule real-time (SMRT) sequencing wells. Each molecule in a single-molecule real-time sequencing well was sequenced an average of 27 times (range: 1-1028 times).

本発明者らはまた、ウェル内のすべての関連するサブリードからのヌクレオチド情報を組み合わせることによって、コンセンサス配列を構築した。合計で、試料ＭＩＸ０２について、３，２６５，４８７個のコンセンサス配列が取得された。このコンセンサス配列を、ＢＷＡを使用して、ヒト参照とマウス参照の両方を含む参照ゲノムに整列させた（ＬｉＨｅｔａｌ．，Ｂｉｏｉｎｆｏｒｍａｔｉｃｓ．２０１０；２６（５）：５８９－５９５）。３００万個の整列されたコンセンサス配列を取得した。それらの中で、３０．５％が、ヒトのみのＤＮＡ型として分類され、３２．２％が、マウスのみのＤＮＡ型として分類され、３３．８％が、ヒト－マウスハイブリッドＤＮＡとして分類された。データセットは、ＳｅｑｕｅｌＩＩＳｅｑｕｅｎｃｉｎｇＫｉｔ１．０によって調製されたＤＮＡから生成された。 We also constructed a consensus sequence by combining nucleotide information from all relevant subreads within a well. In total, 3,265,487 consensus sequences were obtained for sample MIX02. This consensus sequence was aligned using BWA to a reference genome containing both human and mouse references (Li H et al., Bioinformatics. 2010;26(5):589-595). Three million aligned consensus sequences were obtained. Of those, 30.5% were typed as human-only DNA type, 32.2% were typed as mouse-only DNA type, and 33.8% were typed as human-mouse hybrid DNA type. . The dataset was generated from DNA prepared by Sequel II Sequencing Kit 1.0.

図１１６は、試料ＭＩＸ０２の交差連結後のＤＮＡ混合物中のＤＮＡ分子の鎖長分布を示す。ｘ軸は、ＤＮＡ分子の鎖長を示す。ｙ軸は、ＤＮＡ分子の鎖長に関連する頻度を示す。図１１６に示されるように、ヒト－マウスハイブリッドＤＮＡ分子は、より長い鎖長分布を有しており、それらが２つ以上の分子の連結によって生成されたという事実と一致している。 Figure 116 shows the chain length distribution of DNA molecules in the DNA mixture after cross-ligation of sample MIX02. The x-axis indicates the chain length of the DNA molecule. The y-axis indicates the frequency related to the length of the DNA molecule. As shown in Figure 116, the human-mouse hybrid DNA molecules had a longer chain length distribution, consistent with the fact that they were produced by the ligation of two or more molecules.

図１１７は、試料ＭＩＸ０２において、ＣｐＧ部位がメチル化されている確率の箱ひげ図を示す。メチル化状態は、本明細書に記載の方法に従って決定された。ｘ軸は、試料ＭＩＸ０１に存在する３つの異なる分子：ヒトのみのＤＮＡ、マウスのみのＤＮＡ、およびヒトとマウスのハイブリッドＤＮＡ（ヒト部分とマウス部分の両方を含む）を示す。ｙ軸は、ＣｐＧ部位がメチル化されている確率を示している。このアッセイは、ヒトＤＮＡが非メチル化され、マウスＤＮＡがメチル化されるような方法で行われた。 FIG. 117 shows a boxplot of the probability that CpG sites are methylated in sample MIX02. Methylation status was determined according to the methods described herein. The x-axis indicates the three different molecules present in sample MIX01: human-only DNA, mouse-only DNA, and human-mouse hybrid DNA (containing both human and mouse portions). The y-axis shows the probability that the CpG site is methylated. This assay was performed in such a way that human DNA was unmethylated and mouse DNA was methylated.

図１１７に示されるように、ヒトのみのＤＮＡにおけるＣｐＧ部位でメチル化される確率は（中央値：０．０６、範囲：０～１）、マウスのみのＤＮＡの確率（中央値：０．９３；範囲：０～１）よりも有意に低かった（Ｐ値＜０．０００１）。これらの結果は、アッセイ設計と一致していた。つまり、ヒトＤＮＡは、全ゲノム増幅中にメチル化が維持され得ないため、より非メチル化されていたのに対して、マウスＤＮＡでは、ＣｐＧメチルトランスフェラーゼＭ．ＳｓｓＩの処理のために、よりメチル化されていた。さらに、ヒト－マウスハイブリッドＤＮＡ分子のヒトＤＮＡ部分内のＣｐＧ部位（中央値：０．９３、範囲：０～１）は、マウスＤＮＡ部分内のＣｐＧ部位（中央値：０．０７、範囲：０～１）と比較してメチル化される確率が低かった（Ｐ値＜０．０００１）。これらのデータは、開示された方法が、ＤＮＡ分子ならびにＤＮＡ分子内のセグメントのメチル化状態を正確に決定できることを示している。 As shown in FIG. 117, the probability of being methylated at a CpG site in human-only DNA (median: 0.06, range: 0-1) is higher than that in mouse-only DNA (median: 0.93). range: 0-1) was significantly lower (P value <0.0001). These results were consistent with the assay design. Thus, human DNA was more unmethylated due to the inability to maintain methylation during whole genome amplification, whereas in mouse DNA the CpG methyltransferase M. It was more methylated due to SssI treatment. Furthermore, the CpG sites within the human DNA portion of the human-mouse hybrid DNA molecule (median: 0.93, range: 0-1) are significantly less than the CpG sites within the mouse DNA portion (median: 0.07, range: 0). 1) was less likely to be methylated (P-value <0.0001). These data demonstrate that the disclosed method can accurately determine the methylation status of DNA molecules as well as segments within DNA molecules.

バイサルファイト配列決定を使用して、本開示の実施形態による単一分子リアルタイム配列決定によってメチル化パターンが決定されたヒト－マウスハイブリッド断片のメチル化を測定した。試料ＭＩＸ０１（ヒトＤＮＡがメチル化され、マウスＤＮＡが非メチル化された）および試料ＭＩＸ０２（ヒトＤＮＡが非メチル化され、マウスＤＮＡがメチル化された）を超音波処理を介して剪断し、中央値が１９６ｂｐのＤＮＡ断片サイズの混合物を得た（四分位範囲：１６１～２６８）。次いで、リード長３００ｂｐｘ２のＭｉＳｅｑプラットフォーム（Ｉｌｌｕｍｉｎａ）を用いて、ペアエンドバイサルファイト配列決定（ＢＳ－Ｓｅｑ）を行った。ＭＩＸ０１およびＭＩＸ０２について、それぞれ３７０万個と２９０万個の配列断片を取得し、ヒトまたはマウスの参照ゲノム、あるいは部分的にヒトゲノムおよび部分的にマウスゲノムと整列した。ＭＩＸ０１の場合、整列した断片の４１．６％がヒトのみのＤＮＡ、５６．６％がマウスのみのＤＮＡ、１．８％がヒト－マウスハイブリッドＤＮＡとして分類された。ＭＩＸ０２の場合、整列した断片の６１．８％がヒトのみのＤＮＡ、３６．３％がマウスのみのＤＮＡ、１．９％がヒト－マウスハイブリッドＤＮＡとして分類された。ＢＳ－Ｓｅｑでヒト－マウスハイブリッドＤＮＡであると決定された配列決定された断片のパーセンテージ（＜２％）は、ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ配列結果で観察されたパーセンテージ（＞３０％）よりもはるかに低かった。特に、長鎖断片（中央値が約２ｋｂ）は、ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ配列決定によって配列決定されたが、長鎖断片は、ＭｉＳｅｑに好適な短鎖断片（中央値が約１９６ｂｐ）に共有された。このような剪断プロセスは、ヒト－マウスハイブリッド断片を大幅に希釈する。 Bisulfite sequencing was used to measure methylation of human-mouse hybrid fragments whose methylation patterns were determined by single-molecule real-time sequencing according to embodiments of the present disclosure. Samples MIX01 (human DNA methylated, mouse DNA unmethylated) and sample MIX02 (human DNA unmethylated, mouse DNA methylated) were sheared via sonication and the central A mixture of DNA fragment sizes with a value of 196 bp was obtained (interquartile range: 161-268). Paired-end bisulfite sequencing (BS-Seq) was then performed using the MiSeq platform (Illumina) with a read length of 300 bp×2. 3.7 million and 2.9 million sequence fragments were obtained for MIX01 and MIX02, respectively, and aligned with human or mouse reference genomes, or partially human and partially mouse genomes. For MIX01, 41.6% of the aligned fragments were classified as human-only DNA, 56.6% as mouse-only DNA, and 1.8% as human-mouse hybrid DNA. For MIX02, 61.8% of the aligned fragments were classified as human-only DNA, 36.3% as mouse-only DNA, and 1.9% as human-mouse hybrid DNA. The percentage of sequenced fragments determined to be human-mouse hybrid DNA with BS-Seq (<2%) was much lower than the percentage observed with Pacific Biosciences sequencing results (>30%). In particular, the long fragments (median ~2 kb) were sequenced by Pacific Biosciences sequencing, but the long fragments were shared with the short fragments (median ~196 bp) suitable for MiSeq. Such a shearing process greatly dilutes the human-mouse hybrid fragment.

図１１８は、ＭＩＸ０１のバイサルファイト配列決定およびＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ配列決定によって決定されたメチル化を比較した表を示す。表の左端のセクションは、ＤＮＡのタイプ：１）ヒトのみ、２）マウスのみ、および３）ヒトとマウスのハイブリッド（ヒト部分とマウス部分に分けられる）を示す。表の中央のセクションには、ＣＧ部位の数およびメチル化密度を含む、バイサルファイト配列決定からの詳細が示されている。表の右端のセクションには、ＣＧ部位の数およびメチル化密度を含む、ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ配列決定からの詳細が示されている。 Figure 118 shows a table comparing methylation determined by bisulfite sequencing and Pacific Biosciences sequencing of MIX01. The leftmost section of the table indicates the type of DNA: 1) human-only, 2) mouse-only, and 3) human-mouse hybrids (divided into human and mouse portions). The middle section of the table shows details from the bisulfite sequencing, including the number of CG sites and methylation density. The rightmost section of the table shows details from Pacific Biosciences sequencing, including the number of CG sites and methylation density.

図１１８に示されるように、バイサルファイト配列決定とＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ配列決定の両方の結果では、ＭＩＸ０１のヒトのみのＤＮＡは、マウスのみのＤＮＡよりも一貫して高いメチル化密度を示した。ヒト－マウスハイブリッド断片の場合、バイサルファイト配列決定の結果では、ヒト部分とマウス部分のメチル化レベルが、それぞれ４６．８％と２．３％であると決定された。これらの結果は、本開示によるＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ配列決定によって決定されるように、メチル化密度が、マウス部分と比較して、ヒト部分でより高いことが確認された。ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ配列決定では、ヒト部分で５７．４％のメチル化密度が観察され、マウス部分で１２．１％のより低いメチル化密度が観察された。これらの結果は、本開示によるＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ配列決定によって決定されたメチル化が、実行可能であり得ることを示唆している。特に、ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ配列決定を使用して、別のセクションよりもメチル化密度が高いセクションを有するＤＮＡを含めて、異なるメチル化密度を決定することができる。本開示によるＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ配列決定によって決定されたメチル化密度は、バイサルファイト配列決定と比較して、より高いことが観察された。このような推定を、これら２つの技術によって決定された結果間の差を使用して調整することで、技術全体で結果を比較することができる。 As shown in Figure 118, both bisulfite sequencing and Pacific Biosciences sequencing results showed that MIX01 human-only DNA consistently showed higher methylation densities than mouse-only DNA. For the human-mouse hybrid fragment, the bisulfite sequencing results determined the methylation levels of the human and mouse portions to be 46.8% and 2.3%, respectively. These results confirmed that the methylation density was higher in the human portion compared to the mouse portion, as determined by Pacific Biosciences sequencing according to the present disclosure. Pacific Biosciences sequencing observed a methylation density of 57.4% in the human part and a lower methylation density of 12.1% in the mouse part. These results suggest that methylation determined by Pacific Biosciences sequencing according to the present disclosure may be viable. In particular, Pacific Biosciences sequencing can be used to determine different methylation densities, including DNA that has a section with a higher methylation density than another section. Methylation densities determined by Pacific Biosciences sequencing according to the present disclosure were observed to be higher compared to bisulfite sequencing. Adjusting such estimates using the difference between results determined by these two techniques allows results to be compared across techniques.

図１１９は、ＭＩＸ０２のバイサルファイト配列決定とＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ配列決定によって決定されたメチル化を比較した表を示す。表の左端のセクションは、ＤＮＡのタイプ：１）ヒトのみ、２）マウスのみ、および３）ヒトとマウスのハイブリッド（ヒト部分とマウス部分に分けられる）を示す。表の中央のセクションには、ＣＧ部位の数およびメチル化密度を含む、バイサルファイト配列決定からの詳細が示されている。表の右端のセクションには、ＣＧ部位の数およびメチル化密度を含む、ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ配列決定からの詳細が示されている。 Figure 119 shows a table comparing methylation determined by bisulfite sequencing and Pacific Biosciences sequencing of MIX02. The leftmost section of the table indicates the type of DNA: 1) human-only, 2) mouse-only, and 3) human-mouse hybrids (divided into human and mouse portions). The middle section of the table shows details from the bisulfite sequencing, including the number of CG sites and methylation density. The rightmost section of the table shows details from Pacific Biosciences sequencing, including the number of CG sites and methylation density.

図１１９に示されるように、バイサルファイト配列決定とＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ配列決定の両方の結果では、ＭＩＸ０２のヒトのみのＤＮＡは、マウスのみのＤＮＡよりも一貫して低いメチル化密度を示した。ヒト－マウスハイブリッド断片の場合、バイサルファイト配列決定の結果では、ヒト部分とマウス部分のメチル化レベルが、それぞれ１．８％と６７．４％であると決定された。これらの結果は、本開示によるＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ配列決定によって決定されるように、メチル化密度が、マウス部分と比較して、ヒト部分でより低いことがさらに確認された。ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ配列決定では、本開示によるＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ配列決定によって決定されるように、ヒト部分で１３．１％のメチル化密度が観察され、マウス部分で７２．２％のより高いメチル化密度が観察された。また、本開示によるＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ配列決定によってメチル化を決定することが、実行可能であることも示唆した。特に、ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ配列決定を使用して、別のセクションよりもメチル化密度が低いセクションを有するＤＮＡを含めて、異なるメチル化密度を決定することができる。また、本開示によるＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ配列決定によって決定されたメチル化密度は、バイサルファイト配列決定と比較して、より高いことも観察された。このような推定を、これら２つの技術によって決定された結果間の差を使用して調整することで、技術全体で結果を比較することができる。 As shown in Figure 119, both the bisulfite sequencing and Pacific Biosciences sequencing results showed that the human-only DNA of MIX02 consistently showed lower methylation densities than the mouse-only DNA. For the human-mouse hybrid fragment, the bisulfite sequencing results determined the methylation levels of the human and mouse portions to be 1.8% and 67.4%, respectively. These results further confirmed that the methylation density was lower in the human portion compared to the mouse portion, as determined by Pacific Biosciences sequencing according to the present disclosure. Pacific Biosciences sequencing observed a methylation density of 13.1% in the human portion and a higher methylation density of 72.2% in the mouse portion, as determined by Pacific Biosciences sequencing according to the present disclosure. was done. It also suggested that it would be feasible to determine methylation by Pacific Biosciences sequencing according to the present disclosure. In particular, Pacific Biosciences sequencing can be used to determine different methylation densities, including DNA having sections with lower methylation densities than other sections. It was also observed that the methylation density determined by Pacific Biosciences sequencing according to the present disclosure was higher compared to bisulfite sequencing. Adjusting such estimates using the difference between results determined by these two techniques allows results to be compared across techniques.

図１２０Ａは、ＭＩＸ０１について、ヒトのみのＤＮＡおよびマウスのみのＤＮＡの５Ｍｂビンでのメチル化レベルを示す。図１２０Ｂは、ＭＩＸ０２について、ヒトのみのＤＮＡおよびマウスのみのＤＮＡの５Ｍｂビンでのメチル化レベルを示す。両方の図では、ｙ軸に、メチル化レベルがパーセントで示されている。ｘ軸に、ヒトのみのＤＮＡおよびマウスのみのＤＮＡの各々についてのバイサルファイト配列決定およびＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ配列決定が示されている。 FIG. 120A shows methylation levels in 5 Mb bins of human-only and mouse-only DNA for MIX01. FIG. 120B shows methylation levels in 5 Mb bins of human-only and mouse-only DNA for MIX02. In both figures, the methylation level is shown in percent on the y-axis. Bisulfite sequencing and Pacific Biosciences sequencing for human-only DNA and mouse-only DNA, respectively, are shown on the x-axis.

図１２０Ａおよび図１２０Ｂでは、試料ＭＩＸ０１およびＭＩＸ０２の両方のビンにわたって、本開示によるＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ配列決定によって決定された結果が、全体的に高いことが見出された。 In Figures 120A and 120B, across both bins of samples MIX01 and MIX02, results determined by Pacific Biosciences sequencing according to the present disclosure were found to be overall high.

図１２１Ａは、ＭＩＸ０１について、ヒト－マウスハイブリッドＤＮＡ断片のヒト部分およびマウス部分の５Ｍｂビンでのメチル化レベルを示す。図１２１Ｂは、ＭＩＸ０２について、ヒト－マウスハイブリッドＤＮＡ断片のヒト部分およびマウス部分の５Ｍｂビンでのメチル化レベルを示す。両方の図では、ｙ軸に、メチル化レベルがパーセントで示されている。ｘ軸に、ヒト部分のＤＮＡおよびマウス部分のＤＮＡの各々についてのバイサルファイト配列決定およびＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ配列決定が示されている。 FIG. 121A shows methylation levels in 5 Mb bins of the human and mouse portions of the human-mouse hybrid DNA fragment for MIX01. FIG. 121B shows methylation levels in 5 Mb bins of the human and mouse portions of the human-mouse hybrid DNA fragments for MIX02. In both figures, the methylation level is shown in percent on the y-axis. Bisulfite sequencing and Pacific Biosciences sequencing for the human and mouse portions of DNA, respectively, are shown on the x-axis.

図１２１Ａおよび図１２１Ｂの両方で、バイサルファイト配列決定と比較して、ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ配列決定を使用した場合に、メチル化レベルの増加が示された。この増加は、図１２０Ａおよび図１２０ＢにおいてヒトのみのＤＮＡおよびマウスのみのＤＮＡで見られたＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ配列決定によるメチル化レベルの増加と類似している。ハイブリッド断片のバイサルファイト配列決定の結果に存在する５Ｍｂビンにわたるメチル化レベルの可変性の増加は、分析に使用されたＣｐＧ部位の数が少なかったためである可能性が高い。 Both Figures 121A and 121B showed an increase in methylation levels when using Pacific Biosciences sequencing compared to bisulfite sequencing. This increase is similar to the increase in methylation levels by Pacific Biosciences sequencing seen in human-only and mouse-only DNA in Figures 120A and 120B. The increased variability in methylation levels across the 5 Mb bins present in the bisulfite sequencing results of hybrid fragments was likely due to the low number of CpG sites used in the analysis.

図１２２Ａおよび１２２Ｂは、単一のヒト－マウスハイブリッド分子におけるメチル化状態を示す代表的なグラフである。図１２２Ａは、試料ＭＩＸ０１内のヒト－マウスハイブリッド断片を示す。図１２２Ｂは、試料ＭＩＸ０２内のヒト－マウスハイブリッド断片を示す。塗りつぶされた丸はメチル化部位を示し、塗りつぶされていない丸は非メチル化部位を示す。これらの断片のメチル化状態は、本明細書に記載の実施形態に従って決定された。 Figures 122A and 122B are representative graphs showing the methylation status in a single human-mouse hybrid molecule. Figure 122A shows the human-mouse hybrid fragment within sample MIX01. Figure 122B shows the human-mouse hybrid fragment within sample MIX02. Filled circles indicate methylated sites and unfilled circles indicate unmethylated sites. The methylation status of these fragments was determined according to embodiments described herein.

図１２２Ａに示されるように、試料ＭＩＸ０１からのハイブリッド分子のヒト部分は、よりメチル化されていると決定された。対照的に、マウスＤＮＡ部分は、より低メチル化されていると決定された。対照的に、図１２２Ｂは、試料ＭＩＸ０２からのハイブリッド分子のヒト部分がより低メチル化されていると決定されたのに対し、マウスＤＮＡ部分はよりメチル化されていると決定されたことを示す。 As shown in Figure 122A, the human portion of the hybrid molecule from sample MIX01 was determined to be more methylated. In contrast, portions of mouse DNA were determined to be more hypomethylated. In contrast, Figure 122B shows that the human portion of the hybrid molecule from sample MIX02 was determined to be more hypomethylated, whereas the mouse DNA portion was determined to be more methylated. .

これらの結果は、本開示に存在する実施形態が、分子の異なる部分で異なるメチル化パターンを有する単一のＤＮＡ分子において、メチル化の変化を決定することを可能にしたことを実証した。一実施形態では、遺伝子またはゲノム領域の異なる部分が異なるメチル化状態を示すであろう遺伝子または他のゲノム領域のメチル化状態（例えば、プロモーター対遺伝子本体）を測定することができる。別の実施形態では、本明細書に提示される方法は、ヒト－マウスハイブリッド断片を検出することができ、参照ゲノムに関して連続していない断片（すなわち、キメラ分子）を含有するＤＮＡ分子を検出し、それらのメチル化状態を分析するための一般的なアプローチを提供する。例えば、このアプローチを使用して、限定されないが、遺伝子融合、ゲノム再編成、翻訳、逆位、重複、構造変化、ウイルスＤＮＡ組込み、減数分裂組換えなどを分析することができる。 These results demonstrated that the embodiments present in the present disclosure allowed methylation changes to be determined in a single DNA molecule with different methylation patterns in different parts of the molecule. In one embodiment, the methylation status of a gene or other genomic region (eg, promoter versus gene body) can be measured where different portions of the gene or genomic region will exhibit different methylation states. In another embodiment, the methods presented herein are capable of detecting human-mouse hybrid fragments and detect DNA molecules containing non-contiguous fragments (i.e., chimeric molecules) with respect to the reference genome. , providing a general approach for analyzing their methylation status. For example, this approach can be used to analyze gene fusions, genome rearrangements, translation, inversions, duplications, structural changes, viral DNA integration, meiotic recombination, and the like, without limitation.

一部の実施形態では、これらのハイブリッド断片は、プローブベースのハイブリダイゼーション法またはＣＲＩＳＰＲ－Ｃａｓシステムまたは標的ＤＮＡ濃縮のためのそれらのバリアントのアプローチを使用して、配列決定の前に濃縮され得る。最近、シアノバクテリアＳｃｙｔｏｎｅｍａｈｏｆｍａｎｎｉ由来のＣＲＩＳＰＲ関連トランスポザーゼが、目的の標的部位の近くの領域にＤＮＡセグメントを挿入できることが報告された（Ｓｔｒｅｃｋｅｒｅｔａｌ．Ｓｃｉｅｎｃｅ．２０１９；３６５：４８－５３）。ＣＲＩＳＰＲ関連トランスポザーゼは、Ｔｎ７を介した転位のように機能する可能性がある。一実施形態では、本発明者らは、このＣＲＩＳＰＲ関連トランスポザーゼを、例えば、ビオチンで標識されたコメント配列を、ｇＲＮＡによって誘導される１つ以上の目的のゲノム領域に挿入するように適合させることができる。例えば、ストレプトアビジンでコーティングされた磁気ビーズを使用してコメント配列を捕捉し、それによって、本開示の実施形態による配列決定およびメチル化分析のために、標的ＤＮＡ配列を同時にプルダウンすることができる。 In some embodiments, these hybrid fragments may be enriched prior to sequencing using probe-based hybridization methods or CRISPR-Cas systems or their variant approaches for target DNA enrichment. Recently, it was reported that a CRISPR-associated transposase from the cyanobacterium Scytonema hofmanni can insert DNA segments into regions near the target site of interest (Strecker et al. Science. 2019;365:48-53). CRISPR-related transposases may function like Tn7-mediated transposition. In one embodiment, we can adapt this CRISPR-associated transposase to insert, for example, a biotin-labeled comment sequence into one or more gRNA-induced genomic regions of interest. can. For example, streptavidin-coated magnetic beads can be used to capture comment sequences, thereby simultaneously pulling down target DNA sequences for sequencing and methylation analysis according to embodiments of the present disclosure.

一部の実施形態では、断片は、本明細書に開示される任意の制限酵素を含み得る制限酵素を使用することによって濃縮され得る。 In some embodiments, fragments can be enriched by using restriction enzymes, which can include any of the restriction enzymes disclosed herein.

Ｃ．キメラ分子の検出方法の例
図１２３は、生体試料中のキメラ分子を検出する方法１２３０を示す。キメラ分子は、２つの異なる遺伝子、染色体、細胞小器官（例えば、ミトコンドリア、核、葉緑体）、生物（哺乳動物、細菌、ウイルスなど）、および／または種からの配列を含み得る。方法１２３０は、生体試料からの複数のＤＮＡ分子の各々に適用され得る。一部の実施形態では、複数のＤＮＡ分子は、細胞ＤＮＡであり得る。他の実施形態では、複数のＤＮＡ分子は、妊婦の血漿由来の無細胞ＤＮＡ分子であり得る。 C. Examples of Methods for Detecting Chimeric Molecules FIG. 123 illustrates a method 1230 for detecting chimeric molecules in a biological sample. Chimeric molecules can comprise sequences from two different genes, chromosomes, organelles (eg, mitochondria, nucleus, chloroplast), organisms (mammals, bacteria, viruses, etc.), and/or species. Method 1230 can be applied to each of a plurality of DNA molecules from a biological sample. In some embodiments, the plurality of DNA molecules can be cellular DNA. In other embodiments, the plurality of DNA molecules can be cell-free DNA molecules from the plasma of pregnant women.

ブロック１２３２で、ＤＮＡ分子の単一分子配列決定を実施し、Ｎ部位の各々におけるメチル化状態を提供する配列リードを取得することができる。Ｎは、５以上であり、５～１０、１０～１５、１５～２０、または２０超を含む。配列リードのメチル化状態は、メチル化パターンを形成し得る。ＤＮＡ分子は、複数のＤＮＡ分子のうちの１つのＤＮＡ分子であり得、方法１２３０が、複数のＤＮＡ分子に対して実施され得る。メチル化パターンは、様々な形態をとることができる。例えば、パターンは、Ｎ個（例えば、２、３、４など）のメチル化部位と、それに続くＮ個の非メチル化部位、またはその逆であり得る。このようなメチル化の変化は、接合部を示している場合がある。メチル化されている連続した部位の数は、非メチル化されている連続した部位の数とは異なる場合がある。 At block 1232, single molecule sequencing of the DNA molecule can be performed to obtain sequence reads that provide the methylation status at each of the N sites. N is 5 or greater, including 5-10, 10-15, 15-20, or greater than 20. The methylation state of sequence reads can form a methylation pattern. The DNA molecule may be one DNA molecule of a plurality of DNA molecules and method 1230 may be performed on the plurality of DNA molecules. Methylation patterns can take many forms. For example, the pattern can be N (eg, 2, 3, 4, etc.) methylated sites followed by N unmethylated sites, or vice versa. Such methylation changes may indicate junctions. The number of contiguous sites that are methylated may differ from the number of contiguous sites that are unmethylated.

ブロック１２３４では、メチル化パターンは、参照ヒトゲノムの２つの部分（ｐａｒｔ）からの２つの箇所（ｐｏｒｔｉｏｎ）を有するキメラ分子に対応する１つ以上の参照パターン上をスライドさせてもよい。参照パターンは、接合部を示す一致するパターンを特定するためのフィルターとして機能し得る。参照パターンに一致する部位の数を追跡して、一致する部位の最大数に対応する一致する位置（すなわち、メチル化状態が参照パターンに一致する数）を追跡することができる。参照ヒトゲノムの２つの部分は、参照ヒトゲノムの不連続部分であり得る。参照ヒトゲノムの２つの部分は、１ｋｂ、５ｋｂ、１０ｋｂ、１００ｋｂ、１Ｍｂ、５Ｍｂ、または１０Ｍｂ以上離れている場合がある。２つの部分は、２つの異なる染色体アームまたは染色体に由来する場合がある。１つ以上の参照パターンは、メチル化状態と非メチル化状態との間の変化を含み得る。 At block 1234, the methylation pattern may be slid over one or more reference patterns corresponding to chimeric molecules having two portions from two parts of the reference human genome. The reference pattern can act as a filter to identify matching patterns that indicate junctions. The number of sites matching the reference pattern can be tracked to track the matching positions corresponding to the maximum number of matching sites (ie, the number whose methylation status matches the reference pattern). The two portions of the reference human genome can be discontinuous portions of the reference human genome. The two portions of the reference human genome may be separated by 1 kb, 5 kb, 10 kb, 100 kb, 1 Mb, 5 Mb, or 10 Mb or more. The two parts may be derived from two different chromosomal arms or chromosomes. One or more reference patterns may include changes between methylated and unmethylated states.

ブロック１２３６では、一致する位置は、メチル化パターンと１つ以上の参照パターンの第１の参照パターンとの間で特定され得る。一致する位置は、配列リードにおける参照ヒトゲノムの２つの部分間の接合部を特定することができる。一致した位置は、参照パターンとメチル化パターンとの間の重複関数の最大値に対応し得る。重複関数は、複数の参照パターンを使用することができる。出力は、集計関数の最大値（すなわち、各参照パターンが出力値に寄与する）または参照パターンにわたって特定される単一の最大値である可能性がある。 At block 1236, matching positions may be identified between the methylation pattern and the first of the one or more reference patterns. A matching position can identify a junction between two parts of the reference human genome in the sequence read. A matched position may correspond to the maximum of the overlap function between the reference pattern and the methylation pattern. A duplicate function can use multiple reference patterns. The output can be the maximum value of an aggregation function (ie each reference pattern contributes to the output value) or a single maximum value identified over the reference patterns.

ブロック１２３８では、接合部は、キメラ分子における遺伝子融合の位置として出力され得る。遺伝子融合の位置は、癌を含む様々な障害または疾患の遺伝子融合の参照位置と比較することができる。生体試料が取得される生物は、障害または疾患の治療を受けることができる。 At block 1238, the junction may be output as the location of the gene fusion in the chimeric molecule. The location of gene fusions can be compared to reference locations of gene fusions for various disorders or diseases, including cancer. An organism from which a biological sample is obtained can be treated for a disorder or disease.

一致する位置は、整列関数に出力することができる。遺伝子融合の位置は、精密化され得る。遺伝子融合の位置を精密化することは、配列リードの第１の箇所を参照ヒトゲノムの第１の部分に整列させることを含み得る。第１の箇所は、接合部の前にある可能性がある。遺伝子融合の位置を精密化することは、配列リードの第２の箇所を参照ヒトゲノムの第２の部分に整列させることを含み得る。第２の箇所は、接合部の後にある可能性がある。参照ヒトゲノムの第１の部分は、ヒト参照ゲノムの第２の部分から少なくとも１ｋｂ離れていてもよい。例えば、参照ヒトゲノムの第１の部分およびヒト参照ゲノムの第２の部分は、１．０～１．５ｋｂ、１．５～２．０ｋｂ、２．０～２．５ｋｂ、２．５～３．０ｋｂ、３～５ｋｂ、または５ｋｂ以上離れている場合がある。 The matching positions can be output to the alignment function. The location of gene fusions can be refined. Refining the location of the gene fusion can include aligning the first location of the sequence read to a first portion of the reference human genome. The first location can be in front of the joint. Refining the location of the gene fusion can include aligning the second location of the sequence read to a second portion of the reference human genome. The second location can be after the joint. The first portion of the reference human genome may be separated from the second portion of the human reference genome by at least 1 kb. For example, the first portion of the reference human genome and the second portion of the human reference genome are 1.0-1.5 kb, 1.5-2.0 kb, 2.0-2.5 kb, 2.5-3. They may be 0 kb, 3-5 kb, or more than 5 kb apart.

複数のキメラ分子の接合部を互いに比較して、遺伝子融合の位置を確認することができる。 The junctions of multiple chimeric molecules can be compared to each other to confirm the location of gene fusions.

ＶＩＩＩ．結論
本発明者らは、核酸の塩基修飾（例えば、メチル化）のレベルを、単一塩基の解像度で予測するための効率的なアプローチを開発した。この新しいアプローチは、調査される塩基、配列文脈、および鎖情報を取り巻くポリメラーゼ動態を同時に捕捉するための新しいスキームを実装する。動態のそのような新しい変換は、動態パルスで発生するわずかな中断を特定し、モデル化することを可能にした。ＩＰＤのみを使用した以前の方法と比較して、この特許出願に存在する新しいアプローチにより、メチル化分析の分解能および精度が大幅に改善した。この新しいスキームは、他の目的、例えば、５ｈｍＣ（５－ヒドロキシメチルシトシン）、５ｆＣ（５－ホルミルシトシン）、５ｃａＣ（５－カルボキシルシトシン）、４ｍＣ（４－メチルシトシン）、６ｍＡ（Ｎ６－メチルアデニン）、８ｏｘｏＧ（７，８－ジヒドロ－８－オキソグアニン）、８ｏｘｏＡ（７，８－ジヒドロ－８－オキソアデニン）および他の形態の塩基修飾ならびにＤＮＡ損傷の検出に容易に拡張することができる。別の実施形態では、この新しいスキーム（例えば、この用途に存在する２Ｄデジタルマトリックスに類似した動態変換）は、ナノポア配列決定システムを使用する塩基修飾分析に使用することができる。 VIII. Conclusion We have developed an efficient approach to predict the level of base modifications (eg, methylation) of nucleic acids at single base resolution. This new approach implements a new scheme to simultaneously capture the polymerase dynamics surrounding the investigated base, sequence context, and strand information. Such new transformations of kinetics have allowed us to identify and model the subtle interruptions that occur in kinetic pulses. Compared to previous methods using IPD alone, the new approach present in this patent application greatly improved the resolution and accuracy of methylation analysis. This new scheme has other purposes, e.g. ), 8oxoG (7,8-dihydro-8-oxoguanine), 8oxoA (7,8-dihydro-8-oxoadenine) and other forms of base modification as well as the detection of DNA damage. In another embodiment, this new scheme (eg, kinetic transformation similar to 2D digital matrices present in this application) can be used for base modification analysis using nanopore sequencing systems.

メチル化の検出のこの実装は、異なる供給源からの核酸試料、例えば、細胞の核酸、環境試料採取からの核酸（例えば、細胞混入物）、病原体からの核酸（例えば、細菌、および菌類）、および妊婦の血漿中のｃｆＤＮＡに対して使用することができる。これは、非侵襲的な出生前検査、癌検出、移植の監視など、ゲノム研究や分子診断に多くの新しい可能性を開くであろう。ｃｆＤＮＡベースの非侵襲的出生前診断の場合、この新しい発明により、ＰＣＲおよび配列決定前の実験的変換をすることなく、診断で、各分子のコピー数異常、サイズ、変異、断片末端、および塩基修飾を同時に使用することができるようになり、したがって、感度が向上した。ハプロタイプ間のメチル化レベルの不均衡は、本明細書に記載の方法を使用して検出することができる。このような不均衡は、ＤＮＡ分子（例えば、癌患者の血液から単離された癌細胞など、障害から抽出された）または障害の起源を示し得る。 This implementation of methylation detection can be used with nucleic acid samples from different sources, e.g., cellular nucleic acids, nucleic acids from environmental sampling (e.g., cellular contaminants), nucleic acids from pathogens (e.g., bacteria, and fungi), and cfDNA in the plasma of pregnant women. This will open up many new possibilities for genomic research and molecular diagnostics, such as noninvasive prenatal testing, cancer detection, and transplantation surveillance. In the case of cfDNA-based non-invasive prenatal diagnosis, this new invention allows diagnostic copy number abnormalities, size, mutations, fragment ends, and bases of each molecule to be detected without experimental transformations prior to PCR and sequencing. Modifications can now be used simultaneously, thus increasing sensitivity. Imbalances in methylation levels between haplotypes can be detected using the methods described herein. Such imbalances can indicate DNA molecules (eg, extracted from a disorder, such as cancer cells isolated from the blood of cancer patients) or the origin of the disorder.

ＩＸ．実施例システム
図１２４は、本発明の一実施形態による測定システム１２４００を示す。示されたシステムは、試料ホルダ１２４１０内のＤＮＡ分子などの試料１２４０５を含み、試料１２４０５をアッセイ１２４０８と接触させて、物理的特徴１２４１５の信号を提供することができる。試料ホルダの例は、アッセイのプローブおよび／もしくはプライマー、または液滴が（アッセイを含む液滴とともに）移動するチューブを含む、フローセルであり得る。試料からの物理的特徴１２４１５（例えば、蛍光強度、電圧、または電流）は、検出器１２４２０によって検出される。検出器１２４０２は、データ信号を構成するデータポイントを取得するために、間隔（例えば、周期的な間隔）を空けて測定を行うことができる。一実施形態では、アナログ－デジタル変換器は、検出器からのアナログ信号をデジタル形態へと複数回変換する。試料ホルダ１２４０１および検出器１２４０２は、アッセイデバイス、例えば、本明細書に記載される実施形態に従って配列決定を行う配列決定デバイスを形成することができる。データ信号１２４２５は、検出器１２４０２から論理システム１２４０３へ送信される。データ信号１２４２５は、ローカルメモリ１２４３５、外部メモリ１２４０４、またはストレージデバイス１２４４５に記憶され得る。 IX. Example System FIG. 124 illustrates a measurement system 12400 according to one embodiment of the invention. The system shown includes a sample 12405, such as a DNA molecule in a sample holder 12410, and the sample 12405 can be brought into contact with an assay 12408 to provide a physical characteristic 12415 signal. An example of a sample holder can be a flow cell containing a tube in which assay probes and/or primers, or droplets travel (along with assay-containing droplets). A physical characteristic 12415 (eg, fluorescence intensity, voltage, or current) from the sample is detected by detector 12420 . Detector 12402 can take measurements at intervals (eg, periodic intervals) to obtain data points that make up the data signal. In one embodiment, the analog-to-digital converter converts the analog signal from the detector to digital form multiple times. Sample holder 12401 and detector 12402 can form an assay device, eg, a sequencing device that performs sequencing according to embodiments described herein. Data signal 12425 is transmitted from detector 12402 to logic system 12403 . Data signal 12425 may be stored in local memory 12435 , external memory 12404 , or storage device 12445 .

論理システム１２４０３は、コンピュータシステム、ＡＳＩＣ、マイクロプロセッサなどであってもよいか、またはそれらを含んでもよい。それはまた、ディスプレイ（例えば、モニタ、ＬＥＤディスプレイなど）、およびユーザ入力デバイス（例えば、マウス、キーボード、ボタンなど）を含み得るか、またはそれらに連結され得る。論理システム１２４０３および他の構成要素は、スタンドアローンもしくはネットワーク接続されたコンピュータシステムの一部であってもよく、または検出器１２４０２および／もしくは試料ホルダ１２４０１を含むデバイス（例えば、配列決定デバイス）に直接取り付けられても組み込まれてもよい。論理システム１２４０３はまた、プロセッサ１２４０５において実行するソフトウェアを含み得る。論理システム１２４０３は、本明細書に記載される方法のいずれかを行うようにシステム１２４００を制御するための指示を記憶するコンピュータ可読媒体を含み得る。例えば、論理システム１２４０３は、配列決定または他の物理的操作が行われるように、試料ホルダ１２４０１を含むシステムにコマンドを提供することができる。そのような物理的操作は、特定の順序で、例えば、試薬が特定の順序で追加および除去されるように、行うことができる。そのような物理的操作は、試料を取得してアッセイを実施するために使用され得るように、例えば、ロボットアームを含む、ロボットシステムによって行われ得る。 Logic system 12403 may be or include a computer system, ASIC, microprocessor, or the like. It may also include or be coupled to a display (eg, monitor, LED display, etc.) and user input devices (eg, mouse, keyboard, buttons, etc.). Logic system 12403 and other components may be part of a stand-alone or networked computer system, or directly into a device (e.g., a sequencing device) that includes detector 12402 and/or sample holder 12401. It can be attached or built-in. Logic system 12403 may also include software executing on processor 12405 . Logic system 12403 may include a computer readable medium storing instructions for controlling system 12400 to perform any of the methods described herein. For example, logic system 12403 can provide commands to systems including sample holder 12401 such that sequencing or other physical manipulations are performed. Such physical manipulations can be performed in a particular order, eg, reagents are added and removed in a particular order. Such physical manipulations can be performed by robotic systems, including, for example, robotic arms, such that they can be used to obtain samples and perform assays.

本明細書で言及されるコンピュータシステムのうちのいずれも、任意の好適な数のサブシステムを利用してもよい。このようなサブシステムの例をコンピュータシステム１０の図１２５に示す。一部の実施形態では、コンピュータシステムは、単一のコンピュータ装置を含み、サブシステムは、コンピュータ装置の構成要素であり得る。他の実施形態では、コンピュータシステムは、各々がサブシステムであり、内部構成要素を備える、複数のコンピュータ装置を含むことができる。コンピュータシステムは、デスクトップコンピュータおよびラップトップコンピュータ、タブレット、携帯電話、ならびにクラウドベースのシステムを含み得る。 Any of the computer systems mentioned herein may utilize any suitable number of subsystems. An example of such a subsystem is shown in FIG. 125 of computer system 10 . In some embodiments, a computer system includes a single computer device, and a subsystem may be a component of the computer device. In other embodiments, the computer system may include multiple computer devices, each of which is a subsystem and includes internal components. Computer systems may include desktop and laptop computers, tablets, mobile phones, and cloud-based systems.

図１２５に示されるサブシステムは、システムバス７５を介して相互接続される。プリンタ７４、キーボード７８、記憶装置（複数可）７９、ディスプレイアダプター８２に接続されたモニタ７６（（例えば、ＬＥＤなどのディスプレイスクリーン）、およびその他などの追加のサブシステムが示されている。Ｉ／Ｏコントローラ７１に結合する周辺機器および入力／出力（Ｉ／Ｏ）デバイスは、入力／出力（Ｉ／Ｏ）ポート７７（例えば、ＵＳＢ、ＦｉｒｅＷｉｒｅ（登録商標））などの当技術分野において既知である任意の数の手段によって、コンピュータシステムに接続され得る。例えば、Ｉ／Ｏポート７７または外部インターフェース８１（例えば、Ｅｔｈｅｒｎｅｔ、Ｗｉ－Ｆｉなど）を使用して、Ｉｎｔｅｒｎｅｔなどの広域ネットワーク、マウス入力デバイス、またはスキャナに、コンピュータシステム１０を接続することができる。システムバス７５を介した相互接続は、中央プロセッサ７３が、各サブシステムと通信し、システムメモリ７２または記憶デバイス（複数可）７９（例えば、ハードドライブまたは光ディスクなどの固定ディスク）からの複数の命令の実行、およびサブシステム間の情報交換を制御することを可能にする。システムメモリ７２および／または記憶装置（複数可）７９は、コンピュータ可読媒体を具現化してもよい。別のサブシステムは、カメラ、マイクロホン、および加速度計、ならびにこれらに類するものなどのデータ収集装置８５である。本明細書に言及されるデータのうちのいずれも、１つの構成要素から別の構成要素に出力されてもよく、ユーザに対して出力されてもよい。 The subsystems shown in FIG. 125 are interconnected via system bus 75 . Additional subsystems are shown, such as a printer 74, keyboard 78, storage device(s) 79, monitor 76 (e.g., display screen such as LEDs) connected to display adapter 82, and others. Peripherals and input/output (I/O) devices that couple to the O controller 71 are known in the art, such as input/output (I/O) ports 77 (eg, USB, FireWire®). It may be connected to the computer system by any number of means, such as a wide area network such as the Internet, mouse input device, using I/O port 77 or external interface 81 (eg, Ethernet, Wi-Fi, etc.). Alternatively, the scanner may be connected to computer system 10. Interconnection via system bus 75 allows central processor 73 to communicate with each subsystem, system memory 72 or storage device(s) 79 (e.g., A system memory 72 and/or storage device(s) 79 is a computer readable storage device(s) 79 that enables execution of multiple instructions from a hard drive or a fixed disk such as an optical disk) and information exchange between subsystems. Another subsystem is a data collection device 85 such as cameras, microphones and accelerometers, and the like.Any of the data referred to herein It may be output from one component to another component and may be output to the user.

コンピュータシステムは、例えば、外部インターフェース８１によって、内部インターフェースによって、または１つの構成要素から別の構成要素に接続され得る、もしくは取り外され得る記憶装置を介して、ともに接続された、複数の同じ構成要素またはサブシステムを含むことができる。一部の実施形態では、コンピュータシステム、サブシステム、または装置は、ネットワーク上で通信することができる。そのような例において、１つのコンピュータをクライアント、別のコンピュータをサーバとみなすことができ、各々が、同じコンピュータシステムの一部であり得る。クライアントおよびサーバは各々、複数のシステム、サブシステム、または構成要素を含むことができる。 A computer system may be a plurality of identical components connected together, for example, by an external interface 81, by an internal interface, or via storage devices that may be connected or removed from one component to another. or subsystems. In some embodiments, computer systems, subsystems, or devices can communicate over a network. In such an example, one computer can be considered a client and another computer a server, and each can be part of the same computer system. Clients and servers may each include multiple systems, subsystems, or components.

実施形態の態様は、制御ロジックの形態で、ハードウェア回路（例えば、特定用途向け集積回路もしくはフィールドプログラマブルゲートアレイ）を使用して、および／またはモジュール式もしくは集積様態で汎用プログラマブルプロセッサを有するコンピュータソフトウェアを使用して、実装することができる。本明細書で使用される場合、プロセッサは、シングルコアプロセッサ、同じ集積チップ上のマルチコアプロセッサ、または単一の回路基板もしくはネットワーク化された上の複数の処理ユニット、ならびに専用のハードウェアを含むことができる。本開示および本明細書に提供される教示に基づいて、当業者は、ハードウェア、およびハードウェアとソフトウェアとの組み合わせを使用して、本発明の実施形態を実装するための他の方法および／または方法を認識および理解するであろう。 Aspects of an embodiment can be implemented in computer software in the form of control logic, using hardware circuits (e.g., application specific integrated circuits or field programmable gate arrays), and/or in a modular or integrated fashion with a general purpose programmable processor. can be implemented using As used herein, a processor may include a single-core processor, a multi-core processor on the same integrated chip, or multiple processing units on a single circuit board or networked together, as well as dedicated hardware. can be done. Based on this disclosure and the teachings provided herein, one of ordinary skill in the art will be able to implement other methods and/or implementations of the embodiments of the present invention using hardware and combinations of hardware and software. or will know and understand how.

本出願で説明されるソフトウェア構成要素または関数のうちのいずれも、例えば、Ｊａｖａ、Ｃ、Ｃ＋＋、Ｃ＃、Ｏｂｊｅｃｔｉｖｅ－Ｃ、Ｓｗｉｆｔなどの任意の好適なコンピュータ言語、または、例えば、従来の技術もしくはオブジェクト指向の技術を使用するＰｅｒｌもしくはＰｙｔｈｏｎなどのスクリプト言語を使用する、処理デバイスによって実行されるソフトウェアコードとして実装されてもよい。ソフトウェアコードは、記憶および／または伝送のためのコンピュータ可読媒体上に一連の命令またはコマンドとして記憶され得る。好適な非一時的コンピュータ可読媒体は、ランダムアクセスメモリ（ＲＡＭ）、読み取り専用メモリ（ＲＯＭ）、磁気媒体（ハードドライブもしくはフロッピーディスクなど）、または光学媒体（コンパクトディスク（ＣＤ）もしくはＤＶＤ（デジタル多用途ディスク）など）、またはブルーレイディスクおよびフラッシュメモリなどを含むことができる。コンピュータ可読媒体は、そのような記憶または送信デバイスの任意の組み合わせであってもよい。 Any of the software components or functions described in this application may be written in any suitable computer language, such as Java, C, C++, C#, Objective-C, Swift, etc., or It may also be implemented as software code executed by a processing device using a scripting language such as Perl or Python using object-oriented techniques. Software code may be stored as a series of instructions or commands on a computer-readable medium for storage and/or transmission. Suitable non-transitory computer readable media include random access memory (RAM), read only memory (ROM), magnetic media (such as hard drives or floppy disks), or optical media (such as compact discs (CD) or DVDs (Digital Versatile discs), or Blu-ray discs and flash memory, etc. A computer readable medium may be any combination of such storage or transmission devices.

そのようなプログラムはまた、コード化され、インターネットを含む様々なプロトコルに従う有線ネットワーク、光ネットワーク、および／または無線ネットワークを介した送信に適合した搬送波信号を使用して送信されてもよい。したがって、コンピュータ可読媒体は、そのようなプログラムでコード化されたデータ信号を使用して作成されてもよい。プログラムコードでコード化されたコンピュータ可読媒体は、互換性のあるデバイスでパッケージ化されていてもよく、または（例えば、インターネットダウンロードを介して）他のデバイスとは別個に提供されてもよい。任意のそのようなコンピュータ可読媒体は、単一のコンピュータ製品（例えば、ハードドライブ、ＣＤ、もしくはコンピュータシステム全体）上もしくはその内部に存在してもよく、システムまたはネットワーク内の異なるコンピュータ製品上もしくはその内部に存在してもよい。コンピュータシステムは、モニタ、プリンタ、または本明細書に記載の結果のうちのいずれかをユーザへ提供するための他の好適なディスプレイを含み得る。 Such programs may also be encoded and transmitted using carrier wave signals suitable for transmission over wired, optical and/or wireless networks according to various protocols, including the Internet. Accordingly, computer readable media may be produced using data signals encoded with such programs. Computer-readable media encoded with the program code may be packaged with a compatible device or provided separately from other devices (eg, via Internet download). Any such computer-readable medium may reside on or within a single computer product (eg, a hard drive, CD, or an entire computer system), or may be on or among different computer products within a system or network. May exist inside. A computer system may include a monitor, printer, or other suitable display for providing any of the results described herein to a user.

本明細書記載の方法のうちのいずれも、ステップを実施するように構成することができる１つ以上のプロセッサを含むコンピュータシステムを用いて全体的または部分的に実施することができる。したがって、実施形態は、本明細書に説明される方法のうちのいずれかのステップを実施するように構成されたコンピュータシステムを対象とし得、潜在的には異なる構成要素がそれぞれのステップまたはそれぞれのステップのグループを実施する。番号付けされたステップとして提示されるが、本明細書の方法のステップは、同時にもしくは異なる時間に、または異なる順序で実施することができる。加えて、これらのステップの部分は、他の方法からの他のステップの部分と併用することができる。また、あるステップのすべてまたは部分は、任意選択的であってもよい。加えて、本方法のうちのいずれかのステップのうちのいずれかを、これらのステップを実施するためのシステムのモジュール、ユニット、回路、または他の手段を用いて実施することができる。 Any of the methods described herein can be implemented in whole or in part using a computer system including one or more processors that can be configured to perform the steps. Thus, an embodiment may be directed to a computer system configured to perform the steps of any of the methods described herein, with potentially different components each step or each Implement a group of steps. Although presented as numbered steps, the steps of the methods herein can be performed at the same time or at different times or in different orders. Additionally, portions of these steps can be used in conjunction with portions of other steps from other methods. Also, all or part of a step may be optional. Additionally, any of the steps of any of the methods may be performed using a system of modules, units, circuits, or other means for performing those steps.

特定の実施形態の具体的な詳細は、本発明の実施形態の趣旨および範囲から逸脱することなく、任意の好適な様態で組み合わせることができる。しかしながら、本発明の他の実施形態は、各個々の態様、またはこれらの個々の態様の具体的な組み合わせに関する具体的な実施形態を対象とし得る。 The specific details of particular embodiments may be combined in any suitable manner without departing from the spirit and scope of embodiments of the invention. However, other embodiments of the invention may be directed to specific embodiments relating to each individual aspect, or specific combinations of these individual aspects.

本開示の例示的実施形態の上の説明は、例示および説明の目的で提示されている。包括的であること、または本開示を説明された正確な形態に限定することは意図されず、多くの修正および変更が、先の教示に鑑みて可能である。 The foregoing description of exemplary embodiments of the present disclosure has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the disclosure to the precise form described, and many modifications and variations are possible in light of the above teachings.

「ａ」、「ａｎ」、または「ｔｈｅ」の記述は、それとは反対に具体的に示されない限り、「１つ以上」を意味することが意図される。「または」の使用は、それとは反対に具体的に示されない限り、「排他的なまたは」ではなく「包含的なまたは」を意味することが意図される。「第１」の構成要素への言及は、第２の構成要素が提供されることを必ずしも必要としない。さらに、「第１」または「第２」の構成要素への言及は、明示的に述べられていない限り、言及される構成要素を特定の場所に限定するものではない。「～に基づいて」という用語は、「少なくとも一部に基づいて」を意味することを意図している。 References to "a," "an," or "the" are intended to mean "one or more," unless specifically indicated to the contrary. The use of "or" is intended to mean an "inclusive or" rather than an "exclusive or," unless specifically indicated to the contrary. Reference to a "first" component does not necessarily require that a second component be provided. Further, reference to a "first" or "second" component does not limit the referenced component to a particular location unless explicitly stated. The term "based on" is intended to mean "based at least in part on."

本明細書において言及されるすべての特許、特許出願、刊行物、および明細書は、すべての目的に対して参照によりそれらの全体が組み込まれる。いかなるものも、先行技術であるとは認められていない。
参考文献
Ａｌｂｅｒｔ，Ｔ．Ｊ．ｅｔａｌ．（２００７）Ｄｉｒｅｃｔｓｅｌｅｃｔｉｏｎｏｆｈｕｍａｎｇｅｎｏｍｉｃｌｏｃｉｂｙｍｉｃｒｏａｒｒａｙｈｙｂｒｉｄｉｚａｔｉｏｎ．Ｎａｔ．Ｍｅｔｈｏｄｓ，４，９０３－９０５．
Ｂｅｃｋｍａｎｎｅｔａｌ．（２０１４）Ｄｅｔｅｃｔｉｎｇｅｐｉｇｅｎｅｔｉｃｍｏｔｉｆｓｉｎｌｏｗｃｏｖｅｒａｇｅａｎｄｍｅｔａｇｅｎｏｍｉｃｓｓｅｔｔｉｎｇｓ．ＢＭＣＢｉｏｉｎｆｏｒｍａｔｉｃｓ，１５（Ｓｕｐｐｌ９）：Ｓ１６．
Ｂｅａｕｌａｕｒｉｅｒ，Ｊ．ｅｔａｌ．（２０１９）Ｄｅｃｉｐｈｅｒｉｎｇｂａｃｔｅｒｉａｌｅｐｉｇｅｎｏｍｅｓｕｓｉｎｇｍｏｄｅｒｎｓｅｑｕｅｎｃｉｎｇｔｅｃｈｎｏｌｏｇｉｅｓ．ＮａｔｕｒｅＲｅｖｉｅｗｓＧｅｎｅｔｉｃｓ，２０：１５７－１７２．
Ｂｌｏｗ，Ｍ．Ｊ．ｅｔａｌ．（２０１６）ＴｈｅＥｐｉｇｅｎｏｍｉｃＬａｎｄｓｃａｐｅｏｆＰｒｏｋａｒｙｏｔｅｓ．ＰＬＯＳＧｅｎｅｔ．，１２，ｅ１００５８５４．
Ｂｒｅｉｍａｎ，Ｌ．（２００１）ＲａｎｄｏｍＦｏｒｅｓｔｓ．Ｍａｃｈ．Ｌｅａｒｎ．，４５，５－３２．
Ｃｈａｎ，Ｋ．Ｃ．Ａ．ｅｔａｌ．（２０１３）Ｎｏｎｉｎｖａｓｉｖｅｄｅｔｅｃｔｉｏｎｏｆｃａｎｃｅｒ－ａｓｓｏｃｉａｔｅｄｇｅｎｏｍｅ－ｗｉｄｅｈｙｐｏｍｅｔｈｙｌａｔｉｏｎａｎｄｃｏｐｙｎｕｍｂｅｒａｂｅｒｒａｔｉｏｎｓｂｙｐｌａｓｍａＤＮＡｂｉｓｕｌｆｉｔｅｓｅｑｕｅｎｃｉｎｇ．Ｐｒｏｃ．Ｎａｔｌ．Ａｃａｄ．Ｓｃｉ．Ｕ．Ｓ．Ａ．，１１０，１８７６１－８．
Ｃｌａｒｋ，Ｔ．Ａ．ｅｔａｌ．（２０１３）Ｅｎｈａｎｃｅｄ５－ｍｅｔｈｙｌｃｙｔｏｓｉｎｅｄｅｔｅｃｔｉｏｎｉｎｓｉｎｇｌｅ－ｍｏｌｅｃｕｌｅ，ｒｅａｌ－ｔｉｍｅｓｅｑｕｅｎｃｉｎｇｖｉａＴｅｔ１ｏｘｉｄａｔｉｏｎ．ＢＭＣＢｉｏｌ．，１１，４．
Ｃｌａｒｋ，Ｔ．Ａ．ｅｔａｌ．（２０１２）ＣｈａｒａｃｔｅｒｉｚａｔｉｏｎｏｆＤＮＡｍｅｔｈｙｌｔｒａｎｓｆｅｒａｓｅｓｐｅｃｉｆｉｃｉｔｉｅｓｕｓｉｎｇｓｉｎｇｌｅ－ｍｏｌｅｃｕｌｅ，ｒｅａｌ－ｔｉｍｅＤＮＡｓｅｑｕｅｎｃｉｎｇ．ＮｕｃｌｅｉｃＡｃｉｄｓＲｅｓ．，４０：ｅ２９．
Ｅｉｄ，Ｊ．ｅｔａｌ．（２００９）Ｒｅａｌ－ＴｉｍｅＤＮＡＳｅｑｕｅｎｃｉｎｇｆｒｏｍＳｉｎｇｌｅＰｏｌｙｍｅｒａｓｅＭｏｌｅｃｕｌｅｓ．Ｓｃｉｅｎｃｅ３２３，１３３－１３８．
Ｆｅｉｎｂｅｒｇ，Ａ．Ｐ．ａｎｄＩｒｉｚａｒｒｙ，Ｒ．Ａ．（２０１０）Ｓｔｏｃｈａｓｔｉｃｅｐｉｇｅｎｅｔｉｃｖａｒｉａｔｉｏｎａｓａｄｒｉｖｉｎｇｆｏｒｃｅｏｆｄｅｖｅｌｏｐｍｅｎｔ，ｅｖｏｌｕｔｉｏｎａｒｙａｄａｐｔａｔｉｏｎ，ａｎｄｄｉｓｅａｓｅ．Ｐｒｏｃ．Ｎａｔｌ．Ａｃａｄ．Ｓｃｉ．，１０７，１７５７－１７６４．
Ｆｅｎｇ，Ｚ．ｅｔａｌ．（２０１３）ＤｅｔｅｃｔｉｎｇＤＮＡｍｏｄｉｆｉｃａｔｉｏｎｓｆｒｏｍＳＭＲＴｓｅｑｕｅｎｃｉｎｇｄａｔａｂｙｍｏｄｅｌｉｎｇｓｅｑｕｅｎｃｅｃｏｎｔｅｘｔｄｅｐｅｎｄｅｎｃｅｏｆｐｏｌｙｍｅｒａｓｅｋｉｎｅｔｉｃ．ＰＬｏＳＣｏｍｐｕｔＢｉｏｌ．，９：ｅ１００２９３５．
Ｆｌｕｓｂｅｒｇ，Ｂ．Ａ．ｅｔａｌ．（２０１０）ＤｉｒｅｃｔｄｅｔｅｃｔｉｏｎｏｆＤＮＡｍｅｔｈｙｌａｔｉｏｎｄｕｒｉｎｇｓｉｎｇｌｅ－ｍｏｌｅｃｕｌｅ，ｒｅａｌ－ｔｉｍｅｓｅｑｕｅｎｃｉｎｇ．Ｎａｔ．Ｍｅｔｈｏｄｓ，７，４６１－４６５．
Ｆｒｏｍｍｅｒ，Ｍ．ｅｔａｌ．（１９９２）Ａｇｅｎｏｍｉｃｓｅｑｕｅｎｃｉｎｇｐｒｏｔｏｃｏｌｔｈａｔｙｉｅｌｄｓａｐｏｓｉｔｉｖｅｄｉｓｐｌａｙｏｆ５－ｍｅｔｈｙｌｃｙｔｏｓｉｎｅｒｅｓｉｄｕｅｓｉｎｉｎｄｉｖｉｄｕａｌＤＮＡｓｔｒａｎｄｓ．Ｐｒｏｃ．Ｎａｔｌ．Ａｃａｄ．Ｓｃｉ．，８９，１８２７－１８３１．
Ｇａｉ，Ｗ．ｅｔａｌ．（２０１８）Ｌｉｖｅｒ－ａｎｄｃｏｌｏｎ－ｓｐｅｃｉｆｉｃＤＮＡｍｅｔｈｙｌａｔｉｏｎｍａｒｋｅｒｓｉｎｐｌａｓｍａｆｏｒｉｎｖｅｓｔｉｇａｔｉｏｎｏｆｃｏｌｏｒｅｃｔａｌｃａｎｃｅｒｓｗｉｔｈｏｒｗｉｔｈｏｕｔｌｉｖｅｒｍｅｔａｓｔａｓｅｓ．Ｃｌｉｎ．Ｃｈｅｍ．，６４，１２３９－１２４９．
Ｇｏｕｉｌ，Ｑ．ｅｔａｌ．（２０１９）ＬａｔｅｓｔｔｅｃｈｎｉｑｕｅｓｔｏｓｔｕｄｙＤＮＡｍｅｔｈｙｌａｔｉｏｎ．ＥｓｓａｙｓＢｉｏｃｈｅｍ．６３（６）：６３９－６４８．
Ｇｒｕｎａｕ，Ｃ．（２００１）Ｂｉｓｕｌｆｉｔｅｇｅｎｏｍｉｃｓｅｑｕｅｎｃｉｎｇ：ｓｙｓｔｅｍａｔｉｃｉｎｖｅｓｔｉｇａｔｉｏｎｏｆｃｒｉｔｉｃａｌｅｘｐｅｒｉｍｅｎｔａｌｐａｒａｍｅｔｅｒｓ．ＮｕｃｌｅｉｃＡｃｉｄｓＲｅｓ．，２９，６５ｅ－６５．
Ｈｅｒｍａｎ，Ｊ．Ｇ．ｅｔａｌ．（１９９６）Ｍｅｔｈｙｌａｔｉｏｎ－ｓｐｅｃｉｆｉｃＰＣＲ：ａｎｏｖｅｌＰＣＲａｓｓａｙｆｏｒｍｅｔｈｙｌａｔｉｏｎｓｔａｔｕｓｏｆＣｐＧｉｓｌａｎｄｓ．Ｐｒｏｃ．Ｎａｔｌ．Ａｃａｄ．Ｓｃｉ．Ｕ．Ｓ．Ａ．，９３，９８２１－９８２６．
Ｊｉａｎｇ，Ｐ．ｅｔａｌ．（２０１４）Ｍｅｔｈｙ－Ｐｉｐｅ：ＡｎＩｎｔｅｇｒａｔｅｄＢｉｏｉｎｆｏｒｍａｔｉｃｓＰｉｐｅｌｉｎｅｆｏｒＷｈｏｌｅＧｅｎｏｍｅＢｉｓｕｌｆｉｔｅＳｅｑｕｅｎｃｉｎｇＤａｔａＡｎａｌｙｓｉｓ．ＰＬｏＳＯｎｅ，９，ｅ１００３６０．
ＬｅＣｕｎ，Ｙ．ｅｔａｌ．（１９８９）ＢａｃｋｐｒｏｐａｇａｔｉｏｎＡｐｐｌｉｅｄｔｏＨａｎｄｗｒｉｔｔｅｎＺｉｐＣｏｄｅＲｅｃｏｇｎｉｔｉｏｎ．ＮｅｕｒａｌＣｏｍｐｕｔ．，１，５４１－５５１．
Ｌｅｅ，Ｅ．－Ｊ．ｅｔａｌ．（２０１１）Ｔａｒｇｅｔｅｄｂｉｓｕｌｆｉｔｅｓｅｑｕｅｎｃｉｎｇｂｙｓｏｌｕｔｉｏｎｈｙｂｒｉｄｓｅｌｅｃｔｉｏｎａｎｄｍａｓｓｉｖｅｌｙｐａｒａｌｌｅｌｓｅｑｕｅｎｃｉｎｇ．ＮｕｃｌｅｉｃＡｃｉｄｓＲｅｓ．，３９，ｅ１２７－ｅ１２７．
Ｌｅｈｍａｎｎ－Ｗｅｒｍａｎ，Ｒ．ｅｔａｌ．（２０１６）Ｉｄｅｎｔｉｆｉｃａｔｉｏｎｏｆｔｉｓｓｕｅ－ｓｐｅｃｉｆｉｃｃｅｌｌｄｅａｔｈｕｓｉｎｇｍｅｔｈｙｌａｔｉｏｎｐａｔｔｅｒｎｓｏｆｃｉｒｃｕｌａｔｉｎｇＤＮＡ．Ｐｒｏｃ．Ｎａｔｌ．Ａｃａｄ．Ｓｃｉ．，１１３，Ｅ１８２６－Ｅ１８３４．
Ｌｉｓｔｅｒ，Ｒ．ｅｔａｌ．（２００９）ＨｕｍａｎＤＮＡｍｅｔｈｙｌｏｍｅｓａｔｂａｓｅｒｅｓｏｌｕｔｉｏｎｓｈｏｗｗｉｄｅｓｐｒｅａｄｅｐｉｇｅｎｏｍｉｃｄｉｆｆｅｒｅｎｃｅｓ．Ｎａｔｕｒｅ，４６２，３１５－３２２．
Ｌｉｕ，Ｑ．ｅｔａｌ．（２０１９）ＤｅｔｅｃｔｉｏｎｏｆＤＮＡｂａｓｅｍｏｄｉｆｉｃａｔｉｏｎｓｂｙｄｅｅｐｒｅｃｕｒｒｅｎｔｎｅｕｒａｌｎｅｔｗｏｒｋｏｎＯｘｆｏｒｄＮａｎｏｐｏｒｅｓｅｑｕｅｎｃｉｎｇｄａｔａ．ＮａｔｕｒｅＣｏｍｍｕｎ．，１０，２４４９．
Ｌｉｕ，Ｙ．ｅｔａｌ．（２０１９）Ｂｉｓｕｌｆｉｔｅ－ｆｒｅｅｄｉｒｅｃｔｄｅｔｅｃｔｉｏｎｏｆ５－ｍｅｔｈｙｌｃｙｔｏｓｉｎｅａｎｄ５－ｈｙｄｒｏｘｙｍｅｔｈｙｌｃｙｔｏｓｉｎｅａｔｂａｓｅｒｅｓｏｌｕｔｉｏｎ．Ｎａｔ．Ｂｉｏｔｅｃｈｎｏｌ．，３７，４２４－４２９．
Ｌｕｎ，Ｆ．Ｍ．Ｆ．ｅｔａｌ．（２０１３）ＮｏｎｉｎｖａｓｉｖｅｐｒｅｎａｔａｌｍｅｔｈｙｌｏｍｉｃａｎａｌｙｓｉｓｂｙｇｅｎｏｍｅｗｉｄｅｂｉｓｕｌｆｉｔｅｓｅｑｕｅｎｃｉｎｇｏｆｍａｔｅｒｎａｌｐｌａｓｍａＤＮＡ．Ｃｌｉｎ．Ｃｈｅｍ．，５９，１５８３－１５９４．
Ｎａｔｔｅｓｔａｄ，Ｍ．ｅｔａｌ．（２０１８）Ｃｏｍｐｌｅｘｒｅａｒｒａｎｇｅｍｅｎｔｓａｎｄｏｎｃｏｇｅｎｅａｍｐｌｉｆｉｃａｔｉｏｎｓｒｅｖｅａｌｅｄｂｙｌｏｎｇ－ｒｅａｄＤＮＡａｎｄＲＮＡｓｅｑｕｅｎｃｉｎｇｏｆａｂｒｅａｓｔｃａｎｃｅｒｃｅｌｌｌｉｎｅ．ＧｅｎｏｍｅＲｅｓ．，２８，１１２６－１１３５．
Ｎｇ，Ａ．Ｙ．（２００４）Ｆｅａｔｕｒｅｓｅｌｅｃｔｉｏｎ，Ｌ _１ｖｓ．Ｌ _２ｒｅｇｕｌａｒｉｚａｔｉｏｎ，ａｎｄｒｏｔａｔｉｏｎａｌｉｎｖａｒｉａｎｃｅ．Ｉｎ，Ｔｗｅｎｔｙ－ｆｉｒｓｔＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＭａｃｈｉｎｅＬｅａｒｎｉｎｇ－ＩＣＭＬ ’０４．ＡＣＭＰｒｅｓｓ，ＮｅｗＹｏｒｋ，ＮｅｗＹｏｒｋ，ＵＳＡ，ｐ．７８．
Ｎｉ，Ｐ．ｅｔａｌ．（２０１９）ＤｅｅｐＳｉｇｎａｌ：ｄｅｔｅｃｔｉｎｇＤＮＡｍｅｔｈｙｌａｔｉｏｎｓｔａｔｅｆｒｏｍＮａｎｏｐｏｒｅｓｅｑｕｅｎｃｉｎｇｒｅａｄｓｕｓｉｎｇｄｅｅｐ－ｌｅａｒｎｉｎｇ．Ｂｉｏｉｎｆｏｒｍａｔｉｃｓ，３５，４５８６－４５９５
Ｏｋｏｕ，Ｄ．Ｔ．ｅｔａｌ．（２００７）Ｍｉｃｒｏａｒｒａｙ－ｂａｓｅｄｇｅｎｏｍｉｃｓｅｌｅｃｔｉｏｎｆｏｒｈｉｇｈ－ｔｈｒｏｕｇｈｐｕｔｒｅｓｅｑｕｅｎｃｉｎｇ．Ｎａｔ．Ｍｅｔｈｏｄｓ，４，９０７－９０９．
Ｏｌｏｖａ，Ｎ．ｅｔａｌ．（２０１８）Ｃｏｍｐａｒｉｓｏｎｏｆｗｈｏｌｅ－ｇｅｎｏｍｅｂｉｓｕｌｆｉｔｅｓｅｑｕｅｎｃｉｎｇｌｉｂｒａｒｙｐｒｅｐａｒａｔｉｏｎｓｔｒａｔｅｇｉｅｓｉｄｅｎｔｉｆｉｅｓｓｏｕｒｃｅｓｏｆｂｉａｓｅｓａｆｆｅｃｔｉｎｇＤＮＡｍｅｔｈｙｌａｔｉｏｎｄａｔａ．ＧｅｎｏｍｅＢｉｏｌ．，１９，３３．
Ｒｏｂｅｒｔｓｏｎ，Ｋ．Ｄ．（２００５）ＤＮＡｍｅｔｈｙｌａｔｉｏｎａｎｄｈｕｍａｎｄｉｓｅａｓｅ．Ｎａｔ．Ｒｅｖ．Ｇｅｎｅｔ．，６，５９７－６１０．
Ｓｍｉｔｈ，Ｚ．Ｄ．ａｎｄＭｅｉｓｓｎｅｒ，Ａ．（２０１３）ＤＮＡｍｅｔｈｙｌａｔｉｏｎ：ｒｏｌｅｓｉｎｍａｍｍａｌｉａｎｄｅｖｅｌｏｐｍｅｎｔ．Ｎａｔ．Ｒｅｖ．Ｇｅｎｅｔ．，１４，２０４－２０．
Ｓｃｈａｄｔ，Ｅ．Ｅ．ｅｔａｌ．（２０１３）ＭｏｄｅｌｉｎｇｋｉｎｅｔｉｃｒａｔｅｖａｒｉａｔｉｏｎｉｎｔｈｉｒｄｇｅｎｅｒａｔｉｏｎＤＮＡｓｅｑｕｅｎｃｉｎｇｄａｔａｔｏｄｅｔｅｃｔｐｕｔａｔｉｖｅｍｏｄｉｆｉｃａｔｉｏｎｓｔｏＤＮＡｂａｓｅｓ．ＧｅｎｏｍｅＲｅｓ．，２３（１）：１２９－４１．
Ｓｕｎ，Ｋ．ｅｔａｌ．（２０１５）ＰｌａｓｍａＤＮＡｔｉｓｓｕｅｍａｐｐｉｎｇｂｙｇｅｎｏｍｅ－ｗｉｄｅｍｅｔｈｙｌａｔｉｏｎｓｅｑｕｅｎｃｉｎｇｆｏｒｎｏｎｉｎｖａｓｉｖｅｐｒｅｎａｔａｌ，ｃａｎｃｅｒ，ａｎｄｔｒａｎｓｐｌａｎｔａｔｉｏｎａｓｓｅｓｓｍｅｎｔｓ．Ｐｒｏｃ．Ｎａｔｌ．Ａｃａｄ．Ｓｃｉ．，１１２，Ｅ５５０３－Ｅ５５１２．
Ｓｕｚｕｋｉ，Ｙ．ｅｔａｌ．（２０１６）ＡｇＩｎ：ｍｅａｓｕｒｉｎｇｔｈｅｌａｎｄｓｃａｐｅｏｆＣｐＧｍｅｔｈｙｌａｔｉｏｎｏｆｉｎｄｉｖｉｄｕａｌｒｅｐｅｔｉｔｉｖｅｅｌｅｍｅｎｔｓ．Ｂｉｏｉｎｆｏｒｍａｔｉｃｓ，３２，２９１１－２９１９．
Ｗａｔｓｏｎ，Ｃ．Ｍ．ｅｔａｌ．（２０１９）Ｃａｓ９－ｂａｓｅｄｅｎｒｉｃｈｍｅｎｔａｎｄｓｉｎｇｌｅ－ｍｏｌｅｃｕｌｅｓｅｑｕｅｎｃｉｎｇｆｏｒｐｒｅｃｉｓｅｃｈａｒａｃｔｅｒｉｚａｔｉｏｎｏｆｇｅｎｏｍｉｃｄｕｐｌｉｃａｔｉｏｎｓ．Ｌａｂ．Ｉｎｖｅｓｔｉｇ，１００，１３５－１４６．
Ｚｈａｎｇ，Ｗ．ｅｔａｌ．（２０１５）Ｐｒｅｄｉｃｔｉｎｇｇｅｎｏｍｅ－ｗｉｄｅＤＮＡｍｅｔｈｙｌａｔｉｏｎｕｓｉｎｇｍｅｔｈｙｌａｔｉｏｎｍａｒｋｓ，ｇｅｎｏｍｉｃｐｏｓｉｔｉｏｎ，ａｎｄＤＮＡｒｅｇｕｌａｔｏｒｙｅｌｅｍｅｎｔｓ．ＧｅｎｏｍｅＢｉｏｌ．，１６，１４． All patents, patent applications, publications, and specifications referred to herein are incorporated by reference in their entirety for all purposes. Nothing is admitted to be prior art.
Reference Albert, T.; J. et al. (2007) Direct selection of human genomic loci by microarray hybridization. Nat. Methods, 4, 903-905.
Beckmann et al. (2014) Detecting epigenetic motifs in low coverage and metagenomics settings. BMC Bioinformatics, 15 (Suppl 9): S16.
Beaulaurier, J.; et al. (2019) Deciphering bacterial epigenomes using modern sequencing technologies. Nature Reviews Genetics, 20:157-172.
Blow, M. J. et al. (2016) The Epigenomic Landscape of Prokaryotes. PLOS Genet. , 12, e1005854.
Breiman, L.; (2001) Random Forests. Mach. Learn. , 45, 5-32.
Chan, K. C. A. et al. (2013) Noninvasive detection of cancer-associated genome-wide hypomethylation and copy number aberrations by plasma DNA bisulfite sequencing. Proc. Natl. Acad. Sci. U.S.A. S. A. , 110, 18761-8.
Clark, T. A. et al. (2013) Enhanced 5-methylcytosine detection in single-molecule, real-time sequencing via Tet1 oxidation. BMC Biol. , 11, 4.
Clark, T. A. et al. (2012) Characterization of DNA methyltransferase specifications using single-molecule, real-time DNA sequencing. Nucleic Acids Res. , 40:e29.
Eid,J. et al. (2009) Real-Time DNA Sequencing from Single Polymerase Molecules. Science 323, 133-138.
Feinberg, A.; P. and Irizarry, R.I. A. (2010) Stochastic epigenetic variation as a driving force of development, evolutionary adaptation, and disease. Proc. Natl. Acad. Sci. , 107, 1757-1764.
Feng, Z.; et al. (2013) Detecting DNA modifications from SMRT sequencing data by modeling sequence context dependence of polymerase kinetics. PLoS Comput Biol. , 9: e1002935.
Flusberg, B.; A. et al. (2010) Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat. Methods, 7, 461-465.
Frommer, M.; et al. (1992) A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands. Proc. Natl. Acad. Sci. , 89, 1827-1831.
Gai, W.; et al. (2018) Liver- and colon-specific DNA methylation markers in plasma for investigation of color cancers with or without liver metastases. Clin. Chem. , 64, 1239-1249.
Gouil, Q. et al. (2019) Latest techniques to study DNA methylation. Essays Biochem. 63(6):639-648.
Grunau, C.; (2001) Bisulfite genomic sequencing: systematic investigation of critical experimental parameters. Nucleic Acids Res. , 29, 65e-65.
Herman, J.; G. et al. (1996) Methylation-specific PCR: a novel PCR assay for methylation status of CpG islands. Proc. Natl. Acad. Sci. U.S.A. S. A. , 93, 9821-9826.
Jiang, P.; et al. (2014) Methy-Pipe: An Integrated Bioinformatics Pipeline for Whole Genome Bisulfite Sequencing Data Analysis. PLoS One, 9, e100360.
LeCun, Y.; et al. (1989) Backpropagation Applied to Handwritten Zip Code Recognition. Neural Comput. , 1, 541-551.
Lee, E. -J. et al. (2011) Targeted bisulfite sequencing by solution hybrid selection and massively parallel sequencing. Nucleic Acids Res. , 39, e127-e127.
Lehmann-Werman, R.; et al. (2016) Identification of tissue-specific cell death using methylation patterns of circulating DNA. Proc. Natl. Acad. Sci. , 113, E1826-E1834.
Lister, R. et al. (2009) Human DNA methylomes at base resolution show widespread epigenomic differences. Nature, 462, 315-322.
Liu, Q. et al. (2019) Detection of DNA base modifications by deep recurrent neural network on Oxford Nanopore sequencing data. Nature Commun. , 10, 2449.
Liu, Y.; et al. (2019) Bisulfite-free direct detection of 5-methylcytosine and 5-hydroxymethylcytosine at base resolution. Nat. Biotechnol. , 37, 424-429.
Lun, F. M. F. et al. (2013) Noninvasive prenatal methylomic analysis by genomewide bisulfite sequencing of maternal plasma DNA. Clin. Chem. , 59, 1583-1594.
Nattestad, M.; et al. (2018) Complex rearrangements and oncogene amplifications revealed by long-read DNA and RNA sequencing of a breast cancer cell line. Genome Res. , 28, 1126-1135.
Ng, A. Y. (2004) Feature selection, _L1 vs. L ₂ regularization, and rotational invariance. In, Twenty-first International Conference on Machine Learning-ICML '04. ACM Press, New York, New York, USA, p. 78.
Ni, P. et al. (2019) Deep Signal: detecting DNA methylation state from Nanopore sequencing reads using deep-learning. Bioinformatics, 35, 4586-4595
Okou, D. T. et al. (2007) Microarray-based genomic selection for high-throughput resequencing. Nat. Methods, 4, 907-909.
Olova, N.; et al. (2018) Comparison of whole-genome bisulfite sequencing library preparation strategies identifying sources of biases affecting DNA methylation data. Genome Biol. , 19, 33.
Robertson, K.; D. (2005) DNA methylation and human disease. Nat. Rev. Genet. , 6, 597-610.
Smith, Z.; D. and Meissner, A.; (2013) DNA methylation: roles in mammalian development. Nat. Rev. Genet. , 14, 204-20.
Schadt, E. E. et al. (2013) Modeling kinetic rate variation in third generation DNA sequencing data to detect putative modifications to DNA bases. Genome Res. , 23(1):129-41.
Sun, K. et al. (2015) Plasma DNA tissue mapping by genome-wide methylation sequencing for noninvasive prenatal, cancer, and transplantation assessments. Proc. Natl. Acad. Sci. , 112, E5503-E5512.
Suzuki, Y.; et al. (2016) AgIn: Measuring the landscape of CpG methylation of individual repetitive elements. Bioinformatics, 32, 2911-2919.
Watson, C.E. M. et al. (2019) Cas9-based enrichment and single-molecule sequencing for precision characterization of genomic duplications. Lab. Investig, 100, 135-146.
Zhang, W.; et al. (2015) Predicting genome-wide DNA methylation using methylation marks, genomic positions, and DNA regulatory elements. Genome Biol. , 16, 14.

Claims

A method for detecting nucleotide modifications in a nucleic acid molecule comprising:
(a) receiving data obtained by measuring pulses of light signals corresponding to sequenced nucleotides in a sample nucleic acid molecule, from said data the following characteristics:
identification of said nucleotide for each nucleotide;
the position of said nucleotide within said sample nucleic acid molecule;
a width of said pulse corresponding to said nucleotide, and a pulse interval representing the time between said pulse corresponding to said nucleotide and a pulse corresponding to a neighboring nucleotide;
obtaining a value for
(b) creating an input data structure, the input data structure comprising a window of said nucleotides sequenced in said sample nucleic acid molecule, wherein said input data structure comprises: , with the following characteristics:
said identification of said nucleotide;
position of said nucleotide relative to a target position within said window;
a width of said pulse corresponding to said nucleotide; and said pulse interval;
creating, including;
(c) inputting said input data structure into a model, said model:
receiving a first plurality of first data structures, wherein each first data structure of said first plurality of first data structures corresponds to a respective nucleic acid molecule of a plurality of first nucleic acid molecules; wherein each of said first nucleic acid molecules is sequenced by measuring pulses of said light signal corresponding to said nucleotides, said modifications corresponding to respective windows of nucleotides sequenced in each first having known first states of nucleotides at target positions in each window of the nucleic acid molecule of each first data structure, each first data structure containing a value for the same property as said input data structure;
storing a plurality of first training samples, each indicative of one of said first plurality of first data structures and said first state of said nucleotide at said target position; storing, including a label of 1; and
When said first plurality of first data structures is input to said model, said plurality of first training samples are used to match or not match corresponding labels of said first label. optimizing parameters of the model based on the output of the model, the output of the model specifying whether the nucleotide at the target position in the respective window has the modification; trained by, inputting; and
(d) using said model to determine whether said modification is present at a nucleotide at said target position within said window of said input data structure.

the input data structure is one of a plurality of input data structures;
the sample nucleic acid molecule is one of a plurality of sample nucleic acid molecules;
the plurality of sample nucleic acid molecules are obtained from a biological sample of a subject;
each input data structure corresponding to a respective window of sequenced nucleotides in a respective sample nucleic acid molecule of the plurality of sample nucleic acid molecules;
the method comprising:
receiving the plurality of input data structures;
inputting the plurality of input data structures into the model;
2. The method of claim 1, further comprising using said model to determine whether there is a modification at a nucleotide at a target position in said respective window of each input data structure.

determining that the modification is at one or more nucleotides;
3. The method of claim 2, further comprising using the presence of said modification of one or more nucleotides to assign a disorder classification.

4. The method of claim 3, wherein said disorder comprises cancer.

assigning the classification of the disorder to be that the subject has the disorder ;
5. The method of claim 3 or 4, further comprising:

The method of any one of claims 3-5 , wherein the number of modifications or the site of modification is used to assign the classification of the disorder.

A method according to any one of claims 1 to 6, wherein said modification is methylation.

The methylation is 4mC (N4-methylcytosine), 5mC (5-methylcytosine), 5hmC (5-hydroxymethylcytosine), 5fC (5-formylcytosine), 5caC (5-carboxylcytosine), 1mA (N1 -methyladenine), 3mA (N3-methyladenine), 6mA (N6-methyladenine), 7mA (N7-methyladenine), 3mC (N3-methylcytosine), 2mG (N2-methylguanine), 6mG (O6-methyl guanine), 7mG (N7-methylguanine), 3mT (N3-methylthymine), or 4mT (O4-methylthymine).

8. The method of claim 7, wherein said methylation is 5mC.

8. The method of claim 7, wherein said methylation is 6mA.

wherein said modification is methylation and said method comprises:
determining the methylation status of whether the modification is present at one or more nucleotides;
using said methylation status of said one or more nucleotides to determine a clinically relevant DNA fraction, fetal methylation profile, maternal methylation profile, presence of imprinted gene regions, or tissue of origin 3. The method of claim 2, further comprising: and.

12. The method of claim 11, wherein:
The method comprises determining a tissue of origin,
A method, wherein determining the tissue of origin comprises determining whether the sample nucleic acid molecules are of fetal or maternal origin.

Determining whether a sample nucleic acid molecule is of fetal or maternal origin includes:
determining the methylation level of the sample nucleic acid molecule using the methylation state of the one or more nucleotides; and
Comparing the methylation level of said sample nucleic acid molecule to a reference value
13. The method of claim 12, comprising:

14. The method of claim 13, wherein said reference value is determined from methylation levels of one or more maternal nucleic acid molecules.

14. The method of claim 13, wherein:
comparing the methylation level of the sample nucleic acid molecule to the reference value comprises determining that the methylation level of the sample nucleic acid molecule is lower than the reference value;
Determining whether the sample nucleic acid molecule is of fetal or maternal origin comprises using a comparison to determine whether the sample nucleic acid molecule is of fetal origin;
Method.

3. The method of claim 2, wherein said modification is methylation, said method further comprising:
identifying each sample nucleic acid molecule of the plurality of sample nucleic acid molecules as aligned with a region of the genome;
using said model to determine the methylation status of whether said modification is present at one or more nucleotides of each sample nucleic acid molecule of said plurality of sample nucleic acid molecules;
determining the methylation level of the region of the genome using the plurality of methylation states of the one or more nucleotides of the plurality of sample nucleic acid molecules; and
using said methylation level to determine whether a copy number abnormality is present in said region of said genome;
A method, including

17. The method of claim 16, further comprising comparing the methylation level of said region to a reference level, wherein determining whether a copy number abnormality is present in said region of the genome comprises using the comparison. described method.

18. The method of claim 17, wherein the reference level is determined using regions without copy number aberrations of the same type.

19. The method of any one of claims 16-18, wherein the region is a chromosome and the subject is a female subject pregnant with a fetus, the method further comprising:
determining that a copy number abnormality is present; and
determining that the fetus has a chromosomal aneuploidy;
A method, including

20. The method of any one of claims 2-19 , wherein each sample nucleic acid molecule of said plurality of sample nucleic acid molecules has a size greater than a cutoff size.

13. The method of any one of claims 1-12, wherein the nucleotides within the window are determined using a circular consensus sequence without aligning the sequenced nucleotides to a reference genome.

13. The method of any one of claims 1-12, wherein the nucleotides within the window are determined without using circular consensus sequences and without aligning the sequenced nucleotides to a reference genome.

the plurality of sample nucleic acid molecules are aligned to a plurality of genomic regions;
for each genomic region of said plurality of genomic regions a number of sample nucleic acid molecules are aligned with said genomic region;
A method according to any one of claims 2 to 12 , wherein the number of sample nucleic acid molecules is greater than the cutoff number.

The method of any one of claims 1-23 , wherein the model comprises a machine learning model, principal component analysis, convolutional neural network, or logistic regression.

said window of nucleotides corresponding to said input data structure comprises nucleotides on a first strand of said sample nucleic acid molecule and nucleotides on a second strand of said sample nucleic acid molecule;
The input data structure further includes, for each nucleotide within the window, a strand characteristic value, the strand characteristic indicating that the nucleotide is present on either the first strand or the second strand. The method according to any one of claims 1 to 24 .

the sample nucleic acid molecule is a circular DNA molecule;
cleaving the double-stranded DNA molecule using the Cas9 complex to form a cleaved double-stranded DNA molecule;
26. The method of claim 25 formed by ligating hairpin adapters to the ends of the cleaved double-stranded DNA molecules.

A method according to any one of claims 1 to 26 , wherein each nucleotide within said window is enriched or filtered.

Each nucleotide in said window is
Cleaving a double-stranded DNA molecule using a Cas9 complex to form a cleaved double-stranded DNA molecule, enriched by ligating hairpin adapters to the ends of said cleaved double-stranded DNA molecule or filtered by selecting double - stranded DNA molecules having sizes in a size range.

The method of any one of claims 1-28 , wherein the optical signal is a fluorescent signal from a dye-labeled nucleotide.

30. Any one of claims 1-29 , wherein each window associated with said first plurality of first data structures comprises at least 4 contiguous nucleotides on the first strand of each first nucleic acid molecule. The method described in .

2. The method of claim 1, further comprising using the presence of methylation to detect the tissue origin of said sample nucleic acid molecules or to identify chimeric and hybrid DNA, wherein said sample nucleic acid molecules are A method obtained from said subject.

At least some of the plurality of first nucleic acid molecules each have a first location corresponding to a first reference sequence and a second location corresponding to a second reference sequence different from the first reference sequence. 32. The method of any one of claims 1-31, comprising two locations.

further comprising validating the model using a plurality of chimeric nucleic acid molecules, each having a first location corresponding to the first reference sequence and a second location corresponding to the second reference sequence; 33. The method of any one of claims 1-32, comprising: said first location having a first methylation pattern and said second location having a second methylation pattern.

34. The method of claim 32 or claim 33, wherein said first location is treated with a methylase.

35. The method of claim 34, wherein said second location corresponds to an unmethylated location of said second reference sequence.

34. The method of claim 32 or claim 33, wherein the first reference sequence is human and the second reference sequence is from a different animal.

37. The method of any one of claims 1-36, wherein said window comprises at least 3 nucleotides upstream of the target position within said window.

The method of any one of claims 1-30, further comprising sequencing said sample nucleic acid molecules.

39. The method of claim 38, wherein sequencing sample nucleic acid molecules comprises measuring pulses of light signals corresponding to nucleotides in said sample nucleic acid molecules.

A computer readable medium storing a plurality of instructions when executed by a computer system that causes the computer system to perform the method of any one of claims 1-30.

at least one storage device;
a plurality of instructions stored in at least one storage device; and
At least one processor programmed with at least some of a plurality of instructions to perform the method of any one of claims 1-30
computer system, including;