JP7046007B2

JP7046007B2 - How to adjust the molecular label count

Info

Publication number: JP7046007B2
Application number: JP2018561218A
Authority: JP
Inventors: ジェエファン，; ジェニファーツァイ，; エリーンシャム，; リシャデン，; グレンケー．フー，
Original assignee: Becton Dickinson and Co
Current assignee: Becton Dickinson and Co
Priority date: 2016-05-26
Filing date: 2017-05-25
Publication date: 2022-04-01
Anticipated expiration: 2037-05-25
Also published as: JP2019522268A; US20170344866A1; WO2017205691A1; US11397882B2; EP3465502B1; EP3465502A1; US20230065324A1; CN109074430A; CN109074430B

Description

関連出願
本出願は、３５Ｕ．Ｓ．Ｃ．§１１９（ｅ）に従い、２０１６年５月２６日出願の米国仮特許出願第６２／３４２１３７号明細書；２０１６年８月３１日出願の米国仮特許出願第６２／３８１９４５号明細書；および２０１６年９月２９日出願の米国仮特許出願第６２／４０１７２０号明細書に基づく優先権を主張する。これらの出願各々の内容は、本出願をもってその全体が参照により明示的に組み込まれる。 Related application This application is based on 35 U.S. S. C. According to §119 (e), US Provisional Patent Application No. 62/342137 filed May 26, 2016; US Provisional Patent Application No. 62/381945 filed August 31, 2016; and 2016. Claim priority under US Provisional Patent Application No. 62/401720, filed September 29. The contents of each of these applications are expressly incorporated by reference in their entirety with this application.

本開示は、概して、核酸バーコーディング、より具体的には、分子標識を用いたＰＣＲおよびシーケンシングエラーの訂正の分野に関する。 The present disclosure relates generally to the field of nucleic acid barcoding, more specifically PCR and sequencing error correction using molecular labeling.

関連分野の説明
確率バーコーディングなどの方法および技術は、細胞分析において、特に、たとえば、逆転写、ポリメラーゼ連鎖反応（ＰＣＲ）増幅、および次世代シーケンシング（ＮＧＳ）を用いて細胞の状態を判定するために、遺伝子発現プロフィールを解読する上で有用である。しかし、これらの方法および技術は、置換エラー（１つ以上の塩基を含む）および非置換エラーなどのエラーを導入する恐れがあり、未訂正のままだと、過大評価された分子カウントが生じうる。従って、確率バーコーディングを用いて推定される正確な分子カウントを取得するために、さまざまなエラーを訂正することができる方法および技術が求められる。 Description of Related Fields Methods and techniques such as probabilistic barcoding use in cell analysis, for example, reverse transcription, polymerase chain reaction (PCR) amplification, and next-generation sequencing (NGS) to determine cell status. Therefore, it is useful in decoding the gene expression profile. However, these methods and techniques can introduce errors such as substitution errors (including one or more bases) and unsubstituted errors, which, if left uncorrected, can result in overrated molecular counts. .. Therefore, there is a need for methods and techniques that can correct various errors in order to obtain accurate molecular counts estimated using probabilistic barcoding.

本明細書には、標的の数を決定する方法が開示される。いくつかの実施形態では、本方法は、（ａ）複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程であって、複数の確率バーコードの各々が分子標識を含む工程と；（ｂ）確率バーコード付き標的のシーケンシングデータを取得する工程と；（ｃ）複数の標的の１つ以上について：（ｉ）シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数をカウントする工程と；（ｉｉ）（ｂ）で得られたシーケンシングデータ中の標的のクオリティステータスを決定する工程と；（ｉｉｉ）（ｂ）で得られたシーケンシングデータ中の１つ以上のシーケンシングデータエラーを決定する工程であって、シーケンシングデータ中の１つ以上のシーケンシングデータエラーを決定する工程が、以下：シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数、シーケンシングデータ中の標的のクオリティステータス、および複数の確率バーコード中の識別可能な配列を有する分子標識の数のうち１つ以上を決定することを含む工程と；（ｉｖ）標的の数を推定する工程であって、推定された標的の数が、（ｉｉｉ）で決定された１つ以上のシーケンシングデータエラーに応じて調節された、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数と相関する工程と、を含む。工程（ｉ）、（ｉｉ）、（ｉｉｉ）、および（ｉｖ）は、複数の標的の各々について実施することができる。本方法は、多重化することができる。 A method of determining the number of targets is disclosed herein. In some embodiments, the method is (a) a step of using a plurality of probability barcodes to attach probability barcodes to a plurality of targets to generate a plurality of targets with probability barcodes. Each of the probabilistic barcodes of the above includes a molecular label; (b) a step of acquiring sequencing data of a target with a probabilistic bar code; (c) for one or more of a plurality of targets: (i) sequencing data. A step of counting the number of molecular labels having an identifiable sequence associated with a target in; and a step of determining the quality status of the target in the sequencing data obtained in (ii) (b); (iii). ) The step of determining one or more sequencing data errors in the sequencing data obtained in (b), wherein the step of determining one or more sequencing data errors in the sequencing data is as follows: The number of molecular labels with identifiable sequences associated with the target in the sequencing data, the quality status of the target in the sequencing data, and the number of molecular labels with identifiable sequences in multiple probability barcodes. A step comprising determining one or more of them; (iv) a step of estimating the number of targets, wherein the estimated number of targets is one or more sequencing data errors determined in (iii). Includes (i) a step that correlates with the number of molecular labels having an identifiable sequence associated with the target in the sequencing data counted according to. Steps (i), (ii), (iii), and (iv) can be performed for each of the plurality of targets. The method can be multiplexed.

いくつかの実施形態では、本方法は、１つ以上のシーケンシングデータエラーを決定する前に、（ｂ）で得られたシーケンシングデータを折りたたむ工程をさらに含む。（ｂ）で得られたシーケンシングデータを折りたたむ工程は、類似した分子標識を有し、かつ、所定の折りたたみ発生数閾値よりも少ない発生数を有する標的のコピーを、複数の標的について同じ分子標識を有するものとして帰属させる工程を含み、ここで、標的の２つのコピーは、標的の２つのコピーの分子標識の配列が少なくとも１塩基相違する場合、類似の分子標識を有する。 In some embodiments, the method further comprises collapsing the sequencing data obtained in (b) before determining one or more sequencing data errors. In the step of collapsing the sequencing data obtained in (b), a copy of a target having a similar molecular label and having an occurrence number smaller than a predetermined folding occurrence number threshold is printed on the same molecular label for a plurality of targets. Including the step of assigning as having, where the two copies of the target have similar molecular labels if the sequences of the molecular labels of the two copies of the target differ by at least one base.

いくつかの実施形態では、所定の折りたたみ発生数閾値は、確率バーコードが、識別可能な配列を有する約６５６１の分子標識を含む場合、７となりうる。確率バーコードが、識別可能な配列を有する約６５５３６の分子標識を含む場合、所定の折りたたみ発生数閾値は１７となりうる。標的の２つのコピーは、標的の２つのコピーの分子標識の配列が少なくとも１塩基相違する場合、類似の分子標識を有する。いくつかの実施形態では、分子標識は、５～２０個のヌクレオチドを含む。さまざまな確率バーコードの分子標識は、互いに異なっていてよい。複数の確率バーコードは、識別可能な配列を有する約６５６１の分子標識を含む。複数の確率バーコードは、識別可能な配列を有する約６５５３６の分子標識を含む。 In some embodiments, the predetermined folding occurrence threshold can be 7 if the probability barcode contains about 6651 molecular labels with identifiable sequences. If the probability barcode contains about 65536 molecular labels with identifiable sequences, the predetermined folding occurrence number threshold can be 17. The two copies of the target have similar molecular labels if the sequences of the molecular labels on the two copies of the target differ by at least one base. In some embodiments, the molecular label comprises 5-20 nucleotides. The molecular labels of the various probability barcodes may be different from each other. Multiple probability barcodes include about 6651 molecular labels with identifiable sequences. Multiple probability barcodes include about 65536 molecular labels with identifiable sequences.

いくつかの実施形態では、５０ヌクレオチド以上のリード長を有する複数の標的の配列を含む。シーケンシングデータは、７５ヌクレオチド以上のリード長を有する複数の標的の配列を含む。シーケンシングデータは、１００ヌクレオチド以上のリード長を有する複数の標的の配列を含む。（ｂ）で得られたシーケンシングデータは、複数の確率バーコード付き標的に対してポリメラーゼ連鎖反応（ＰＣＲ）増幅を実施することによって生成することができる。 In some embodiments, it comprises a sequence of a plurality of targets having a read length of 50 nucleotides or greater. Sequencing data include sequences of multiple targets with read lengths of 75 nucleotides or greater. Sequencing data include sequences of multiple targets with read lengths of 100 nucleotides or greater. The sequencing data obtained in (b) can be generated by performing polymerase chain reaction (PCR) amplification on a plurality of probabilistic barcoded targets.

いくつかの実施形態では、１つ以上のシーケンシングデータエラーは、ＰＣＲ導入エラー、シーケンシング導入エラー、バーコード混入に起因するエラー、ライブラリー作製エラー、またはそれらの任意の組合せでありうる。ＰＣＲ導入エラーは、ＰＣＲ増幅エラー、ＰＣＲ増幅バイアス、不十分なＰＣＲ増幅、またはそれらの任意の組合せの結果でありうる。シーケンシング導入エラーは、不正確なベースコーリング、不十分なシーケンシング、またはそれらの任意の組合せの結果でありうる。 In some embodiments, the one or more sequencing data errors can be PCR introduction errors, sequencing introduction errors, errors due to bar code contamination, library fabrication errors, or any combination thereof. PCR induction errors can be the result of PCR amplification errors, PCR amplification bias, inadequate PCR amplification, or any combination thereof. Sequencing implementation errors can be the result of inaccurate base calling, inadequate sequencing, or any combination thereof.

いくつかの実施形態では、シーケンシングデータ中の標的のクオリティステータスは、完全シーケンシング、不完全シーケンシング、または飽和シーケンシングでありうる。シーケンシングデータ中の標的のクオリティステータスは、複数の確率バーコード中に識別可能な配列を有する分子標識の数と、カウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数によって決定することができる。シーケンシングデータ中の標的のクオリティステータスは、（ｂ）で得られたシーケンシングデータ中の標的のクオリティステータスが、完全シーケンシングではなく、かつ、飽和シーケンシングではない場合に、不完全シーケンシングとして分類することができる。 In some embodiments, the quality status of the target in the sequencing data can be complete sequencing, incomplete sequencing, or saturated sequencing. The quality status of a target in the sequencing data is the number of molecular labels that have an identifiable sequence in multiple probability barcodes and the molecule that has an identifiable sequence associated with the target in the counted sequencing data. It can be determined by the number of signs. The quality status of the target in the sequencing data is defined as incomplete sequencing when the quality status of the target in the sequencing data obtained in (b) is not complete sequencing and is not saturated sequencing. Can be categorized.

いくつかの実施形態では、完全シーケンシングクオリティステータスは、所定の完全シーケンシング散布閾値以上のポアソン分布と比較した散布指数によって決定され、ここで、所定の完全シーケンシング散布閾値は、０．９、１、または４でありうる。完全シーケンシングクオリティステータスは、さらに、（ｂ）で得られたシーケンシングデータ中の所定の完全シーケンシング発生数閾値以上の発生数を有する分子標識によって決定することもでき、ここで、所定の完全シーケンシング発生数閾値は、１０または１８でありうる。 In some embodiments, the complete sequencing quality status is determined by a dispersal index compared to a Poisson distribution above a predetermined complete sequencing application threshold, where the predetermined complete sequencing application threshold is 0.9, It can be 1 or 4. The complete sequencing quality status can also be further determined by a molecular label having an incidence greater than or equal to a predetermined complete sequencing occurrence number threshold in the sequencing data obtained in (b), where the predetermined completeness. The sequencing occurrence number threshold can be 10 or 18.

いくつかの実施形態では、飽和シーケンシングクオリティステータスは、所定の飽和閾値よりも大きい、識別可能な配列を含む分子標識の数を有する標的によって、決定することができる。飽和シーケンシングクオリティステータスは、さらに、所定の飽和閾値よりも大きい、識別可能な配列を含む分子標識の数を有する複数の標的のうちの１つの他の標的によって、決定することもできる。所定の飽和閾値は、確率バーコードが、識別可能な配列を有する約６５６１の分子標識を含む場合、６５５７でありうる。所定の飽和閾値は、確率バーコードが、識別可能な配列を有する約６５５３６の分子標識を含む場合、６５５３２でありうる。 In some embodiments, the saturation sequencing quality status can be determined by a target having a number of molecular labels containing an identifiable sequence that is greater than a predetermined saturation threshold. Saturation sequencing quality status can also be further determined by one of the other targets having a number of molecular labels containing identifiable sequences greater than a predetermined saturation threshold. The predetermined saturation threshold can be 6557 if the probability barcode contains about 6651 molecular labels with identifiable sequences. The predetermined saturation threshold can be 65532 if the probability barcode contains about 65536 molecular labels with identifiable sequences.

いくつかの実施形態では、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数は、（ｉｖ）において、標的が完全シーケンシングクオリティステータスを有していれば、１つ以上の親分子標識についてすべての子供分子標識を決定する工程と；少なくとも１つの子供分子標識および親分子標識について第１の統計解析を実施する工程と；第１の統計解析の帰無仮説が容認されれば、子供分子標識の発生数を親分子標識に帰属させる工程と、によって調節される。 In some embodiments, the number of molecular labels with identifiable sequences associated with the target in the sequencing data counted in (i) is, in (iv), that the target has a complete sequencing quality status. If so, a step of determining all child molecule labels for one or more parent molecule labels; and a step of performing a first statistical analysis on at least one child molecule label and parent molecule label; a first statistic. If the null hypothesis of the analysis is accepted, it is regulated by the step of assigning the number of child molecule labels to the parent molecule label.

いくつかの実施形態では、１つ以上の親分子標識は、所定の完全シーケンシング親閾値以上の発生数を有する分子標識を含み、ここで、所定の完全シーケンシング親閾値は、所定の完全シーケンシング発生数閾値と等しい。子供分子標識は、親分子標識と１塩基相違し、かつ、所定の完全シーケンシング子供閾値以下の発生数を有する分子標識を含み、ここで、所定の完全シーケンシング子供閾値は、３または５でありうる。第１の統計解析の帰無仮説は、真であるという帰無仮説の確率が偽発見率を下回れば、容認することができ、ここで、偽発見率は、５％または１０％である。第１の統計解析は、多重二項検定であってよい。 In some embodiments, one or more parent molecular labels include a molecular label having an incidence greater than or equal to a predetermined complete sequencing parent threshold, wherein the predetermined complete sequencing parent threshold is a predetermined complete sequence. Equal to the number of sing occurrence thresholds. The child molecule label comprises a molecule label that is one base different from the parent molecule label and has a number of occurrences that is less than or equal to the predetermined complete sequencing child threshold, where the predetermined complete sequencing child threshold is 3 or 5. It is possible. The null hypothesis of the first statistical analysis is acceptable if the probability of the null hypothesis that it is true is less than the false discovery rate, where the false discovery rate is 5% or 10%. The first statistical analysis may be a multiple binomial test.

いくつかの実施形態では、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数は、（ｉｖ）において、標的が完全シーケンシングクオリティステータスを有していれば、標的の分子標識を閾値化して、（ｂ）で得られたシーケンシングデータ中の標的に関連付けられた真の分子標識および偽の分子標識を決定する工程によって調節される。標的の分子標識を閾値化する工程は、標的の分子標識について第２の統計解析を実施する工程を含む。 In some embodiments, the number of molecular labels having an identifiable sequence associated with the target in the sequencing data counted in (i), in (iv), the target has a complete sequencing quality status. If so, it is regulated by the step of thresholding the molecular label of the target to determine the true and false molecular label associated with the target in the sequencing data obtained in (b). The step of thresholding the target molecular label comprises performing a second statistical analysis on the target molecular label.

いくつかの実施形態では、第２の統計解析を実施する工程は、以下：標的の分子標識の分布およびそれらの発生数を２つのポアソン分布に当てはめる工程と；２つのポアソン分布を用いて真の分子標識の数ｎを決定する工程と；（ｂ）で得られたシーケンシングデータから偽の分子標識を除去する工程と、を含み、ここで、偽の分子標識は、ｎ番目に豊富な分子標識の発生数よりも低い発生数を有する分子標識を含み、また、真の分子標識は、ｎ番目に豊富な分子標識の発生数以上の発生数を有する分子標識を含む。２つのポアソン分布は、真の分子標識に対応する第１のポアソン分布と、偽の分子標識に対応する第２のポアソン分布を含む。 In some embodiments, the steps of performing a second statistical analysis are as follows: fitting the distribution of target molecular labels and the number of their occurrences into two Poisson distributions; true using two Poisson distributions. It comprises a step of determining the number n of molecular labels; a step of removing the fake molecular label from the sequencing data obtained in (b), where the fake molecular label is the nth richest molecule. A true molecular label comprises a molecular label having a number of occurrences lower than the number of occurrences of the label, and a true molecular label includes a molecular label having a number of occurrences greater than or equal to the number of occurrences of the nth most abundant molecular label. The two Poisson distributions include a first Poisson distribution corresponding to a true molecular label and a second Poisson distribution corresponding to a false molecular label.

いくつかの実施形態では、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数は、（ｉｖ）において、（ｂ）で得られたシーケンシングデータ中の標的のクオリティステータスが、不完全シーケンシングクオリティステータスである場合、標的が、（ｂ）で得られたシーケンシングデータにおいてノイジーであるか否かを決定する工程と；（ｂ）で得られたシーケンシングデータからノイジー標的を除去する工程とによって調節されうる。ノイジー標的の分子標識の発生数が、不完全シーケンシングクノイジー標的閾値以下であれば、標的はノイジーである可能性があり、ここで、不完全シーケンシングノイジー遺伝子閾値は、５である。不完全シーケンシングノイジー標的閾値は、完全シーケンシングのクオリティステータスを有する複数の標的の分子標識の中央または平均発生数と等しくてもよい。 In some embodiments, the number of molecular labels with identifiable sequences associated with the target in the sequencing data counted in (i) is, in (iv), the number of sequencing obtained in (b). If the quality status of the target in the data is incomplete sequencing quality status, the step of determining whether the target is noisy in the sequencing data obtained in (b); It can be regulated by the step of removing the noisy target from the sequenced data obtained. If the number of molecular labels generated by the noisy target is less than or equal to the incomplete sequencing noisy target threshold, the target may be noisy, where the incomplete sequencing noisy gene threshold is 5. The incomplete sequencing noisy target threshold may be equal to the median or average incidence of molecular labels for multiple targets with complete sequencing quality status.

いくつかの実施形態では、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数は、（ｉｖ）において、（ｂ）で得られたシーケンシングデータ中の標的のクオリティステータスが不完全シーケンシングクオリティステータスであれば、標的の分子標識を閾値化して、（ｂ）で得られたシーケンシングデータ中の真の分子標識および偽の分子標識を決定する工程によって調節することができる。 In some embodiments, the number of molecular labels with identifiable sequences associated with the target in the sequencing data counted in (i) is in (iv) the sequencing obtained in (b). If the quality status of the target in the data is incomplete sequencing quality status, the molecular label of the target is thresholded to determine the true and false molecular labels in the sequencing data obtained in (b). It can be adjusted by the process of

いくつかの実施形態では、標的の分子標識を閾値化する工程は、分子標識について第３の統計解析を実施する工程を含む。分子標識について第３の統計解析を実施する工程は、ゼロ切断ポアソンモデルを用いて、真の分子標識の数ｎを決定する工程と；（ｂ）で得られたシーケンシングデータから偽の分子標識を除去する工程と、を含み、ここで、偽の分子標識は、ｎ番目に豊富な分子標識の発生数よりも低い発生数を有する分子標識を含み、また、真の分子標識は、ｎ番目に豊富な分子標識の発生数以上の発生数を有する分子標識を含む。 In some embodiments, the step of thresholding the target molecular label comprises performing a third statistical analysis on the molecular label. The step of performing the third statistical analysis on the molecular label is the step of determining the number n of true molecular labels using the zero-cleaving Poisson model; and the false molecular label from the sequencing data obtained in (b). The false molecular label comprises a molecular label having a lower number of occurrences than the nth abundant molecular label, and the true molecular label comprises the nth. Includes molecular labels with more than abundant molecular labels.

いくつかの実施形態では、（ｉ）でカウントされたシーケンシングデータが、（ｉｉｉ）で決定された１つ以上のシーケンシングデータエラーに応じて調節された後、（ｂ）で得られたシーケンシングデータ中の分子標識の少なくとも５０％または８０％を保持することができる。 In some embodiments, the sequencing data counted in (i) is adjusted according to one or more sequencing data errors determined in (iii) and then the sequence obtained in (b). At least 50% or 80% of the molecular labels in the singing data can be retained.

いくつかの実施形態では、複数の標的に確率バーコードを付ける工程は、複数の確率バーコードを複数の標的とハイブリダイズさせて、確率バーコード付き標的を生成する工程を含む。複数の標的に確率バーコードを付ける工程は、確率バーコード付き標的のインデックス付きライブラリーを作製する工程を含む。確率バーコード付き標的のインデックス付きライブラリーを作製する工程は、複数の確率バーコードを含む固体担体を用いて実施することができる。固体担体は、複数の確率バーコードと関連した複数の合成粒子を含む。固体担体は、２次元または３次元の複数の確率バーコードを含む。固体担体は、ポリマー、マトリックス、ヒドロゲル、ニードルアレイデバイス、抗体、またはそれらの任意の組合せを含む。 In some embodiments, the step of attaching a probability barcode to a plurality of targets comprises hybridizing the plurality of probability barcodes with the plurality of targets to generate a target with a probability barcode. The step of attaching a probability barcode to a plurality of targets includes a step of creating an indexed library of targets with a probability barcode. The step of creating an indexed library of targets with probability barcodes can be performed using a solid support containing a plurality of probability barcodes. The solid support comprises a plurality of synthetic particles associated with a plurality of probability barcodes. The solid support comprises multiple two-dimensional or three-dimensional probability barcodes. Solid carriers include polymers, matrices, hydrogels, needle array devices, antibodies, or any combination thereof.

いくつかの実施形態では、複数の確率バーコードの各々は、サンプル標識、ユニバーサル標識および細胞標識の１つ以上を含み、ここで、サンプル標識は、固体担体上の複数の確率バーコードに対するものと同じであってよく、ユニバーサル標識は、固体担体上の複数の確率バーコードに対するものと同じであってよく、細胞標識は、固体担体上の複数の確率バーコードに対するものと同じであってよい。サンプル標識は、５～２０ヌクレオチドを含む。ユニバーサル標識は、５～２０ヌクレオチドを含む。細胞標識は、５～２０ヌクレオチドを含む。 In some embodiments, each of the plurality of probability barcodes comprises one or more of a sample label, a universal label and a cell label, wherein the sample label is for the plurality of probability barcodes on a solid carrier. It may be the same, the universal label may be the same for multiple probability barcodes on a solid carrier, and the cell label may be the same for multiple probability barcodes on a solid carrier. The sample label contains 5-20 nucleotides. The universal label contains 5-20 nucleotides. Cell labels contain 5-20 nucleotides.

いくつかの実施形態では、合成粒子はビーズであってよい。ビーズは、シリカゲルビーズ、多孔性ガラスビーズ、磁気ビーズ、ダイナビーズ、セファデックス／セファロースビーズ、セルロースビーズ、ポリスチレンビーズ、またはそれらの任意の組合せであってよい。 In some embodiments, the synthetic particles may be beads. The beads may be silica gel beads, porous glass beads, magnetic beads, dyna beads, sephadex / sephadex beads, cellulose beads, polystyrene beads, or any combination thereof.

いくつかの実施形態では、複数の標的をサンプル中に含有させることができる。サンプルは、１つ以上の細胞を含む。サンプルは、単一細胞であってもよい。１つ以上の細胞は、１つ以上の細胞型を含む。１つ以上の細胞型の少なくとも１つは、脳細胞、心細胞、癌細胞、循環腫瘍細胞、臓器細胞、上皮細胞、転移細胞、良性細胞、一次細胞、循環細胞、またはそれらの任意の組合せである。 In some embodiments, multiple targets can be included in the sample. The sample contains one or more cells. The sample may be a single cell. One or more cells include one or more cell types. At least one of one or more cell types can be brain cells, heart cells, cancer cells, circulating tumor cells, organ cells, epithelial cells, metastatic cells, benign cells, primary cells, circulating cells, or any combination thereof. be.

いくつかの実施形態では、複数の標的は、リボ核酸（ＲＮＡ）、メッセンジャーＲＮＡ（ｍＲＮＡ）、ｍｉｃｒｏＲＮＡ、低分子干渉ＲＮＡ（ｓｉＲＮＡ）、ＲＮＡ分解産物、ポリ（Ａ）テールを各々含むＲＮＡ、またはそれらの任意の組合せを含む。 In some embodiments, the plurality of targets are ribonucleic acid (RNA), messenger RNA (mRNA), microRNA, small interfering RNA (siRNA), RNA degradation products, RNA containing poly (A) tail, or RNA thereof. Includes any combination of.

いくつかの実施形態では、本方法は、さらに、１つ以上の細胞を溶解する工程を含みうる。１つ以上の細胞を溶解する工程は、サンプルを加熱する工程、サンプルを洗剤と接触させる工程、サンプルのｐＨを変える工程、またはそれらの任意の組合せを含む。 In some embodiments, the method may further comprise the step of lysing one or more cells. The steps of lysing one or more cells include heating the sample, contacting the sample with a detergent, changing the pH of the sample, or any combination thereof.

本明細書には、標的の数を決定する方法が開示される。いくつかの実施形態では、本方法は、（ａ）複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程であって、複数の確率バーコードの各々が分子標識を含む工程と；（ｂ）確率バーコード付き標的のシーケンシングデータを取得する工程と；（ｃ）複数の標的の１つ以上について：（ｉ）シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数をカウントする工程と；（ｉｉ）方向近接性を用いて、標的の分子標識のクラスターを同定する工程と；（ｉｉｉ）（ｉｉ）で同定された標的の分子標識のクラスターを用いて、（ｂ）で得られたシーケンシングデータを折りたたむ工程と；（ｉｖ）標的の数を推定する工程であって、推定された標的の数が、（ｉｉ）のシーケンシングデータの折りたたみ後に、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数と相関する工程と、を含む。複数の標的は、細胞の全トランスクリプトームの標的を含む。 A method of determining the number of targets is disclosed herein. In some embodiments, the method is (a) a step of attaching a probability barcode to a plurality of targets using a plurality of probability barcodes to generate a plurality of targets with the probability barcodes. Each of the probabilistic barcodes of the above includes a molecular label; (b) a step of acquiring sequencing data of a target with a probabilistic barcode; (c) for one or more of a plurality of targets: (i) sequencing data. A step of counting the number of molecular labels having an identifiable sequence associated with a target in; (iii) a step of identifying clusters of target molecular labels using directional proximity; (iii) (ii). ), Using the cluster of molecular labels of the targets identified in), the step of collapsing the sequencing data obtained in (b); (iv) the step of estimating the number of targets, the estimated number of targets. Includes (ii) folding of the sequencing data followed by a step that correlates with the number of molecular labels having an identifiable sequence associated with the target in the sequencing data counted in (i). Multiple targets include targets for the entire transcriptome of the cell.

いくつかの実施形態では、クラスター内の標的の分子標識は、互いの所定の方向近接性閾値内にある。方向近接性閾値は、１のハミング距離である。クラスター内の標的の分子標識は、１つ以上の親分子標識と、１つ以上の親分子標識の子供分子標識を含み、ここで、親分子標識の発生数は、所定の方向近接性発生数閾値以上である。所定の方向近接性発生数閾値は、２×（子供分子標識の発生数）－１であってよい。 In some embodiments, the target molecular labels within the cluster are within a predetermined directional proximity threshold of each other. The directional proximity threshold is a Hamming distance of 1. The target molecular label within the cluster comprises one or more parent molecule labels and a child molecule label of one or more parent molecule labels, wherein the number of occurrences of the parent molecule label is the number of occurrences of predetermined directional proximity. It is above the threshold. The predetermined number of directional proximity occurrence thresholds may be 2 × ( number of occurrences of child molecule labels) -1 .

いくつかの実施形態では、（ｉｉ）で同定された標的の分子標識のクラスターを用いて、（ｂ）で得られたシーケンシングデータを折りたたむ工程は、子供分子標識の発生数を親分子標識に帰属させる工程を含む。 In some embodiments, the step of collapsing the sequencing data obtained in (b) using the cluster of target molecular labels identified in (ii) converts the number of child molecule labels generated into the parent molecule label. Includes the process of attribution.

いくつかの実施形態では、本方法は、さらに、標的のシーケンシング深度を決定する工程も含みうる。標的の数を推定する工程は、標的のシーケンシング深度が所定のシーケンシング深度閾値を超える場合、（ｉ）でカウントされたシーケンシングデータを調節する工程を含む。所定のシーケンシング深度閾値は、１５～２０であってよい。（ｉ）でカウントされたシーケンシングデータを調節する工程は、標的の分子標識を閾値化して、（ｂ）で得られたシーケンシングデータ中の標的に関連付けられた真の分子標識および偽の分子標識を決定する工程を含む。標的の分子標識を閾値化する工程は、標的の分子標識について統計解析を実施する工程を含む。統計解析を実施する工程は、以下：標的の分子標識の分布およびそれらの発生数を２つのポアソン分布に当てはめる工程と；２つのポアソン分布を用いて真の分子標識の数ｎを決定する工程と；（ｂ）で得られたシーケンシングデータから偽の分子標識を除去する工程と、を含み、ここで、偽の分子標識は、ｎ番目に豊富な分子標識の発生数よりも低い発生数を有する分子標識を含み、また、真の分子標識は、ｎ番目に豊富な分子標識の発生数以上の発生数を有する分子標識を含む。 In some embodiments, the method may further include determining the sequencing depth of the target. The step of estimating the number of targets includes adjusting the sequencing data counted in (i) when the sequencing depth of the targets exceeds a predetermined sequencing depth threshold. The predetermined sequencing depth threshold may be 15-20. The step of adjusting the sequencing data counted in (i) is to threshold the molecular label of the target and the true molecular label and false molecule associated with the target in the sequencing data obtained in (b). Includes the step of determining the label. The step of thresholding the target molecular label includes a step of performing statistical analysis on the target molecular label. The steps to carry out the statistical analysis are as follows: the steps of applying the distribution of target molecular labels and the number of their occurrences to the two Poisson distributions; and the step of determining the number n of true molecular labels using the two Poisson distributions. Includes a step of removing the fake molecular label from the sequencing data obtained in (b), where the fake molecular label has a lower number of occurrences than the nth most abundant molecular label. A true molecular label comprises a molecular label having, and a true molecular label comprises a molecular label having a number of occurrences greater than or equal to the number of occurrences of the nth most abundant molecular label.

本明細書には、標的の数を決定するためのコンピュータシステムが開示される。いくつかの実施形態では、コンピュータシステムは、実行可能命令を保存するコンピュータが可読メモリーと；コンピュータ可読メモリーと連絡する１つ以上のコンピュータプロセッサーを含み、ここで、１つ以上のコンピュータプロセッサーは、実行可能命令によりプログラムされて、（ａ）複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程であって、複数の確率バーコードの各々が分子標識を含む工程と；（ｂ）確率バーコード付き標的のシーケンシングデータを取得する工程と；（ｃ）複数の標的の１つ以上について：（ｉ）シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数をカウントする工程と；（ｉｉ）（ｂ）で得られたシーケンシングデータ中の標的のクオリティステータスを決定する工程と；（ｉｉｉ）（ｂ）で得られたシーケンシングデータ中の１つ以上のシーケンシングデータエラーを決定する工程であって、シーケンシングデータ中の１つ以上のシーケンシングエラーを決定する工程が、以下：シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数、シーケンシングデータ中の標的のクオリティステータス、および複数の確率バーコード中の識別可能な配列を有する分子標識の数のうち１つ以上を決定することを含む工程と；（ｉｖ）標的の数を推定する工程であって、推定された標的の数が、（ｉｉｉ）で決定された１つ以上のシーケンシングデータエラーに応じて調節された、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数と相関する工程と、を含む。工程（ｉ）、（ｉｉ）、（ｉｉｉ）、および（ｉｖ）は、複数の標的の各々について実施することができる。工程（ａ）、（ｂ）、（ｃ）、（ｉ）、（ｉｉ）、（ｉｉｉ）、および（ｉｖ）は、多重化することができる。 The present specification discloses a computer system for determining the number of targets. In some embodiments, the computer system comprises one or more computer processors in contact with computer-readable memory; where the computer-readable memory stores executable instructions; where one or more computer processors execute. Programmed by possible instructions, (a) a process of attaching probability barcodes to multiple targets using multiple probability barcodes to generate multiple probability barcoded targets, with multiple probability barcodes. Each of the steps comprises a molecular label; (b) a step of obtaining sequencing data for a target with a probability bar code; (c) for one or more of multiple targets: (i) to a target in the sequencing data. A step of counting the number of molecular labels having an associated identifiable sequence; and a step of determining the quality status of the target in the sequencing data obtained in (ii) (b); (iii) (b). The step of determining one or more sequencing data errors in the sequencing data obtained in the above, and the step of determining one or more sequencing errors in the sequencing data, is as follows: One or more of the number of molecular labels with identifiable sequences associated with the target, the quality status of the target in the sequencing data, and the number of molecular labels with identifiable sequences in multiple probability barcodes. A step comprising determining; (iv) a step of estimating the number of targets, wherein the estimated number of targets is adjusted according to one or more sequencing data errors determined in (iii). It also comprises a step that correlates with the number of molecular labels having an identifiable sequence associated with the target in the sequencing data counted in (i). Steps (i), (ii), (iii), and (iv) can be performed for each of the plurality of targets. Steps (a), (b), (c), (i), (ii), (iii), and (iv) can be multiplexed.

いくつかの実施形態では、実行可能命令は、さらに、１つ以上のシーケンシングデータエラーを決定する前に、（ｂ）で得られたシーケンシングデータを折りたたむ工程を実施するように、１つ以上のコンピュータプロセッサーをプログラムすることもできる。（ｂ）で得られたシーケンシングデータを折りたたむ工程は、類似分子標識を有し、かつ、所定の折りたたみ発生数閾値よりも少ない発生数を有する標的のコピーを、複数の標的について同じ分子標識を有するものとして帰属させる工程を含み、ここで、標的の２つのコピーは、標的の２つのコピーの分子標識の配列が、少なくとも１塩基相違する場合、類似の分子標識を有する。 In some embodiments, the executable instruction further performs one or more steps of collapsing the sequencing data obtained in (b) before determining one or more sequencing data errors. You can also program your computer processor. In the step of collapsing the sequencing data obtained in (b), a copy of a target having a similar molecular label and an occurrence number smaller than a predetermined folding occurrence number threshold is obtained, and the same molecular label is applied to a plurality of targets. Including the step of assigning as having, where the two copies of the target have similar molecular labels if the sequences of the molecular labels on the two copies of the target differ by at least one base.

いくつかの実施形態では、所定の折りたたみ発生数閾値は、確率バーコードが、識別可能な配列を有する約６５６１の分子標識を含む場合、７となりうる。確率バーコードが、識別可能な配列を有する約６５５３６の分子標識を含む場合、所定の折りたたみ発生数閾値は１７となりうる。標的の２つのコピーは、標的の２つのコピーの分子標識の配列が、少なくとも１塩基相違する場合、類似の分子標識を有する。いくつかの実施形態では、分子標識は、５～２０ヌクレオチドを含む。さまざまな確率バーコードの分子標識は、互いに異なっていてよい。複数の確率バーコードは、識別可能な配列を有する約６５６１の分子標識を含む。複数の確率バーコードは、識別可能な配列を有する約６５５３６の分子標識を含む。 In some embodiments, the predetermined folding occurrence threshold can be 7 if the probability barcode contains about 6651 molecular labels with identifiable sequences. If the probability barcode contains about 65536 molecular labels with identifiable sequences, the predetermined folding occurrence number threshold can be 17. The two copies of the target have similar molecular labels if the sequences of the molecular labels on the two copies of the target differ by at least one base. In some embodiments, the molecular label comprises 5-20 nucleotides. The molecular labels of the various probability barcodes may be different from each other. Multiple probability barcodes include about 6651 molecular labels with identifiable sequences. Multiple probability barcodes include about 65536 molecular labels with identifiable sequences.

いくつかの実施形態では、実行可能命令は、さらに、シーケンシングデータ中の標的のクオリティステータスが、完全シーケンシング、不完全シーケンシング、または飽和シーケンシングであることを決定するように、１つ以上のコンピュータプロセッサーをプログラムすることもできる。シーケンシングデータ中の標的のクオリティステータスは、複数の確率バーコード中に識別可能な配列を有する分子標識の数と、カウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数によって決定することができる。シーケンシングデータ中の標的のクオリティステータスは、（ｂ）で得られたシーケンシングデータ中の標的のクオリティステータスが、完全シーケンシングではなく、飽和シーケンシングでもない場合、不完全シーケンシングとして分類することができる。 In some embodiments, the executable instruction further determines that the quality status of the target in the sequencing data is complete sequencing, incomplete sequencing, or saturated sequencing. You can also program your computer processor. The quality status of a target in the sequencing data is the number of molecular labels that have an identifiable sequence in multiple probability barcodes and the molecule that has an identifiable sequence associated with the target in the counted sequencing data. It can be determined by the number of signs. The quality status of the target in the sequencing data shall be classified as incomplete sequencing if the quality status of the target in the sequencing data obtained in (b) is neither complete sequencing nor saturated sequencing. Can be done.

いくつかの実施形態では、実行可能命令は、さらに、所定の完全シーケンシング散布閾値以上のポアソン分布と比較した散布指数によって、完全シーケンシングクオリティステータスを決定するように、１つ以上のコンピュータプロセッサーをプログラムすることもでき、ここで、所定の完全シーケンシング散布閾値は、０．９、１、または４でありうる。完全シーケンシングクオリティステータスは、さらに、（ｂ）で得られたシーケンシングデータ中の所定の完全シーケンシング発生数閾値以上の発生数を有する分子標識によって決定することもでき、ここで、所定の完全シーケンシング発生数閾値は、１０または１８でありうる。 In some embodiments, the executable instruction further comprises one or more computer processors such that the complete sequencing quality status is determined by a dispersal index compared to a Poisson distribution above a predetermined complete sequencing dispersal threshold. It can also be programmed, where the predetermined complete sequencing application threshold can be 0.9, 1, or 4. The complete sequencing quality status can also be further determined by a molecular label having an incidence greater than or equal to a predetermined complete sequencing occurrence number threshold in the sequencing data obtained in (b), where the predetermined completeness. The sequencing occurrence number threshold can be 10 or 18.

いくつかの実施形態では、実行可能命令は、さらに、所定の飽和閾値よりも大きい、識別可能な配列を含む特定の数の分子標識を有する標的によって、飽和シーケンシングクオリティステータスを決定するように、１つ以上のコンピュータプロセッサーをプログラムすることもできる。飽和シーケンシングクオリティステータスは、さらに、所定の飽和閾値よりも大きい、識別可能な配列を含む特定の数の分子標識を有する複数の標的のうちの１つの他の標的によって決定することもできる。所定の飽和閾値は、確率バーコードが、識別可能な配列を有する約６５６１の分子標識を含む場合、６５５７でありうる。所定の飽和閾値は、確率バーコードが、識別可能な配列を有する約６５５３６の分子標識を含む場合、６５５３２でありうる。 In some embodiments, the executable instruction further determines the saturation sequencing quality status by a target having a specific number of molecular labels containing an identifiable sequence that is greater than a predetermined saturation threshold. It is also possible to program one or more computer processors. Saturation sequencing quality status can also be further determined by one of the other targets having a particular number of molecular labels containing an identifiable sequence greater than a predetermined saturation threshold. The predetermined saturation threshold can be 6557 if the probability barcode contains about 6651 molecular labels with identifiable sequences. The predetermined saturation threshold can be 65532 if the probability barcode contains about 65536 molecular labels with identifiable sequences.

いくつかの実施形態では、実行可能命令は、さらに、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数を、（ｉｖ）において、標的が完全シーケンシングクオリティステータスを有していれば、１つ以上の親分子標識についてすべての子供分子標識を決定する工程と；少なくとも１つの子供分子標識および親分子標識について第１の統計解析を実施する工程と；第１の統計解析の帰無仮説が容認されれば、子供分子標識の発生数を親分子標識に帰属させる工程と、によって調節するように、１つ以上のコンピュータプロセッサーをプログラムすることもできる。 In some embodiments, the executable instruction further determines, in (iv), the number of molecular labels having an identifiable sequence associated with the target in the sequencing data counted in (i). With the step of determining all child molecule labels for one or more parent molecule labels if they have full sequencing quality status; perform a first statistical analysis for at least one child molecule label and parent molecule label. Steps; If the null hypothesis of the first statistical analysis is accepted, program one or more computer processors to regulate by the step of assigning the number of child molecule labels to the parent molecule label. You can also.

いくつかの実施形態では、実行可能命令は、さらに、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数を、（ｉｖ）において、標的が完全シーケンシングクオリティステータスを有していれば、標的の分子標識を閾値化して、（ｂ）で得られたシーケンシングデータ中の標的に関連付けられた真の分子標識および偽の分子標識を決定する工程によって調節するように、１つ以上のコンピュータプロセッサーをプログラムすることもできる。標的の分子標識を閾値化する工程は、標的の分子標識について第２の統計解析を実施する工程を含む。 In some embodiments, the executable instruction further determines, in (iv), the number of molecular labels having an identifiable sequence associated with the target in the sequencing data counted in (i). If it has a complete sequencing quality status, it thresholds the target's molecular label to determine the true and false molecular label associated with the target in the sequencing data obtained in (b). One or more computer processors can also be programmed to adjust according to the process. The step of thresholding the target molecular label comprises performing a second statistical analysis on the target molecular label.

いくつかの実施形態では、実行可能命令は、さらに、標的の分子標識の分布およびそれらの発生数を２つのポアソン分布に当てはめる工程と；２つのポアソン分布を用いて真の分子標識の数ｎを決定する工程と；（ｂ）で得られたシーケンシングデータから偽の分子標識を除去する工程と、によって、第２の統計解析を実施する工程を実施するように、１つ以上のコンピュータプロセッサーをプログラムすることもでき、ここで、偽の分子標識は、ｎ番目に豊富な分子標識の発生数よりも低い発生数を有する分子標識を含み、また、真の分子標識は、ｎ番目に豊富な分子標識の発生数以上の発生数を有する分子標識を含む。２つのポアソン分布は、真の分子標識に対応する第１ポアソン分布と、偽の分子標識に対応する第２ポアソン分布を含む。 In some embodiments, the viable instruction further applies the distribution of target molecular labels and the number of their occurrences to the two Poisson distributions; the number n of true molecular labels using the two Poisson distributions. One or more computer processors to carry out the step of performing the second statistical analysis by the step of determining; and the step of removing the false molecular label from the sequencing data obtained in (b). It can also be programmed, where the false molecular label contains a molecular label with a lower number of occurrences than the nth richest molecular label, and the true molecular label is the nth richest. Includes molecular labels with more than the number of molecular labels generated . The two Poisson distributions include a first Poisson distribution corresponding to a true molecular label and a second Poisson distribution corresponding to a false molecular label.

いくつかの実施形態では、実行可能命令は、さらに、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数を、（ｉｖ）において、（ｂ）で得られたシーケンシングデータ中の標的のクオリティステータスが、不完全シーケンシングクオリティステータスである場合、標的が、（ｂ）で得られたシーケンシングデータにおいてノイジーであるか否かを決定する工程と；（ｂ）で得られたシーケンシングデータからノイジー標的を除去する工程と、によって調節するように、１つ以上のコンピュータプロセッサーをプログラムすることもできる。ノイジー標的の分子標識の発生数が、不完全シーケンシングクノイジー標的閾値以下であれば、標的はノイジーである可能性があり、ここで、不完全シーケンシングノイジー遺伝子閾値は５である。不完全シーケンシングノイジー標的閾値は、完全シーケンシングのクオリティステータスを有する複数の標的の分子標識の中央または平均発生数と等しくてもよい。 In some embodiments, the executable instruction further, in (iv), the number of molecular labels having an identifiable sequence associated with the target in the sequencing data counted in (i), (b). If the quality status of the target in the sequencing data obtained in) is incomplete sequencing quality status, the step of determining whether the target is noisy in the sequencing data obtained in (b). And; one or more computer processors can also be programmed to be tuned by the step of removing noisy targets from the sequencing data obtained in (b). If the number of molecular labels generated by the noisy target is less than or equal to the incomplete sequencing noisy target threshold, the target may be noisy, where the incomplete sequencing noisy gene threshold is 5. The incomplete sequencing noisy target threshold may be equal to the median or average incidence of molecular labels for multiple targets with complete sequencing quality status.

いくつかの実施形態では、実行可能命令は、さらに、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数を、（ｉｖ）において、（ｂ）で得られたシーケンシングデータ中の標的のクオリティステータスが不完全シーケンシングクオリティステータであれば、標的の分子標識を閾値化して、（ｂ）で得られたシーケンシングデータ中の真の分子標識および偽の分子標識を決定する工程によって調節するように、１つ以上のコンピュータプロセッサーをプログラムすることもできる。 In some embodiments, the executable instruction further, in (iv), the number of molecular labels having an identifiable sequence associated with the target in the sequencing data counted in (i), (b). If the quality status of the target in the sequencing data obtained in) is incomplete sequencing quality stator, the molecular label of the target is thresholded and the true molecular label in the sequencing data obtained in (b). And one or more computer processors can be programmed to adjust by the step of determining false molecular labels.

いくつかの実施形態では、実行可能命令は、さらに、分子標識について第３の統計解析を実施する工程によって、標的の分子標識を閾値化するように、１つ以上のコンピュータプロセッサーをプログラムすることもできる。分子標識について第３の統計解析を実施する工程は、ゼロ切断ポアソンモデルを用いて、真の分子標識の数ｎを決定する工程と；（ｂ）で得られたシーケンシングデータから偽の分子標識を除去する工程と、を含み、ここで、偽の分子標識は、ｎ番目に豊富な分子標識の発生数よりも低い発生数を有する分子標識を含み、また、真の分子標識は、ｎ番目に豊富な分子標識の発生数以上の発生数を有する分子標識を含む。 In some embodiments, the executable instruction may further program one or more computer processors to threshold the target molecular label by performing a third statistical analysis on the molecular label. can. The step of performing the third statistical analysis on the molecular label is the step of determining the number n of true molecular labels using the zero-cleaving Poisson model; and the false molecular label from the sequencing data obtained in (b). The false molecular label comprises a molecular label having a lower number of occurrences than the nth abundant molecular label, and the true molecular label comprises the nth. Includes molecular labels with more than abundant molecular labels.

いくつかの実施形態では、（ｉ）でカウントされたシーケンシングデータが（ｉｉｉ）で決定された１つ以上のシーケンシングデータエラーに応じて調節された後、（ｂ）で得られたシーケンシングデータ中の分子標識の少なくとも５０％または８０％が保持されうる。 In some embodiments, the sequencing data obtained in (i) is adjusted according to one or more sequencing data errors determined in (iii), and then the sequencing obtained in (b). At least 50% or 80% of the molecular labels in the data can be retained.

本明細書には、標的の数を決定するためのコンピュータシステムが開示される。いくつかの実施形態では、コンピュータシステムは、実行可能命令を記憶するコンピュータ可読メモリーと；コンピュータ可読メモリーと連絡する１つ以上のコンピュータプロセッサーを含み、ここで、１つ以上のコンピュータプロセッサーは、以下：（ａ）複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程であって、複数の確率バーコードの各々が分子標識を含む工程と；（ｂ）確率バーコード付き標的のシーケンシングデータを取得する工程と；（ｃ）複数の標的の１つ以上について：（ｉ）シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数をカウントする工程と；（ｉｉ）方向近接性を用いて、標的の分子標識のクラスターを同定する工程と；（ｉｉｉ）（ｉｉ）で同定された標的の分子標識のクラスターを用いて、（ｂ）で得られたシーケンシングデータを折りたたむ工程と；（ｉｖ）標的の数を推定する工程であって、推定された標的の数が、（ｉｉ）でシーケンシングデータを折りたたんだ後、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数と相関する工程と、を実施するように、実行可能命令によりプログラムされる。複数の標的は、細胞の全トランスクリプトームの標的を含む。 The present specification discloses a computer system for determining the number of targets. In some embodiments, the computer system comprises a computer-readable memory for storing executable instructions; one or more computer processors in contact with the computer-readable memory, wherein the one or more computer processors are: (A) A step of attaching a probability barcode to a plurality of targets using a plurality of probability barcodes to generate a target with a plurality of probability barcodes, and each of the plurality of probability barcodes includes a molecular label. Steps and; (b) Obtaining sequencing data for targets with probability barcodes; (c) For one or more of multiple targets: (i) Identifiable sequences associated with the targets in the sequencing data. A step of counting the number of molecular labels having; (ii) a step of identifying a cluster of target molecular labels using directional proximity; and a step of identifying a cluster of target molecular labels identified in (iii) (ii). In the step of collapsing the sequencing data obtained in (b) and (iv) the step of estimating the number of targets, the estimated number of targets is the step of collapsing the sequencing data in (ii). Then, a step of correlating with the number of molecular labels having an identifiable sequence associated with the target in the sequencing data counted in (i) is programmed by a viable instruction. Multiple targets include targets for the entire transcriptome of the cell.

いくつかの実施形態では、実行可能命令は、さらに、標的のシーケンシング深度を決定するように、１つ以上のコンピュータプロセッサーをプログラムすることができる。標的の数を推定する工程は、標的のシーケンシング深度が所定のシーケンシング深度閾値を超える場合、（ｉ）でカウントされたシーケンシングデータを調節する工程を含む。所定のシーケンシング深度閾値は、１５～２０であってよい。（ｉ）でカウントされたシーケンシングデータを調節する工程は、標的の分子標識を閾値化して、（ｂ）で得られたシーケンシングデータ中の標的に関連付けられた真の分子標識および偽の分子標識を決定する工程を含む。標的の分子標識を閾値化する工程は、標的の分子標識について統計解析を実施する工程を含む。統計解析を実施する工程は、以下：標的の分子標識の分布およびそれらの発生数を２つのポアソン分布に当てはめる工程と；２つのポアソン分布を用いて真の分子標識の数ｎを決定する工程と；（ｂ）で得られたシーケンシングデータから偽の分子標識を除去する工程と、を含み、ここで、偽の分子標識は、ｎ番目に豊富な分子標識の発生数よりも低い発生数を有する分子標識を含み、また、真の分子標識は、ｎ番目に豊富な分子標識の発生数以上の発生数を有する分子標識を含む。 In some embodiments, the executable instruction can further program one or more computer processors to determine the sequencing depth of the target. The step of estimating the number of targets includes adjusting the sequencing data counted in (i) when the sequencing depth of the targets exceeds a predetermined sequencing depth threshold. The predetermined sequencing depth threshold may be 15-20. The step of adjusting the sequencing data counted in (i) is to threshold the molecular label of the target and the true molecular label and false molecule associated with the target in the sequencing data obtained in (b). Includes the step of determining the label. The step of thresholding the target molecular label includes a step of performing statistical analysis on the target molecular label. The steps to carry out the statistical analysis are as follows: the steps of applying the distribution of target molecular labels and the number of their occurrences to the two Poisson distributions; and the step of determining the number n of true molecular labels using the two Poisson distributions. Includes a step of removing the fake molecular label from the sequencing data obtained in (b), where the fake molecular label has a lower number of occurrences than the nth most abundant molecular label. A true molecular label comprises a molecular label having, and a true molecular label comprises a molecular label having a number of occurrences greater than or equal to the number of occurrences of the nth most abundant molecular label.

本明細書には、実行可能コードを含む１つ以上の非一過性コンピュータ読取り媒体が開示され、これは、実行されると、１つ以上のコンピュータデバイスに標的の数を決定させる。いくつかの実施形態では、実行可能コードは、実行されると、１つ以上のコンピュータデバイスに、以下：（ａ）複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程であって、複数の確率バーコードの各々が分子標識を含む工程と；（ｂ）確率バーコード付き標的のシーケンシングデータを取得する工程と；（ｃ）複数の標的の１つ以上について：（ｉ）シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数をカウントする工程と；（ｉｉ）（ｂ）で得られたシーケンシングデータ中の標的のクオリティステータスを決定する工程と；（ｉｉｉ）（ｂ）で得られたシーケンシングデータ中の１つ以上のシーケンシングデータエラーを決定する工程であって、シーケンシングデータ中の１つ以上のシーケンシングエラーを決定する工程が、以下：シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数、シーケンシングデータ中の標的のクオリティステータス、および複数の確率バーコードに識別可能な配列を有する分子標識の数のうち１つ以上を決定することを含む工程と；（ｉｖ）標的の数を推定する工程であって、推定された標的の数が、（ｉｉｉ）で決定された１つ以上のシーケンシングデータエラーに応じて調節された、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数と相関する工程と、を含むプロセスを実施させる。工程（ｉ）、（ｉｉ）、（ｉｉｉ）、および（ｉｖ）は、複数の標的の各々について実施することができる。本方法は、多重化することができる。 The present specification discloses one or more non-transient computer reading media containing executable code, which, when executed, causes one or more computer devices to determine the number of targets. In some embodiments, the executable code, when executed, attaches probabilistic barcodes to one or more computer devices, the following: (a) multiple probabilistic barcodes and multiple targets. A step of generating a plurality of probabilistic bar coded targets, wherein each of the plurality of probabilistic bar codes includes a molecular label; (b) a step of acquiring sequencing data of the probabilistic bar coded target; (c). ) For one or more of a plurality of targets: (i) counting the number of molecular labels having an identifiable sequence associated with the target in the sequencing data; (ii) the sequence obtained in (b). The step of determining the quality status of the target in the single data; the step of determining one or more sequencing data errors in the sequencing data obtained in (iii) (b), in the sequencing data. The steps to determine one or more sequencing errors are as follows: the number of molecular labels with identifiable sequences associated with the target in the sequencing data, the quality status of the target in the sequencing data, and multiple probabilities. A step comprising determining one or more of the number of molecular labels having an identifiable sequence on the bar code; (iv) a step of estimating the number of targets, wherein the estimated number of targets is (iv). Correlates with the number of molecular labels with identifiable sequences associated with the target in the sequencing data counted in (i), adjusted for one or more sequencing data errors determined in iii). And the process including. Steps (i), (ii), (iii), and (iv) can be performed for each of the plurality of targets. The method can be multiplexed.

いくつかの実施形態では、本プロセスは、１つ以上のシーケンシングデータエラーを決定する前に、（ｂ）で得られたシーケンシングデータを折りたたむ工程をさらに含む。（ｂ）で得られたシーケンシングデータを折りたたむ工程は、類似分子標識を有し、かつ、所定の折りたたみ発生数閾値よりも少ない発生数を有する標的のコピーを、複数の標的について同じ分子標識を有するものとして帰属させる工程を含み、ここで、標的の２つのコピーは、標的の２つのコピーの分子標識の配列が、少なくとも１塩基相違する場合、類似の分子標識を有する。 In some embodiments, the process further comprises collapsing the sequencing data obtained in (b) before determining one or more sequencing data errors. In the step of collapsing the sequencing data obtained in (b), a copy of a target having a similar molecular label and an occurrence number smaller than a predetermined folding occurrence number threshold is obtained, and the same molecular label is applied to a plurality of targets. Including the step of assigning as having, where the two copies of the target have similar molecular labels if the sequences of the molecular labels on the two copies of the target differ by at least one base.

いくつかの実施形態では、シーケンシングデータは、５０ヌクレオチド以上のリード長を有する複数の標的の配列を含む。シーケンシングデータは、７５ヌクレオチド以上のリード長を有する複数の標的の配列を含む。シーケンシングデータは、１００ヌクレオチド以上のリード長を有する複数の標的の配列を含む。（ｂ）で得られたシーケンシングデータは、複数の確率バーコード付き標的に対してポリメラーゼ連鎖反応（ＰＣＲ）増幅を実施することによって生成することができる。 In some embodiments, the sequencing data comprises a sequence of multiple targets with a read length of 50 nucleotides or greater. Sequencing data include sequences of multiple targets with read lengths of 75 nucleotides or greater. Sequencing data include sequences of multiple targets with read lengths of 100 nucleotides or greater. The sequencing data obtained in (b) can be generated by performing polymerase chain reaction (PCR) amplification on a plurality of probabilistic barcoded targets.

いくつかの実施形態では、完全シーケンシングクオリティステータスは、所定の完全シーケンシング散布閾値以上のポアソン分布に対する散布指数によって決定され、ここで、所定の完全シーケンシング散布閾値は、０．９、１、または４でありうる。完全シーケンシングクオリティステータスは、さらに、（ｂ）で得られたシーケンシングデータ中の所定の完全シーケンシング発生数閾値以上の発生数を有する分子標識によって決定することもでき、ここで、所定の完全シーケンシング発生数閾値は、１０または１８でありうる。 In some embodiments, the complete sequencing quality status is determined by the dispersion index for a Poisson distribution above a predetermined complete sequencing application threshold, where the predetermined complete sequencing application threshold is 0.9, 1, ,. Or it can be 4. The complete sequencing quality status can also be further determined by a molecular label having an incidence greater than or equal to a predetermined complete sequencing occurrence number threshold in the sequencing data obtained in (b), where the predetermined completeness. The sequencing occurrence number threshold can be 10 or 18.

いくつかの実施形態では、飽和シーケンシングクオリティステータスは、所定の飽和閾値よりも大きい，識別可能な配列を含む分子標識の数を有する標的によって、決定することができる。飽和シーケンシングクオリティステータスは、さらに、所定の飽和閾値よりも大きい、識別可能な配列を含む分子標識の数を有する複数の標的のうちの１つの他の標的によって決定することもできる。所定の飽和閾値は、確率バーコードが、識別可能な配列を有する約６５６１の分子標識を含む場合、６５５７でありうる。所定の飽和閾値は、確率バーコードが、識別可能な配列を有する約６５５３６の分子標識を含む場合、６５５３２でありうる。 In some embodiments, the saturation sequencing quality status can be determined by a target having a number of molecular labels containing an identifiable sequence that is greater than a predetermined saturation threshold. Saturation sequencing quality status can also be further determined by one of the other targets having a number of molecular labels containing identifiable sequences greater than a predetermined saturation threshold. The predetermined saturation threshold can be 6557 if the probability barcode contains about 6651 molecular labels with identifiable sequences. The predetermined saturation threshold can be 65532 if the probability barcode contains about 65536 molecular labels with identifiable sequences.

いくつかの実施形態では、第２の統計解析を実施する工程は、標的の分子標識の分布およびそれらの発生数を２つのポアソン分布に当てはめる工程と；２つのポアソン分布を用いて真の分子標識の数ｎを決定する工程と；（ｂ）で得られたシーケンシングデータから偽の分子標識を除去する工程と、を含み、ここで、偽の分子標識は、ｎ番目に豊富な分子標識の発生数よりも低い発生数を有する分子標識を含み、また、真の分子標識は、ｎ番目に豊富な分子標識の発生数以上の発生数を有する分子標識を含む。２つのポアソン分布は、真の分子標識に対応する第１ポアソン分布と、偽の分子標識に対応する第２ポアソン分布を含む。 In some embodiments, the step of performing the second statistical analysis is to fit the distribution of the target molecular labels and the number of their occurrences to the two Poisson distributions; the true molecular label using the two Poisson distributions. A step of determining the number n of the above; a step of removing the fake molecular label from the sequencing data obtained in (b), wherein the fake molecular label is the nth richest molecular label. A true molecular label comprises a molecular label having a number of occurrences lower than the number of occurrences, and a true molecular label comprises a molecular label having a number of occurrences greater than or equal to the number of occurrences of the nth most abundant molecular label. The two Poisson distributions include a first Poisson distribution corresponding to a true molecular label and a second Poisson distribution corresponding to a false molecular label.

いくつかの実施形態では、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数は、（ｉｖ）において、（ｂ）で得られたシーケンシングデータ中の標的のクオリティステータスが、不完全シーケンシングクオリティステータスである場合、標的が、（ｂ）で得られたシーケンシングデータにおいてノイジーであるか否かを決定する工程と；（ｂ）で得られたシーケンシングデータからノイジー標的を除去する工程と、によって調節することができる。ノイジー標的の分子標識の発生数が、不完全シーケンシングクノイジー標的閾値以下であれば、標的はノイジーである可能性があり、ここで、不完全シーケンシングノイジー遺伝子閾値は５である。不完全シーケンシングノイジー標的閾値は、完全シーケンシングのクオリティステータスを有する複数の標的の分子標識の中央または平均発生数と等しくてもよい。 In some embodiments, the number of molecular labels with identifiable sequences associated with the target in the sequencing data counted in (i) is, in (iv), the number of sequencing obtained in (b). If the quality status of the target in the data is incomplete sequencing quality status, the step of determining whether the target is noisy in the sequencing data obtained in (b); It can be adjusted by the step of removing the noisy target from the sequenced data obtained. If the number of molecular labels generated by the noisy target is less than or equal to the incomplete sequencing noisy target threshold, the target may be noisy, where the incomplete sequencing noisy gene threshold is 5. The incomplete sequencing noisy target threshold may be equal to the median or average incidence of molecular labels for multiple targets with complete sequencing quality status.

いくつかの実施形態では、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数は、（ｉｖ）において、（ｂ）で得られたシーケンシングデータ中の標的のクオリティステータスが不完全シーケンシングクオリティステータである場合、標的の分子標識を閾値化して、（ｂ）で得られたシーケンシングデータ中の真の分子標識および偽の分子標識を決定する工程によって調節することができる。 In some embodiments, the number of molecular labels with identifiable sequences associated with the target in the sequencing data counted in (i) is in (iv) the sequencing obtained in (b). If the quality status of the target in the data is an incomplete sequencing quality stator, the molecular label of the target is thresholded to determine the true and false molecular labels in the sequencing data obtained in (b). It can be adjusted by the process of

本明細書には、実行可能コードを含む１つ以上の非一過性コンピュータ読取り媒体が開示され、これは、実行されると、１つ以上のコンピュータデバイスに標的の数を決定させる。いくつかの実施形態では、実行可能コードは、実行されると、１つ以上のコンピュータデバイスに、以下：（ａ）複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程であって、複数の確率バーコードの各々が分子標識を含む工程と；（ｂ）確率バーコード付き標的のシーケンシングデータを取得する工程と；（ｃ）複数の標的の１つ以上について：（ｉ）シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数をカウントする工程と；（ｉｉ）方向近接性を用いて、標的の分子標識のクラスターを同定する工程と；（ｉｉｉ）（ｉｉ）で同定された標的の分子標識のクラスターを用いて、（ｂ）で得られたシーケンシングデータを折りたたむ工程と；（ｉｖ）標的の数を推定する工程であって、推定された標的の数が、（ｉｉ）でシーケンシングデータを折りたたんだ後、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数と相関する工程と、を含むプロセスを実施させる。複数の標的は、細胞の全トランスクリプトームの標的を含む。 The present specification discloses one or more non-transient computer reading media containing executable code, which, when executed, causes one or more computer devices to determine the number of targets. In some embodiments, the executable code, when executed, attaches probabilistic barcodes to one or more computer devices, the following: (a) with multiple probabilistic barcodes and multiple targets. A step of generating a target with a plurality of probability barcodes, wherein each of the plurality of probability barcodes includes a molecular label; (b) a step of acquiring sequencing data of the target with a probability barcode; (c). ) For one or more of multiple targets: (i) counting the number of molecular labels with identifiable sequences associated with the target in the sequencing data; (ii) using directional proximity to the target. (Iii) A step of collapsing the sequencing data obtained in (b) using the target molecularly labeled clusters identified in (iii); (iv) Target. In the step of estimating the number of targets, the estimated number of targets is identifiable associated with the targets in the sequencing data counted in (i) after collapsing the sequencing data in (ii). Perform a process comprising a step that correlates with the number of molecular labels having a sequence. Multiple targets include targets for the entire transcriptome of the cell.

いくつかの実施形態では、本方法は、さらに、標的のシーケンシング深度を決定する工程を含みうる。標的の数を推定する工程は、標的のシーケンシング深度が所定のシーケンシング深度閾値を超える場合、（ｉ）でカウントされたシーケンシングデータを調節する工程を含む。所定のシーケンシング深度閾値は、１５～２０であってよい。（ｉ）でカウントされたシーケンシングデータを調節する工程は、標的の分子標識を閾値化して、（ｂ）で得られたシーケンシングデータ中の標的に関連付けられた真の分子標識および偽の分子標識を決定する工程を含む。標的の分子標識を閾値化する工程は、標的の分子標識について統計解析を実施する工程を含む。統計解析を実施する工程は、以下：標的の分子標識の分布およびそれらの発生数を２つのポアソン分布に当てはめる工程と；２つのポアソン分布を用いて真の分子標識の数ｎを決定する工程と；（ｂ）で得られたシーケンシングデータから偽の分子標識を除去する工程と、を含み、ここで、偽の分子標識は、ｎ番目に豊富な分子標識の発生数よりも低い発生数を有する分子標識を含み、また、真の分子標識は、ｎ番目に豊富な分子標識の発生数以上の発生数を有する分子標識を含む。 In some embodiments, the method may further include determining the sequencing depth of the target. The step of estimating the number of targets includes adjusting the sequencing data counted in (i) when the sequencing depth of the targets exceeds a predetermined sequencing depth threshold. The predetermined sequencing depth threshold may be 15-20. The step of adjusting the sequencing data counted in (i) is to threshold the molecular label of the target and the true molecular label and false molecule associated with the target in the sequencing data obtained in (b). Includes the step of determining the label. The step of thresholding the target molecular label includes a step of performing statistical analysis on the target molecular label. The steps to carry out the statistical analysis are as follows: the steps of applying the distribution of target molecular labels and the number of their occurrences to the two Poisson distributions; and the step of determining the number n of true molecular labels using the two Poisson distributions. Includes a step of removing the fake molecular label from the sequencing data obtained in (b), where the fake molecular label has a lower number of occurrences than the nth most abundant molecular label. A true molecular label comprises a molecular label having, and a true molecular label comprises a molecular label having a number of occurrences greater than or equal to the number of occurrences of the nth most abundant molecular label.

本明細書には、ＰＣＲまたはシーケンシングエラーを訂正する方法が開示される。いくつかの実施形態では、本方法は、（ａ）確率バーコード付き標的のシーケンシングデータを取得する工程と；（ｂ）複数の標的の１つ以上について：（ｉ）シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数をカウントする工程と；（ｉｉ）方向近接性を用いて、標的の分子標識のクラスターを同定する工程と；（ｉｉｉ）（ｉｉ）で同定された標的の分子標識のクラスターを用いて、（ａ）で得られたシーケンシングデータを折りたたむ工程と；（ｉｖ）標的の数を推定する工程と、を含むことができ、ここで、推定された標的の数は、（ｉｉ）のシーケンシングデータの折りたたみ後に、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数と相関する。複数の標的は、細胞の全トランスクリプトームの標的を含む。いくつかの実施形態では、本方法を用いて、標的の数を決定することができる。本方法は、さらに、（ｃ）複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程と；（ｄ）確率バーコード付き標的をシーケンシングして、受け取った確率バーコード付き標的のシーケンシングデータを生成する工程と、を含みうる。 The present specification discloses methods for correcting PCR or sequencing errors. In some embodiments, the method comprises (a) acquiring sequencing data for a probabilistic bar coded target; (b) for one or more of a plurality of targets: (i) targets in the sequencing data. Counting the number of molecular labels with identifiable sequences associated with; (ii) identifying clusters of target molecular labels using directional proximity; (iii) (ii). Using the clusters of molecular labels of the targeted targets, a step of collapsing the sequencing data obtained in (a); (iv) a step of estimating the number of targets can be included, which is estimated here. The number of targets correlates with the number of molecular labels having an identifiable sequence associated with the target in the sequencing data counted in (i) after folding the sequencing data in (ii). Multiple targets include targets for the entire transcriptome of the cell. In some embodiments, the method can be used to determine the number of targets. The method further comprises (c) using multiple probability barcodes to attach probability barcodes to multiple targets to generate multiple probability barcoded targets; (d) probability barcoded targets. Can include a step of sequencing to generate sequencing data for a target with a received probability barcode.

いくつかの実施形態では、本方法は、さらに、標的のシーケンシング深度を決定する工程を含む。標的の数を推定する工程は、標的のシーケンシング深度が所定のシーケンシング深度閾値を超える場合、（ｉ）でカウントされたシーケンシングデータを調節する工程を含む。所定のシーケンシング深度閾値は、１５～２０であってよい。（ｉ）でカウントされたシーケンシングデータを調節する工程は、標的の分子標識を閾値化して、（ｂ）で得られたシーケンシングデータ中の標的に関連付けられた真の分子標識および偽の分子標識を決定する工程を含む。標的の分子標識を閾値化する工程は、標的の分子標識について統計解析を実施する工程を含む。統計解析を実施する工程は、標的の分子標識の分布およびそれらの発生数を２つのネガティブ二項分布に当てはめる工程と；２つのネガティブ二項分布を用いて真の分子標識の数ｎを決定する工程と；（ｂ）で得られたシーケンシングデータから偽の分子標識を除去する工程と、を含み、ここで、偽の分子標識は、ｎ番目に豊富な分子標識の発生数よりも低い発生数を有する分子標識を含み、また、真の分子標識は、ｎ番目に豊富な分子標識の発生数以上の発生数を有する分子標識を含む。 In some embodiments, the method further comprises the step of determining the sequencing depth of the target. The step of estimating the number of targets includes adjusting the sequencing data counted in (i) when the sequencing depth of the targets exceeds a predetermined sequencing depth threshold. The predetermined sequencing depth threshold may be 15-20. The step of adjusting the sequencing data counted in (i) is to threshold the molecular label of the target and the true molecular label and false molecule associated with the target in the sequencing data obtained in (b). Includes the step of determining the label. The step of thresholding the target molecular label includes a step of performing statistical analysis on the target molecular label. The steps to carry out the statistical analysis are to apply the distribution of target molecular labels and the number of their occurrences to the two negative binomial distributions; the two negative binomial distributions are used to determine the number n of true molecular labels. Steps; include removing the fake molecular label from the sequencing data obtained in (b), where the fake molecular label occurs less than the nth abundant number of molecular labels. A molecular label having a number is included, and a true molecular label includes a molecular label having an occurrence number equal to or greater than the number of occurrences of the nth most abundant molecular label.

本明細書には、標的の数を決定するためのコンピュータシステムが開示される。いくつかの実施形態では、コンピュータシステムは、実行可能命令を記憶するコンピュータ可読メモリーと；コンピュータ可読メモリーと連絡する１つ以上のコンピュータプロセッサーを含み、ここで、１つ以上のコンピュータプロセッサーは、（ａ）複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程であって、複数の確率バーコードの各々が分子標識を含む工程と；（ｂ）確率バーコード付き標的のシーケンシングデータを取得する工程と；（ｃ）複数の標的の１つ以上について：（ｉ）シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数をカウントする工程と；（ｉｉ）方向近接性を用いて、標的の分子標識のクラスターを同定する工程と；（ｉｉｉ）（ｉｉ）で同定された標的の分子標識のクラスターを用いて、（ｂ）で得られたシーケンシングデータを折りたたむ工程と；（ｉｖ）標的の数を推定する工程であって、推定された標的の数が、（ｉｉ）でシーケンシングデータを折りたたんだ後、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数と相関する工程と、を実施するように、実行可能命令によりプログラムされる。複数の標的は、細胞の全トランスクリプトームの標的を含む。 The present specification discloses a computer system for determining the number of targets. In some embodiments, the computer system comprises a computer-readable memory for storing executable instructions; one or more computer processors in contact with the computer-readable memory, wherein the one or more computer processors are (a). ) A step of attaching a probability bar code to a plurality of targets using a plurality of probability bar codes to generate a target with a plurality of probability bar codes, in which each of the plurality of probability bar codes includes a molecular label. (B) the step of acquiring sequencing data for a target with a probability bar code; (c) for one or more of the targets: (i) having an identifiable sequence associated with the target in the sequencing data. Using the steps of counting the number of molecular labels; (ii) identifying clusters of target molecular labels using directional proximity; and using the clusters of target molecular labels identified in (iii) (ii). Then, in the step of collapsing the sequencing data obtained in (b) and (iv) the step of estimating the number of targets, after the estimated number of targets is the step of folding the sequencing data in (ii). , A step that correlates with the number of molecular labels having an identifiable sequence associated with the target in the sequencing data counted in (i), is programmed by a viable instruction. Multiple targets include targets for the entire transcriptome of the cell.

いくつかの実施形態では、実行可能命令は、さらに、標的のシーケンシング深度を決定するように、１つ以上のコンピュータプロセッサーをプログラムすることができる。標的の数を推定する工程は、標的のシーケンシング深度が所定のシーケンシング深度閾値を超える場合、（ｉ）でカウントされたシーケンシングデータを調節する工程を含む。所定のシーケンシング深度閾値は、１５～２０であってよい。（ｉ）でカウントされたシーケンシングデータを調節する工程は、標的の分子標識を閾値化して、（ｂ）で得られたシーケンシングデータ中の標的に関連付けられた真の分子標識および偽の分子標識を決定する工程を含む。標的の分子標識を閾値化する工程は、標的の分子標識について統計解析を実施する工程を含む。統計解析を実施する工程は、以下：標的の分子標識の分布およびそれらの発生数を２つのネガティブ二項分布に当てはめる工程と；２つのネガティブ二項分布を用いて真の分子標識の数ｎを決定する工程と；（ｂ）で得られたシーケンシングデータから偽の分子標識を除去する工程と、を含み、ここで、偽の分子標識は、ｎ番目に豊富な分子標識の発生数よりも低い発生数を有する分子標識を含み、また、真の分子標識は、ｎ番目に豊富な分子標識の発生数以上の発生数を有する分子標識を含む。 In some embodiments, the executable instruction can further program one or more computer processors to determine the sequencing depth of the target. The step of estimating the number of targets includes adjusting the sequencing data counted in (i) when the sequencing depth of the targets exceeds a predetermined sequencing depth threshold. The predetermined sequencing depth threshold may be 15-20. The step of adjusting the sequencing data counted in (i) is to threshold the molecular label of the target and the true molecular label and false molecule associated with the target in the sequencing data obtained in (b). Includes the step of determining the label. The step of thresholding the target molecular label includes a step of performing statistical analysis on the target molecular label. The steps to carry out the statistical analysis are as follows: applying the distribution of target molecular labels and the number of their occurrences to two negative binomial distributions; using the two negative binomial distributions to determine the number n of true molecular labels. It comprises a step of determining; a step of removing the fake molecular label from the sequencing data obtained in (b), where the fake molecular label is more than the number of nth most abundant molecular labels generated . A molecular label having a low number of occurrences is included, and a true molecular label includes a molecular label having a number of occurrences equal to or higher than the number of occurrences of the nth most abundant molecular label.

本明細書には、ＰＣＲまたはシーケンシングエラーを訂正する方法が開示される。いくつかの実施形態では、本方法は、以下：（ａ）確率バーコード付き標的のシーケンシングデータを取得する工程と；（ｂ）複数の標的の１つ以上について：（ｉ）シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数をカウントする工程と；（ｉｉ）シーケンシングデータ中の標的に関連付けられた識別可能な配列を有するノイズ分子標識の数を決定する工程と；（ｉｉｉ）標的の数を推定する工程と、を含み、ここで、推定された標的の数が、（ｉｉ）で決定されたノイズ分子標識の数に応じて調節された、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数と相関する。いくつかの実施形態では、本方法は、シーケンシングデータ中の標的のシーケンシングステータスを決定する工程もさらに含む。シーケンシングデータ中の標的のシーケンシングステータスは、飽和シーケンシング、過少シーケンシング、または過剰シーケンシングである。いくつかの実施形態では、本方法を用いて、標的の数を決定することができる。本方法は、さらに、（ｃ）複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程と；（ｄ）確率バーコード付き標的をシーケンシングして、受け取った確率バーコード付き標的のシーケンシングデータを生成する工程と、を含みうる。 The present specification discloses methods for correcting PCR or sequencing errors. In some embodiments, the method comprises the following: (a) the step of obtaining sequencing data for a probabilistic bar coded target; (b) for one or more of a plurality of targets: (i) in the sequencing data. A step of counting the number of molecular labels having an identifiable sequence associated with a target; (ii) a step of determining the number of noise molecular labels having an identifiable sequence associated with a target in sequencing data. And; (iii) including the step of estimating the number of targets, wherein the estimated number of targets is adjusted according to the number of noise molecule labels determined in (ii), (i). Correlates with the number of molecular labels with identifiable sequences associated with the target in the sequencing data counted in. In some embodiments, the method further comprises determining the sequencing status of the target in the sequencing data. The target sequencing status in the sequencing data is saturated sequencing, undersequencing, or oversequencing. In some embodiments, the method can be used to determine the number of targets. The method further comprises (c) using multiple probability barcodes to attach probability barcodes to multiple targets to generate multiple probability barcoded targets; (d) probability barcoded targets. Can include a step of sequencing to generate sequencing data for a target with a received probability barcode.

いくつかの実施形態では、飽和シーケンシングステータスは、所定の飽和閾値よりも大きい、識別可能な配列を含む分子標識の数を有する標的によって決定される。所定の飽和閾値は、確率バーコードが、識別可能な配列を有する約６５６１の分子標識を含む場合、約６５５７である。所定の飽和閾値は、確率バーコードが、識別可能な配列を有する約６５５３６の分子標識を含む場合、約６５５３２である。シーケンシングデータ中の標的のシーケンシグステータスが、飽和シーケンシングステータスである場合、（ｉｉ）で決定されたノイズ分子標識の数は、ゼロである。 In some embodiments, the saturation sequencing status is determined by a target having a number of molecular labels containing an identifiable sequence that is greater than a predetermined saturation threshold. The predetermined saturation threshold is about 6557 if the probability barcode contains about 6651 molecular labels with identifiable sequences. The predetermined saturation threshold is about 65532 if the probability barcode contains about 65536 molecular labels with identifiable sequences. When the sequence status of the target in the sequencing data is the saturated sequencing status, the number of noise molecule labels determined in (ii) is zero.

いくつかの実施形態では、過少シーケンシングステータスは、所定の過少シーケンシング閾値より小さい深度（たとえば、平均、最小、または最大深度）を有する標的によって決定することができる。過少シーケンシング閾値は約４である。過少シーケンシング閾値は、識別可能な配列を有する分子標識の数とは無関係でありうる。シーケンシングデータ中の標的のシーケンシングステータスが、過少シーケンシングステータスである場合、（ｉｉ）で決定されたノイズ分子標識の数は、ゼロである。 In some embodiments, the undersequencing status can be determined by a target having a depth less than a predetermined undersequencing threshold (eg, mean, minimum, or maximum depth). The undersequencing threshold is about 4. The undersequencing threshold can be independent of the number of molecular labels with identifiable sequences. If the target sequencing status in the sequencing data is under-sequencing status, the number of noise molecule labels determined in (ii) is zero.

いくつかの実施形態では、過剰シーケンシングステータスは、所定の過剰シーケンシング閾値より大きい、識別可能な配列を有する分子標識の数を含む標的によって決定される。たとえば、過剰シーケンシング閾値は、確率バーコードが、識別可能な配列を有する約６５６１分子標識を含む場合、約２５０でありうる。本方法は、シーケンシングデータ中の標的のシーケンシングテータスが、過剰シーケンシングステータスである場合、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数を、所定の過剰シーケンシング閾値にサブサンプリングする工程を含む。 In some embodiments, the excess sequencing status is determined by a target comprising a number of molecular labels having an identifiable sequence that is greater than a predetermined excess sequencing threshold. For example, the excess sequencing threshold can be about 250 if the probability barcode contains about 6651 molecular labels with identifiable sequences. The method determines the number of molecular labels having an identifiable sequence associated with a target in the sequencing data, given the predetermined excess sequence, when the sequencing data of the target in the sequencing data is in excess sequencing status. A step of subsampling to the single threshold is included.

いくつかの実施形態では、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有するノイズ分子標識の数を決定する工程は、ネガティブ二項分布当てはめ条件が満たされれば、（ｉｖ）シグナルネガティブ二項分布を、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数に当てはめる工程であって、シグナルネガティブ二項分布が、シグナル分子標識である、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数に対応するステップと；（ｖ）ノイズネガティブ二項分布を、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数に当てはめる工程であって、ノイズネガティブ二項分布が、ノイズ分子標識である、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数に対応する工程と；（ｖｉ）（ｖ）で当てはめたシグナルネガティブ二項分布および（ｖｉ）で当てはめたノイズネガティブ二項分布を用いて、ノイズ分子標識の数を決定する工程と、を含む。 In some embodiments, the step of determining the number of noise molecule labels with identifiable sequences associated with a target in sequencing data is (iv) signal negative if the negative binomial distribution fitting condition is met. The step of fitting the binomial distribution to the number of molecular labels with identifiable sequences associated with the target in the sequencing data counted in (i), wherein the signal negative binomial distribution is a signal molecular label. There is a step corresponding to the number of molecular labels having an identifiable sequence associated with a target in the sequencing data counted in (i); (v) the noise-negative binomial distribution is counted in (i). A step of fitting to the number of molecular labels with identifiable sequences associated with a target in the sequenced data, where the noise-negative binomial distribution is the noise molecular label, the sequence counted in (i). The steps corresponding to the number of molecular labels with identifiable sequences associated with the target in the single data; the signal negative binomial distribution fitted in (vi) (v) and the noise negative binomial fitted in (vi). It comprises the steps of determining the number of noise molecule labels using the distribution.

いくつかの実施形態では、ネガティブ二項分布当てはめ条件は、シーケンシングデータ中の標的のシーケンシングステータスが、過少シーケンシングステータスまたは過剰シーケンシングステータスではないことを含む。（ｖ）で当てはめたシグナルネガティブ二項分布および（ｖｉ）で当てはめたノイズネガティブ二項分布を用いて、ノイズ分子標識の数を決定する工程は、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の各々について、識別可能な配列のシグナル確率が、シグナルネガティブ二項分布であることを決定する工程と；識別可能な配列のノイズ確率が、ノイズネガティブ二項分布であることを決定する工程と；シグナル確率がノイズ確率より小さければ、識別可能な配列がノイズ分子標識であることを決定する工程と、を含む。 In some embodiments, the negative binomial distribution fitting condition comprises that the target sequencing status in the sequencing data is not an under-sequencing status or an over-sequencing status. Using the signal-negative binomial distribution fitted in (v) and the noise-negative binomial distribution fitted in (vi), the step of determining the number of noise molecule labels is identifiable associated with the target in the sequencing data. For each of the molecular labels having an identifiable sequence, the step of determining that the signal probability of the identifiable sequence is a signal negative binomial distribution; and the noise probability of the identifiable sequence is the noise negative binomial distribution. A step of determining that the identifiable sequence is a noise molecule label if the signal probability is less than the noise probability.

いくつかの実施形態では、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有するノイズ分子標識の数を決定する工程は、シーケンシングデータ中の標的のシーケンシングステータスが、過少シーケンシングステータスまたは過剰シーケンシングステータスではなく、かつ、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数が、擬似点閾値より少ない場合、（ｉｉ）でシーケンシングデータ中の標的に関連付けられた識別可能な配列を有するノイズ分子標識の数を決定する前に、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数に擬似点を加える工程を含む。擬似点閾値は、１０である。 In some embodiments, the step of determining the number of noise molecule labels with identifiable sequences associated with a target in the sequencing data is that the sequencing status of the target in the sequencing data is under-sequencing status. Or if the number of molecular labels with identifiable sequences associated with the target in the sequencing data counted in (i) is less than the pseudo-point threshold, and not in excessive sequencing status, in (ii). Pseudo points to the number of molecular labels with identifiable sequences associated with the target in the sequencing data before determining the number of noise molecule labels with identifiable sequences associated with the target in the sequencing data. Includes the step of adding. The pseudo-point threshold is 10.

いくつかの実施形態では、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有するノイズ分子標識の数を決定する工程は、以下：シーケンシングデータ中の標的のシーケンシングステータスが、過少シーケンシングステータスまたは過剰シーケンシングステータスではなく、かつ、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数が、擬似点閾値以上である場合、（ｉｉ）でシーケンシングデータ中の標的に関連付けられた識別可能な配列を有するノイズ分子標識の数を決定する際に、非ユニーク分子標識を除去する工程を含む。 In some embodiments, the step of determining the number of noise molecule labels with identifiable sequences associated with a target in the sequencing data is as follows: The sequencing status of the target in the sequencing data is undersequencing. If the number of molecular labels with identifiable sequences associated with the target in the sequencing data counted in (i) is greater than or equal to the pseudo-point threshold, not in single or excessive sequencing status. In ii), the step of removing the non-unique molecular label is included in determining the number of noise molecular labels having an identifiable sequence associated with the target in the sequencing data.

いくつかの実施形態では、非ユニーク分子標識を除去する工程は、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数が、所定の再使用分子標識閾値より大きい場合、（ｉｉ）でシーケンシングデータ中の標的に関連付けられた識別可能な配列を有するノイズ分子標識の数を決定する際に、非ユニーク分子標識を除去する工程を含む。たとえば、再使用分子標識閾値は、確率バーコードが、識別可能な配列を有する約６５６１の分子標識を含む場合、約６５０でありうる。 In some embodiments, the step of removing the non-unique molecular label is when the number of molecular labels with identifiable sequences associated with the target in the sequencing data is greater than the predetermined reused molecular label threshold. In (ii), the step of removing the non-unique molecular label is included in determining the number of noise molecular labels having an identifiable sequence associated with the target in the sequencing data. For example, the reuse molecular label threshold can be about 650 if the probability barcode contains about 6651 molecular labels with identifiable sequences.

いくつかの実施形態では、非ユニーク分子標識を除去する工程は、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数について非ユニーク分子標識の理論上の数を決定する工程と；シーケンシングデータ中の標的に関連付けられた識別可能な配列を有するｎ番目に豊富な分子標識よりも大きい発生数を有する分子標識を除去する工程と、を含み、ここで、ｎは、非ユニーク分子標識の理論数である。 In some embodiments, the step of removing the non-unique molecular label determines the theoretical number of non-unique molecular labels for the number of molecular labels having an identifiable sequence associated with the target in the sequencing data. Steps; include removing the molecular label having a larger number of occurrences than the nth richest molecular label having an identifiable sequence associated with the target in the sequencing data, where n is: The theoretical number of non-unique molecular labels.

本明細書には、標的の数を決定するためのコンピュータシステムが開示される。いくつかの実施形態では、コンピュータシステムは、実行可能命令を記憶するコンピュータ可読メモリーと；コンピュータ可読メモリーと連絡する１つ以上のコンピュータプロセッサーを含み、ここで、１つ以上のコンピュータプロセッサーは、（ａ）複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程であって、複数の確率バーコードの各々が分子標識を含む工程と；（ｂ）確率バーコード付き標的のシーケンシングデータを取得する工程と；（ｃ）複数の標的の１つ以上について：（ｉ）シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数をカウントする工程と；（ｉｉ）シーケンシングデータ中の標的に関連付けられた識別可能な配列を有するノイズ分子標識の数を決定する工程と；（ｉｉｉ）標的の数を推定する工程と、を実施するように、実行可能命令によりプログラムされ、ここで、推定された標的の数は、（ｉｉ）で決定されたノイズ分子標識の数に応じて調節された、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数と相関する。いくつかの実施形態では、本方法は、シーケンシングデータ中の標的のシーケンシングステータスを決定する工程をさらに含む。シーケンシングデータ中の標的のシーケンシングステータスは、飽和シーケンシング、過少シーケンシング、または過剰シーケンシングである。 The present specification discloses a computer system for determining the number of targets. In some embodiments, the computer system comprises a computer-readable memory for storing executable instructions; one or more computer processors in contact with the computer-readable memory, wherein the one or more computer processors are (a). ) A step of attaching a probability bar code to a plurality of targets using a plurality of probability bar codes to generate a target with a plurality of probability bar codes, in which each of the plurality of probability bar codes includes a molecular label. (B) the step of acquiring sequencing data for a target with a probability bar code; (c) for one or more of the targets: (i) having an identifiable sequence associated with the target in the sequencing data. A step of counting the number of molecular labels; (ii) a step of determining the number of noise molecular labels having an identifiable sequence associated with a target in the sequencing data; and (ii) a step of estimating the number of targets. And, programmed by an executable instruction to perform, where the estimated number of targets was adjusted according to the number of noise molecule labels determined in (ii), counted in (i). Correlates with the number of molecular labels with identifiable sequences associated with the target in the sequenced data. In some embodiments, the method further comprises the step of determining the sequencing status of the target in the sequencing data. The target sequencing status in the sequencing data is saturated sequencing, undersequencing, or oversequencing.

いくつかの実施形態では、飽和シーケンシングステータスは、所定の飽和閾値よりも大きい、識別可能な配列を含む分子標識の数を有する標的によって決定される。たとえば、所定の飽和閾値は、確率バーコードが、識別可能な配列を有する約６５６１の分子標識を含む場合、約６５５７である。所定の飽和閾値は、確率バーコードが、識別可能な配列を有する約６５５３６の分子標識を含む場合、約６５５３２でありうる。シーケンシングデータ中の標的のシーケンシグステータスが、飽和シーケンシングステータスである場合、（ｉｉ）で決定されたノイズ分子標識の数は、ゼロである。 In some embodiments, the saturation sequencing status is determined by a target having a number of molecular labels containing an identifiable sequence that is greater than a predetermined saturation threshold. For example, a predetermined saturation threshold is about 6557 if the probability barcode contains about 6651 molecular labels with identifiable sequences. The predetermined saturation threshold can be about 65532 if the probability barcode contains about 65536 molecular labels with identifiable sequences. When the sequence status of the target in the sequencing data is the saturated sequencing status, the number of noise molecule labels determined in (ii) is zero.

いくつかの実施形態では、過剰シーケンシングステータスは、所定の過剰シーケンシング閾値より大きい、識別可能な配列を有する分子標識の数を有する標的によって決定される。たとえば、過剰シーケンシング閾値は、確率バーコードが、識別可能な配列を有する約６５６１の分子標識を含む場合、約２５０でありうる。本方法は、シーケンシングデータ中の標的のシーケンシングステータスが、過剰シーケンシングステータスである場合、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数を、所定の過剰シーケンシング閾値にサブサンプリングする工程を含む。 In some embodiments, the excess sequencing status is determined by a target having a number of molecular labels with identifiable sequences that are greater than a predetermined excess sequencing threshold. For example, the excess sequencing threshold can be about 250 if the probability barcode contains about 6651 molecular labels with identifiable sequences. The method determines the number of molecular labels having an identifiable sequence associated with a target in the sequencing data, given that the sequence status of the target in the sequencing data is excessive sequencing. A step of subsampling to the single threshold is included.

いくつかの実施形態では、ネガティブ二項分布当てはめ条件は、以下：シーケンシングデータ中の標的のシーケンシングステータスが、過少シーケンシングステータスまたは過剰シーケンシングステータスではないことを含む。（ｖ）で当てはめたシグナルネガティブ二項分布および（ｖｉ）で当てはめたノイズネガティブ二項分布を用いて、ノイズ分子標識の数を決定する工程は、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の各々について、識別可能な配列のシグナル確率が、シグナルネガティブ二項分布であることを決定する工程と；識別可能な配列のノイズ確率が、ノイズネガティブ二項分布であることを決定する工程と；シグナル確率がノイズ確率より小さければ、識別可能な配列がノイズ分子標識であることを決定する工程と、を含む。 In some embodiments, the negative binomial distribution fitting condition includes: the target sequencing status in the sequencing data is not an under-sequencing status or an over-sequencing status. Using the signal-negative binomial distribution fitted in (v) and the noise-negative binomial distribution fitted in (vi), the step of determining the number of noise molecule labels is identifiable associated with the target in the sequencing data. For each of the molecular labels having an identifiable sequence, the step of determining that the signal probability of the identifiable sequence is a signal negative binomial distribution; and the noise probability of the identifiable sequence is the noise negative binomial distribution. A step of determining that the identifiable sequence is a noise molecule label if the signal probability is less than the noise probability.

いくつかの実施形態では、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有するノイズ分子標識の数を決定する工程は、シーケンシングデータ中の標的のシーケンシングステータスが、過少シーケンシングステータスまたは過剰シーケンシングステータスではなく、かつ、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数が、擬似点閾値以上である場合、（ｉｉ）でシーケンシングデータ中の標的に関連付けられた識別可能な配列を有するノイズ分子標識の数を決定する際に、非ユニーク分子標識を除去する工程を含む。 In some embodiments, the step of determining the number of noise molecule labels with identifiable sequences associated with a target in the sequencing data is that the sequencing status of the target in the sequencing data is under-sequencing status. Or if the number of molecular labels having an identifiable sequence associated with the target in the sequencing data counted in (i) is greater than or equal to the pseudo-point threshold, if not in excess sequencing status (ii). Includes the step of removing non-unique molecular labels in determining the number of noise molecule labels with identifiable sequences associated with the target in the sequencing data.

本明細書に、実行されると、本明細書に開示した方法のいずれかを実施する実行可能コードを含む１つ以上の非一過性コンピュータ読取り媒体が開示される。 The specification discloses one or more non-transient computer reading media, which, when performed, contain an executable code that implements any of the methods disclosed herein.

非限定的な例示的確率バーコードを示す。Shows non-limiting exemplary probability barcodes. 非限定的な例示的確率バーコーディングおよびディジタルカウンティングを示す。Shows non-limiting exemplary probability barcoding and digital counting. 複数の標的からの確率バーコード標的のインデックス付きライブラリーを作製するための非限定的な例示的プロセスを示す概略図である。FIG. 6 illustrates a non-limiting exemplary process for creating an indexed library of probabilistic barcode targets from multiple targets. 分子標識エラー、サンプル標識エラー、および真の分子標識シグナルの非限定的な例示的分布を示す概略図である。It is a schematic diagram showing a non-limiting exemplary distribution of molecular labeling errors, sample labeling errors, and true molecular labeling signals. 分子標識を用いて、ＰＣＲおよびシーケンシングエラーを訂正する非限定的な例示的実施形態を示すフローチャートである。It is a flowchart which shows the non-limiting exemplary embodiment which corrects a PCR and a sequencing error using a molecular label. 完全シーケンシングおよび不完全シーケンシングによって得られるシーケンシングデータを示す概略図である。It is a schematic diagram which shows the sequencing data obtained by the complete sequencing and the incomplete sequencing. 方向近接性に基づく分子標識を用いて、ＰＣＲおよびシーケンシングエラーを訂正する非限定的な例示的実施形態を示すフローチャートである。It is a flowchart showing a non-limiting exemplary embodiment of correcting PCR and sequencing errors using molecular labeling based on directional proximity. 再帰的置換エラー訂正および分子標識深度変化の二次導関数に基づいて、ＰＣＲおよびシーケンシングエラーを訂正する非限定的な例示的実施形態を示すフローチャートである。It is a flowchart showing a non-limiting exemplary embodiment which corrects PCR and sequencing error based on the quadratic derivative of recursive replacement error correction and molecular labeling depth change. 再帰的置換エラー訂正および分布ベースのエラー訂正に基づき、ＰＣＲおよびシーケンシングエラーを訂正する非限定的な例示的実施形態を示すフローチャートである。FIG. 6 is a flow chart illustrating a non-limiting exemplary embodiment of correcting PCR and sequencing errors based on recursive replacement error correction and distribution-based error correction. ２つのネガティブ二項分布を用いたエラー訂正の非限定的な例示的実施形態を示すフローチャートである。It is a flowchart which shows the non-limiting exemplary embodiment of error correction using two negative binomial distributions. マイクロウェルプレートのサブサンプリングおよび分子標識のマッピングにより、再帰的置換エラー訂正および分布ベースのエラー訂正に基づき、ＰＣＲおよびシーケンシングエラーを訂正する非限定的な例示的実施形態を示すフローチャートである。It is a flowchart showing a non-limiting exemplary embodiment of correcting PCR and sequencing errors based on recursive replacement error correction and distribution-based error correction by subsampling of microwell plates and mapping of molecular labels. 遺伝子のサブサンプリングおよび分子標識のマッピングにより、再帰的置換エラー訂正および分布ベースのエラー訂正に基づき、ＰＣＲおよびシーケンシングエラーを訂正する非限定的な例示的実施形態を示すフローチャートである。It is a flowchart showing a non-limiting exemplary embodiment of correcting PCR and sequencing errors based on recursive replacement error correction and distribution-based error correction by gene subsampling and molecular labeling mapping. 再帰により、再帰的置換エラー訂正および分布ベースのエラー訂正に基づき、ＰＣＲおよびシーケンシングエラーを訂正する非限定的な例示的実施形態を示すフローチャートである。It is a flowchart showing a non-limiting exemplary embodiment that corrects PCR and sequencing errors based on recursive replacement error correction and distribution-based error correction by recursion. 初期パラメータ推定値について２番目に高い分子標識を用いることにより、再帰的置換エラー訂正および分布ベースのエラー訂正に基づき、ＰＣＲおよびシーケンシングエラーを訂正する非限定的な例示的実施形態を示すフローチャートである。A flow chart illustrating a non-limiting exemplary embodiment that corrects PCR and sequencing errors based on recursive replacement error correction and distribution-based error correction by using the second highest molecular label for initial parameter estimates. be. 本開示の方法に使用するのに好適な非限定的な例示的機器を示す。Shown are non-limiting exemplary devices suitable for use in the methods of the present disclosure. 本開示の実施形態に関連して使用することができるコンピュータシステムの非限定的な例示的構造を示す。Shown are non-limiting exemplary structures of computer systems that can be used in connection with embodiments of the present disclosure. 本開示の方法に使用するのに好適な複数のコンピュータシステムを含むネットワークを示す非限定的な例示的構造を図示する。Illustrated is a non-limiting exemplary structure showing a network containing a plurality of computer systems suitable for use in the methods of the present disclosure. 本開示の方法に従う共有仮想アドレスメモリー空間を用いるマルチプロセッサーコンピュータシステムの非限定的な例示的構造を示す。A non-limiting exemplary structure of a multiprocessor computer system using a shared virtual address memory space according to the methods of the present disclosure is shown. 完全および不完全シーケンシング遺伝子の非限定的な例を示す。Non-limiting examples of complete and incomplete sequencing genes are shown. １塩基シーケンシングエラーならびに真およびエラーバーコードを分離するための閾値についての訂正後のシーケンシングリードとその等級の非限定的な例示的プロットである。A non-limiting exemplary plot of corrected sequencing reads and their grades for single nucleotide sequencing errors as well as thresholds for separating true and error barcodes. ゼロ切断ポアソンモデルの非限定的な例示的図である。It is a non-limiting exemplary figure of a zero-cut Poisson model. ウェル当たりの総シーケンシングリードの棒グラフを示す。A bar graph of total sequencing leads per well is shown. 完全シーケンシング遺伝子（％）、真のバーコードとして保持された分子標識（ＭＬ）（％）および各ウェルについて保持されたそれらのＭＬにマッピングされた保持リード（％）の棒グラフを示す。Shown is a bar graph of complete sequencing genes (%), molecular labels (ML) (%) retained as true barcodes and retained leads (%) mapped to those MLs retained for each well. 各ウェルの遺伝子により変動する保持リード（％）の箱ひげ図を示す。A boxplot of retained leads (%) that varies depending on the gene of each well is shown. ２つのプレートからのアルゴリズム適用後の未補正ＭＬ対訂正ＭＩを使用する主成分分析（ＰＣＡ）を示す。Principal component analysis (PCA) using uncorrected ML vs. corrected MI after applying the algorithm from two plates is shown. 入力分子の増加と共に使用されるユニーク分子標識の理論計算の例示的プロットである。It is an exemplary plot of the theoretical calculation of the unique molecular label used with the increase of input molecules. 高発現遺伝子－ＡＴＣＢについてのマイクロウェルプレート全体の各分子標識の分子標識カバー率を示す例示的プロットであり、ここで、エラー分子標識とリアル分子標識との間に明瞭な分布が観察される。It is an exemplary plot showing the molecular label coverage of each molecular label across the microwell plate for the highly expressed gene-ATCB, where a clear distribution is observed between the error molecular label and the real molecular label. 高発現遺伝子－ＡＴＣＢについてのマイクロウェルプレート全体の各分子標識の分子標識カバー率に、２つのネガティブ二項分布を当てはめる工程を示す例示的プロットである。２つのネガティブ二項分布の当てはめによって、より低い分子標識深度を有する分子標識エラーと、より高い分子標識深度を有する真の分子標識を統計的に識別することができることが実証される。ｘ軸は、分子深度である。It is an exemplary plot showing the process of fitting the two negative binomial distributions to the molecular label coverage of each molecular label across the microwell plate for the highly expressed gene-ATCB. The fit of the two negative binomial distributions demonstrates that it is possible to statistically distinguish between a molecular labeling error with a lower molecular labeling depth and a true molecular labeling with a higher molecular labeling depth. The x-axis is the molecular depth. 分子標識訂正を示し、ここで、１のペアワイズハミング距離が大きな比率を占めた。分子標識訂正後、１のハミング距離相違する分子標識がクラスター化して、同じ親分子標識へ折りたたまれた。A molecular label correction was shown, where the pairwise Hamming distance of 1 accounted for a large proportion. After the molecular label was corrected, molecular labels with different Hamming distances of 1 were clustered and folded into the same parent molecule label. 訂正された分子標識の数対訂正されたリード数カバー率の曲線を示す。The curve of the number of corrected molecular labels to the corrected number of reads coverage is shown. 再帰的置換エラー訂正の一例の概略図を示す。A schematic diagram of an example of recursive replacement error correction is shown. パネル（ａ）～（ｅ）は、分子標識深度変化の二次導関数に基づいて、ＰＣＲおよびシーケンシングエラーを訂正した例示的な結果を示す。Panels (a)-(e) show exemplary results with corrected PCR and sequencing errors based on the quadratic derivative of the molecular labeling depth change. パネル（ａ）～（ｃ）は、ＣＤ６９について２つのネガティブ二項分布に基づいて、ＰＣＲおよびシーケンシングエラーを訂正した例示的な結果を示す。Panels (a)-(c) show exemplary results with corrected PCR and sequencing errors based on two negative binomial distributions for CD69. 同上。Same as above. パネル（ａ）～（ｃ）は、ＣＤ３Ｅについて２つのネガティブ二項分布に基づき、ＰＣＲおよびシーケンシングエラーを訂正した例示的な結果を示す。Panels (a)-(c) show exemplary results with corrected PCR and sequencing errors based on two negative binomial distributions for CD3E. 同上。Same as above. パネル（ａ）～（ｃ）は、高発現遺伝子について２つのネガティブ二項分布に基づき、ＰＣＲおよびシーケンシングエラーを訂正した例示的な結果を示す。Panels (a)-(c) show exemplary results with corrected PCR and sequencing errors based on two negative binomial distributions for highly expressed genes. 同上。Same as above. 高発現遺伝子のＧリッチ分子標識の再使用の例示的な結果を示す。Shown are exemplary results of the reuse of G-rich molecular labels for highly expressed genes. パネル（ａ）～（ｂ）は、２つのネガティブ二項分布を当てはめる前に、高発現遺伝子について入力データを調節した例示的な結果を示す。Panels (a)-(b) show exemplary results of adjusting input data for highly expressed genes prior to fitting the two negative binomial distributions. パネル（ａ）～（ｊ）は、２つのネガティブ二項分布を用いて訂正されたデータセットの非限定的な例示的検証を示す。Panels (a)-(j) show a non-limiting exemplary validation of the dataset corrected using two negative binomial distributions. 同上。Same as above. 同上。Same as above. 同上。Same as above. 同上。Same as above. パネル（ａ）～（ｄ）は、混合Ｊｕｒｋａｔおよび乳癌（ＢｒＣａ）単一細胞（８６の被検遺伝子）の９６ウェルからのＰｒｅｃｉｓｅ（商標）標的アッセイの例示的なｔ－確率的近傍埋込み（ｔ－ＳＮＥ）視覚化を示す。Panels (a)-(d) are exemplary t-probabilistic near-implantation (t) of Precise ™ targeting assays from 96 wells of mixed Jurkat and breast cancer (BrCa) single cells (86 test genes). -SNE) Shows visualization. 同上。Same as above. パネル（ａ）～（ｂ）は、ＤＢＳｃａｎにより計算され、かつ各クラスター中の遺伝子マーカーレベルにより決定された、両方の選択クラスターにおいて、＞０ＭＬの遺伝子に関する細胞クラスター間の差異発現分析を示す非限定的な例示的プロットである。Panels (a)-(b) show unrestricted expression analysis of differences between cell clusters for genes> 0 ML in both selected clusters, calculated by DBScan and determined by the level of the genetic marker in each cluster. It is an exemplary plot. 同上。Same as above. パネル（ａ）～（ｄ）は、８６の被検遺伝子を含む混合Ｊｕｒｋａｔおよび乳癌（Ｔ４７Ｄ）単一細胞の９６ウェルプレートからのＢＤＰｒｅｃｉｓｅ（商標）標的アッセイのｔ－確率的近傍埋込み（ｔ－ＳＮＥ）視覚化を示す、非限定的な例示的プロットである。Panels (a)-(d) are t-probabilistic neighborhood implants (t-probabilistic neighborhood implantation) of BD Precise ™ target assays from 96-well plates of mixed Jurkat and breast cancer (T47D) single cells containing 86 test genes. SNE) Non-limiting exemplary plot showing visualization. 同上。Same as above. いずれかのエラー訂正工程前（図４２、パネル（ａ）に示す未補正ＭＬ）、ならびにＲＳＥＣおよびＤＢＥＣ訂正後（図４２、パネル（ｂ）に示す調節ＭＬ）に、図４１で同定されたさまざまな細胞クラスター間の分子標識カウントによる差異遺伝子発現を表示する非限定的な例示的ヒートマップである。Various identified in FIG. 41 before any error correction step (uncorrected ML shown in FIG. 42, panel (a)) and after RSEC and DBEC correction (adjusted ML shown in FIG. 42, panel (b)). It is a non-limiting exemplary heat map showing the difference gene expression by molecular labeling count between different cell clusters. 同上。Same as above.

以下の詳細な説明では、その一部を成す添付の図面を参照にする。これら図面において、類似する符号は、文脈から他の解釈が要求されない限り、一般に、類似の構成要素を同一のものとみなす。詳細な説明、図面、および特許請求の範囲に記載される例示的な実施形態は、限定的であることを意味しない。本明細書に提示される主題の精神または範囲から逸脱することなく、他の実施形態を使用してもよく、また他の変更を実施してもよい。本明細書に概略的に記載され、図面に図示されるように、本開示の態様は、非常に多様な異なる構成で配置、代替、組合せ、分離、および設計することができ、それらのすべては、本明細書において明示的に考慮され、本開示の一部を成すものとすることを理解されたい。 In the following detailed description, the attached drawings that form a part thereof will be referred to. In these drawings, similar symbols generally consider similar components to be the same, unless context requires other interpretations. The detailed description, drawings, and exemplary embodiments described in the claims are not meant to be limiting. Other embodiments may be used or other modifications may be made without departing from the spirit or scope of the subject matter presented herein. As schematically described herein and illustrated in the drawings, aspects of the present disclosure can be arranged, substituted, combined, separated, and designed in a wide variety of different configurations, all of which can be arranged and designed. , It is to be understood that it is expressly considered herein and constitutes part of this disclosure.

本明細書で参照にされるすべての特許、公開特許出願、他の刊行物、ならびにＧｅｎＢａｎｋおよび他のデータベースからの配列は、関連技術に関してその全体を参照により組み込むものとする。 All patents, published patent applications, other publications, and sequences from GenBank and other databases referenced herein are incorporated by reference in their entirety with respect to the relevant technology.

少数の核酸、たとえば、メッセンジャーリボ核酸（ｍＲＮＡ）分子などの定量は、たとえば、さまざまな発生段階またはさまざまな環境条件下で発現される遺伝子を決定するために、臨床上重要である。しかし、特に、分子数が非常に小さい場合、核酸分子（たとえば、ｍＲＮＡ分子）の絶対数を決定するのは極めて困難となりうる。サンプル中の分子の絶対数を決定する一方法は、ディジタルポリメラーゼ連鎖反応（ＰＣＲ）である。理想的には、ＰＣＲは、各サイクルで分子の同一コピーを産生する。しかしながら、ＰＣＲは、各分子は、推計学的確率で複製し、この確率は、ＰＣＲサイクルおよび遺伝子配列によって変動するため、増幅バイアスおよび不正確な遺伝子発現測定値が生じるといった問題を有しうる。ユニーク分子標識（分子指標（ＭＩ）とも呼ばれる）を有する確率バーコードを用いて、分子数をカウントし、増幅バイアスを補正することができる。Ｐｒｅｃｉｓｅ（商標）アッセイ（ＣｅｌｌｕｌａｒＲｅｓｅａｒｃｈ，Ｉｎｃ．（ＰａｌｏＡｌｔｏ，ＣＡ））などの確率バーコーディングは、分子標識（ＭＬ）を用いて、逆転写（ＲＴ）中にｍＲＮＡに標識することによって、ＰＣＲおよびライブラリー作製工程により誘導されるバイアスを補正することができる。 Quantification of a small number of nucleic acids, such as messenger ribonucleic acid (mRNA) molecules, is clinically important, for example, to determine genes that are expressed at different developmental stages or under different environmental conditions. However, it can be extremely difficult to determine the absolute number of nucleic acid molecules (eg, mRNA molecules), especially if the number of molecules is very small. One method of determining the absolute number of molecules in a sample is the digital polymerase chain reaction (PCR). Ideally, PCR produces the same copy of the molecule at each cycle. However, PCR can have problems such as amplification bias and inaccurate gene expression measurements because each molecule replicates with an putative probability, which varies with the PCR cycle and gene sequence. Probability barcodes with unique molecular labels (also called molecular indicators (MI)) can be used to count the number of molecules and correct for amplification bias. Probability bar coding, such as the Precise ™ assay (Cellular Research, Inc. (Palo Alto, CA)), PCR and PCR and by labeling mRNA during reverse transcription (RT) using molecular labeling (ML). The bias induced by the library fabrication process can be corrected.

Ｐｒｅｃｉｓｅ（商標）アッセイは、ＲＴ工程中に、サンプル中のすべてのポリ（Ａ）－ｍＲＮＡとハイブリダイズさせるために、ポリ（Ｔ）オリゴヌクレオチド上に多数（たとえば、６５６１～６５５３６）のユニーク分子標識を有する確率バーコードの非枯渇プールを使用することができる。分子標識に加えて、確率バーコードのサンプル標識（サンプル指標（ＳＩ）とも呼ばれる）を用いて、Ｐｒｅｃｉｓｅ（商標）プレートの各ウェルを識別することができる。確率バーコードは、ユニバーサルＰＣＲプライミング部位を含んでもよい。ＲＴの最中に、標的遺伝子分子は、確率バーコードとランダムに反応する。各標的分子は、得られた確率バーコードとハイブリダイズして、確率バーコード付きの相補的リボヌクレオチド酸（ｃＤＮＡ）分子を生成しうる）。標識した後、マイクロウェルプレートのマイクロウェルからの確率バーコード付きｃＤＮＡ分子を、ＰＣＲ増幅およびシーケンシングのために単一チューブ中にプールすることができる。未補正のシーケンシングデータを分析して、ポアソン補正または２つのネガティブ二項分布に基づく補正方法に従い、リードの数、ユニーク分子標識を有する確率バーコードの数、ｍＲＮＡ分子の数を取得しうる。 The Precise ™ assay is labeled with a large number (eg, 6561-65536) on poly (T) oligonucleotides to hybridize to all poly (A) -mRNAs in the sample during the RT step. You can use a non-depleted pool of probability barcodes with. In addition to the molecular label, a sample label with a probability barcode (also referred to as sample index (SI)) can be used to identify each well of the Precise ™ plate. Probability barcodes may include universal PCR priming sites. During RT, the target gene molecule reacts randomly with the probability barcode. Each target molecule can hybridize with the resulting probability barcode to produce a complementary ribonucleotide acid (DNA) molecule with a probability barcode). After labeling, cDNA molecules with probability barcodes from the microwells of the microwell plate can be pooled in a single tube for PCR amplification and sequencing. Uncorrected sequencing data can be analyzed to obtain the number of reads, the number of probability barcodes with unique molecular labels, and the number of mRNA molecules according to Poisson correction or a correction method based on two negative binomial distributions.

バイアス補正以外に、分子標識は、観察されるシーケンシングリードに存在する出発ｃＤＮＡ分子数を明らかにすることによって、結果の統計的品質のよりよい理解をもたらすことができる。たとえば、多数のリードは、統計的に正確な回答を示しうるが、リードが、わずか少数の出発ｍＲＮＡ分子から得られる場合、測定精度は損なわれうる。 Besides bias correction, molecular labeling can provide a better understanding of the statistical quality of the results by revealing the number of starting cDNA molecules present in the observed sequencing reads. For example, a large number of reads can give a statistically accurate answer, but measurement accuracy can be compromised if the reads are obtained from only a small number of starting mRNA molecules.

ＰＣＲおよびライブラリー作製工程により誘導された増幅バイアスは、たとえば、分子標識によって修正することはできるが、分子の絶対数の定量は、いくつかの他の因子のために依然として困難となりうる。第１に、ｍＲＮＡ分子の数の推定は、分子標識の全体的多様性によって制限されうる。確率バーコーディングの最中に、ｍＲＮＡ分子は、利用可能な確率バーコードとランダムに反応することができる。従って、各ｍＲＮＡ分子は、確率バーコードとハイブリダイズすることができるが；その分子標識は、いずれか所与の遺伝子について必ずしもユニークではない場合もある。ｍＲＮＡ分子の数が、確率バーコードの数に比べて小さいとき、各々のｍＲＮＡ分子は、ユニーク分子標識を有する確率バーコードとハイブリダイズする傾向があり、分子数のカウントは、分子標識数のカウントと同等となりうる。 Amplification bias induced by PCR and library fabrication steps can be corrected, for example, by molecular labeling, but quantification of the absolute number of molecules can still be difficult due to some other factors. First, the estimation of the number of mRNA molecules can be limited by the overall variety of molecular labels. During probability barcoding, the mRNA molecule can react randomly with the available probability barcodes. Thus, although each mRNA molecule can hybridize to a probability barcode; its molecular label may not necessarily be unique for any given gene. When the number of mRNA molecules is smaller than the number of probability barcodes, each mRNA molecule tends to hybridize with a probability barcode with a unique molecular label, and the count of the number of molecules is the count of the number of molecular labels. Can be equivalent to.

ｍＲＮＡ分子の数が増加するにつれて、多数のｍＲＮＡ分子は、同じ分子標識を有する確率バーコードとハイブリダイズする傾向が高くなる。故に、ユニーク分子標識のカウントを使用すると、分子数を過少評価する可能性がある。いくつかの事例では、ｍＲＮＡ分子の数は、ポアソン補正または観察されるユニーク分子標識の総数の２つのネガティブ二項分布に基づく補正に従って推定することができる。しかしながら、６５６１の確率バーコードの全コレクションが観察される極端な事例では、ポアソン補正または２つのネガティブ二項分布に基づく補正はもはや不可能となりうる。たとえば、６５０００または１０００００の出発ｍＲＮＡ分子のいずれにかかわらず、いずれの場合でも、６５６１飽和確率バーコードの最大値が予想される。 As the number of mRNA molecules increases, a large number of mRNA molecules are more likely to hybridize to probability barcodes with the same molecular label. Therefore, the use of unique molecular label counts may underestimate the number of molecules. In some cases, the number of mRNA molecules can be estimated according to a Poisson correction or a correction based on two negative binomial distributions of the total number of unique molecular labels observed. However, in the extreme case where a complete collection of 6651 probability barcodes is observed, Poisson correction or correction based on two negative binomial distributions may no longer be possible. For example, regardless of whether the starting mRNA molecule is 65,000 or 100,000, the maximum value of the 6651 saturation probability barcode is expected in either case.

第２に、ＰＣＲエラー（すなわち、ＰＣＲ増幅の最中に発生したエラー）は、人工的確率バーコードを誘導して、分子標識カウントを任意で増大させうる。第３に、ＰＣＲ増幅バイアスおよび非効率的ＰＣＲは、エラーと識別不可能なバーコード付き分子の少数コピーを生成しうる。第４に、シーケンシングエラー、確率バーコード配列の不正確なコーリングは、人工的確率バーコードを誘導して、分子標識カウントを増大させうる。さらに、シーケンシング深度は、特に、シーケンシングが、浅すぎて、サンプルライブラリー中に存在する確率バーコード付きｍＲＮＡのすべてを検出することができない場合に重要となりうる。 Second, PCR errors (ie, errors that occur during PCR amplification) can induce artificial probability barcodes and optionally increase the molecular label count. Third, PCR amplification bias and inefficient PCR can produce a small copy of the bar coded molecule that is indistinguishable from error. Fourth, sequencing errors, inaccurate calling of stochastic barcode sequences can induce artificial probabilistic barcodes and increase the molecular label count. In addition, sequencing depth can be important, especially if the sequencing is too shallow to detect all of the probabilistic barcoded mRNAs present in the sample library.

１つ以上のＰＣＲを有する標的の数、または訂正若しくは調節されたシーケンシングエラーを決定する方法およびシステムが本明細書に開示される。いくつかの実施形態では、本方法は、（ａ）複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程であって、複数の確率バーコードの各々が分子標識を含む工程と；（ｂ）確率バーコード付き標的のシーケンシングデータを取得する工程と；（ｃ）複数の標的の１つ以上について：（ｉ）シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数をカウントする工程と；（ｉｉ）（ｂ）で得られたシーケンシングデータ中の標的のクオリティステータスを決定する工程と；（ｉｉｉ）（ｂ）で得られたシーケンシングデータ中の１つ以上のシーケンシングデータエラーを決定する工程であって、シーケンシングデータ中の１つ以上のシーケンシングデータエラーを決定する工程が、以下：シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数、シーケンシングデータ中の標的のクオリティステータス、および複数の確率バーコードに識別可能な配列を有する分子標識の数のうち１つ以上を決定することを含む工程と；（ｉｖ）標的の数を推定する工程であって、推定された標的の数が、（ｉｉｉ）で決定された１つ以上のシーケンシングデータエラーに応じて調節された、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数と相関する工程と、を含む。 Methods and systems for determining the number of targets with one or more PCRs, or corrected or regulated sequencing errors, are disclosed herein. In some embodiments, the method is (a) a step of using a plurality of probability barcodes to attach probability barcodes to a plurality of targets to generate a plurality of targets with probability barcodes. Each of the probabilistic barcodes of the above includes a molecular label; (b) a step of acquiring sequencing data of a target with a probabilistic bar code; (c) for one or more of a plurality of targets: (i) sequencing data. A step of counting the number of molecular labels having an identifiable sequence associated with a target in; and a step of determining the quality status of the target in the sequencing data obtained in (ii) (b); (iii). ) The step of determining one or more sequencing data errors in the sequencing data obtained in (b), wherein the step of determining one or more sequencing data errors in the sequencing data is as follows: Of the number of molecular labels with identifiable sequences associated with the target in the sequencing data, the quality status of the target in the sequencing data, and the number of molecular labels with identifiable sequences in multiple probability barcodes. A step comprising determining one or more; (iv) a step of estimating the number of targets, wherein the estimated number of targets results in one or more sequencing data errors determined in (iii). Containing a step that correlates with the number of molecular labels having an identifiable sequence associated with the target in the sequencing data counted in (i), adjusted accordingly.

方向近接性に基づいて訂正若しくは調節された１つ以上のＰＣＲまたはシーケンシングエラーを有する標的の数を決定する方法が開示される。いくつかの実施形態では、本方法は、（ａ）複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程であって、複数の確率バーコードの各々が分子標識を含む工程と；（ｂ）確率バーコード付き標的のシーケンシングデータを取得する工程と；（ｃ）複数の標的の１つ以上について：（ｉ）シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数をカウントする工程と；（ｉｉ）方向近接性を用いて、標的の分子標識のクラスターを同定する工程と；（ｉｉｉ）（ｉｉ）で同定された標的の分子標識のクラスターを用いて、（ｂ）で得られたシーケンシングデータを折りたたむ工程と；（ｉｖ）標的の数を推定する工程であって、推定された標的の数が、（ｉｉ）のシーケンシングデータの折りたたみ後に、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数と相関する工程と、を含む。 Disclosed are methods of determining the number of targets with one or more PCR or sequencing errors corrected or adjusted based on directional proximity. In some embodiments, the method is (a) a step of attaching a probability barcode to a plurality of targets using a plurality of probability barcodes to generate a plurality of targets with the probability barcodes. Each of the probabilistic barcodes of the above includes a molecular label; (b) a step of acquiring sequencing data of a target with a probabilistic barcode; (c) for one or more of a plurality of targets: (i) sequencing data. A step of counting the number of molecular labels having an identifiable sequence associated with a target in; (iii) a step of identifying clusters of target molecular labels using directional proximity; (iii) (ii). ), Using the cluster of molecular labels of the targets identified in), the step of collapsing the sequencing data obtained in (b); (iv) the step of estimating the number of targets, the estimated number of targets. Includes (ii) folding of the sequencing data followed by a step that correlates with the number of molecular labels having an identifiable sequence associated with the target in the sequencing data counted in (i).

訂正若しくは調節された１つ以上のＰＣＲまたはシーケンシングエラーを有する標的の数を決定するためのコンピュータシステムが開示される。実行されると、１つ以上のコンピュータデバイスに、訂正若しくは調節された１つ以上のＰＣＲまたはシーケンシングエラーを有する標的の数を決定させる、実行可能コードを含む非一過性コンピュータ読取り媒体が開示される。 A computer system for determining the number of targets with one or more corrected or regulated PCR or sequencing errors is disclosed. Disclosed is a non-transient computer reading medium containing an executable code that, when executed, causes one or more computer devices to determine the number of targets with one or more corrected or regulated PCR or sequencing errors. Will be done.

定義
特に定義がない限り、本明細書で用いられる技術用語はすべて、本開示が属する分野の当業者により一般に理解されているものと同一の意味を有する。たとえば、Ｓｉｎｇｌｅｔｏｎｅｔａｌ．，ＤｉｃｔｉｏｎａｒｙｏｆＭｉｃｒｏｂｉｏｌｏｇｙａｎｄＭｏｌｅｃｕｌａｒＢｉｏｌｏｇｙ２ｎｄｅｄ．，Ｊ．Ｗｉｌｅｙ＆Ｓｏｎｓ（ＮｅｗＹｏｒｋ，ＮＹ１９９４）；Ｓａｍｂｒｏｏｋｅｔａｌ．，ＭｏｌｅｃｕｌａｒＣｌｏｎｉｎｇ，ＡＬａｂｏｒａｔｏｒｙＭａｎｕａｌ，ＣｏｌｄＳｐｒｉｎｇｓＨａｒｂｏｒＰｒｅｓｓ（ＣｏｌｄＳｐｒｉｎｇｓＨａｒｂｏｒ，ＮＹ１９８９）を参照されたい。本開示の目的のために、下記の用語を以下に定義する。 Definitions Unless otherwise defined, all technical terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. For example, Singleton et al. , Dictionary of Microbiology and Molecular Biology 2nd ed. , J. Wiley & Sons (New York, NY 1994); Sambrook et al. , Molecular Cloning, A Laboratory Manual, Cold Spring Harbor Press (Cold Spring Harbor, NY 1989). For the purposes of this disclosure, the following terms are defined below.

本明細書で用いられる場合、「アダプター」という用語は、関連核酸の増幅またはシーケンシングを促進するための配列を意味しうる。関連核酸は、標的核酸を含みうる。関連核酸は、空間標識、標的標識、サンプル標識、指標標識、バーコード、確率バーコード、または分子標識の１つ以上を含みうる。アダプターは、線状であってよい。アダプターは、事前にアデニル化されたアダプターであってよい。アダプターは、二本鎖または一本鎖であってよい。１つ以上のアダプターは、核酸の５’または３’末端に配置することができる。アダプターが５’および３’末端に既知の配列を含む場合、既知の配列は、同じ配列でも、異なる配列でもよい。ポリヌクレオチドの５’および／または３’末端に位置するアダプターは、表面上に固定された１つ以上のオリゴヌクレオチドにハイブリダイズする能力を有しうる。アダプターは、いくつかの実施形態では、ユニバーサル配列を含む。ユニバーサル配列は、２つ以上の核酸分子と共通のヌクレオチド配列の１領域であってよい。２つ以上の核酸分子は、異なる配列の領域を有しうる。従って、たとえば、５’アダプターは、同一配列および／またはユニバーサル核酸配列を含み、３’アダプターは、同一配列および／またはユニバーサル配列を含みうる。複数の核酸分子の異なるメンバー中に存在しうるユニバーサル配列は、ユニバーサル配列と相補的な単一ユニバーサルプライマーを用いて、複数の異なる配列の複製または増幅を可能にしうる。同様に、核酸分子のコレクションの異なるメンバー中に存在しうる少なくとも１つ、２つ（たとえば、ペア）若しくはそれ以上のユニバーサル配列は、ユニバーサル配列と相補的な少なくとも１つ、２つ（たとえば、一対）若しくはそれ以上の単一ユニバーサルプライマーを用いて、複数の異なる配列の複製または増幅を可能にしうる。従って、ユニバーサルプライマーは、こうしたユニバーサル配列とハイブリダイズすることができる配列を含む。標的核酸配列担持分子を修飾して、ユニバーサルアダプター（たとえば、非標的核酸配列）を異なる標的核酸配列の一端または両端に結合させることができる。標的核酸に結合した１つ以上のユニバーサルプライマーは、ユニバーサルプライマーのハイブリダイゼーションのための部位を提供することができる。標的核酸に結合した１つ以上のユニバーサルプライマーは、同じでも、互いに異なってもよい。 As used herein, the term "adapter" can mean a sequence for facilitating amplification or sequencing of related nucleic acids. The relevant nucleic acid may include the target nucleic acid. The relevant nucleic acid may include one or more of a spatial label, a target label, a sample label, an indicator label, a bar code, a stochastic bar code, or a molecular label. The adapter may be linear. The adapter may be a pre-adenylated adapter. The adapter may be double-stranded or single-stranded. One or more adapters can be placed at the 5'or 3'end of the nucleic acid. If the adapter contains known sequences at the 5'and 3'ends, the known sequences may be the same sequence or different sequences. Adapters located at the 5'and / or 3'ends of the polynucleotide may have the ability to hybridize to one or more oligonucleotides immobilized on the surface. The adapter, in some embodiments, comprises a universal array. The universal sequence may be one region of a nucleotide sequence common to two or more nucleic acid molecules. Two or more nucleic acid molecules can have regions of different sequences. Thus, for example, a 5'adapter may contain the same sequence and / or a universal nucleic acid sequence and a 3'adapter may contain the same sequence and / or a universal sequence. Universal sequences that can be present in different members of multiple nucleic acid molecules can allow replication or amplification of multiple different sequences using a single universal primer complementary to the universal sequence. Similarly, at least one, two (eg, a pair) or more universal sequences that may be present in different members of a collection of nucleic acid molecules are at least one, two (eg, a pair) that are complementary to the universal sequence. ) Or more single universal primers may be used to allow replication or amplification of multiple different sequences. Thus, universal primers include sequences that can hybridize to such universal sequences. The target nucleic acid sequence-bearing molecule can be modified to attach a universal adapter (eg, a non-target nucleic acid sequence) to one or both ends of a different target nucleic acid sequence. One or more universal primers bound to the target nucleic acid can provide a site for hybridization of the universal primer. The one or more universal primers bound to the target nucleic acid may be the same or different from each other.

本明細書で用いられる場合、「関連付けられる」または「～に関連付けられる」という用語は、ある時点で２つ以上の種が共配置されているとして同定可能であることを意味しうる。関連付けは、２つ以上の種が類似の容器内にあることを意味しうる。関連付けは、インフォマティクス的関連付けでありうる。この場合、たとえば、２つ以上の種に関するディジタル情報が記憶され、かつその情報を用いてこれらの種の１つ以上が共配置されたことを決定可能である。関連付けはまた、物理的関連付けでありうる。いくつかの実施形態では、２つ以上の関連付けられる種は、互いにまたは共通の固体もしくは半固体の表面に「テザー連結」、「結合」、または「固定」される。関連付けは、ビーズなどの固体または半固体の支持体に標識を結合するための共有結合手段または非共有結合手段を意味しうる。関連付けは、標的と標識との共有結合でありうる。 As used herein, the terms "associated" or "associated with" can mean that at any given time two or more species can be identified as co-located. The association can mean that more than one species are in similar containers. The association can be an informatic association. In this case, for example, digital information about two or more species is stored, and the information can be used to determine that one or more of these species are co-located. The association can also be a physical association. In some embodiments, the two or more associated species are "tethered", "bonded", or "fixed" to the surface of a solid or semi-solid that is mutual or common. The association can mean covalent or non-covalent means for binding the label to a solid or semi-solid support such as beads. The association can be a covalent bond between the target and the label.

本明細書で用いられる場合、「相補的」という用語は、２つのヌクレオチド間の精密なペアリングの能力を意味しうる。たとえば、核酸の所与の位置のヌクレオチドが他の核酸のヌクレオチドと水素結合可能である場合、２つの核酸はその位置で互いに相補的であるとみなされる。２つの一本鎖核酸分子間の相補性は、ヌクレオチドの一部のみが結合する場合には「部分的」でありうるし、一本鎖分子間のすべてに相補性が存在する場合には完全でありうる。第１のヌクレオチド配列が第２のヌクレオチド配列に相補的である場合、第１のヌクレオチド配列は第２の配列の「相補体」であるといえる。第１のヌクレオチド配列が第２の配列の逆（すなわち、ヌクレオチドの順序が逆）の配列に相補的である場合、第１のヌクレオチド配列は第２の配列の「逆相補体」であるといえる。本明細書で用いられる場合、「相補体」、「相補的」、および「逆相補体」という用語は、同義的に用いることが可能である。ある分子が他の分子にハイブリダイズしうる場合、それはハイブリダイズしている分子の相補体でありうることが、本開示から理解される。 As used herein, the term "complementary" can mean the ability of precise pairing between two nucleotides. For example, if a nucleotide at a given position in a nucleic acid is hydrogen-bondable to a nucleotide in another nucleic acid, the two nucleic acids are considered complementary to each other at that position. Complementarity between two single-stranded nucleic acid molecules can be "partial" if only a portion of the nucleotide is bound, or complete if all of the single-stranded nucleic acid molecules have complementarity. It is possible. If the first nucleotide sequence is complementary to the second nucleotide sequence, then the first nucleotide sequence is said to be a "complement" of the second sequence. If the first nucleotide sequence is complementary to the reverse sequence of the second sequence (ie, the order of the nucleotides is reversed), then the first nucleotide sequence is said to be the "reverse complement" of the second sequence. .. As used herein, the terms "complementary", "complementary", and "reverse complementary" can be used synonymously. It is understood from the present disclosure that if one molecule can hybridize to another molecule, it can be a complement to the hybridizing molecule.

本明細書で用いられる場合、「ディジタルカウンティング」という用語は、サンプル中の標的分子の数を推定する方法を意味しうる。ディジタルカウンティングは、サンプル中の標的に関連付けられたユニーク標識の数を決定する工程を含みうる。この確率的方法は、分子をカウントする問題を、同一の分子の位置決定および同定の問題から、所定の標識のセットの検出に関する一連のあり／なしのディジタル問題に変換する。 As used herein, the term "digital counting" can mean a method of estimating the number of target molecules in a sample. Digital counting may include determining the number of unique labels associated with the target in the sample. This probabilistic method transforms the problem of counting molecules from the problem of positioning and identifying the same molecule into a series of digital problems with and without the detection of a given set of labels.

本明細書で用いられる場合、「標識」という用語は、サンプル内の標的に関連付けられる核酸コードを意味しうる。標識は、たとえば、核酸標識でありうる。標識は、全体または一部が増幅可能な標識でありうる。標識は、全体または一部がシーケンス可能標識でありうる。標識は、個別に同定可能な天然核酸の一部でありうる。標識は、既知の配列でありうる。標識は、核酸配列の接合（たとえば、天然配列と非天然配列との接合）を含みうる。本明細書で用いられる場合、「標識」という用語は、「インデックス」、「タグ」、または「標識タグ」という用語と同義的に用いうる。標識は、情報を伝達可能である。たとえば、種々の実施形態では、標識は、サンプル同一性、サンプル源、細胞同一性、および／または標的を決定するために使用可能である。 As used herein, the term "label" can mean the nucleic acid code associated with a target in a sample. The label can be, for example, a nucleic acid label. The label can be a label that can be amplified in whole or in part. The label can be a sequenceable label in whole or in part. The label can be part of an individually identifiable natural nucleic acid. The label can be a known sequence. Labeling can include conjugation of nucleic acid sequences (eg, conjugation of natural and non-natural sequences). As used herein, the term "marker" may be used synonymously with the term "index," "tag," or "marker tag." Signs can convey information. For example, in various embodiments, the label can be used to determine sample identity, sample source, cell identity, and / or target.

本明細書で用いられる場合、「非枯渇リザーバー」という用語は、多種多様な標識から構成された確率バーコードのプールを意味しうる。非枯渇リザーバーは、非枯渇リザーバーが標的のプールに関連付けられる場合、各標的がユニーク確率バーコードに関連付けられる可能性が高くなるように、多数の異なる確率バーコードを含みうる。各標識標的分子のユニーク性は、ランダム選択の統計により決定可能であり、標識の多様性と比較してコレクション中の同一の標的分子のコピー数に依存する。得られる標識標的分子のセットのサイズは、バーコーディングプロセスの確率的性質により決定可能であり、次いで、検出された確率バーコードの数の解析は、元のコレクションまたはサンプル中に存在する標的分子の数の計算を可能にする。存在する標的分子のコピー数とユニーク確率バーコードの数との比が低い場合、標識標的分子はきわめてユニークである（すなわち、２つ以上の標的分子が１つの所与の標識で標識される確率は非常に低い）。 As used herein, the term "non-depleted reservoir" can mean a pool of stochastic barcodes composed of a wide variety of labels. A non-depleted reservoir may contain a number of different probability barcodes such that if a non-depleted reservoir is associated with a pool of targets, each target is more likely to be associated with a unique probability barcode. The uniqueness of each labeled target molecule can be determined by random selection statistics and depends on the number of copies of the same target molecule in the collection compared to the variety of labels. The size of the resulting set of labeled target molecules can be determined by the probabilistic nature of the bar coding process, and then analysis of the number of probabilistic barcodes detected is performed on the target molecules present in the original collection or sample. Allows calculation of numbers. A labeled target molecule is highly unique (ie, the probability that two or more target molecules will be labeled with one given label) if the ratio of the number of copies of the existing target molecule to the number of unique probability barcodes is low. Is very low).

本明細書で用いられる場合、「核酸」という用語は、ポリヌクレオチド配列またはその断片を意味する。核酸はヌクレオチドを含みうる。核酸は細胞に対して外因性または内因性でありうる。核酸は細胞フリー環境中に存在しうる。核酸は遺伝子またはその断片でありうる。核酸はＤＮＡでありうる。核酸はＲＮＡでありうる。核酸は１つ以上のアナログ（たとえば、修飾された骨格、糖または核酸塩基）を含みうる。アナログのいくつかの例としては、限定されるものではないが、５－ブロモウラシル、ペプチド核酸、ゼノ核酸、モルホリノ体、ロックド核酸、グリコール核酸、トレオース核酸、ジデオキシヌクレオチド、コルジセピン、７－デアザ－ＧＴＰ、フルオロフォア（たとえば、糖に結合されたローダミンまたはフルオレセイン）、チオール含有ヌクレオチド、ビオチン結合ヌクレオチド、蛍光塩基アナログ、ＣｐＧアイランド、メチル－７－グアノシン、メチル化ヌクレオチド、イノシン、チオウリジン、プソイドウリジン、ジヒドロウリジン、キューオシン、およびワイオシンが挙げられる。「核酸」、「ポリヌクレオチド、「標的ポリヌクレオチド」、および「標的核酸」は、同義的に用いうる。 As used herein, the term "nucleic acid" means a polynucleotide sequence or fragment thereof. Nucleic acid can include nucleotides. Nucleic acids can be exogenous or endogenous to cells. Nucleic acid can be present in a cell-free environment. Nucleic acid can be a gene or a fragment thereof. Nucleic acid can be DNA. Nucleic acid can be RNA. Nucleic acids can include one or more analogs (eg, modified scaffolds, sugars or nucleic acid bases). Some examples of analogs are, but are not limited to, 5-bromouracil, peptide nucleic acid, xenonucleic acid, morpholino form, locked nucleic acid, glycolnucleic acid, treose nucleic acid, dideoxynucleotide, cordisepine, 7-deaza-GTP. , Fluorophore (eg, sugar-bound rhodamine or fluorescein), thiol-containing nucleotides, biotin-binding nucleotides, fluorescent base analogs, CpG islands, methyl-7-guanosine, methylated nucleotides, inosine, thiouridine, pseudouridine, dihydrouridine, Nucleic acid and Wyosin can be mentioned. "Nucleic acid", "polynucleotide", "target polynucleotide", and "target nucleic acid" can be used interchangeably.

核酸は、新しいまたは向上した特徴（たとえば、向上した安定性）を有する核酸を提供するために１つ以上の修飾（たとえば、塩基修飾、骨格修飾）を含みうる。核酸は核酸アフィニティータグを含みうる。ヌクレオシドは塩基－糖の組合せでありうる。ヌクレオシドの塩基部分はヘテロ環塩基でありうる。かかるヘテロ環塩基の２つの最も一般的なクラスはプリンおよびピリミジンである。ヌクレオチドは、ヌクレオシドの糖部分に共有結合されたリン酸基をさらに含むヌクレオシドでありうる。ペントフラノシル糖を含むヌクレオシドでは、リン酸基は、糖の２’、３’、または５’ヒドロキシル部分に結合可能である。核酸を形成する際、リン酸基は、隣接ヌクレオシドを互いに共有結合して線状高分子化合物を形成可能である。ひいては、この線状高分子化合物のそれぞれの末端をさらに連結して環状化合物を形成可能である。しかしながら、線状化合物が一般に好適である。そのほかに、線状化合物は、内部ヌクレオチド塩基相補性を有しうるので、完全二本鎖または部分二本鎖の化合物を生成するようにフォールディングしうる。核酸内では、リン酸基は、通常、核酸のヌクレオシド間骨格を形成するものとして参照可能である。結合または骨格は、３’→５’ホスホジエステル結合でありうる。 Nucleic acids may include one or more modifications (eg, base modifications, skeletal modifications) to provide nucleic acids with new or improved characteristics (eg, improved stability). The nucleic acid may include a nucleic acid affinity tag. The nucleoside can be a base-sugar combination. The base portion of the nucleoside can be a heterocyclic base. The two most common classes of such heterocyclic bases are purines and pyrimidines. Nucleotides can be nucleosides that further contain a phosphate group covalently attached to the sugar moiety of the nucleoside. In nucleosides containing pentoflanosyl sugars, the phosphate group is capable of binding to the 2', 3', or 5'hydroxyl moieties of the sugar. When forming a nucleic acid, the phosphate group can covalently bond adjacent nucleosides to each other to form a linear polymer compound. As a result, the cyclic compound can be formed by further connecting the respective ends of the linear polymer compound. However, linear compounds are generally preferred. In addition, linear compounds can have internal nucleotide base complementarity and can be folded to produce fully double-stranded or partially double-stranded compounds. Within the nucleic acid, the phosphate group is usually referred to as forming the nucleoside interskeletal structure of the nucleic acid. The bond or backbone can be a 3'→ 5'phosphodiester bond.

核酸は、修飾骨格および／または修飾ヌクレオシド間結合を含みうる。修飾骨格は、骨格中にリン原子を保持するものおよび骨格中にリン原子を有していないものを含みうる。リン原子を中に含有する好適な修飾核酸骨格は、たとえば、ホスホロチオエート、キラルホスホロチオエート、ホスホロジチオエート、ホスホトリエステル、アミノアルキルホスホトリエステル、３’－アルキレンホスホネートや５’－アルキレンホスホネートなどのメチルや他のアルキルのホスホネート、キラルホスホネート、ホスフィネート、３’－アミノホスホルアミデートやアミノアルキルホスホルアミデートなどのホスホルアミデート、ホスホロジアミデート、チオノホスホルアミデート、チオノアルキルホスホネート、チオノアルキルホスホトリエステル、セレノホスフェート、および通常３’－５’結合、２’－５’結合アナログを有するボラノホスフェート、ならびに１つ以上のヌクレオチド間結合が３’→３’、５’→５’、または２’→２’結合である逆極性を有するものを含みうる。 Nucleic acids may contain modified scaffolds and / or modifications between modified nucleosides. Modified skeletons may include those that retain a phosphorus atom in the skeleton and those that do not have a phosphorus atom in the skeleton. Suitable modified nucleic acid skeletons containing a phosphorus atom include, for example, methyl such as phosphorothioate, chiral phosphorothioate, phosphorodithioate, phosphotriester, aminoalkylphosphotriester, 3'-alkylene phosphonate and 5'-alkylene phosphonate. And other alkyl phosphonates, chiral phosphonates, phosphinates, phosphoramidates such as 3'-aminophosphoramidate and aminoalkylphosphoramidates, phosphorodiamidates, thionophosphoramidates, thionoalkylphosphonates. , Thionoalkyl phosphotriesters, selenophosphates, and borane phosphates usually having 3'-5'bonds, 2'-5'binding analogs, and one or more internucleotide bonds 3'→ 3', 5'. It may include those having a reverse polarity, which is a → 5 ′ or 2 ′ → 2 ′ bond.

核酸は、短鎖アルキルもしくはシクロアルキルのヌクレオシド間結合、混合ヘテロ原子およびアルキルもしくはシクロアルキルのヌクレオシド間結合、または１つ以上の短鎖ヘテロ原子もしくはヘテロ環のヌクレオシド間結合により形成されるポリヌクレオチド骨格を含みうる。これらは、モルホリノ結合（ヌクレオシドの糖部分から部分的に形成される）、シロキサン骨格、スルフィド、スルホキシド、およびスルホン骨格、ホルムアセチルおよびチオホルムアセチル骨格、メチレンホルムアセチルおよびチオホルムアセチル骨格、リボアセチル骨格、アルケン含有骨格、スルファメート骨格、メチレンイミノおよびメチレンヒドラジノ骨格、スルホネートおよびスルホンアミド骨格、アミド骨格を有するもの、ならびに混合Ｎ、Ｏ、Ｓ、およびＣＨ₂構成部分を有する他のものを含みうる。 Nucleic acids are polynucleotide skeletons formed by short-chain alkyl or cycloalkyl nucleoside bonds, mixed heteroatoms and alkyl or cycloalkyl nucleoside bonds, or one or more short chain heteroatom or heterocycle nucleoside bonds. Can include. These include morpholino bonds (partially formed from the sugar moiety of the nucleoside), siloxane skeletons, sulfides, sulfoxides, and sulfone skeletons, form acetyl and thioform acetyl skeletons, methyleneform acetyl and thioform acetyl skeletons, riboacetyl skeletons, It may include alkene-containing skeletons, sulfamate skeletons, methyleneimino and methylenehydrazino skeletons, sulfonate and sulfoneamide skeletons, those with amide skeletons, and others with mixed N, O, S, and CH ₂ constituents.

核酸は核酸ミメティックを含みうる。「ミメティック」という用語は、フラノース環のみまたはフラノース環とヌクレオチド間結合の両方が非フラノース基で置き換えられているポリヌクレオチドを含むことを意図し得、フラノース環のみの置換えは、糖サロゲートであるとして参照可能である。ヘテロ環塩基部分または修飾ヘテロ環塩基部分は、適切な標的核酸とのハイブリダイゼーションのために保持可能である。かかる核酸の１つはペプチド核酸（ＰＮＡ）でありうる。ＰＮＡでは、ポリヌクレオチドの糖骨格は、アミド含有骨格特にアミノエチルグリシン骨格で置換え可能である。ヌクレオチドは保持可能であり、かつ骨格のアミド部分のアザ窒素原子に直接的または間接的に結合される。ＰＮＡ化合物中の骨格は、ＰＮＡにアミド含有骨格を与える２つ以上の結合されたアミノエチルグリシン単位を含みうる。ヘテロ環塩基部分は、骨格のアミド部分のアザ窒素原子に直接的または間接的に結合可能である。 Nucleic acids can include nucleic acid mimetics. The term "mimetic" may be intended to include a polynucleotide in which only the furanose ring or both the furanose ring and the internucleotide bond has been replaced with a non-furanose group, assuming that the replacement of the furanose ring alone is a sugar surrogate. It can be referred to. The heterocyclic base moiety or modified heterocyclic base moiety can be retained for hybridization with a suitable target nucleic acid. One such nucleic acid can be a peptide nucleic acid (PNA). In PNA, the sugar skeleton of a polynucleotide can be replaced with an amide-containing skeleton, particularly an aminoethylglycine skeleton. Nucleotides are retentive and are directly or indirectly attached to the azanitrogen atom of the amide portion of the backbone. The skeleton in the PNA compound may contain two or more bound aminoethylglycine units that give the PNA an amide-containing skeleton. The heterocyclic base moiety can be directly or indirectly attached to the azanitrogen atom of the amide moiety of the skeleton.

核酸はモルホリノ骨格構造を含みうる。たとえば、核酸は、リボース環の代わりに６員モルホリノ環を含みうる。これらの実施形態のいくつかでは、ホスホロジアミデートまたは他の非ホスホジエステルのヌクレオシド間結合によりホスホジエステル結合を置換え可能である。 Nucleic acid may include a morpholino skeletal structure. For example, the nucleic acid may contain a 6-membered morpholino ring instead of a ribose ring. In some of these embodiments, the phosphodiester bond can be replaced by a phosphodiester bond or an internucleoside bond of another non-phosphodiester.

核酸は、モルホリノ環に結合されたヘテロ環塩基を有する結合されたモルホリノ単位（すなわちモルホリノ核酸）を含みうる。結合基は、モルホリノ核酸中のモルホリノモノマー単位を結合可能である。非イオン性モルホリノ系オリゴマー化合物は、細胞タンパク質とのより少ない望ましくない相互作用を有しうる。モルホリノ系ポリヌクレオチドは、核酸の非イオン性ミミックでありうる。モルホリノクラス内のさまざまな化合物は、異なる結合基を用いて連結可能である。ポリヌクレオチドミメティックのさらなるクラスは、シクロヘキセニル核酸（ＣｅＮＡ）として参照可能である。核酸分子中に通常存在するフラノース環は、シクロヘキセニル環で置換え可能である。ＣｅＮＡＤＭＴ保護ホスホロアミダイトモノマーは、ホスホロアミダイト化学を用いたオリゴマー化合物合成のために調製および使用が可能である。核酸鎖中へのＣｅＮＡモノマーの取込みは、ＤＮＡ／ＲＮＡハイブリッドの安定性を増加可能である。ＣｅＮＡオリゴアデニレートは、天然複合体に類似した安定性を有する核酸相補体との複合体を形成可能である。さらなる修飾は、２’－ヒドロキシル基が糖環の４’炭素原子に結合されて２’－Ｃ，４’－Ｃ－オキシメチレン結合を形成することにより二環式糖部分を形成するロックド核酸（ＬＮＡ）を含みうる。結合は、２’酸素原子と４’炭素原子とを架橋するメチレン（－ＣＨ２），基（式中、ｎは１または２である）でありうる。ＬＮＡおよびＬＮＡアナログは、相補的核酸との非常に高い二本鎖熱安定性（Ｔｍ＝＋３～＋１０℃）、３’－エキソヌクレアーゼ分解に対する安定性、および良好な溶解性を示しうる。 The nucleic acid may include a bound morpholino unit (ie, a morpholinonucleic acid) having a heterocyclic base bound to the morpholino ring. The binding group is capable of binding morpholinomonomer units in the morpholino nucleic acid. Nonionic morpholino oligomeric compounds may have less desirable interactions with cellular proteins. Morpholine polynucleotides can be nonionic mimics of nucleic acids. Various compounds within the morpholino class can be linked using different linking groups. A further class of polynucleotide mimetics can be referred to as cyclohexenyl nucleic acid (CeNA). The furanose ring normally present in nucleic acid molecules can be replaced with a cyclohexenyl ring. CeNA DMT-protected phosphoramidite monomers can be prepared and used for the synthesis of oligomeric compounds using phosphoramidite chemistry. Incorporation of CeNA monomers into the nucleic acid strand can increase the stability of the DNA / RNA hybrid. CeNA oligoadenylates can form complexes with nucleic acid complements that have stability similar to that of natural complexes. A further modification is a locked nucleic acid that forms a bicyclic sugar moiety by attaching a 2'-hydroxyl group to the 4'carbon atom of the sugar ring to form a 2'-C, 4'-C-oxymethylene bond. LNA) can be included. The bond can be a methylene (-CH2), group (where n is 1 or 2 in the formula) that crosslinks the 2'oxygen atom and the 4'carbon atom. LNA and LNA analogs may exhibit very high double-stranded thermal stability (Tm = + 3 to + 10 ° C.) with complementary nucleic acids, stability to 3'-exonuclease degradation, and good solubility.

核酸はまた、核酸塩基（単に「塩基」ということが多い）の修飾または置換を含みうる。本明細書で用いられる場合、「非修飾」または「天然」の核酸塩基は、プリン塩基（たとえば、アデニン（Ａ）およびグアニン（Ｇ））、ならびにピリミジン塩基（たとえば、チミン（Ｔ）、シトシン（Ｃ）およびウラシル（Ｕ））を含みうる。修飾核酸塩基は、他の合成および天然の核酸塩基、たとえば、５－メチルシトシン（５－ｍｅ－Ｃ）、５－ヒドロキシメチルシトシン、キサンチン、ヒポキサンチン、２－アミノアデニン、アデニンおよびグアニンの６－メチルおよび他のアルキル誘導体、アデニンおよびグアニンの２－プロピルおよび他のアルキル誘導体、２－チオウラシル、２－チオチミンおよび２－チオシトシン、５－ハロウラシルおよびシトシン、５－プロピニル（－Ｃ＝Ｃ－ＣＨ３）ウラシルおよびシトシン、ならびにピリミジン塩基の他のアルキニル誘導体、６－アゾウラシル、シトシンおよびチミン、５－ウラシル（プソイドウラシル）、４－チオウラシル、８－ハロ、８－アミノ、８－チオール、８－チオアルキル、８－ヒドロキシル、ならびに他の８－置換アデニンおよびグアニン、５－ハロ特に５－ブロモ、５－トリフルオロメチルおよび他の５－置換ウラシルおよびシトシン、７－メチルグアニンおよび７－メチルアデニン、２－Ｆ－アデニン、２－アミノアデニン、８－アザグアニンおよび８－アザアデニン、７－デアザグアニンおよび７－デアザアデニン、ならびに３－デアザグアニンおよび３－デアアデニンを含みうる。修飾核酸塩基は、三環式ピリミジン、たとえば、フェノキサジンシチジン（１Ｈ－ピリミド（５，４－ｂ）（１，４）ベンゾオキサジン－２（３Ｈ）－オン）、フェノチアジンシチジン（１Ｈ－ピリミド（５，４－ｂ）（１，４）ベンゾチアジン－２（３Ｈ）－オン）、置換フェノキサジンシチジン（たとえば、９－（２－アミノエトキシ）－Ｈ－ピリミド（５，４－（ｂ）（１，４）ベンゾオキサジン－２（３Ｈ）－オン）などのＧ－クランプ、フェノチアジンシチジン（１Ｈ－ピリミド（５，４－ｂ）（１，４）ベンゾチアジン－２（３Ｈ）－オン）、置換フェノキサジンシチジン（たとえば、９－（２－アミノエトキシ）－Ｈ－ピリミド（５，４－（ｂ）（１，４）ベンゾオキサジン－２（３Ｈ）－オン）などのＧ－クランプ、カルバゾールシチジン（２Ｈ－ピリミド（４，５－ｂ）インドール－２－オン）、ピリドインドールシチジン（Ｈ－ピリド（３’，’：４，５）ピロロ［２，３－ｄ］ピリミジン－２－オン）を含みうる。 Nucleic acids can also include modifications or substitutions of nucleobases (often referred to simply as "bases"). As used herein, "unmodified" or "natural" nucleobases are purine bases (eg, adenine (A) and guanine (G)), as well as pyrimidine bases (eg, thymine (T), cytosine (eg, thymine (T)). C) and uracil (U)) may be included. Modified nucleobases include other synthetic and naturally occurring nucleobases such as 5-methylcytosine (5-me-C), 5-hydroxymethylcytosine, xanthin, hypoxanthin, 2-aminoadenine, adenine and guanine 6-. Methyl and other alkyl derivatives, 2-propyl and other alkyl derivatives of adenine and guanine, 2-thiouracil, 2-thiothymine and 2-thiocytosine, 5-halolasyl and cytosine, 5-propynyl (-C = C-CH3) uracil And cytosine, and other alkynyl derivatives of pyrimidine base, 6-azouracil, cytosine and timine, 5-uracil (psoiduracil), 4-thiouracil, 8-halo, 8-amino, 8-thiol, 8-thioalkyl, 8-hydroxyl. , And other 8-substituted adenine and guanine, 5-halo, especially 5-bromo, 5-trifluoromethyl and other 5-substituted uracil and cytosine, 7-methylguanine and 7-methyladenine, 2-F-adenine, It may include 2-aminoadenine, 8-azaguanine and 8-azaadenine, 7-deazaguanine and 7-deazaadenine, and 3-deazaguanine and 3-deadenine. Modified nucleic acid bases include tricyclic pyrimidines, such as phenoxadincytidine (1H-pyrimid (5,4-b) (1,4) benzoxazine-2 (3H) -on), phenothiazinecytidine (1H-pyrimid) (5). , 4-b) (1,4) Benzoxazine-2 (3H) -on), substituted phenoxazine cytidine (eg 9- (2-aminoethoxy) -H-pyrimid (5,4- (b) (1, 4) G-clamps such as benzoxazine-2 (3H) -on), phenothiazine cytidine (1H-pyrimid (5,4-b) (1,4) benzothiadin-2 (3H) -on), substituted phenoxadin cytidine G-clamps such as (eg, 9- (2-aminoethoxy) -H-pyrimid (5,4- (b) (1,4) benzoxazine-2 (3H) -on), carbazole cytidine (2H-pyrimid). (4,5-b) indol-2-one), pyridoindole cytidine (H-pyrid (3',': 4,5) pyrrolo [2,3-d] pyrimidin-2-one) may be included.

本明細書で用いられる場合、「サンプル」という用語は、標的を含む組成物を意味しうる。本開示の方法、デバイス、およびシステムによる分析に好適なサンプルとしては、細胞、組織、器官、または生物が挙げられる。 As used herein, the term "sample" can mean a composition comprising a target. Suitable samples for analysis by the methods, devices, and systems of the present disclosure include cells, tissues, organs, or organisms.

本明細書で用いられる場合、「サンプリングデバイス」または「デバイス」という用語は、サンプルのセクションの採取および／または基材上へのセクションの配置を行いうるデバイスを意味しうる。サンプルデバイスとは、たとえば、蛍光活性化細胞選別（ＦＡＣＳ）機、セルソーター機、生検針、生検デバイス、組織切片化デバイス、マイクロ流体デバイス、ブレードグリッド、および／またはミクロトームを意味しうる。 As used herein, the term "sampling device" or "device" can mean a device capable of collecting and / or arranging sections on a substrate. The sample device can mean, for example, a fluorescence activated cell sorting (FACS) machine, a cell sorter machine, a biopsy needle, a biopsy device, a tissue sectioning device, a microfluidic device, a blade grid, and / or a microtome.

本明細書で用いられる場合、「固体担体」という用語は、複数の確率バーコードを結合しうる離散した固体または半固体の表面を意味しうる。固体担体は、核酸を（たとえば共有結合または非共有結合で）固定しうるプラスチック、セラミック、金属、または高分子材料（たとえばヒドロゲル）で構成された任意のタイプの中実、多孔性、または中空のスフェア、ボール、ベアリング、シリンダー、または他の類似の構成体を包含しうる。固体担体は、球状（たとえばマイクロスフェア）でありうるかまたは非球状もしくは不規則形状、たとえば、立方体形、直方体形、角錐形、円柱形、円錐形、扁球形、ディスク形などを有しうる離散粒子を含みうる。アレイ状に離間して配置された複数の固体担体は、基材を含まないこともありうる。固体担体は、「ビーズ」という用語と同義的に用いうる。 As used herein, the term "solid carrier" can mean a discrete solid or semi-solid surface to which multiple probability barcodes can be attached. Solid carriers are solid, porous, or hollow of any type composed of plastic, ceramic, metal, or polymeric materials (eg, hydrogels) capable of immobilizing nucleic acids (eg, covalently or non-covalently). It may include spheres, balls, bearings, cylinders, or other similar constructs. The solid support can be spherical (eg, microspheres) or discrete particles that can have irregular or irregular shapes, such as a cube, a rectangular parallelepiped, a pyramidal, a cylinder, a cone, an oblate, a disc, and the like. Can include. The plurality of solid supports arranged apart in an array may not contain a substrate. The solid carrier can be used synonymously with the term "beads".

固体担体は「基材」を意味しうる。基材は固体担体の１種でありうる。基材は、本開示の方法を行いうる連続した固体または半固体の表面を意味しうる。基材は、たとえば、アレイ、カートリッジ、チップ、デバイス、およびスライドを意味しうる。 The solid carrier can mean "base material". The substrate can be one of the solid carriers. Substrate can mean a continuous solid or semi-solid surface that can perform the methods of the present disclosure. The substrate can mean, for example, an array, a cartridge, a chip, a device, and a slide.

本明細書で用いられる場合、「空間標識」という用語は、空間内の位置と関連させることができる標識を意味しうる。 As used herein, the term "spatial marker" can mean a marker that can be associated with a position within space.

本明細書で用いられる場合、「確率バーコード」という用語は、標識を含むポリヌクレオチド配列を意味しうる。確率バーコードは、確率バーコーディングに使用可能なポリヌクレオチド配列でありうる。確率バーコードは、サンプル中の標的を定量可能である。確率バーコードは、標識を標的に関連付けた後に起こりうるエラーの制御に使用可能である。たとえば、確率バーコードは、増幅またはシーケンシングのエラーを評価可能である。標的に関連付けられた確率バーコードは、確率バーコード標的または確率バーコードタグ標的と呼ぶことが可能である。 As used herein, the term "probability barcode" can mean a polynucleotide sequence containing a label. The probability barcode can be a polynucleotide sequence that can be used for probability barcoding. Probability barcodes can quantify targets in a sample. Probability barcodes can be used to control possible errors after associating a marker with a target. For example, probability barcodes can evaluate amplification or sequencing errors. The probability barcode associated with the target can be referred to as a probability barcode target or a probability barcode tag target.

本明細書で用いられる場合、「遺伝子特異的確率バーコード」という用語は、標識と、遺伝子特異的である標的結合領域とを含むポリヌクレオチド配列を意味しうる。確率バーコードは、確率バーコーディングに使用することができるポリヌクレオチド配列でありうる。確率バーコードは、サンプル中の標的を定量するために使用することができる。確率バーコードは、標識を標的に関連付けた後に起こりうるエラーの制御に使用することができる。たとえば、確率バーコードは、増幅またはシーケンシングのエラーを評価することができる。標的に関連付けられた確率バーコードは、確率バーコード標的または確率バーコードタグ標的と呼ぶことができる。 As used herein, the term "gene-specific probability barcode" can mean a polynucleotide sequence comprising a label and a gene-specific target binding region. The probability barcode can be a polynucleotide sequence that can be used for probability barcoding. Probability barcodes can be used to quantify targets in a sample. Probability barcodes can be used to control possible errors after associating a marker with a target. For example, probability barcodes can evaluate amplification or sequencing errors. The probability barcode associated with the target can be referred to as a probability barcode target or a probability barcode tag target.

本明細書で用いられる場合、「確率バーコーディング」という用語は、核酸のランダム標識化（たとえばバーコーディング）を意味しうる。確率バーコーディングは、標識を標的に関連付けて、標識に関連付けられた標識を定量するために再帰的ポアソンストラテジーを利用することができる。本明細書で用いられる場合、「確率バーコーディング」という用語は、「遺伝子特異的確率バーコーディング」と置き換え可能に用いられうる。 As used herein, the term "probability barcoding" can mean random labeling of nucleic acids (eg, barcoding). Probability bar coding can utilize a recursive Poisson strategy to associate a label with a target and quantify the label associated with the label. As used herein, the term "probability barcoding" may be used interchangeably with "gene-specific probability barcoding".

本明細書で用いられる場合、「標的」という用語は、確率バーコードに関連付け可能な組成物を意味しうる。本開示の方法、デバイス、およびシステムによる分析に好適な例示的な標的としては、オリゴヌクレオチド、ＤＮＡ、ＲＮＡ、ｍＲＮＡ、マイクロＲＮＡ、ｔＲＮＡなどが挙げられる。標的は一本鎖または二本鎖でありうる。いくつかの実施形態では、標的はタンパク質でありうる。いくつかの実施形態では、標的は脂質である。 As used herein, the term "target" can mean a composition that can be associated with a probability barcode. Exemplary targets suitable for analysis by the methods, devices, and systems of the present disclosure include oligonucleotides, DNA, RNA, mRNA, microRNA, tRNA, and the like. The target can be single-stranded or double-stranded. In some embodiments, the target can be a protein. In some embodiments, the target is a lipid.

本明細書で用いられる場合、「逆転写酵素」という用語は、逆転写酵素活性を有する（すなわち、ＲＮＡ鋳型からのＤＮＡの合成を触媒する）酵素のグループを意味しうる。一般的には、かかる酵素としては、限定されるものではないが、レトロウイルス逆転写酵素、レトロトランスポゾン逆転写酵素、レトロプラスミド逆転写酵素、レトロン逆転写酵素、細菌逆転写酵素、グループＩＩイントロン由来逆転写酵素、およびそれらの突然変異体、変異体、または誘導体が挙げられる。非レトロウイルス逆転写酵素としては、非ＬＴＲレトロトランスポゾン逆転写酵素、レトロプラスミド逆転写酵素、レトロン逆転写酵素、およびグループＩＩイントロン逆転写酵素が挙げられる。グループＩＩイントロン逆転写酵素の例としては、ラクトコッカス・ラクティス（Ｌａｃｔｏｃｏｃｃｕｓｌａｃｔｉｓ）Ｌｌ．ＬｔｒＢイントロン逆転写酵素、サーモシネココッカス（Ｔｈｅｒｍｏｓｙｎｅｃｈｏｃｏｃｃｕｓ）は、ＴｅＩ４ｃイントロン逆転写酵素、またはジオバチルス・ステアロサーモフィラス（Ｇｅｏｂａｃｉｌｌｕｓｓｔｅａｒｏｔｈｅｒｍｏｐｈｉｌｕｓ）ＧｓＩ－ＩＩＣイントロン逆転写酵素を伸長する。他のクラスの逆転写酵素としては、多くのクラスの非レトロウイルス逆転写酵素（すなわち、レトロン、グループＩＩイントロン、および特に多様性生成レトロエレメント）が挙げられうる。 As used herein, the term "reverse transcriptase" can mean a group of enzymes that have reverse transcriptase activity (ie, catalyze the synthesis of DNA from an RNA template). In general, such enzymes are derived from, but not limited to, retrovirus reverse transcriptase, retrotransposon reverse transcriptase, retroplasma reverse transcriptase, letron reverse transcriptase, bacterial reverse transcriptase, and Group II intron. Included are reverse transcriptases and their variants, variants, or derivatives. Non-retroviral reverse transcriptase includes non-LTR retrotransposon reverse transcriptase, retroplasma reverse transcriptase, retron reverse transcriptase, and Group II intron reverse transcriptase. Examples of Group II intron reverse transcriptase include Lactococcus lactis Ll. The LtrB intron reverse transcriptase, Thermocynechococcus, extends the TeI4c intron reverse transcriptase, or Geobacillus stearomophilus GsI-IIC intron reverse transcriptase. Other classes of reverse transcriptase can include many classes of non-retroviral reverse transcriptases (ie, retrons, group II introns, and especially diversity-producing retroelements).

「ユニバーサルアダプタープライマー」、「ユニバーサルプライマーアダプター」または「ユニバーサルアダプター配列」という用語は、置き換え可能に用いられて、確率バーコードをハイブリダイズして、遺伝子特異的確率バーコードを作製するために使用することができるヌクレオチド配列を指す。ユニバーサルアダプター配列は、たとえば、本開示の方法に用いられるすべての確率バーコードに対してユニバーサルである既知の配列であってよい。たとえば、本明細書に開示する方法を用いて複数の標的が標識される場合、標的特異的配列の各々を同じユニバーサルアダプター配列に連結させてもよい。いくつかの実施形態では、本明細書に開示する方法に、２つ以上のユニバーサルアダプター配列を使用することができる。たとえば、本明細書に開示する方法を用いて複数の標的が標識される場合、標的特異的配列の少なくとも２つを異なるユニバーサルアダプター配列と連結させる。ユニバーサルアダプタープライマーおよびその補体は、２つのオリゴヌクレオチドに含有させてもよく、そのうちの１つは、標的特異的配列を含み、他方は、確率バーコードを含む。たとえば、ユニバーサルアダプター配列は、標的核酸と相補的なヌクレオチド配列を生成するための標的特異的配列を含むオリゴヌクレオチドの一部であってもよい。確率バーコードと、ユニバーサルアダプター配列の相補的配列を含む第２のオリゴヌクレオチドは、ヌクレオチド配列とハイブリダイズして、標的特異的確率バーコードを生成しうる。いくつかの実施形態では、ユニバーサルアダプタープライマーは、本開示の方法で使用されるユニバーサルＰＣＲプライマーとは異なる配列を有する。 The terms "universal primer primer", "universal primer adapter" or "universal adapter sequence" are used interchangeably and are used to hybridize probability barcodes to generate gene-specific probability barcodes. Refers to a nucleotide sequence that can be. The universal adapter sequence may be, for example, a known sequence that is universal for all probability barcodes used in the methods of the present disclosure. For example, if multiple targets are labeled using the methods disclosed herein, each of the target-specific sequences may be linked to the same universal adapter sequence. In some embodiments, two or more universal adapter sequences can be used in the methods disclosed herein. For example, if multiple targets are labeled using the methods disclosed herein, at least two of the target-specific sequences are ligated with different universal adapter sequences. The universal adapter primer and its complement may be contained in two oligonucleotides, one containing a target-specific sequence and the other containing a probability barcode. For example, the universal adapter sequence may be part of an oligonucleotide that contains a target-specific sequence to generate a nucleotide sequence that is complementary to the target nucleic acid. A second oligonucleotide containing the probability barcode and the complementary sequence of the universal adapter sequence can hybridize with the nucleotide sequence to generate a target-specific probability barcode. In some embodiments, the universal adapter primer has a different sequence than the universal PCR primer used in the methods of the present disclosure.

本明細書には、ＰＣＲおよび／またはシーケンシングの最中に発生したエラーを検出し、および／または訂正する方法およびシステムが開示される。エラーのタイプとしては、限定するものではないが、たとえば、置換エラー（１つ以上の塩基）および非置換エラーがある。置換エラーのうち、１塩基置換エラーは、２塩基以上相違するエラーよりもはるかに頻繁に起こりうる。本方法およびシステムは、たとえば、確率バーコーディングにより分子標的の正確なカウンティングを達成するために使用することができる。 The present specification discloses methods and systems for detecting and / or correcting errors that occur during PCR and / or sequencing. The types of errors include, but are not limited to, substitution errors (one or more bases) and unsubstituted errors. Of the substitution errors, single-base substitution errors can occur much more often than errors that differ by more than one base. The method and system can be used, for example, to achieve accurate counting of molecular targets by stochastic barcoding.

確率バーコード
確率バーコーディングは、たとえば、米国特許出願公開第２０１５０２９９７８４号明細書、国際公開第２０１５０３１６９１号パンフレット、およびＦｕｅｔａｌ，ＰｒｏｃＮａｔｌＡｃａｄＳｃｉＵ．Ｓ．Ａ．２０１１Ｍａｙ３１；１０８（２２）：９０２６－３１に記載されており、これらの刊行物の内容は、その全体を参照により本明細書に組み込む。手短には、確率バーコードは、標的に確率標識（例えば、バーコード、タグ）を付けるために使用することができるポリヌクレオチド配列であってよい。確率バーコードは、１つ以上の標識を含みうる。例示的な標識としては、ユニバーサル標識、細胞標識、分子標識、サンプル標識、プレート標識、空間標識、および／またはプレ空間標識を挙げることができる。図１は、空間標識を有する例示的な確率バーコード１０４を示す。確率バーコード１０４は、確率バーコードを固体担体１０５に連結しうる５’アミンを含んでよい。確率バーコードは、ユニバーサル標識、次元標識、空間標識、細胞標識、および／または分子標識を含みうる。確率バーコード中のさまざまな標識（限定するものではないが、ユニバーサル標識、次元標識、空間標識、細胞標識、および分子標識など）の順序は変動しうる。たとえば、図１に示すように、ユニバーサル標識は、最も５’側の標識であってよく、分子標識は、最も３’側の標識であってもよい。空間標識、次元標識、および細胞標識は、任意の順序であってよい。いくつかの実施形態では、ユニバーサル標識、空間標識、次元標識、細胞標識、および分子標識は、任意の順序であってよい。 Probability Barcoding Probability barcoding is described, for example, in US Patent Application Publication No. 20150299784, International Publication No. 2015301691, and Fu et al, Proc Natl Acad Sci U.S.A. S. A. 2011 May 31; 108 (22): 9026-31, the contents of these publications are incorporated herein by reference in their entirety. Briefly, the probability barcode may be a polynucleotide sequence that can be used to attach a probability marker (eg, barcode, tag) to the target. Probability barcodes can include one or more indicators. Exemplary labels include universal labels, cell labels, molecular labels, sample labels, plate labels, spatial labels, and / or pre-spatial labels. FIG. 1 shows an exemplary probability bar code 104 with a spatial indicator. The probability barcode 104 may contain a 5'amine capable of linking the probability barcode to the solid support 105. Probability barcodes can include universal labels, dimensional labels, spatial labels, cellular labels, and / or molecular labels. The order of the various labels in the probability bar code, such as, but not limited to, universal, dimensional, spatial, cellular, and molecular labels can vary. For example, as shown in FIG. 1, the universal label may be the label on the most 5'side, and the molecular label may be the label on the most 3'side. Spatial, dimensional, and cellular labels may be in any order. In some embodiments, the universal label, spatial label, dimensional label, cell label, and molecular label may be in any order.

標識、たとえば、細胞標識は、規定長さ、たとえば、各々７ヌクレオチド（いくつかのハミングエラー訂正コードに使用されるビット数に相当する）の核酸部分配列の固有のセットを含んでもよく、これらは、エラー訂正能力を賦与するように設計することができる。エラー訂正部分配列のセットは、７つのヌクレオチド配列を含み、これらは、セット内の配列の任意のペア組合せが、規定の「遺伝子距離」（またはミスマッチ塩基の数）を呈示するように、設計することができ、たとえば、３ヌクレオチドの遺伝子距離を呈示するように、１セットのエラー訂正部分配列を設計することができる。この場合、標識化標的核酸分子についてのシーケンシングデータのセット内のエラー訂正配列の見直しによって、増幅若しくはシーケンシングエラーを検出または訂正することが可能になる。いくつかの実施形態では、エラー訂正コードを作製するために用いられる核酸部分配列の長さは、たとえば、約１、２、３、４、５、６、７、８、９、１０、１５、２０、３０、３１、４０、５０ヌクレオチド長、またはこれらの値のいずれか２つの間の数もしくは範囲であってよい。いくつかの実施形態では、エラー訂正コードを作製するために、他の長さの核酸部分配列を使用することも可能である。 Labels, eg, cell labels, may contain a unique set of nucleic acid partial sequences of defined length, eg, 7 nucleotides each (corresponding to the number of bits used for some humming error correction codes). , Can be designed to provide error correction capability. A set of error-correcting subsequences comprises seven nucleotide sequences, which are designed so that any pair combination of sequences in the set exhibits a defined "gene distance" (or number of mismatched bases). And, for example, a set of error-correcting subsequences can be designed to exhibit a gene distance of 3 nucleotides. In this case, by reviewing the error-correcting sequences in the set of sequencing data for the labeled target nucleic acid molecule, amplification or sequencing errors can be detected or corrected. In some embodiments, the length of the nucleic acid partial sequence used to generate the error correction code is, for example, about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, ... It may be 20, 30, 31, 40, 50 nucleotides in length, or a number or range between any two of these values. In some embodiments, it is also possible to use nucleic acid partial sequences of other lengths to generate error correction codes.

確率バーコードは、標的結合領域を含みうる。標的結合領域は、サンプル中の標的と相互作用することができる。標的は、リボ核酸（ＲＮＡ）、メッセンジャーＲＮＡ（ｍＲＮＡ）、ｍｉｃｒｏＲＮＡ、低分子干渉ＲＮＡ（ｓｉＲＮＡ）、ＲＮＡ分解産物、各々がポリ（Ａ）テールを含有するＲＮＡ、またはそれらの任意の組合せであってもよいし、これらを含んでもよい。いくつかの実施形態では、複数の標的は、デオキシリボ核酸（ＤＮＡ）を含みうる。 The probability barcode may include a target binding region. The target binding region can interact with the target in the sample. Targets are ribonucleic acid (RNA), messenger RNA (mRNA), microRNA, small interfering RNA (siRNA), RNA degradation products, RNA each containing a poly (A) tail, or any combination thereof. Alternatively, these may be included. In some embodiments, the plurality of targets may include deoxyribonucleic acid (DNA).

いくつかの実施形態では、標的結合領域は、ｍＲＮＡのポリ（Ａ）テールと相互作用することができるオリゴ（ｄＴ）配列を含みうる。確率バーコードの標識（たとえば、ユニバーサル標識、次元標識、空間標識、細胞標識、および分子標識）の１つ以上は、確率バーコードの残りの標識の別の１つまたは２つからスペーサによって隔てることができる。スペーサは、たとえば、１、２、３、４、５、６、７、８、９、１０、１１、１２、１３、１４、１５、１６、１７、１８、１９、もしくは２０ヌクレオチドまたはそれ以上であってよい。いくつかの実施形態では、確率バーコードの標識のいずれもスペーサによって隔てられない。 In some embodiments, the target binding region may comprise an oligo (dT) sequence capable of interacting with the poly (A) tail of the mRNA. One or more of the probabilistic barcode labels (eg, universal, dimensional, spatial, cellular, and molecular labels) should be separated by a spacer from another one or two of the remaining probabilistic barcode labels. Can be done. Spacers are, for example, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 20 nucleotides or more. It may be there. In some embodiments, none of the probabilistic barcode markings are separated by spacers.

ユニバーサル標識
確率バーコードは１つ以上のユニバーサル標識を含みうる。いくつかの実施形態では、１つ以上のユニバーサル標識は、所与の固体担体に結合される確率バーコードのセット中のすべての確率バーコードで同一でありうる。いくつかの実施形態では、１つ以上のユニバーサル標識は、複数のビーズに結合されるすべての確率バーコードで同一でありうる。いくつかの実施形態では、ユニバーサル標識は、シーケンシングプライマーにハイブリダイズ可能な核酸配列を含みうる。シークエンシングプライマーは、ユニバーサル標識を含む確率バーコードをシーケンスするために使用可能である。シークエンシングプライマー（たとえば、ユニバーサルシークエンシングプライマー）は、高スループットシークエンシングプラットフォームに関連付けられるシークエンシングプライマーを含みうる。いくつかの実施形態では、ユニバーサル標識は、ＰＣＲプライマーにハイブリダイズ可能な核酸配列を含みうる。いくつかの実施形態では、ユニバーサル標識は、シークエンシングプライマーおよびＰＣＲプライマーにハイブリダイズ可能な核酸配列を含みうる。シーケンシングプライマーまたはＰＣＲプライマーにハイブリダイズ可能なユニバーサル標識の核酸配列は、プライマー結合部位として参照しうる。ユニバーサル標識は、確率バーコードの転写を開始するために使用しうる配列を含みうる。ユニバーサル標識は、確率バーコードまたは確率バーコード内の領域の伸長のために、使用しうる配列を含みうる。ユニバーサル標識は、約１、２、３、４、５、１０、１５、２０、２５、３０、３５、４０、４５、５０ヌクレオチド長、またはこれらの値のいずれか２つの間の数もしくは範囲であってよい。たとえば、ユニバーサル標識は、少なくとも約１０ヌクレオチドを含みうる。ユニバーサル標識は、少なくとも、または多くとも、１、２、３、４、５、１０、１５、２０、２５、３０、３５、４０、４５、５０、１００、２００、もしくは３００ヌクレオチド長でありうる。いくつかの実施形態では、切断可能なリンカーまたは修飾ヌクレオチドは、担体から確率バーコードを切断して除去することを可能にするユニバーサル標識配列の一部であってよい。 Universal Marking Probability barcodes can contain one or more universal markings. In some embodiments, one or more universal labels can be identical for all probability barcodes in a set of probability barcodes bound to a given solid carrier. In some embodiments, one or more universal labels can be identical for all probability barcodes bound to multiple beads. In some embodiments, the universal label may contain a nucleic acid sequence that is hybridizable to the sequencing primer. Sequencing primers can be used to sequence probability barcodes containing universal labels. The sequencing primer (eg, universal sequencing primer) may include a sequencing primer associated with a high throughput sequencing platform. In some embodiments, the universal label may include a nucleic acid sequence that is hybridizable to the PCR primer. In some embodiments, the universal label may contain a sequencing primer and a nucleic acid sequence that is hybridizable to the PCR primer. A universally labeled nucleic acid sequence capable of hybridizing to a sequencing primer or a PCR primer can be referred to as a primer binding site. Universal labels may contain sequences that can be used to initiate transcription of stochastic barcodes. Universal markings may contain sequences that can be used for probabilistic barcodes or extension of regions within probabilistic barcodes. Universal labels are approximately 1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50 nucleotides in length, or in numbers or ranges between any two of these values. It may be there. For example, a universal label may contain at least about 10 nucleotides. The universal label can be at least, or at most, 1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 100, 200, or 300 nucleotides in length. In some embodiments, the cleavable linker or modified nucleotide may be part of a universally labeled sequence that allows the probabilistic barcode to be cleaved and removed from the carrier.

次元標識
確率バーコードは１つ以上の次元標識を含みうる。いくつかの実施形態では、次元標識は、確率標識化が行われた次元に関する情報を提供する核酸配列を含みうる。たとえば、次元標識は、標的に確率バーコードが付された時点に関する情報を提供可能である。次元標識は、サンプルの確率バーコーディングの時点に関連付け可能である。次元標識は、確率標識化の時点で活性化可能である。異なる時点で異なる次元標識を活性化可能である。次元標識は、標的、標的のグループ、および／またはサンプルに確率バーコードを付けた順序に関する情報を提供する。たとえば、細胞集団は、細胞周期のＧ０期に確率バーコードを付けることが可能である。細胞は、細胞周期のＧ１期に確率バーコードで再びパルスすることが可能である。細胞は、細胞周期のＳ期に確率バーコードで再びパルスすることが可能であり、他の時期も同様である。各パルス時（たとえば、細胞周期の各期）の確率バーコードは、異なる次元標識を含みうる。こうして、次元標識は、細胞周期のどの期に標的に標識したかに関する情報を提供する。次元標識は、多種多様な生物時間を精査することが可能である。例示的な生物時間としては、限定されるものではないが、細胞周期、転写（たとえば転写開始）、および転写物分解が挙げられうる。他の例として、薬剤治療および／または療法の前および／または後にサンプル（たとえば、細胞、細胞集団）に確率標識を付けることが可能である。識別可能な標的のコピー数の変化は、薬剤および／または療法に対するサンプルの反応の指標でありうる。 Dimensional Marking Probability barcodes can contain one or more dimensional markings. In some embodiments, the dimension label may include a nucleic acid sequence that provides information about the dimension in which the probability labeling was performed. For example, a dimensional indicator can provide information about when a target is tagged with a probability bar code. Dimensional markers can be associated with the time point of probability barcoding of the sample. Dimensional labeling can be activated at the time of stochastic labeling. Different dimensional labels can be activated at different time points. Dimensional markers provide information about the order of targets, groups of targets, and / or samples with probability barcodes. For example, a cell population can attach a probability barcode to the G0 phase of the cell cycle. The cell can be pulsed again with a probability barcode during the G1 phase of the cell cycle. The cell can be pulsed again with a probability barcode during the S phase of the cell cycle, as well as at other periods. Probability barcodes at each pulse (eg, each phase of the cell cycle) can contain different dimensional markers. Thus, the dimensional labeling provides information about at what stage of the cell cycle the target was labeled. Dimensional markers can scrutinize a wide variety of biological times. Exemplary biological times include, but are not limited to, the cell cycle, transcription (eg, transcription initiation), and transcript degradation. As another example, it is possible to probabilistically label samples (eg, cells, cell populations) before and / or after drug treatment and / or therapy. Changes in the copy number of the identifiable target can be an indicator of the sample's response to the drug and / or therapy.

次元標識は、活性化可能であってよい。活性化可能な次元標識は、特定の時点で活性化可能でありうる。活性化可能な標識は、たとえば、構成的に活性化することができる（たとえば、オフに切り替わらない）。活性化可能な次元標識は、たとえば、可逆的に活性化可能である（たとえば、活性化可能な次元標識は、オン・オフの切替えが可能である）。たとえば、次元標識は、少なくとも１、２、３、４、５、６、７、８、９、もしくは１０回またはそれ以上可逆的に活性化可能でありうる。次元標識は、たとえば、少なくとも１、２、３、４、５、６、７、８、９、もしくは１０回またはそれ以上可逆的に活性化可能でありうる。いくつかの実施形態では、次元標識は、蛍光、光、化学的イベント（たとえば、切断、他の分子のライゲーション、修飾（たとえば、ペグ化、ＳＵＭＯ化、アセチル化、メチル化、脱アセチル化、脱メチル化）の付加、光化学的イベント（たとえば、光ケージング）、および非天然ヌクレオチドの導入により活性化可能である。 The dimensional label may be activating. The activating dimensional label can be activating at a particular point in time. Activateable labels can, for example, be constitutively activated (eg, do not switch off). The activating dimensional label is, for example, reversibly activating (eg, the activating dimensional label is switchable on and off). For example, the dimensional label can be reversibly activated at least 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 times or more. The dimensional label can be reversibly activated, for example, at least 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 times or more. In some embodiments, the dimensional label is fluorescent, light, chemical event (eg, cleavage, ligation of other molecules, modification (eg, pegging, SUMOization, acetylation, methylation, deacetylation, deacetylation). It can be activated by the addition of methylation), photochemical events (eg, photocaging), and the introduction of unnatural nucleotides.

次元標識は、いくつかの実施形態では、所与の固体担体（たとえばビーズ）に結合されるすべての確率バーコードで同一でありうるが、異なる固体担体（たとえばビーズ）では異なりうる。いくつかの実施形態では、同一の固体担体上の確率バーコードの少なくとも６０％、７０％、８０％、８５％、９０％、９５％、９７％、９９％、または１００％は、同一の次元標識を含みうる。いくつかの実施形態では、同一の固体担体上の確率バーコードの少なくとも６０％は、同一の次元標識を含みうる。いくつかの実施形態では、同一の固体担体上の確率バーコードの少なくとも９５％は、同一の次元標識を含みうる。 The dimensional label can, in some embodiments, be the same for all probability barcodes attached to a given solid carrier (eg, beads), but can be different for different solid carriers (eg, beads). In some embodiments, at least 60%, 70%, 80%, 85%, 90%, 95%, 97%, 99%, or 100% of the probability barcodes on the same solid support are of the same dimension. May include signs. In some embodiments, at least 60% of the probability barcodes on the same solid support may contain the same dimensional label. In some embodiments, at least 95% of the probability barcodes on the same solid support may contain the same dimensional label.

複数の固体担体（たとえばビーズ）には、１０⁶程度またはそれ以上のユニーク次元標識配列が存在可能である。次元標識は、１、２、３、４、５、１０、１５、２０、２５、３０、３５、４０、４５、５０、またはこれらの値のいずれか２つの間の数もしくは範囲、あるいはそうした近似値のヌクレオチド長でありうる。次元標識は、少なくとも、または多くとも、１、２、３、４、５、１０、１５、２０、２５、３０、３５、４０、４５、５０、１００、２００、もしくは３００ヌクレオチド長でありうる。次元標識は、約５～約２００ヌクレオチドを含みうる。次元標識は、約１０～約１５０ヌクレオチドを含みうる。次元標識は、約２０～約１２５ヌクレオチドを含みうる。 Multiple solid carriers (eg, beads) can have as many as ¹⁰⁶ or more unique dimensionally labeled sequences. A dimensional indicator is a number or range between 1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, or any two of these values, or an approximation thereof. The value can be the nucleotide length. The dimensional label can be at least, or at most, 1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 100, 200, or 300 nucleotides in length. The dimensional label can contain from about 5 to about 200 nucleotides. The dimensional label may contain from about 10 to about 150 nucleotides. The dimensional label may contain from about 20 to about 125 nucleotides.

空間標識
確率バーコードは１つ以上の空間標識を含みうる。いくつかの実施形態では、空間標識は、確率バーコードに関連付けられる標的分子の空間配向に関する情報を提供する核酸配列を含みうる。空間標識は、サンプル中の座標に関連付け可能である。座標は固定座標でありうる。たとえば、座標は基材を基準にして固定可能である。空間標識は二次元または三次元のグリッドを基準にしうる。座標はランドマークを基準にして固定可能である。ランドマークは空間内で同定可能である。ランドマークはイメージング可能な構造体でありうる。ランドマークは生物学的構造体たとえば解剖学的ランドマークでありうる。ランドマークは細胞ランドマーク（たとえばオルガネラ）でありうる。ランドマークは、非天然ランドマーク、たとえば、色コード、バーコード、磁性、蛍光、放射能、またはユニークなサイズもしくは形状のような同定可能な識別子を有する構造体でありうる。空間標識は、物理的パーティション（たとえば、ウェル、容器、またはドロップレット）に関連付け可能である。いくつかの実施形態では、空間内の１つ以上の位置にコードを付けるために複数の空間標識が一緒に使用される。 Spatial Marking Probability barcodes can contain one or more spatial markings. In some embodiments, the spatial label may include a nucleic acid sequence that provides information about the spatial orientation of the target molecule associated with the probability barcode. Spatial markers can be associated with coordinates in the sample. The coordinates can be fixed coordinates. For example, the coordinates can be fixed relative to the substrate. Spatial markers can be based on a two-dimensional or three-dimensional grid. The coordinates can be fixed with respect to the landmark. Landmarks can be identified in space. The landmark can be an imageable structure. Landmarks can be biological structures such as anatomical landmarks. Landmarks can be cellular landmarks (eg organelles). A landmark can be a non-natural landmark, eg, a structure with an identifiable identifier such as a color code, barcode, magnetism, fluorescence, radioactivity, or a unique size or shape. Spatial markers can be associated with physical partitions (eg, wells, containers, or droplets). In some embodiments, multiple spatial markers are used together to code one or more positions in space.

空間標識は、所与の固体担体（たとえばビーズ）に結合されるすべての確率バーコードで同一であってよいが、異なる固体担体（たとえばビーズ）については異なっていてもよい。いくつかの実施形態では、同一の空間標識を含む、同一の固体担体上の確率バーコードのパーセンテージは、６０％、７０％、８０％、８５％、９０％、９５％、９７％、９９％、１００％、またはこれらの値のいずれか２つの間の数もしくは範囲、あるいはそうした近似値でありうる。いくつかの実施形態では、同一の空間標識を含む、同一の固体担体上の確率バーコードのパーセンテージは、少なくとも、または多くとも、６０％、７０％、８０％、８５％、９０％、９５％、９７％、９９％、もしくは１００％でありうる。いくつかの実施形態では、同一の固体担体上の確率バーコードの少なくとも６０％が、同一の空間標識を含んでよい。いくつかの実施形態では、同一の固体担体上の確率バーコードの少なくとも９５％が、同一の空間標識を含んでよい。 The spatial label may be the same for all probability barcodes bound to a given solid carrier (eg beads), but may be different for different solid carriers (eg beads). In some embodiments, the percentages of probability barcodes on the same solid carrier, including the same spatial label, are 60%, 70%, 80%, 85%, 90%, 95%, 97%, 99%. , 100 percent, or a number or range between any two of these values, or an approximation thereof. In some embodiments, the percentage of probability barcodes on the same solid carrier, including the same spatial label, is at least or at most 60%, 70%, 80%, 85%, 90%, 95%. , 97%, 99%, or 100%. In some embodiments, at least 60% of the probability barcodes on the same solid support may contain the same spatial label. In some embodiments, at least 95% of the probability barcodes on the same solid support may contain the same spatial label.

複数の固体担体（たとえばビーズ）には、１０⁶程度またはそれ以上のユニーク空間標識配列が存在可能である。空間標識は、１、２、３、４、５、１０、１５、２０、２５、３０、３５、４０、４５、５０、またはこれらの値のいずれか２つの間の数もしくは範囲、あるいはそうした近似値のヌクレオチド長でありうる。空間標識は、少なくとも、または多くとも１、２、３、４、５、１０、１５、２０、２５、３０、３５、４０、４５、５０、１００、２００、もしくは３００ヌクレオチド長でありうる。空間標識は、約５～約２００ヌクレオチドを含みうる。空間標識は、約１０～約１５０ヌクレオチドを含みうる。空間標識は、約２０～約１２５ヌクレオチドを含みうる。 Multiple solid carriers (eg, beads) can have as many as ¹⁰⁶ or more unique spatially labeled sequences. Spatial markers are numbers or ranges between 1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, or any two of these values, or an approximation thereof. The value can be the nucleotide length. Spatial labels can be at least or at most 1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 100, 200, or 300 nucleotides in length. Spatial labels can contain from about 5 to about 200 nucleotides. Spatial labels can contain from about 10 to about 150 nucleotides. Spatial labels can contain from about 20 to about 125 nucleotides.

細胞標識
確率バーコードは、１つ以上の細胞標識を含みうる。いくつかの実施形態では、細胞標識は、どの標的核酸がどの細胞に由来するかを決定するための情報を提供する核酸配列を含みうる。いくつかの実施形態では、細胞標識は、所与の固体担体（たとえばビーズ）に結合されるすべての確率バーコードで同一であるが、異なる固体担体（たとえばビーズ）については異なっている。いくつかの実施形態では、同一の細胞標識を含む、同一の固体担体上の確率バーコードのパーセンテージは、６０％、７０％、８０％、８５％、９０％、９５％、９７％、９９％、１００％、またはこれらの値のいずれか２つの間の数もしくは範囲、あるいはそうした近似値でありうる。いくつかの実施形態では、同一の細胞標識を含む、同一の固体担体上の確率バーコードのパーセンテージは、６０％、７０％、８０％、８５％、９０％、９５％、９７％、９９％、もしくは１００％、またはそうした近似値であってよい。たとえば、同一の固体担体上の確率バーコードの少なくとも６０％が、同一の細胞標識を含みうる。別の例として、同一の固体担体上の確率バーコードの少なくとも９５％が、同一の細胞標識を含んでもよい。 Cell Labeling Probability barcodes can include one or more cell labels. In some embodiments, the cell label may include a nucleic acid sequence that provides information for determining which target nucleic acid is derived from which cell. In some embodiments, the cell label is the same for all probability barcodes bound to a given solid carrier (eg beads), but is different for different solid carriers (eg beads). In some embodiments, the percentage of probability barcodes on the same solid carrier, including the same cell label, is 60%, 70%, 80%, 85%, 90%, 95%, 97%, 99%. , 100 percent, or a number or range between any two of these values, or an approximation thereof. In some embodiments, the percentage of probability barcodes on the same solid carrier, including the same cell label, is 60%, 70%, 80%, 85%, 90%, 95%, 97%, 99%. , Or 100%, or an approximation thereof. For example, at least 60% of the probability barcodes on the same solid support may contain the same cell label. As another example, at least 95% of the probability barcodes on the same solid support may contain the same cell label.

複数の固体担体（たとえばビーズ）には、１０⁶程度またはそれ以上のユニーク細胞標識配列が存在可能である。細胞標識は、１、２、３、４、５、１０、１５、２０、２５、３０、３５、４０、４５、５０、またはこれらの値のいずれか２つの間の数もしくは範囲、あるいはそうした近似値のヌクレオチド長でありうる。細胞標識は、少なくとも、または多くとも、１、２、３、４、５、１０、１５、２０、２５、３０、３５、４０、４５、５０、１００、２００、もしくは３００ヌクレオチド長でありうる。たとえば、細胞標識は、約５～約２００ヌクレオチドを含みうる。別の例として、細胞標識は、約１０～約１５０ヌクレオチドを含みうる。さらに別の例として、細胞標識は、約２０～約１２５ヌクレオチドを含みうる。 Multiple solid carriers (eg, beads) can have as many as ¹⁰⁶ or more unique cell-labeled sequences. Cell labels are 1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, or a number or range between any two of these values, or an approximation thereof. The value can be the nucleotide length. Cell labels can be at least, or at most, 1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 100, 200, or 300 nucleotides in length. For example, a cell label can contain from about 5 to about 200 nucleotides. As another example, the cell label may contain from about 10 to about 150 nucleotides. As yet another example, the cell label may contain from about 20 to about 125 nucleotides.

分子標識
確率バーコードは、１つ以上の分子標識を含みうる。いくつかの実施形態では、分子標識は、確率バーコードにハイブリダイズされた標的核酸種の特定のタイプを同定するための情報を提供する核酸配列を含みうる。分子標識は、確率バーコード（たとえば標的結合領域）にハイブリダイズされた標的核酸種の特定の存在に対するカウンターを提供する核酸配列を含みうる。 Molecular Labeling Probability barcodes can include one or more molecular labels. In some embodiments, the molecular label may include a nucleic acid sequence that provides information for identifying a particular type of target nucleic acid species hybridized to a probability barcode. The molecular label may include a nucleic acid sequence that provides a counter to the specific presence of the target nucleic acid species hybridized to a probability barcode (eg, the target binding region).

いくつかの実施形態では、分子標識の多様なセットが所与の固体担体（たとえばビーズ）に結合される。いくつかの実施形態では、１０²、１０³、１０⁴、１０⁵、１０⁶、１０⁷、１０⁸、１０⁹、またはこれらの値のいずれか２つの間の数もしくは範囲、あるいはそうした近似値のユニーク分子標識配列が存在しうる。たとえば、複数の確率バーコードは、識別可能な配列を有する約６５６１の分子標識を含みうる。別の例として、複数の確率バーコードは、識別可能な配列を有する約６５５３６の分子標識を含みうる。いくつかの実施形態では、少なくとも、または多くとも、１０²、１０³、１０⁴、１０⁵、１０⁶、１０⁷、１０⁸、もしくは１０⁹のユニーク分子標識配列が存在しうる。ユニーク分子標識配列は、所与の固体担体（たとえばビーズ）に結合されている。 In some embodiments, various sets of molecular labels are attached to a given solid carrier (eg, beads). In some embodiments, 10 ² , 10 ³ , 10 ⁴ , 10 ⁵ , 10 ⁶ , 10 ⁷ , 10 ⁸ , 10 ⁹ , or a number or range between any two of these values, or an approximation thereof. Unique molecularly labeled sequences can exist. For example, multiple probability barcodes may contain about 6651 molecular labels with identifiable sequences. As another example, multiple probability barcodes can include about 65536 molecular labels with identifiable sequences. In some embodiments, there may be at least, or at most, 10 ² , 10 ³ , 10 ⁴ , 10 ⁵ , 10 ⁶ , 10 ⁷ , 10 ⁸ or 10 ⁹ unique molecularly labeled sequences. The unique molecularly labeled sequence is attached to a given solid carrier (eg, beads).

分子標識は、１、２、３、４、５、１０、１５、２０、２５、３０、３５、４０、４５、５０ヌクレオチド長、またはこれらの値のいずれか２つの間の数もしくは範囲、あるいはそうした近似値のヌクレオチド長でありうる。分子標識は、少なくとも、または多くとも、１、２、３、４、５、１０、１５、２０、２５、３０、３５、４０、４５、５０、１００、２００、もしくは３００ヌクレオチド長でありうる。 The molecular label is 1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50 nucleotides in length, or a number or range between any two of these values, or It can be such an approximate nucleotide length. The molecular label can be at least, or at most, 1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 100, 200, or 300 nucleotides in length.

標的結合領域
確率バーコードは、１つ以上の標的結合領域を含みうる。いくつかの実施形態では、標的結合領域は、対象の標的とハイブリダイズすることができる。いくつかの実施形態では、標的結合領域は、標的（たとえば、標的核酸、標的分子、たとえば、分析される細胞核酸）、たとえば、特定の遺伝子配列に特異的にハイブリダイズする核酸配列を含みうる。いくつかの実施形態では、標的結合領域は、特定の標的核酸の特定の位置に結合（たとえばハイブリダイズ）しうる核酸配列を含みうる。いくつかの実施形態では、標的結合領域は、制限酵素部位オーバーハング（たとえば、ＥｃｏＲＩ付着末端オーバーハング）への特異的なハイブリダイゼーションが可能な核酸配列を含みうる。次いで、確率バーコードは、制限部位オーバーハングに相補的な配列を含む任意の核酸分子にライゲートしうる。 Target Binding Regions Probability barcodes can include one or more target binding regions. In some embodiments, the target binding region is capable of hybridizing to the target of interest. In some embodiments, the target binding region may comprise a target (eg, a target nucleic acid, a target molecule, eg, a cellular nucleic acid being analyzed), eg, a nucleic acid sequence that specifically hybridizes to a particular gene sequence. In some embodiments, the target binding region may comprise a nucleic acid sequence capable of binding (eg, hybridizing) to a particular position of a particular target nucleic acid. In some embodiments, the target binding region may comprise a nucleic acid sequence capable of specific hybridization to a restriction enzyme site overhang (eg, EcoRI attachment terminal overhang). The probability barcode can then be ligated to any nucleic acid molecule containing a sequence complementary to the restriction site overhang.

いくつかの実施形態では、標的結合領域は非特異的標的核酸配列を含みうる。非特異的標的核酸配列は、標的核酸の特定の配列に依存せずに複数の標的核酸に結合しうる配列を意味しうる。たとえば、標的結合領域は、ランダムマルチマー配列を含みうるかまたはｍＲＮＡ分子のポリ（Ａ）テールにハイブリダイズするオリゴ（ｄＴ）配列を含みうる。ランダムマルチマー配列は、たとえば、ランダムダイマー、ランダムトリマー、ランダムクアトラマー、ランダムペンタマー、ランダムヘキサマー、ランダムセプタマー、ランダムオクタマー、ランダムノナマー、ランダムデカマー、または任意の長さのより高次のランダムマルチマーの配列でありうる。いくつかの実施形態では、標的結合領域は、所与のビーズに結合されたすべての確率バーコードで同一である。いくつかの実施形態では、所与のビーズに結合された複数の確率バーコードの標的結合領域は、２つ以上の異なる標的結合配列を含む。標的結合領域は、５、１０、１５、２０、２５、３０、３５、４０、４５、５０、またはこれらの値のいずれか２つの間の数もしくは範囲、あるいはそうした近似値のヌクレオチド長でありうる。もしくはそれ以上または概略で少なくともそうしたヌクレオチド長でありうる。標的結合領域は、多くとも約５、１０、１５、２０、２５、３０、３５、４０、４５、５０ヌクレオチド長またはそれ以上でありうる。 In some embodiments, the target binding region may comprise a non-specific target nucleic acid sequence. A non-specific target nucleic acid sequence can mean a sequence that can bind to a plurality of target nucleic acids independently of a specific sequence of the target nucleic acid. For example, the target binding region can contain a random multimer sequence or an oligo (dT) sequence that hybridizes to the poly (A) tail of the mRNA molecule. Random multimer sequences can be, for example, random dimers, random trimmers, random quatramers, random pentamers, random hexamers, random septamers, random octamers, random nonamars, random decamers, or higher orders of any length. Can be a random multimer sequence of. In some embodiments, the target binding region is identical for all probability barcodes bound to a given bead. In some embodiments, the target binding region of the plurality of probability barcodes bound to a given bead comprises two or more different target binding sequences. The target binding region can be a number or range of 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, or any two of these values, or a nucleotide length of such approximation. .. Or more or roughly at least such nucleotide lengths. The target binding region can be at most about 5, 10, 15, 20, 25, 30, 35, 40, 45, 50 nucleotides in length or longer.

いくつかの実施形態では、標的結合領域は、ポリアデニル化末端を含むｍＲＮＡにハイブリダイズすることができるオリゴ（ｄＴ）を含みうる。標的結合領域は、遺伝子特異的でありうる。たとえば、標的結合領域は、標的の特定の領域にハイブリダイズするように構成することができる。標的結合領域は、１、２、３、４、５、６、７、８、９、１０、１１、１２、１３、１４、１５、１６、１７、１８、１９、２０、２１、２２、２３、２４、２５、２６、２７、２８、２９、３０、またはこれらの値のいずれか２つの間の数もしくは範囲、あるいはそうした近似値のヌクレオチド長でありうる。標的結合領域は、少なくとも、または多くとも、１、２、３、４、５、６、７、８、９、１０、１１、１２、１３、１４、１５、１６、１７、１８、１９、２０、２１、２２、２３、２４、２５、２６、２７、２８、２９、もしくは３０ヌクレオチド長でありうる。標的結合領域は、約５～３０ヌクレオチド長であってもよい。確率バーコードが、遺伝子特異的標的結合領域を含む場合、この確率バーコードは、遺伝子特異的確率バーコードと呼ぶことができる。 In some embodiments, the target binding region may comprise an oligo (dT) capable of hybridizing to an mRNA containing a polyadenylation end. The target binding region can be gene-specific. For example, the target binding region can be configured to hybridize to a specific region of the target. Target binding regions are 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23. , 24, 25, 26, 27, 28, 29, 30, or a number or range between any two of these values, or a nucleotide length of such an approximation. Target binding regions are at least, or at most, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20. , 21, 22, 23, 24, 25, 26, 27, 28, 29, or 30 nucleotides in length. The target binding region may be about 5-30 nucleotides in length. If the probability barcode contains a gene-specific target binding region, this probability barcode can be referred to as a gene-specific probability barcode.

配向性
確率バーコードは、確率バーコードの配向（たとえばアライメント）のために使用することができる１つ以上の配向性を含みうる。確率バーコードは、等電点電気泳動用の部分を含みうる。異なる確率バーコードは、異なる等電点電気泳動点を含みうる。こうした確率バーコードをサンプルに導入した場合、サンプルは、確率バーコードを既知の形態にオリエントするために等電点電気泳動を行うことが可能である。こうして、オリエント性は、サンプルで確率バーコードの既知のマップを作成するために使用可能である。例示的なオリエント性としては、電気泳動移動度（たとえば、確率バーコードのサイズに基づく）、等電点、スピン、伝導率、および／またはセルフアセンブリーが挙げられうる。たとえば、セルフアセンブリーのオリエント性を含む確率バーコードは、活性化時に特定のオリエンテーションにセルフアセンブル可能である（たとえば、核酸ナノ構造）。 Orientation Probability barcodes can include one or more orientations that can be used for alignment (eg, alignment) of probability barcodes. The probability barcode may include a portion for isoelectric focusing. Different probability barcodes can contain different isoelectric focusing points. When such a probabilistic barcode is introduced into the sample, the sample can perform isoelectric focusing to orient the probabilistic barcode into a known form. Thus, orientality can be used to create a known map of probability barcodes in the sample. Exemplary orientality may include electrophoretic mobility (eg, based on the size of a stochastic barcode), isoelectric point, spin, conductivity, and / or self-assembly. For example, probability barcodes that include self-assembly orientation can be self-assembled to a particular orientation upon activation (eg, nucleic acid nanostructures).

親和性
確率バーコードは、１つ以上の親和性を含みうる。たとえば、空間標識は、親和性を含みうる。親和性は、他のエンティティー（たとえば細胞レセプター）との確率バーコードの結合を促進することができる化学的および／または生物学的部分を含みうる。たとえば、親和性は、抗体、たとえば、サンプル上の特定の部分（たとえばレセプター）に特異的な抗体を含みうる。いくつかの実施形態では、抗体は、確率バーコードを特定の細胞型または分子に誘導することができる。特定の細胞型もしくは分子および／またはその近傍にある標的を確率標識化することができる。抗体は確率バーコードを特定の位置に誘導することができるので、いくつかの実施形態において、親和性は、空間標識のヌクレオチド配列に加え、空間情報も提供することができる。抗体は、治療用抗体、たとえば、モノクローナル抗体またはポリクローナル抗体であってもよい。抗体は、ヒト化されていても、またはキメラであってもよい。抗体は、ネイキッド抗体または融合抗体であってもよい。 Affinity probability barcodes can contain one or more affinities. For example, spatial labels may include affinity. Affinities can include chemical and / or biological moieties that can facilitate the binding of stochastic barcodes to other entities (eg, cellular receptors). For example, affinity can include an antibody, eg, an antibody specific for a particular moiety (eg, a receptor) on a sample. In some embodiments, the antibody is capable of inducing a probability barcode into a particular cell type or molecule. Targets located in and / or in the vicinity of a particular cell type or molecule can be probabilistically labeled. In some embodiments, the affinity can provide spatial information in addition to the nucleotide sequence of the spatial label, as the antibody can direct the probability barcode to a particular position. The antibody may be a therapeutic antibody, such as a monoclonal antibody or a polyclonal antibody. The antibody may be humanized or chimeric. The antibody may be a naked antibody or a fusion antibody.

抗体は、全長（すなわち、天然に存在するかもしくは通常の免疫グロブリン遺伝子断片組換えプロセスにより形成される）免疫グロブリン分子（たとえばＩｇＧ抗体）または免疫グロブリン分子の免疫活性（すなわち特異的結合）部分たとえば抗体フラグメントでありうる。 An antibody is a full-length (ie, naturally occurring or formed by a normal immunoglobulin gene fragment recombination process) immunoglobulin molecule (eg, an IgG antibody) or an immunoactive (ie, specific binding) portion of the immunoglobulin molecule, eg. It can be an antibody fragment.

抗体フラグメントは、たとえば、Ｆ（ａｂ’）２、Ｆａｂ’、Ｆａｂ、Ｆｖ、ｓＦｖなどの抗体の一部でありうる。いくつかの実施形態において、抗体フラグメントは、全長抗体により認識される同一の抗原に結合可能である。抗体フラグメントは、抗体の可変領域からなる単離された断片、たとえば、重鎖および軽鎖の可変領域からなる「Ｆｖ」フラグメントならびに軽鎖および重鎖の可変領域がペプチドリンカーにより接続された組換え一本鎖ポリペプチド分子（「ｓｃＦｖタンパク質」）を含みうる。例示的な抗体としては、限定されるものではないが、癌細胞に対する抗体、ウイルスに対する抗体、細胞表面レセプター（ＣＤ８、ＣＤ３４、ＣＤ４５）に結合する抗体、および治療用抗体が挙げられうる。 The antibody fragment can be part of an antibody such as, for example, F (ab') 2, Fab', Fab, Fv, sFv. In some embodiments, the antibody fragment is capable of binding to the same antigen recognized by the full-length antibody. The antibody fragment is an isolated fragment consisting of a variable region of the antibody, for example, an "Fv" fragment consisting of a heavy chain and a light chain variable region and a recombination in which the light chain and the heavy chain variable region are connected by a peptide linker. It may contain a single chain polypeptide molecule (“scFv protein”). Exemplary antibodies include, but are not limited to, antibodies against cancer cells, antibodies against viruses, antibodies that bind to cell surface receptors (CD8, CD34, CD45), and therapeutic antibodies.

ユニバーサルアダプタープライマー
確率バーコードは、１つ以上のユニバーサルアダプタープライマーを含みうる。たとえば、遺伝子特異的確率バーコードは、ユニバーサルアダプタープライマーを含みうる。ユニバーサルアダプタープライマーは、すべての確率バーコードに対してユニバーサルであるヌクレオチド配列を意味しうる。ユニバーサルアダプタープライマーは、遺伝子特異的確率バーコードを構築するために使用することができる。ユニバーサルアダプタープライマーは、１、２、３、４、５、６、７、８、９、１０、１１、１２、１３、１４、１５、１６、１７、１８、１９、２０、２１、２２、２３、２４、２５、２６、２７、２８、２９、３０、またはこれらの値のいずれか２つの間の数もしくは範囲、あるいはそうした近似値のヌクレオチド長でありうる。ユニバーサルアダプタープライマーは、少なくとも、または多くとも、１、２、３、４、５、６、７、８、９、１０、１１、１２、１３、１４、１５、１６、１７、１８、１９、２０、２１、２２、２３、２４、２５、２６、２７、２８、２９、もしくは３０ヌクレオチド長でありうる。ユニバーサルアダプタープライマーは、約５～３０ヌクレオチド長であってもよい。 Universal Adapter Primers Probability barcodes can include one or more universal adapter primers. For example, gene-specific probability barcodes may include universal adapter primers. A universal adapter primer can mean a nucleotide sequence that is universal for all probability barcodes. Universal adapter primers can be used to construct gene-specific probability barcodes. Universal adapter primers are 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23. , 24, 25, 26, 27, 28, 29, 30, or a number or range between any two of these values, or a nucleotide length of such an approximation. Universal adapter primers are at least or at most 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20. , 21, 22, 23, 24, 25, 26, 27, 28, 29, or 30 nucleotides in length. The universal adapter primer may be about 5-30 nucleotides in length.

固体担体
本明細書に開示される確率バーコードは、いくつかの実施形態において、固体担体と結合することができる。固体担体は、たとえば、合成粒子であってよい。いくつかの実施形態では、固体担体上の複数の確率バーコード（たとえば、第１の複数の確率バーコード）の分子標識（たとえば、第１の分子標識）の一部または全部が、少なくとも１ヌクレオチド異なる。同じ固体担体上の確率バーコードの細胞標識は、同じであってもよい。異なる固体担体上の確率バーコードの細胞標識は、少なくとも１ヌクレオチド異なりうる。たとえば、第１の固体担体上の第１の複数の確率バーコードの第１の細胞標識は、同じ配列を有してよく、第２の固体担体上の第２の複数の確率バーコードの第２の細胞標識は、同じ配列を有してよい。第１の固体担体上の第１の複数の確率バーコードの第１の細胞標識と、第２の固体担体上の第２の複数の確率バーコードの第２の細胞標識とは、少なくとも１ヌクレオチド異なりうる。細胞標識は、たとえば、約５～２０ヌクレオチド長でありうる。分子標識は、たとえば、約５～２０ヌクレオチド長でありうる。合成粒子は、たとえば、ビーズであってよい。 Solid Carriers The probability barcodes disclosed herein can be combined with solid carriers in some embodiments. The solid carrier may be, for example, synthetic particles. In some embodiments, some or all of the molecular labels (eg, first molecular labels) of the plurality of probability barcodes (eg, first plurality of probability barcodes) on the solid support are at least one nucleotide. different. The cell labels of the probability barcodes on the same solid support may be the same. Cell labels for probability barcodes on different solid carriers can differ by at least one nucleotide. For example, the first cell label of the first plurality of probability barcodes on the first solid carrier may have the same sequence and the second plurality of probability barcodes on the second solid carrier may have the same sequence. The cell labels of 2 may have the same sequence. The first cell label of the first plurality of probability barcodes on the first solid carrier and the second cell label of the second plurality of probability barcodes on the second solid carrier are at least one nucleotide. It can be different. The cell label can be, for example, about 5-20 nucleotides in length. The molecular label can be, for example, about 5-20 nucleotides in length. The synthetic particles may be beads, for example.

ビーズは、たとえば、シリカゲルビーズ、調節多孔性ガラスビーズ、磁気ビーズ、ダイナビーズ、セファデックス／セファロースビーズ、セルロースビーズ、ポリスチレンビーズ、またはそれらの任意の組合せであってよい。ビーズは、ポリジメチルシロキサン（ＰＤＭＳ）、ポリスチレン、ガラス、ポリプロピレン、アガロース、ゼラチン、ヒドロゲル、常磁性材料、セラミック、プラスチック、ガラス、メチルスチレン、アクリルポリマー、チタン、ラテックス、セファロース、セルロース、ナイロン、シリコン、またはそれらの任意の組合せなどの材料を含みうる。 The beads may be, for example, silica gel beads, regulated porous glass beads, magnetic beads, dyna beads, sephadex / sephadex beads, cellulose beads, polystyrene beads, or any combination thereof. Beads include polydimethylsiloxane (PDMS), polystyrene, glass, polypropylene, agarose, gelatin, hydrogel, paramagnetic materials, ceramics, plastics, glass, methylstyrene, acrylic polymers, titanium, latex, sepharose, cellulose, nylon, silicon, Or may include materials such as any combination thereof.

いくつかの実施形態では、ビーズは、ポリマービーズ、たとえば、変形性ビーズまたはゲルビーズであってよく、これらは、確率バーコードで官能化されている（たとえば、１０ＸＧｅｎｏｍｉｃｓ（ＳａｎＦｒａｎｃｉｓｃｏ，ＣＡ）からのゲルビーズなど）。いくつかの実施形態では、ゲルビーズは、ポリマーベースのゲルを含みうる。ゲルビーズは、たとえば、１つ以上のポリマー前駆体を液滴中に封入することによって作製することができる。促進剤（たとえば、テトラメチルエチレンジアミン（ＴＥＭＥＤ））にポリマー前駆体を曝露すると、ゲルビーズが作製されうる。 In some embodiments, the beads may be polymer beads, eg, deformable beads or gel beads, which are functionalized with probability barcodes (eg, from 10X Genomics (San Francisco, CA)). Gel beads etc.). In some embodiments, the gel beads may comprise a polymer-based gel. Gel beads can be made, for example, by encapsulating one or more polymer precursors in a droplet. Exposure of the polymer precursor to an accelerator (eg, tetramethylethylenediamine (TEMED)) can produce gel beads.

いくつかの実施形態では、ポリマービーズは、たとえば、所望の条件下で、溶解、溶融、または分解しうる。所望の条件は、環境条件を含みうる。所望の条件は、制御された様式で、ポリマービーズの溶解、溶融、または分解を引き起こしうる。ゲルビーズは、化学的刺激、物理的刺激、生物学的刺激、熱刺激、磁気刺激、電気刺激、光刺激、またはそれらの任意の組合せによって、溶解、溶融、または分解しうる。 In some embodiments, the polymer beads can be dissolved, melted, or decomposed, for example, under desired conditions. The desired conditions may include environmental conditions. The desired conditions can cause dissolution, melting, or decomposition of the polymer beads in a controlled manner. Gel beads can be dissolved, melted, or decomposed by chemical, physical, biological, thermal, magnetic, electrical, optical, or any combination thereof.

たとえば、オリゴヌクレオチドバーコードなどの被検物質および／もしくは試薬を、ゲルビーズの内側表面（たとえば、オリゴヌクレオチドバーコードおよび／もしくはオリゴヌクレオチドバーコードを作製するために用いられる材料の拡散を介して進入可能な内部）ならびに／またはゲルビーズの外側表面、あるいは本明細書に記載されるいずれか他のマイクロカプセルにカップリング／固定してもよい。カップリング／固定は、化学結合（たとえば、共有結合、イオン結合）または物理的現象（たとえば、ファンデルワールス力、双極子－双極子相互作用など）の任意の形態を介するものであってよい。いくつかの実施形態では、ゲルビーズまたは本明細書に記載する任意の他のマイクロカプセルに対する試薬のカップリング／固定は、たとえば、不安定部分（たとえば、本明細書に記載の化学架橋剤をはじめとする、化学架橋剤）を介するなど、可逆性であってもよい。刺激を適用すると、不安定部分は、切断されて、固定された試薬が遊離されうる。いくつかの事例では、不安定部分は、ジスルフィド結合である。たとえば、オリゴヌクレオチドバーコードが、ジスルフィド結合を介してゲルビーズに固定されている場合、ジスルフィド結合を還元剤に曝露することにより、ジスルフィド結合を切断して、オリゴヌクレオチドバーコードをビーズから遊離させることができる。不安定部分は、ゲルビーズもしくはマイクロカプセルの一部として、試薬もしくは被検物質をゲルビーズもしくはマイクロカプセルに連結する化学リンカーの一部として、および／または試薬もしくは被検物質の一部として含有させてもよい。 For example, test substances and / or reagents such as oligonucleotide barcodes can enter through the diffusion of the inner surface of the gel beads (eg, oligonucleotide barcodes and / or materials used to make oligonucleotide barcodes). (Inside) and / or may be coupled / fixed to the outer surface of the gel beads or any other microcapsule described herein. Coupling / fixation may be mediated by any form of chemical bond (eg, covalent bond, ionic bond) or physical phenomenon (eg, van der Waals force, dipole-dipole interaction, etc.). In some embodiments, coupling / immobilization of the reagent to gel beads or any other microcapsule described herein includes, for example, an unstable moiety (eg, a chemical cross-linking agent described herein). It may be reversible, such as via a chemical cross-linking agent. When the stimulus is applied, the unstable part can be cleaved and the immobilized reagent can be released. In some cases, the unstable moiety is a disulfide bond. For example, if the oligonucleotide bar code is immobilized on the gel beads via a disulfide bond, the disulfide bond can be cleaved and the oligonucleotide bar code released from the beads by exposing the disulfide bond to a reducing agent. can. Unstable moieties may be contained as part of the gel beads or microcapsules, as part of the chemical linker that links the reagent or test material to the gel beads or microcapsules, and / or as part of the reagent or test material. good.

いくつかの実施形態では、ゲルビーズは、限定するものではないが、以下のものをはじめとする、極めて多様なポリマーを含みうる：ポリマー、熱感受性ポリマー、感光性ポリマー、磁気ポリマー、ｐＨ感受性ポリマー、塩感受性ポリマー、化学的感受性ポリマー、高分子電解質、多糖、ペプチド、タンパク質、および／またはプラスチック。ポリマーとしては、限定するものではないが、ポリ（Ｎ－イソプロピルアクリルアミド）（ＰＮＩＰＡＡｍ）、ポリ（スルホン酸スチレン）（ＰＳＳ）、ポリ（アリルアミン）（ＰＡＡｍ）、ポリ（アクリル酸）（ＰＡＡ）、ポリ（エチレンイミン）（ＰＥＩ）、ポリ（ジアリルジメチル－塩化アンモニウム）（ＰＤＡＤＭＡＣ）、ポリ（ピロール）（ＰＰｙ）、ポリ（ビニルピロリドン）（ＰＶＰＯＮ）、ポリ（ビニルピリジン）（ＰＶＰ）、ポリ（メタクリル酸）（ＰＭＡＡ）、ポリ（メチルメタクリレート）（ＰＭＭＡ）、ポリスチレン（ＰＳ）、ポリ（テトラヒドロフラン）（ＰＴＨＦ）、ポリ（フタルアルデヒド）（ＰＴＨＦ）、ポリ（ヘキシルビオロゲン）（ＰＨＶ）、ポリ（Ｌ－リシン）（ＰＬＬ）、ポリ（Ｌ－アルギニン）（ＰＡＲＧ）、乳酸－グリコール酸共重合体（ＰＬＧＡ）などの材料が挙げられる。 In some embodiments, the gel beads can include a wide variety of polymers, including but not limited to: polymers, heat sensitive polymers, photosensitive polymers, magnetic polymers, pH sensitive polymers, Salt-sensitive polymers, chemically sensitive polymers, polyelectrolytes, polysaccharides, peptides, proteins, and / or plastics. Polymers include, but are not limited to, poly (N-isopropylacrylamide) (PNIPAAm), poly (styrene sulfonate) (PSS), poly (allylamine) (PAAm), poly (acrylic acid) (PAA), poly. (Ethethyleneimine) (PEI), Poly (diallyldimethyl-ammonium chloride) (PDADMAC), Poly (pyrol) (PPy), Poly (vinylpyrrolidone) (PVPON), Poly (vinylpyridine) (PVP), Poly (methacrylic acid). ) (PMAA), poly (methylmethacrylate) (PMMA), polystyrene (PS), poly (tetrahydrofuran) (PTTH), poly (phthalaldehyde) (PTTH), poly (hexylviologen) (PHV), poly (L-lysine). ) (PLL), poly (L-arginine) (PARG), lactic acid-glycolic acid copolymer (PLGA) and the like.

多数の化学的刺激を用いて、ビーズの破壊または分解をトリガーすることができる。これらの化学的変化の例として、限定するものではないが、ビーズ壁に対するｐＨ媒介による変化、架橋の化学的切断を介したビーズ壁の崩壊、ビーズ壁の解重合トリガー、およびビーズ壁スイッチング反応が挙げられる。また、バルク変化を用いて、ビーズの破壊をトリガーしてもよい。 Numerous chemical stimuli can be used to trigger bead destruction or degradation. Examples of these chemical changes include, but are not limited to, pH-mediated changes to the bead wall, bead wall disintegration through chemical cleavage of crosslinks, bead wall depolymerization triggers, and bead wall switching reactions. Can be mentioned. Bulk changes may also be used to trigger bead destruction.

また、さまざまな刺激を介したマイクロカプセルに対するバルクまたは物理的変化も、試薬を放出するようにカプセルを設計する上で多くの利点をもたらす。バルクまたは物理的変化は、巨視的規模で起こり、その際、ビーズ破断は、刺激により誘導された機械物理的力の結果による。こうしたプロセスとしては、限定するものではないが、圧力誘導破断、ビーズ壁溶融、またはビーズ壁の多孔性変化が挙げられる。 Bulk or physical changes to microcapsules via various stimuli also provide many advantages in designing capsules to release reagents. Bulk or physical changes occur on a macroscopic scale, where bead breakage is the result of stimulus-induced mechanical physical forces. Such processes include, but are not limited to, pressure-induced fractures, bead wall melting, or bead wall porosity changes.

生物学的刺激を用いて、ビーズの破壊または分解をトリガーすることもできる。概して、生物学的トリガーは、化学的トリガーと類似しているが、多くの例では、生体分子、または酵素、ペプチド、糖類、核酸などの生存系に一般的に存在する分子が使用される。たとえば、ビーズは、特定のプロテアーゼによる切断に感受性のペプチド架橋を有するポリマーを含んでもよい。さらに具体的には、一例は、ＧＦＬＧＫペプチド架橋を含むマイクロカプセルを含んでもよい。プロテアーゼカテプシンＢなどの生物学的トリガーを加えると、シェルウェルのペプチド架橋が切断されて、ビーズの内容物が放出される。他の事例では、プロテアーゼを熱活性化してもよい。別の例では、ビーズは、セルロースを含有するシェル壁を含む。加水分解性酵素キトサンの添加は、セルロース結合の切断、シェル壁の解重合、およびその内部内容物の放出のための生物学的トリガーとして役立つ。 Biological stimuli can also be used to trigger the destruction or degradation of beads. In general, biological triggers are similar to chemical triggers, but in many cases biomolecules or molecules commonly present in survival systems such as enzymes, peptides, sugars, nucleic acids are used. For example, the beads may contain polymers with peptide crosslinks that are sensitive to cleavage by certain proteases. More specifically, one example may include microcapsules containing a GFLGK peptide crosslink. When a biological trigger such as the protease cathepsin B is applied, the peptide crosslinks in the shellwell are cleaved and the contents of the beads are released. In other cases, the protease may be thermally activated. In another example, the beads include a shell wall containing cellulose. The addition of the hydrolytic enzyme chitosan serves as a biological trigger for the cleavage of cellulose bonds, the depolymerization of shell walls, and the release of their internal contents.

さらに、ビーズは、熱刺激の適用時にその内容物を放出するように誘導することもできる。温度の変化は、ビーズにさまざまな変化を引き起こし得る。熱の変化は、ビーズ壁が崩壊するように、ビーズの溶融を引き起こし得る。別の事例では、熱は、ビーズが破断または破裂するように、ビーズの内部成分の内圧を高めうる。また別の事例では、熱は、ビーズを収縮した脱水状態に変形させうる。さらに、熱は、ビーズの壁内の熱感受性ポリマーに作用して、ビーズの破壊を引き起こしうる。 In addition, the beads can be induced to release their contents upon application of thermal stimuli. Changes in temperature can cause a variety of changes in the beads. Changes in heat can cause the beads to melt, much like the bead walls collapse. In another case, heat can increase the internal pressure of the internal components of the bead so that the bead breaks or bursts. In yet another case, heat can transform the beads into a deflated, dehydrated state. In addition, heat can act on the heat sensitive polymer in the walls of the beads, causing the beads to break.

マイクロカプセルのビーズ壁に磁気ナノ粒子を含有させると、ビーズの破断トリガー、ならびに多数のビーズの誘導を可能にしうる。本開示のデバイスは、いずれの目的で磁気ビーズを含んでもよい。一例では、高分子電解質含有ビーズにＦｅ₃Ｏ₄ナノ粒子を組み込むと、振動磁界刺激の存在下で破断がトリガーされる。 The inclusion of magnetic nanoparticles in the bead wall of the microcapsules can allow the bead to break and induce a large number of beads. The devices of the present disclosure may include magnetic beads for any purpose. In one example, the incorporation of Fe ₃ O ₄ nanoparticles into a polyelectrolyte-containing bead triggers fracture in the presence of vibrating magnetic field stimuli.

ビーズはまた、電気刺激の結果として破壊または分解することもできる。前のセクションに記載した磁気粒子と同様に、電気感受性ビーズも、ビーズの破断トリガー、ならびに電界下でのアラインメント、導電性またはレドックス反応などの他の機能を可能にする。一例では、電気感受性材料を含有するビーズは、内部試薬の放出を制御することができるように、電界下でアラインメントされる。他の例では、電界は、ビーズ壁自体の内部でレドックス反応を誘導することもでき、これにより、多孔性が増加しうる。 The beads can also be destroyed or decomposed as a result of electrical stimulation. Like the magnetic particles described in the previous section, electrically sensitive beads also enable bead rupture triggers, as well as other functions such as alignment, conductivity or redox reactions under electric fields. In one example, beads containing an electrically sensitive material are aligned under an electric field so that the release of internal reagents can be controlled. In another example, the electric field can also induce a redox reaction inside the bead wall itself, which can increase porosity.

また、光刺激を用いて、ビーズを破壊することもできる。多数の光トリガーが考えられ、特定の範囲の波長の光子を吸収することができるナノ粒子および発色団などのさまざまな分子を用いるシステムが挙げられる。たとえば、金属酸化物コーティングをカプセルトリガーとして用いることができる。ＳｉＯ₂でコーティングされた高分子電解質カプセルのＵＶ照射は、ビーズ壁の崩壊を引き起こしうる。また別の例では、アゾベンゼン基などのフォトスイッチ材料をビーズ壁に組み込んでもよい。ＵＶまたは可視光線を適用すると、こうした化学物質は、光子の吸収時に、可逆的シス－トランス異性化を被る。この態様では、光子スイッチの組込みによって、光トリガー適用の際に、崩壊するか、またはより多孔性になりうるビーズ壁が得られる。 The beads can also be destroyed using light stimulation. Numerous optical triggers are possible, including systems using various molecules such as nanoparticles and chromophores capable of absorbing photons of a particular range of wavelengths. For example, a metal oxide coating can be used as a capsule trigger. UV irradiation of the polyelectrolyte capsule coated with SiO ₂ can cause the bead wall to collapse. In yet another example, a photoswitch material such as an azobenzene group may be incorporated into the bead wall. When UV or visible light is applied, these chemicals undergo reversible cis-trans isomerization upon absorption of photons. In this aspect, the incorporation of a photon switch provides a bead wall that can collapse or become more porous upon application of the optical trigger.

たとえば、図２に示す確率バーコードの非限定的な例において、ブロック２０８でのマイクロウェルアレイの複数のマイクロウェルに、単一細胞などの細胞を導入した後、ビーズをブロック２１２のマイクロウェルアレイの複数のマイクロウェルに導入することができる。各マイクロウェルは、１つのビーズを含みうる。ビーズは、複数の確率バーコードを含みうる。確率バーコードは、ビーズに結合した５’アミン領域を含みうる。確率バーコードは、ユニバーサル標識、分子標識、標的結合領域、またはそれらの任意の組合せを含んでもよい。 For example, in the non-limiting example of the probability barcode shown in FIG. 2, after introducing cells such as single cells into multiple microwells of the microwell array at block 208, the beads are placed in the microwell array of block 212. Can be introduced into multiple microwells. Each microwell may contain one bead. The beads may contain multiple probability barcodes. The probability barcode may include a 5'amine region bound to the bead. Probability barcodes may include universal labels, molecular labels, target binding regions, or any combination thereof.

本明細書に開示する確率バーコードは、固体担体（たとえば、ビーズ）に関連（たとえば、結合）させることができる。固体担体と結合した確率バーコードは、各々、ユニーク配列を有する少なくとも１００または１０００の分子標識を含む群から選択される分子標識を含みうる。いくつかの実施形態では、固体担体と結合した異なる確率バーコードは、異なる配列の分子標識を含んでもよい。いくつかの実施形態では、固体担体と結合した、特定のパーセンテージの確率バーコードが、同じ細胞標識を含む。たとえば、そのパーセンテージは、６０％、７０％、８０％、８５％、９０％、９５％、９７％、９９％、１００％、またはこれらの値のいずれか２つの間の数もしくは範囲、あるいはそうした近似値でありうる。別の例として、パーセンテージは、少なくとも、または多くとも６０％、７０％、８０％、８５％、９０％、９５％、９７％、９９％、もしくは１００％でありうる。いくつかの実施形態では、固体担体と結合した確率バーコードは、同じ細胞標識を含みうる。異なる固体担体と結合した確率バーコードは、ユニーク配列を有する少なくとも１００または１０００の細胞標識を含む群から選択される、異なる細胞標識を含んでもよい。 The probability barcodes disclosed herein can be associated (eg, bound) with a solid carrier (eg, beads). The probability barcode associated with the solid support may include a molecular label selected from the group containing at least 100 or 1000 molecular labels each having a unique sequence. In some embodiments, the different probability barcodes bound to the solid support may include molecular labels of different sequences. In some embodiments, a particular percentage of probability barcodes associated with a solid carrier comprises the same cell label. For example, the percentage is 60%, 70%, 80%, 85%, 90%, 95%, 97%, 99%, 100%, or a number or range between any two of these values, or so. It can be an approximation. As another example, the percentage can be at least or at most 60%, 70%, 80%, 85%, 90%, 95%, 97%, 99%, or 100%. In some embodiments, the probability barcode associated with the solid carrier may include the same cell label. Probability barcodes bound to different solid carriers may include different cell labels selected from the group containing at least 100 or 1000 cell labels with unique sequences.

本明細書に開示する確率バーコードは、固体担体（たとえば、ビーズ）に関連（たとえば、結合）させることができる。いくつかの実施形態では、サンプル中の複数の標的に確率バーコードを付ける工程は、複数の確率バーコードと結合した複数の合成粒子を含む固体担体を用いて、実施することができる。いくつかの実施形態では、固体担体は、複数の確率バーコードと結合した複数の合成粒子を含みうる。さまざまな固体担体上の複数の確率バーコードの空間標識は、少なくとも１ヌクレオチド異なりうる。固体担体は、たとえば、２次元または３次元の複数の確率バーコードを含みうる。合成粒子は、ビーズであってよい。ビーズは、シリカゲルビーズ、調節多孔性ガラスビーズ、磁気ビーズ、ダイナビーズ、セファデックス／セファロースビーズ、セルロースビーズ、ポリスチレンビーズ、またはそれらの任意の組合せであってよい。固体担体は、ポリマー、マトリックス、ヒドロゲル、ニードルアレイデバイス、抗体、またはそれらの任意の組合せを含みうる。いくつかの実施形態では、固体担体は、浮動性であってよい。いくつかの実施形態では、固体担体は、半固体または固体アレイに埋め込むことができる。確率バーコードは、固体担体と結合していなくてもよい。確率バーコードは、個別のヌクレオチドであってもよい。確率バーコードは、基材と結合してもよい。 The probability barcodes disclosed herein can be associated (eg, bound) with a solid carrier (eg, beads). In some embodiments, the step of attaching probabilistic barcodes to a plurality of targets in a sample can be performed using a solid carrier containing a plurality of synthetic particles bound to the plurality of probability barcodes. In some embodiments, the solid support may include a plurality of synthetic particles associated with a plurality of probability barcodes. Spatial labeling of multiple probability barcodes on different solid carriers can differ by at least one nucleotide. The solid support may include, for example, a plurality of two-dimensional or three-dimensional probability barcodes. The synthetic particles may be beads. The beads may be silica gel beads, regulated porous glass beads, magnetic beads, dyna beads, sephadex / sephadex beads, cellulose beads, polystyrene beads, or any combination thereof. The solid support may include polymers, matrices, hydrogels, needle array devices, antibodies, or any combination thereof. In some embodiments, the solid support may be floating. In some embodiments, the solid carrier can be embedded in a semi-solid or solid array. The probability barcode does not have to be bound to the solid support. Probability barcodes may be individual nucleotides. The probability barcode may be attached to the substrate.

本明細書で使用される場合、「テザー連結」、「結合」、および「固定」という用語は、同義的に用いられて、確率バーコードを固体担体に結合するための共有結合または非共有結合の手段を意味しうる。さまざまな異なるいずれの固体担体も、プレ合成された確率バーコードを結合するための、または確率バーコードをｉｎｓｉｔｕ固相合成するための固体担体として使用することができる。 As used herein, the terms "tethered", "bound", and "fixed" are used synonymously to covalently or non-covalently bind a stochastic barcode to a solid carrier. Can mean the means of. Any of a wide variety of different solid supports can be used as solid carriers for binding pre-synthesized probabilistic barcodes or for solid phase synthesis of probabilistic barcodes in situ.

いくつかの実施形態では、固体担体はビーズである。ビーズは、核酸を（たとえば共有結合または非共有結合で）固定することができる、固体、多孔性、もしくは中空のスフェア、ボール、ベアリング、シリンダー、または他の類似の構成体の１つ以上のタイプを包含しうる。ビーズは、たとえば、プラスチック、セラミック、金属、もしくは高分子材料、またはそれらの任意の組合せから構成されうる。ビーズは、離散粒子であるか、またはそれを含んでもよく、離散粒子は、球状（たとえばマイクロスフェア）であるか、または非球状もしくは不規則形状、たとえば、立方体形、直方体形、角錐形、円柱形、円錐形、扁球形、ディスク形などを有する。いくつかの実施形態では、ビーズは、非球状の形状でありうる。 In some embodiments, the solid support is beads. Beads are one or more types of solid, porous, or hollow spheres, balls, bearings, cylinders, or other similar constructs capable of immobilizing nucleic acids (eg, covalently or non-covalently). Can be included. The beads can be composed of, for example, plastic, ceramic, metal, or polymeric materials, or any combination thereof. The beads are or may contain discrete particles, which are spherical (eg, microspheres) or non-spherical or irregularly shaped, such as cubic, rectangular parallelepiped, pyramidal, cylindrical. It has a shape, a cone shape, a rectangular parallelepiped shape, a disc shape, and the like. In some embodiments, the beads can have a non-spherical shape.

ビーズは、限定されるものではないが、常磁性材料（たとえば、マグネシウム、モリブデン、リチウム、およびタンタル）、超常磁性材料（たとえば、フェライト（Ｆｅ₃Ｏ₄、マグネタイト）ナノ粒子）、強磁性材料（たとえば、鉄、ニッケル、コバルト、それらのいくつかの合金、およびいくつかの希土類金属化合物）、セラミック、プラスチック、ガラス、ポリスチレン、シリカ、メチルスチレン、アクリルポリマー、チタン、ラテックス、セファロース、アガロース、ヒドロゲル、ポリマー、セルロース、ナイロン、ならびにそれらの任意の組合せなどのさまざまな材料を含みうる。 Beads are, but are not limited to, paramagnetic materials (eg, magnesium, molybdenum, lithium, and tantalum), _ultranormal magnetic materials (eg, ferrite (Fe 3O ₄ , magnetite) nanoparticles), ferromagnetic materials (eg,). For example, iron, nickel, cobalt, some alloys of them, and some rare earth metal compounds), ceramics, plastics, glass, polystyrene, silica, methylstyrene, acrylic polymers, titanium, latex, sepharose, agarose, hydrogels, It can include various materials such as polymers, cellulose, nylon, and any combination thereof.

いくつかの実施形態では、ビーズ（たとえば、確率バーコードが結合されたビーズ）は、ヒドロゲルビーズである。いくつかの実施形態では、ビーズは、ヒドロゲルを含む。 In some embodiments, the beads (eg, beads to which a probability barcode is attached) are hydrogel beads. In some embodiments, the beads comprise hydrogel.

本明細書に開示するいくつかの実施形態は、１つ以上の粒子（たとえば、ビーズ）を含む。粒子は各々、複数のオリゴヌクレオチド（たとえば、確率バーコード）を含みうる。複数のオリゴヌクレオチドは各々、分子標識配列、細胞標識配列、および標的結合領域（たとえば、オリゴｄＴ配列、遺伝子特異的配列、ランダム多量体、またはそれらの組合せ）を含みうる。複数のオリゴヌクレオチドの各々の細胞標識配列は、同じであってもよい。異なる粒子上のオリゴヌクレオチドの細胞標識配列は、異なる粒子上のオリゴヌクレオチドを同定できるように、相違してもよい。異なる細胞標識配列の数は、異なる実装において相違してもよい。いくつかの実施形態では、細胞標識配列の数は、１０、１００、２００、３００、４００、５００、６００、７００、８００、９００、１０００、２０００、３０００、４０００、５０００、６０００、７０００、８０００、９０００、１００００、２００００、３００００、４００００、５００００、６００００、７００００、８００００、９００００、１０００００、１０⁶、１０⁷、１０⁸、１０⁹、またはこれらの値のいずれか２つの間の数もしくは範囲、またはそれ以上、あるいはそうした近似値でありうる。いくつかの実施形態では、細胞標識配列の数は、少なくとも、または多くとも１０、１００、２００、３００、４００、５００、６００、７００、８００、９００、１０００、２０００、３０００、４０００、５０００、６０００、７０００、８０００、９０００、１００００、２００００、３００００、４００００、５００００、６００００、７００００、８００００、９００００、１０００００、１０⁶、１０⁷、１０⁸、もしくは１０⁹でありうる。いくつかの実施形態では、複数の粒子の１、２、３、４、５、６、７、８、９、１０、２０、３０、４０、５０、６０、７０、８０、９０、１００、２００、３００、４００、５００、６００、７００、８００、９００、１０００以下、またはそれ以上が、同じ細胞配列のオリゴヌクレオチドを含む。いくつかの実施形態では、同じ細胞配列のオリゴヌクレオチドを含む複数の粒子は、多くとも０．１％、０．２％、０．３％、０．４％、０．５％、０．７％、０．８％、０．９％、１％、２％、３％、４％、５％、６％、７％、８％、９％、１０％またはそれ以上であってよい。いくつかの実施形態では、複数の粒子のいずれも同じ細胞標識配列を含まない。 Some embodiments disclosed herein include one or more particles (eg, beads). Each particle can contain multiple oligonucleotides (eg, probability barcodes). Each of the plurality of oligonucleotides may contain a molecularly labeled sequence, a cell labeled sequence, and a target binding region (eg, an oligo dT sequence, a gene-specific sequence, a random multimer, or a combination thereof). The cell-labeled sequence of each of the plurality of oligonucleotides may be the same. The cell-labeled sequences of oligonucleotides on different particles may differ so that oligonucleotides on different particles can be identified. The number of different cell-labeled sequences may differ in different implementations. In some embodiments, the number of cell-labeled sequences is 10,100,200,300,400,500,600,700,800,900,1000,2000,3000,4000,5000,6000,7000,8000, 9000, 10000, 20000, 30000, 40,000, 50000, 60000, 70000, 80000, 90000, 100000, 10 ⁶ , 10 ⁷ , 10 ⁸ , 10 ⁹ , or any number or range between two of these values, or It can be higher or an approximation thereof. In some embodiments, the number of cell-labeled sequences is at least, or at most 10, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 3000, 4000, 5000, 6000. , 7000, 8000, 9000, 10000, 20000, 30000, 40,000, 50000, 60000, 70000, 80000, 90000, 100000, 10 ⁶ , ¹⁰ ⁷ , 10 ⁸ or 109. In some embodiments, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200 of a plurality of particles. , 300, 400, 500, 600, 700, 800, 900, 1000 or less, or more contain oligonucleotides of the same cell sequence. In some embodiments, the plurality of particles containing oligonucleotides of the same cell sequence are at most 0.1%, 0.2%, 0.3%, 0.4%, 0.5%, 0.7. %, 0.8%, 0.9%, 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10% or more. In some embodiments, none of the plurality of particles contains the same cell-labeled sequence.

各粒子の複数のオリゴヌクレオチドは、異なる分子標識配列を含みうる。いくつかの実施形態では、分子標識配列の数は、１０、１００、２００、３００、４００、５００、６００、７００、８００、９００、１０００、２０００、３０００、４０００、５０００、６０００、７０００、８０００、９０００、１００００、２００００、３００００、４００００、５００００、６００００、７００００、８００００、９００００、１０００００、１０⁶、１０⁷、１０⁸、１０⁹、またはこれらの値のいずれか２つの間の数もしくは範囲、あるいはそうした近似値でありうる。分子標識配列の数は、少なくとも、または多くとも１０、１００、２００、３００、４００、５００、６００、７００、８００、９００、１０００、２０００、３０００、４０００、５０００、６０００、７０００、８０００、９０００、１００００、２００００、３００００、４００００、５００００、６００００、７００００、８００００、９００００、１０００００、１０⁶、１０⁷、１０⁸、もしくは１０⁹でありうる。たとえば、複数のオリゴヌクレオチドの少なくとも１００は、異なる分子標識配列を含む。別の例として、単一粒子において、複数のオリゴヌクレオチドの少なくとも１００、５００、１０００、５０００、１００００、１５０００、２００００、５００００、これらの値のいずれか２つの間の数もしくは範囲、またはそれ以上が、異なる分子標識配列を含む。いくつかの実施形態は、確率バーコードを含む複数の粒子を提供する。いくつかの実施形態では、標的の発生数（またはコピーもしくは数）と異なる分子標識配列の比は、少なくとも、１：１、１：２、１：３、１：４、１：５、１：６、１：７、１：８、１：９、１：１０、１：１１、１：１２、１：１３、１：１４、１：１５、１：１６、１：１７、１：１８、１：１９、１：２０、１：３０、１：４０、１：５０、１：６０、１：７０、１：８０、１：９０、またはそれ以上でありうる。いくつかの実施形態では、複数のオリゴヌクレオチドの各々は、サンプル標識、ユニバーサル標識、またはその両方をさらに含む。粒子は、たとえば、ナノ粒子またはミクロ粒子であってよい。 Multiple oligonucleotides of each particle may contain different molecularly labeled sequences. In some embodiments, the number of molecularly labeled sequences is 10,100,200,300,400,500,600,700,800,900,1000,2000,3000,4000,5000,6000,7000,8000, 9000, 10000, 20000, 30000, 40,000, 50000, 60000, 70000, 80000, 90000, 100000, 10 ⁶ , 10 ⁷ , 10 ⁸ , 10 ⁹ , or any number or range between two of these values, or It can be such an approximation. The number of molecularly labeled sequences is at least or at most 10, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, It can be 10000, 20000, 30000, 40,000, 50000, 60000, 70000, 80000, 90000, 100000, 10 ⁶ , 10 ⁷ , 10 ⁸ or 10 ⁹ . For example, at least 100 of a plurality of oligonucleotides contain different molecularly labeled sequences. As another example, in a single particle, at least 100, 500, 1000, 5000, 10000, 15000, 20000, 50000 of multiple oligonucleotides, a number or range between any two of these values, or more. , Contains different molecularly labeled sequences. Some embodiments provide a plurality of particles that include a probability barcode. In some embodiments, the ratio of molecularly labeled sequences that differ from the number (or copy or number) of targets generated is at least 1: 1, 1: 2, 1: 3, 1: 4, 1: 5, 1: 1. 6, 1: 7, 1: 8, 1: 9, 1:10, 1:11, 1:12, 1:13, 1:14, 1:15, 1:16, 1:17, 1:18, It can be 1:19, 1:20, 1:30, 1:40, 1:50, 1:60, 1:70, 1:80, 1:90, or more. In some embodiments, each of the plurality of oligonucleotides further comprises a sample label, a universal label, or both. The particles may be, for example, nanoparticles or microparticles.

ビーズのサイズは、変動しうる。たとえば、ビーズの直径は、０．１マイクロメートル～５０マイクロメートルの範囲であってよい。いくつかの実施形態では、ビーズの直径は、０．１、０．５、１、２、３、４、５、６、７、８、９、１０、２０、３０、４０、５０マイクロメートル、またはこれらの値のいずれか２つの間の数もしくは範囲、あるいはそうした近似値でありうる。 The size of the beads can vary. For example, the diameter of the beads may range from 0.1 micrometer to 50 micrometer. In some embodiments, the diameter of the beads is 0.1, 0.5, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50 micrometers, Or it can be a number or range between any two of these values, or an approximation thereof.

ビーズの直径は、基材のウェルの直径と関連させることができる。いくつかの実施形態では、ビーズの直径は、ウェルの直径よりも、１０％、２０％、３０％、４０％、５０％、６０％、７０％、８０％、９０％、１００％、またはこれらの値のいずれか２つの間の数もしくは範囲、あるいはそうした近似値だけ長いもしくは短い長さであってよい。ビーズの直径は、細胞（たとえば、基材のウェルに閉じ込められた単一細胞）の直径に関連させることができる。いくつかの実施形態では、ビーズの直径は、細胞の直径よりも、１０％、２０％、３０％、４０％、５０％、６０％、７０％、８０％、９０％、１００％、２００％、２５０％、３００％、またはこれらの値のいずれか２つの間の数もしくは範囲、あるいはそうした近似値だけ長いもしくは短い長さであってもよい。 The diameter of the beads can be related to the diameter of the wells of the substrate. In some embodiments, the diameter of the beads is 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%, or these, than the diameter of the wells. It may be a number or range between any two of the values of, or a length as long or short by such an approximation. The diameter of the beads can be related to the diameter of the cell (eg, a single cell confined in a well of the substrate). In some embodiments, the diameter of the beads is 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%, 200% of the diameter of the cells. , 250%, 300%, or a number or range between any two of these values, or may be as long or short as such an approximation.

ビーズは、基材への埋込みおよび／または結合が可能である。ビーズは、ゲル、ヒドロゲル、ポリマー、および／またはマトリックスへの埋込みおよび／または結合が可能である。基材（たとえば、ゲル、マトリックス、スキャフォールド、またはポリマー）内のビーズの空間位置は、位置アドレスとして機能可能なビーズ上の確率バーコードに存在する空間標識を用いて同定可能である。 The beads can be embedded and / or bonded to the substrate. The beads can be embedded and / or bonded to gels, hydrogels, polymers, and / or matrices. The spatial location of the beads within the substrate (eg, gel, matrix, scaffold, or polymer) can be identified using the spatial markings present on the probability barcode on the beads that can serve as location addresses.

ビーズの例としては、限定されるものではないが、ストレプトアビジンビーズ、アガロースビーズ、磁気ビーズ、Ｄｙｎａｂｅａｄ（登録商標）、ＭＡＣＳ（登録商標）マイクロビーズ、抗体コンジュゲートビーズ（たとえば、抗免疫グロブリンマイクロビーズ）、プロテインＡコンジュゲートビーズ、プロテインＧコンジュゲートビーズ、プロテインＡ／Ｇコンジュゲートビーズ、プロテインＬコンジュゲートビーズ、オリゴ（ｄＴ）コンジュゲートビーズ、シリカビーズ、シリカ様ビーズ、抗ビオチンマイクロビーズ、抗蛍光色素マイクロビーズ、およびＢｃＭａｇ（商標）カルボキシル末端磁気ビーズが挙げられうる。 Examples of beads include, but are not limited to, streptavidin beads, agarose beads, magnetic beads, Dynabed®, MACS® microbeads, antibody conjugated beads (eg, anti-immunoglobulin microbeads). ), Protein A-conjugated beads, Protein G-conjugated beads, Protein A / G-conjugated beads, Protein L-conjugated beads, Oligo (dT) -conjugated beads, Silica beads, Silica-like beads, Anti-biotin microbeads, Anti-fluorescence Dye microbeads and BcMag ™ carboxyl-terminated magnetic beads can be mentioned.

ビーズは、１つの蛍光光学チャネルまたは複数の光学チャネルで蛍光を発するように量子ドットまたは蛍光色素への関連付け（たとえばそれらによる含浸）が可能である。ビーズは、常磁性または強磁性にするために酸化鉄または酸化クロムへの関連付けが可能である。ビーズは同定可能でありうる。たとえば、ビーズは、カメラを用いてイメージング可能である。ビーズは、ビーズに関連付けられた検出可能なコードを有しうる。たとえば、ビーズは、確率バーコードを含みうる。ビーズは、たとえば、有機または無機の溶液中での膨潤に起因してサイズ変化しうる。ビーズは疎水性でありうる。ビーズは親水性でありうる。ビーズは生体適合性でありうる。 The beads can be associated (eg, impregnated) with quantum dots or fluorochromes to fluoresce in one fluorescent optical channel or multiple optical channels. Beads can be associated with iron oxide or chromium oxide to make them paramagnetic or ferromagnetic. The beads can be identifiable. For example, beads can be imaged using a camera. The beads may have a detectable code associated with the beads. For example, beads may include probability barcodes. Beads can change in size, for example, due to swelling in organic or inorganic solutions. The beads can be hydrophobic. The beads can be hydrophilic. The beads can be biocompatible.

固体担体（たとえばビーズ）は可視化可能である。固体担体は可視化タグ（たとえば蛍光色素）を含みうる。固体担体（たとえばビーズ）は識別子（たとえば数）でエッチング可能である。識別子はビーズのイメージングにより可視化可能である。 Solid carriers (eg beads) are visible. The solid support may include a visualization tag (eg, a fluorescent dye). Solid carriers (eg beads) can be etched by identifier (eg number). The identifier can be visualized by imaging the beads.

基材およびマイクロウェルアレイ
本明細書で使用される場合、基材はあるタイプの固体担体を意味しうる。基材は、本開示の確率バーコードを含みうる固体担体を意味しうる。基材は、たとえば、複数のマイクロウェルを含みうる。たとえば、基材は、２つ以上のマイクロウェルを含むウェルアレイであってよい。いくつかの実施形態では、マイクロウェルは、規定の体積の小さい反応チャンバーを含みうる。いくつかの実施形態では、マイクロウェルは、１つ以上の細胞を閉じ込めることができる。いくつかの実施形態では、マイクロウェルは、１つの細胞のみを閉じ込めることができる。いくつかの実施形態では、マイクロウェルは、１つ以上の固体担体を閉じ込めることができる。いくつかの実施形態では、マイクロウェルは、１つの固体担体のみを閉じ込めることができる。いくつかの実施形態では、マイクロウェルは、単一細胞および単一固体担体（たとえば、ビーズ）を閉じ込める。 Substrate and Microwell Array As used herein, substrate can mean a type of solid carrier. The substrate can mean a solid carrier that may contain the probability barcodes of the present disclosure. The substrate may include, for example, a plurality of microwells. For example, the substrate may be a well array containing two or more microwells. In some embodiments, the microwell may include a reaction chamber with a specified volume. In some embodiments, the microwell is capable of trapping one or more cells. In some embodiments, the microwell is capable of confining only one cell. In some embodiments, the microwell is capable of enclosing one or more solid carriers. In some embodiments, the microwell is capable of confining only one solid carrier. In some embodiments, the microwell encloses a single cell and a single solid support (eg, beads).

確率バーコーディングの方法
本開示は、身体サンプル（たとえば、組織、器官、腫瘍、細胞）における識別可能な位置の識別可能な標的の数を推定する方法を提供する。本方法は、サンプルと接近させて確率バーコードを配置する工程と、サンプルを溶解させる工程と、識別可能な標的を確率バーコードと関連させる工程と、標的を増幅する工程および／または標的をディジタルカウントする工程と、を含みうる。本方法は、さらに、確率バーコード上の空間標識から得られた情報を分析する工程および／または視覚化する工程をさらに含みうる。いくつかの実施形態では、一方法は、サンプル中の複数の標識を視覚化する工程を含む。サンプルのマップに複数の標的をマッピングする工程は、サンプルの二次元マップまたは三次元マップの作製を含みうる。二次元マップまたは三次元マップは、サンプル中の複数の標的に確率バーコードを付ける前または後に作製することができる。サンプル中の複数の標的を視覚化する工程は、サンプルのマップに複数の標的をマッピングする工程を含みうる。サンプルのマップに複数の標的をマッピングする工程は、サンプルの二次元マップまたは三次元マップを作製するステップを含みうる。二次元マップおよび三次元マップは、サンプル中の複数の標的に確率バーコードを付ける前または後に作製することができる。いくつかの実施形態では、二次元マップおよび三次元マップは、サンプルを溶解させる前または後に作製することができる。二次元マップまたは三次元マップの作製前または後にサンプルを溶解させる工程は、サンプルを加熱する工程と、サンプルを洗剤と接触させる工程と、サンプルのｐＨを変化させる工程、またはそれらの任意の組合せを含みうる。 Probability Barcoding Methods The present disclosure provides a method for estimating the number of identifiable targets at identifiable locations in body samples (eg, tissues, organs, tumors, cells). The method involves placing a probabilistic barcode in close proximity to the sample, melting the sample, associating an identifiable target with the probabilistic barcode, amplifying the target, and / or digitalizing the target. It may include a step of counting. The method may further include the steps of analyzing and / or visualizing the information obtained from the spatial markings on the probability barcode. In some embodiments, one method comprises the step of visualizing a plurality of labels in a sample. The process of mapping multiple targets to a map of a sample may include the creation of a 2D or 3D map of the sample. A 2D or 3D map can be created before or after probabilistic barcodes are attached to multiple targets in the sample. The step of visualizing a plurality of targets in a sample may include the step of mapping the plurality of targets to the map of the sample. The step of mapping multiple targets to a sample map may include creating a two-dimensional or three-dimensional map of the sample. Two-dimensional and three-dimensional maps can be created before or after probabilistic barcodes are attached to multiple targets in the sample. In some embodiments, the 2D and 3D maps can be made before or after the sample is melted. The steps of melting the sample before or after making the 2D or 3D map include heating the sample, contacting the sample with detergent, changing the pH of the sample, or any combination thereof. Can include.

いくつかの実施形態では、複数の標的に確率バーコードを付ける工程は、複数の確率バーコードを複数の標的とハイブリダイズさせて、確率バーコード付き標的を作製する工程を含む。複数の標的に確率バーコードを付ける工程は、確率バーコード付き標的のインデックス付きライブラリーを作製する工程を含みうる。確率バーコード付き標的のインデックス付きライブラリーを作製する工程は、複数の確率バーコードを含む固体担体を用いて実施することができる。 In some embodiments, the step of attaching a probability barcode to a plurality of targets comprises hybridizing the plurality of probability barcodes with the plurality of targets to produce a target with a probability barcode. The step of attaching a probability barcode to a plurality of targets may include a step of creating an indexed library of targets with a probability barcode. The step of creating an indexed library of targets with probability barcodes can be performed using a solid support containing a plurality of probability barcodes.

サンプルと確率バーコードの接触
本開示は、サンプル（たとえば、細胞）を本開示の基材と接触させる方法を提供する。たとえば、細胞、器官、または組織薄片を含むサンプルを確率バーコードと接触させることができる。たとえば、重力流によって、細胞を接触させることができ、その場合、細胞は沈殿して単層を形成しうる。サンプルは、組織薄片であってよい。薄片を基材の上に配置することができる。サンプルは、一次元（たとえば、平面表面を形成する）であってよい。サンプル（たとえば、細胞）は、たとえば、基材上に細胞を増殖させる／培養することによって、基材全体に広げることができる。 Contact of Samples and Probability Barcodes The present disclosure provides a method of contacting a sample (eg, a cell) with a substrate of the present disclosure. For example, a sample containing cells, organs, or tissue flakes can be contacted with a probability barcode. For example, a gravitational flow can bring the cells into contact, in which case the cells can precipitate to form a monolayer. The sample may be tissue flakes. The flakes can be placed on the substrate. The sample may be one-dimensional (eg, forming a planar surface). The sample (eg, cells) can be spread over the entire substrate, for example, by growing / culturing the cells on the substrate.

確率バーコードが標的と近接して位置すると、標的は、確率バーコードとハイブリダイズしうる。識別可能な標的の各々が、本開示の識別可能な確率バーコードと結合し得るように、確率バーコードを非枯渇的比率で接触させることができる。標的と確率バーコード同士の効率的な結合を確実にするために、標的を確率バーコードと架橋させることができる。 If the probability barcode is located in close proximity to the target, the target can hybridize with the probability barcode. Probability barcodes can be contacted in non-depleting proportions so that each of the identifiable targets can be coupled with the identifiable probability barcodes of the present disclosure. Targets can be cross-linked with probability barcodes to ensure efficient binding between targets and probability barcodes.

細胞溶解
細胞および確率バーコードの分配後、細胞は標的分子を遊離するように溶解可能である。細胞溶解は、さまざまな手段のいずれかにより、たとえば、化学的もしくは生化学的手段により、浸透圧ショックにより、または熱溶解、機械溶解、もしくは光学溶解により達成可能である。細胞は、界面活性剤（たとえば、ＳＤＳ、Ｌｉドデシルスルフェート、ＴｒｉｔｏｎＸ－１００、Ｔｗｅｅｎ－２０、もしくはＮＰ－４０）、有機溶媒（たとえば、メタノールもしくはアセトン）、または消化酵素（たとえば、プロテイナーゼＫ、ペプシンまたはトリプシン）、あるいはそれらの任意の組合せを含む細胞溶解緩衝液の添加により溶解可能である。標的と確率バーコードとの関連付けを向上させるために、たとえば、温度の低下および／またはライセートの粘度の増加により、標的分子の拡散速度を変化させることが可能である。 Cytolysis After distribution of cells and probability barcodes, the cells are lysable to release the target molecule. Cytolysis can be achieved by any of a variety of means, eg, by chemical or biochemical means, by osmotic shock, or by thermal lysis, mechanical lysis, or optical lysis. The cells can be a detergent (eg, SDS, Lidodecylsulfate, Triton X-100, Tween-20, or NP-40), an organic solvent (eg, methanol or acetone), or a digestive enzyme (eg, Proteinase K,). It can be lysed by the addition of a cell lysis buffer containing pepsin or trypsin), or any combination thereof. In order to improve the association between the target and the probability barcode, it is possible to change the diffusion rate of the target molecule, for example by lowering the temperature and / or increasing the viscosity of the lysate.

いくつかの実施形態では、サンプルは濾紙を用いて溶解可能である。濾紙は濾紙の上を溶解緩衝液で浸漬可能である。濾紙は、サンプルの溶解および基材へのサンプルの標的のハイブリダイゼーションを促進可能な加圧でサンプルに適用可能である。 In some embodiments, the sample can be dissolved using filter paper. The filter paper can be dipped in the dissolution buffer on the filter paper. The filter paper is applicable to the sample at a pressure that can facilitate dissolution of the sample and hybridization of the sample's target to the substrate.

いくつかの実施形態では、溶解は、機械溶解、熱溶解、光学溶解、および／または化学溶解により行うことが可能である。化学溶解は、プロテイナーゼＫ、ペプシン、トリプシンなどの消化酵素の使用を含みうる。溶解は、基材への溶解緩衝液の添加により行うことが可能である。溶解緩衝液はトリスＨＣｌを含みうる。溶解緩衝液は、少なくとも約０．０１、０．０５、０．１、０．５、もしくは１Ｍまたはそれ以上のトリスＨＣｌを含みうる。溶解緩衝液は、多くとも約０．０１、０．０５、０．１、０．５、もしくは１Ｍまたはそれ以上のトリスＨＣｌを含みうる。溶解緩衝液は約０．１ＭトリスＨＣｌを含みうる。溶解緩衝液のｐＨは、少なくとも約１、２、３、４、５、６、７、８、９、もしくは１０またはそれ以上でありうる。溶解緩衝液のｐＨは、多くとも約１、２、３、４、５、６、７、８、９、もしくは１０またはそれ以上でありうる。いくつかの実施形態では、溶解緩衝液のｐＨは約７．５である。溶解緩衝液は塩（たとえばＬｉＣｌ）を含みうる。溶解緩衝液中の塩の濃度は、少なくとも約０．１、０．５、もしくは１Ｍまたはそれ以上でありうる。溶解緩衝液中の塩の濃度は、多くとも約０．１、０．５、もしくは１Ｍまたはそれ以上でありうる。いくつかの実施形態では、溶解緩衝液中の塩の濃度は約０．５Ｍである。溶解緩衝液は、界面活性剤（たとえば、ＳＤＳ、Ｌｉドデシルスルフェート、トリトンＸ、トゥイーン、ＮＰ－４０）を含みうる。溶解緩衝液中の界面活性剤の濃度は、少なくとも約０．０００１％、０．０００５％、０．００１％、０．００５％、０．０１％、０．０５％、０．１％、０．５％、１％、２％、３％、４％、５％、６％、もしくは７％またはそれ以上でありうる。溶解緩衝液中の界面活性剤の濃度は、多くとも約０．０００１％、０．０００５％、０．００１％、０．００５％、０．０１％、０．０５％、０．１％、０．５％、１％、２％、３％、４％、５％、６％、もしくは７％またはそれ以上でありうる。いくつかの実施形態では、溶解緩衝液中の界面活性剤の濃度は約１％Ｌｉドデシルスルフェートである。本方法で溶解に使用される時間は、使用される界面活性剤の量に依存性しうる。いくつかの実施形態では、界面活性剤を多く使用するほど、溶解に必要な時間は短くなる。溶解緩衝液はキレート化剤（たとえば、ＥＤＴＡ、ＥＧＴＡ）を含みうる。溶解緩衝液中のキレート化剤の濃度は、少なくとも約１、５、１０、１５、２０、２５、もしくは３０ｍＭまたはそれ以上でありうる。溶解緩衝液中のキレート化剤の濃度は、多くとも約１、５、１０、１５、２０、２５、もしくは３０ｍＭまたはそれ以上でありうる。いくつかの実施形態では、溶解緩衝液中のキレート化剤の濃度は約１０ｍＭである。溶解緩衝液は還元試薬（たとえば、βメルカプトエタノール、ＤＴＴ）を含みうる。溶解緩衝液中の還元試薬の濃度は少なくとも約１、５、１０、１５、２０ｍＭまたはそれ以上でありうる。溶解緩衝液中の還元試薬の濃度は多くとも約１、５、１０、１５、２０ｍＭまたはそれ以上でありうる。いくつかの実施形態では、溶解緩衝液中の還元試薬の濃度は約５ｍＭである。いくつかの実施形態では、溶解緩衝液は、約０．１ＭのトリスＨＣｌ、約ｐＨ７．５、約０．５ＭＬｉＣｌ、約１％リチウムドデシルスルフェート、約１０ｍＭＥＤＴＡ、および約５ｍＭＤＴＴを含みうる。 In some embodiments, the dissolution can be done by mechanical dissolution, thermal dissolution, optical dissolution, and / or chemical dissolution. Chemical lysis may include the use of digestive enzymes such as proteinase K, pepsin, trypsin. Dissolution can be performed by adding a dissolution buffer to the substrate. The lysis buffer may contain Tris HCl. The lysis buffer may contain at least about 0.01, 0.05, 0.1, 0.5, or 1 M or more of Tris HCl. The lysis buffer may contain at most about 0.01, 0.05, 0.1, 0.5, or 1 M or more of Tris HCl. The lysis buffer may contain approximately 0.1 M Tris HCl. The pH of the lysis buffer can be at least about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or higher. The pH of the lysis buffer can be at most about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or higher. In some embodiments, the pH of the lysis buffer is about 7.5. The lysis buffer may contain salts (eg LiCl). The concentration of salt in the lysis buffer can be at least about 0.1, 0.5, or 1 M or higher. The concentration of salt in the lysis buffer can be at most about 0.1, 0.5, or 1 M or more. In some embodiments, the concentration of salt in the lysis buffer is about 0.5 M. The lysis buffer may contain a surfactant (eg, SDS, Li-dodecylsulfate, Triton X, Tween, NP-40). The concentration of surfactant in the lysis buffer is at least about 0.0001%, 0.0005%, 0.001%, 0.005%, 0.01%, 0.05%, 0.1%, 0. It can be 5.5%, 1%, 2%, 3%, 4%, 5%, 6%, or 7% or more. The concentration of the surfactant in the lysis buffer is at most about 0.0001%, 0.0005%, 0.001%, 0.005%, 0.01%, 0.05%, 0.1%, It can be 0.5%, 1%, 2%, 3%, 4%, 5%, 6%, or 7% or more. In some embodiments, the concentration of detergent in the lysis buffer is about 1% Li dodecyl sulfate. The time used for dissolution in this method may depend on the amount of detergent used. In some embodiments, the more surfactant used, the shorter the time required for dissolution. The lysis buffer may contain a chelating agent (eg, EDTA, EGTA). The concentration of chelating agent in the lysis buffer can be at least about 1, 5, 10, 15, 20, 25, or 30 mM or higher. The concentration of chelating agent in the lysis buffer can be at most about 1, 5, 10, 15, 20, 25, or 30 mM or higher. In some embodiments, the concentration of chelating agent in the lysis buffer is about 10 mM. The lysis buffer may contain a reducing reagent (eg, β-mercaptoethanol, DTT). The concentration of reducing reagent in the lysis buffer can be at least about 1, 5, 10, 15, 20 mM or higher. The concentration of reducing reagent in the lysis buffer can be at most about 1, 5, 10, 15, 20 mM or higher. In some embodiments, the concentration of reducing reagent in the lysis buffer is about 5 mM. In some embodiments, the lysis buffer may contain about 0.1 M Tris HCl, about pH 7.5, about 0.5 M LiCl, about 1% lithium dodecyl sulphate, about 10 mM EDTA, and about 5 mM DTT. ..

溶解は、約４、１０、１５、２０、２５、または３０℃の温度で行うことが可能である。溶解は、約１、５、１０、１５、もしくは２０分間またはそれ以上行うことが可能である。溶解細胞は、少なくとも約１０００００、２０００００、３０００００、４０００００、５０００００、６０００００、もしくは７０００００標的核酸分子またはそれ以上を含みうる。溶解細胞は、多くとも約１０００００、２０００００、３０００００、４０００００、５０００００、６０００００、もしくは７０００００標的核酸分子またはそれ以上を含みうる。 Melting can be done at a temperature of about 4, 10, 15, 20, 25, or 30 ° C. Dissolution can be performed for about 1, 5, 10, 15, or 20 minutes or longer. Lysating cells may contain at least about 100,000, 200,000, 300,000, 400,000, 500,000, 600,000, or 700,000 target nucleic acid molecules or more. Lysating cells can contain at most about 100,000, 200,000, 300,000, 400,000, 500,000, 600,000, or 700,000 target nucleic acid molecules or more.

標的核酸分子への確率バーコードの結合
細胞の溶解およびそれからの核酸分子の放出の後、核酸分子は、共局在化された固体担体の確率バーコードにランダムに関連付けすることができる。関連付けは、標的核酸分子の相補的部分への確率バーコードの標的認識領域のハイブリダイゼーションを含みうる（たとえば、確率バーコードのオリゴ（ｄＴ）は、標的のポリ（Ａ）テールと相互作用可能である）。ハイブリダイゼーションに使用されるアッセイ条件（たとえば、緩衝液ｐＨ、イオン強度、温度など）は、特定の安定なハイブリッドの形成を促進するように選択可能である。いくつかの実施形態では、溶解した細胞から放出された核酸分子は、基材上の複数のプローブに関連付けする（たとえば、基板上のプローブとハイブリダイズする）ことができる。プローブが、オリゴ（ｄＴ）を含むとき、ｍＲＮＡ分子は、プローブにハイブリダイズして、逆転写されうる。オリゴヌクレオチドのオリゴ（ｄＴ）部分は、ｃＤＮＡ分子の第１鎖合成のためのプライマーとして作用しうる。たとえば、図２、ブロック２１６に示す確率バーコードの非限定的な例において、ｍＲＮＡ分子は、ビーズ上の確率バーコードをハイブリダイズすることができる。たとえば、一本鎖ヌクレオチド断片は、確率バーコードの標的結合領域にハイブリダイズすることができる。 Binding of Probability Barcode to Target Nucleic Acid Molecule After lysis of the cell and release of the nucleic acid molecule from it, the nucleic acid molecule can be randomly associated with the probability bar code of the co-localized solid carrier. The association may include hybridization of the target recognition region of the probability barcode to the complementary portion of the target nucleic acid molecule (eg, the oligo (dT) of the probability barcode may interact with the poly (A) tail of the target. be). The assay conditions used for hybridization (eg, buffer pH, ionic strength, temperature, etc.) can be selected to facilitate the formation of a particular stable hybrid. In some embodiments, nucleic acid molecules released from lysed cells can be associated with multiple probes on the substrate (eg, hybridize with probes on the substrate). When the probe contains an oligo (dT), the mRNA molecule can hybridize to the probe and be reverse transcribed. The oligo (dT) portion of the oligonucleotide can act as a primer for the first strand synthesis of the cDNA molecule. For example, in a non-limiting example of the probability barcode shown in FIG. 2, block 216, the mRNA molecule can hybridize the probability barcode on the beads. For example, single-stranded nucleotide fragments can hybridize to the target binding region of a stochastic barcode.

結合は、確率バーコードの標的認識領域と標的核酸分子の一部とのライゲーションをさらに含みうる。たとえば、標的結合領域は、制限部位オーバーハング（たとえば、ＥｃｏＲＩ付着末端オーバーハング）への特異的ハイブリダイゼーションが可能でありうる核酸配列を含みうる。アッセイ手順は、制限部位オーバーハングを生成するために制限酵素（たとえばＥｃｏＲＩ）で標的核酸を処置する工程をさらに含みうる。次いで、確率バーコードは、制限部位オーバーハングに相補的な配列を含む任意の核酸分子にライゲートしうる。リガーゼ（たとえばＴ４ＤＮＡリガーゼ）は２つの断片を連結するために使用しうる。 Binding may further include ligation of the target recognition region of the probability barcode with a portion of the target nucleic acid molecule. For example, the target binding region may contain a nucleic acid sequence that may allow specific hybridization to a restriction site overhang (eg, EcoRI attachment terminal overhang). The assay procedure may further include treating the target nucleic acid with a restriction enzyme (eg EcoRI) to produce a restriction site overhang. The probability barcode can then be ligated to any nucleic acid molecule containing a sequence complementary to the restriction site overhang. A ligase (eg, T4DNA ligase) can be used to ligate the two fragments.

たとえば、図２、ブロック２２０に図示する確率バーコードの非限定的な例では、複数の細胞（または複数のサンプル）からの標識標的（たとえば、標的－バーコード分子）は、続いて、たとえば、チューブ中にプールすることができる。たとえば、確率バーコードおよび／または標的－バーコード分子が結合したビーズを回収することにより、標識標的をプールすることができる。 For example, in the non-limiting example of the probability barcode illustrated in FIG. 2, block 220, labeled targets (eg, target-barcode molecules) from multiple cells (or multiple samples) are subsequently, eg, eg. Can be pooled in tubes. Labeled targets can be pooled, for example, by recovering beads to which probability barcodes and / or target-barcode molecules are bound.

結合した標的－バーコード分子の固体担体ベースのコレクションの回収は、磁気ビーズおよび外部印加磁界の使用により実現しうる。標的－バーコード分子をプールした後、すべてのさらなる処理を単一反応槽内で進行させることができる。さらなる処理は、たとえば、逆転写反応、増幅反応、切断反応、解離反応、および／または核酸伸長反応を含みうる。さらなる処理反応は、マイクロウェル内で、すなわち、複数の細胞の標識標的核酸分子を最初にプールすることなく、実施することができる。 Recovery of a solid support-based collection of bound target-barcode molecules can be achieved by the use of magnetic beads and an externally applied magnetic field. After pooling the target-barcode molecules, all further processing can proceed in a single reaction vessel. Further processing may include, for example, a reverse transcription reaction, an amplification reaction, a cleavage reaction, a dissociation reaction, and / or a nucleic acid extension reaction. Further processing reactions can be performed within the microwells, i.e., without first pooling the labeled target nucleic acid molecules of multiple cells.

逆転写
本開示は、（たとえば、図２のブロック２２４で）逆転写を用いて確率標的－バーコードコンジュゲートを生成する方法を提供する。確率標的－バーコードコンジュゲートは、確率バーコードと標的核酸の全部または一部の相補的配列と（すなわち、確率バーコード付きｃＤＮＡ分子）を含みうる。関連付けられたＲＮＡ分子の逆転写は、逆転写酵素と共に逆転写プライマーを添加することによって起こりうる。逆転写プライマーは、オリゴ（ｄＴ）プライマー、ランダムヘキサヌクレオチドプライマー、または標的特異的オリゴヌクレオチドプライマーでありうる。オリゴ（ｄＴ）プライマーは、１２～１８ヌクレオチド長、または概ねそうしたヌクレオチド長であってよく、哺乳動物ｍＲＮＡの３’末端の内因性ポリ（Ａ）テールに結合することができる。ランダムヘキサヌクレオチドプライマーは、さまざまな相補的部位でｍＲＮＡと結合しうる。標的特異的オリゴヌクレオチドプライマーは、典型的には対象のｍＲＮＡを選択的にプライミングする。 Reverse Transcription The present disclosure provides a method of generating a stochastic target-barcode conjugate using reverse transcription (eg, in block 224 of FIG. 2). A stochastic target-barcode conjugate may include a probabilistic bar code and a complementary sequence of all or part of the target nucleic acid (ie, a cDNA molecule with a probabilistic bar code). Reverse transcription of the associated RNA molecule can occur by adding reverse transcriptase along with reverse transcriptase. The reverse transcription primer can be an oligo (dT) primer, a random hexanucleotide primer, or a target-specific oligonucleotide primer. Oligo (dT) primers can be 12-18 nucleotides in length, or approximately such nucleotides in length, and can bind to the endogenous poly (A) tail at the 3'end of mammalian mRNA. Random hexanucleotide primers can bind mRNA at various complementary sites. Target-specific oligonucleotide primers typically selectively prime the mRNA of interest.

いくつかの実施形態では標識ＲＮＡ分子の逆転写は、逆転写プライマーの添加によって起こりうる。いくつかの実施形態では、逆転写プライマーは、オリゴ（ｄＴ）プライマー、ランダムヘキサヌクレオチドプライマー、または標的特異的オリゴヌクレオチドプライマーである。一般に、オリゴ（ｄＴ）プライマーは、１２～１８ヌクレオチド長であり、哺乳動物ｍＲＮＡの３’末端の内因性ポリ（Ａ）＋テールに結合する。ランダムヘキサヌクレオチドプライマーは、さまざまな相補的部位でｍＲＮＡと結合しうる。標的特異的オリゴヌクレオチドプライマーは、典型的には対象のｍＲＮＡを選択的にプライミングする。 In some embodiments, reverse transcription of the labeled RNA molecule can occur by the addition of reverse transcription primers. In some embodiments, the reverse transcription primer is an oligo (dT) primer, a random hexanucleotide primer, or a target-specific oligonucleotide primer. Generally, oligo (dT) primers are 12-18 nucleotides in length and bind to the endogenous poly (A) + tail at the 3'end of mammalian mRNA. Random hexanucleotide primers can bind mRNA at various complementary sites. Target-specific oligonucleotide primers typically selectively prime the mRNA of interest.

逆転写は、繰返し行うことにより複数の標識ｃＤＮＡ分子を生成可能である。本明細書に開示される方法は、少なくとも約１、２、３、４、５、６、７、８、９、１０、１１、１２、１３、１４、１５、１６、１７、１８、１９、または２０回の逆転写反応を行う工程を含みうる。本方法は、少なくとも約２５、３０、３５、４０、４５、５０、５５、６０、６５、７０、７５、８０、８５、９０、９５、または１００回の逆転写反応を行う工程を含みうる。 Reverse transcription can be repeated to generate multiple labeled cDNA molecules. The methods disclosed herein are at least about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 19. Alternatively, it may include a step of performing 20 reverse transcription reactions. The method may include at least about 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, or 100 reverse transcription reactions.

増幅
核酸増幅反応（たとえば、図２のブロック２２８で）は、標識標的核酸分子の複数のコピーを生成するために１回以上実施することができる。増幅は、複数の標的核酸配列が同時に増幅される、多重方式で実施してよい。増幅反応は、核酸分子にシーケンシングアダプターを付加するために使用することができる。増幅反応は、存在するのであれば、サンプル標識の少なくとも一部を増幅する工程を含みうる。増幅反応は、細胞および／または分子標識の少なくとも一部を増幅する工程を含みうる。増幅反応は、サンプルタグ、細胞標識、空間標識、分子標識、標的核酸、またはそれらの組合せの少なくとも一部を増幅する工程を含みうる。増幅反応は、複数の核酸の０．５％、１％、２％、３％、４％、５％、６％、７％、８％、９％、１０％、１５％、２０％、２５％、３０％、３５％、４０％、４５％、５０％、５５％、６０％、６５％、７０％、７５％、８０％、８５％、９０％、９５％、９７％、１００％、またはこれらの値のいずれか２つの間の範囲もしくは数を増幅する工程を含みうる。本方法は、サンプル標識、細胞標識、空間標識、および／または分子標識を含む標的－バーコード分子のｃＤＮＡコピーを１つ以上生成するために、ｃＤＮＡ合成反応を１回以上行う工程をさらに含みうる。 Amplification Nucleic acid amplification reaction (eg, in block 228 of FIG. 2) can be performed one or more times to generate multiple copies of the labeled target nucleic acid molecule. Amplification may be performed by a multiplex method in which a plurality of target nucleic acid sequences are simultaneously amplified. The amplification reaction can be used to add a sequencing adapter to the nucleic acid molecule. The amplification reaction, if present, may include the step of amplifying at least a portion of the sample label. The amplification reaction may include the step of amplifying at least a portion of the cell and / or molecular label. The amplification reaction may include amplifying at least a portion of a sample tag, cell label, spatial label, molecular label, target nucleic acid, or a combination thereof. The amplification reaction was 0.5%, 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 15%, 20%, 25 of multiple nucleic acids. %, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 97%, 100%, Alternatively, it may include the step of amplifying the range or number between any two of these values. The method may further include performing one or more cDNA synthesis reactions to generate one or more cDNA copies of the target-barcode molecule, including sample labels, cell labels, spatial labels, and / or molecular labels. ..

いくつかの実施形態では、ポリメラーゼ連鎖反応（ＰＣＲ）を用いて、増幅を実施することができる。本明細書で用いられる場合、ＰＣＲとは、ＤＮＡの相補鎖の同時プライマー伸長により特定のＤＮＡ配列のｉｎｖｉｔｒｏ増幅を行う反応を意味しうる。本明細書で用いられる場合、ＰＣＲは、その反応の派生形、たとえば、限定されるものではないが、ＲＴ－ＰＣＲ、リアルタイムＰＣＲ、ネステッドＰＣＲ、定量ＰＣＲ、多重ＰＣＲ、ディジタルＰＣＲ、およびアセンブリーＰＣＲを包含しうる。 In some embodiments, the polymerase chain reaction (PCR) can be used to perform amplification. As used herein, PCR can mean a reaction that performs in vitro amplification of a particular DNA sequence by simultaneous primer extension of the complementary strand of DNA. As used herein, PCR refers to variants of the reaction, such as, but not limited to, RT-PCR, real-time PCR, nested PCR, quantitative PCR, multiplex PCR, digital PCR, and assembly PCR. Can be included.

標識核酸の増幅は、非ＰＣＲベースの方法を含みうる。非ＰＣＲベースの方法の例としては、限定されるものではないが、多重置換増幅（ＭＤＡ）、転写媒介増幅（ＴＭＡ）、核酸配列ベースの増幅（ＮＡＳＢＡ）、鎖置換増幅（ＳＤＡ）、リアルタイムＳＤＡ、ローリングサークル増幅、またはサークル－サークル増幅が挙げられる。他の非ＰＣＲベースの増幅方法としては、ＤＮＡもしくはＲＮＡ標的を増幅するためのＤＮＡ依存性ＲＮＡポリメラーゼ駆動ＲＮＡ転写増幅またはＲＮＡ指向ＤＮＡ合成および転写の多重サイクル、リガーゼ連鎖反応（ＬＣＲ）、およびＱβレプリカーゼ（Ｑβ）法、パリンドロームプローブの使用、鎖置換増幅、制限エンドヌクレアーゼを用いたオリゴヌクレオチド駆動増幅、プライマーが核酸配列にハイブリダイズされかつ得られた二本鎖が伸長反応および増幅の前に切断される増幅方法、５’エキソヌクレアーゼ活性の欠如した核酸ポリメラーゼを用いた鎖置換増幅、ローリングサークル増幅、および分岐伸長増幅（ＲＡＭ）が挙げられる。いくつかの実施形態では、増幅は、環化転写物を生成しうる。 Amplification of the labeled nucleic acid can include non-PCR-based methods. Examples of non-PCR-based methods include, but are not limited to, multiplex substitution amplification (MDA), transcription-mediated amplification (TMA), nucleic acid sequence-based amplification (NASBA), strand substitution amplification (SDA), real-time SDA. , Rolling circle amplification, or circle-circle amplification. Other non-PCR-based amplification methods include DNA-dependent RNA polymerase-driven RNA transcription amplification or multiple cycles of RNA-oriented DNA synthesis and transcription to amplify DNA or RNA targets, ligase linkage reaction (LCR), and Qβ replicase. (Qβ) method, use of parindrome probe, strand substitution amplification, oligonucleotide-driven amplification with restriction endonuclease, primer hybridized to nucleic acid sequence and resulting double strand cleaved prior to extension reaction and amplification Amplification methods such as strand substitution amplification using nucleic acid polymerase lacking 5'exonuclease activity, rolling circle amplification, and branch extension amplification (RAM). In some embodiments, amplification can produce a cyclized transcript.

いくつかの実施形態では、本明細書に開示する方法は、確率標識アンプリコンを生成するために標識核酸（たとえば、標識ＲＮＡ、標識ＤＮＡ、標識ｃＤＮＡ）上でポリメラーゼ連鎖反応を実施する工程をさらに含む。標識アンプリコンは、二本鎖分子であってよい。二本鎖分子は、二本鎖ＲＮＡ分子、二本鎖ＤＮＡ分子、またはＤＮＡ分子にハイブリダイズされたＲＮＡ分子を含みうる。二本鎖分子の一方または両方の鎖は、サンプル標識、空間標識、細胞標識、および／または分子標識を含みうる。確率標識アンプリコンは、一本鎖分子でありうる。一本鎖分子は、ＤＮＡ、ＲＮＡ、またはそれらの組合せを含みうる。本開示の核酸は、合成核酸または改変核酸を含みうる。 In some embodiments, the method disclosed herein further comprises performing a polymerase chain reaction on a labeled nucleic acid (eg, labeled RNA, labeled DNA, labeled cDNA) to generate a probabilistic labeled amplicon. include. The labeled amplicon may be a double-stranded molecule. The double-stranded molecule can include a double-stranded RNA molecule, a double-stranded DNA molecule, or an RNA molecule hybridized to the DNA molecule. One or both strands of a double-stranded molecule may include a sample label, a spatial label, a cellular label, and / or a molecular label. The probability-labeled amplicon can be a single-stranded molecule. Single-stranded molecules can include DNA, RNA, or combinations thereof. The nucleic acids of the present disclosure may include synthetic or modified nucleic acids.

増幅は、１つ以上の非天然ヌクレオチドの使用を含みうる。非天然ヌクレオチドは、光不安定性またはトリガー性のヌクレオチドを含みうる。非天然ヌクレオチドの例としては、限定されるものではないが、ペプチド核酸（ＰＮＡ）、モルホリノ核酸、およびロックド核酸（ＬＮＡ）、さらにはグリコール核酸（ＧＮＡ）およびトレオース核酸（ＴＮＡ）が挙げられうる。非天然ヌクレオチドは、増幅反応の１サイクル以上に添加することができる。非天然ヌクレオチドの添加は、増幅反応の特定のサイクルまたは時点で産物を同定するために使用しうる。 Amplification may include the use of one or more unnatural nucleotides. Unnatural nucleotides can include photolabile or triggering nucleotides. Examples of unnatural nucleotides may include, but are not limited to, peptide nucleic acids (PNAs), morpholino nucleic acids, and locked nucleic acids (LNAs), as well as glycol nucleic acids (GNA) and threose nucleic acids (TNAs). Unnatural nucleotides can be added in one or more cycles of the amplification reaction. Addition of unnatural nucleotides can be used to identify the product at a particular cycle or time point of the amplification reaction.

増幅反応を１回以上行う工程は、１つ以上のプライマーの使用を含みうる。１つ以上のプライマーは、たとえば、少なくとも１、２、３、４、５、６、７、８、９、１０、１１、１２、１３、１４、もしくは１５ヌクレオチドまたはそれ以上を含みうる。１つ以上のプライマーは、少なくとも１、２、３、４、５、６、７、８、９、１０、１１、１２、１３、１４、もしくは１５ヌクレオチドまたはそれ以上を含みうる。１つ以上のプライマーは、１２～１５ヌクレオチド未満を含みうる。１つ以上のプライマーは、複数の確率標識標的の少なくとも一部にアニールしうる。１つ以上のプライマーは、複数の確率標識標的の３’末端または５’末端にアニールしうる。１つ以上のプライマーは、複数の確率標識標的の内部領域にアニールしうる。内部領域は、複数の確率標識標的の３’末端から少なくとも約５０、１００、１５０、２００、２２０、２３０、２４０、２５０、２６０、２７０、２８０、２９０、３００、３１０、３２０、３３０、３４０、３５０、３６０、３７０、３８０、３９０、４００、４１０、４２０、４３０、４４０、４５０、４６０、４７０、４８０、４９０、５００、５１０、５２０、５３０、５４０、５５０、５６０、５７０、５８０、５９０、６００、６５０、７００、７５０、８００、８５０、９００、または１０００ヌクレオチドでありうる。１つ以上のプライマーは、プライマーの一定パネルを含みうる。１つ以上のプライマーは、少なくとも１つ以上のカスタムプライマーを含みうる。１つ以上のプライマーは、少なくとも１つ以上の対照プライマーを含みうる。１つ以上のプライマーは、少なくとも１つ以上の遺伝子特異的プライマーを含みうる。 The step of performing the amplification reaction more than once may include the use of one or more primers. The one or more primers may contain, for example, at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, or 15 nucleotides or more. The one or more primers may contain at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, or 15 nucleotides or more. One or more primers may contain less than 12-15 nucleotides. One or more primers can anneal to at least some of the probability labeled targets. One or more primers can anneal to the 3'end or 5'end of multiple probability-labeled targets. One or more primers can anneal to the internal regions of multiple probability-labeled targets. The internal region is at least about 50, 100, 150, 200, 220, 230, 240, 250, 260, 270, 280, 290, 300, 310, 320, 330, 340, from the 3'end of multiple probability-labeled targets. 350, 360, 370, 380, 390, 400, 410, 420, 430, 440, 450, 460, 470, 480, 490, 500, 510, 520, 530, 540, 550, 560, 570, 580, 590, It can be 600, 650, 700, 750, 800, 850, 900, or 1000 nucleotides. One or more primers may include a constant panel of primers. The one or more primers may include at least one custom primer. One or more primers may include at least one control primer. One or more primers may include at least one gene-specific primer.

１つ以上のプライマーは、ユニバーサルプライマーを含みうる。ユニバーサルプライマーは、ユニバーサルプライマー結合部位にアニールしうる。１つ以上のカスタムプライマーは、第１のサンプル標識、第２のサンプル標識、空間標識、細胞標識、分子標識、標的、またはそれらの任意の組合せにアニールしうる。１つ以上のプライマーは、ユニバーサルプライマーおよびカスタムプライマーを含みうる。カスタムプライマーは、１つ以上の標的を増幅するように設計しうる。標的は、１つ以上のサンプル中の全核酸のサブセットを含みうる。標的は、１つ以上のサンプル中の全確率標識標的のサブセットを含みうる。１つ以上のプライマーは、少なくとも９６カスタムプライマーまたはそれ以上を含みうる。１つ以上のプライマーは、少なくとも９６０カスタムプライマーまたはそれ以上を含みうる。１つ以上のプライマーは、少なくとも９６００カスタムプライマーまたはそれ以上を含みうる。１つ以上のカスタムプライマーは、２つ以上の異なる標識核酸にアニールしうる。２つ以上の異なる標識核酸は、１つ以上の遺伝子に相当しうる。 One or more primers may include universal primers. The universal primer can anneal to the universal primer binding site. The one or more custom primers can be annealed to a first sample label, a second sample label, a spatial label, a cell label, a molecular label, a target, or any combination thereof. One or more primers may include universal and custom primers. Custom primers can be designed to amplify one or more targets. The target can include a subset of all nucleic acids in one or more samples. The target may include a subset of all probability labeled targets in one or more samples. The one or more primers may include at least 96 custom primers or more. The one or more primers may include at least 960 custom primers or more. One or more primers may include at least 9600 custom primers or more. One or more custom primers can anneal to two or more different labeled nucleic acids. Two or more different labeled nucleic acids can correspond to one or more genes.

任意の増幅スキームを本開示の方法で使用することができる。たとえば、一スキームでは、第１ラウンドのＰＣＲは、遺伝子特異的プライマーおよびユニバーサルＩｌｌｕｍｉｎａシーケンシングプライマー１配列に対するプライマーを用いて、ビーズに結合された分子を増幅することができる。第２ラウンドのＰＣＲは、Ｉｌｌｕｍｉｎａシーケンシングプライマー２配列がフランキングするネステッド遺伝子特異的プライマーとユニバーサルＩｌｌｕｍｉｎａシーケンシングプライマー１配列に対するプライマーとを用いて第１のＰＣＲ産物を増幅可能である。第３ラウンドのＰＣＲは、Ｐ５およびＰ７とサンプルインデックスを付加して、ＰＣＲ産物をＩｌｌｕｍｉｎａシーケンシングライブラリーにする。１５０ｂｐ×２シーケンシングを用いたシーケンシングは、リード１上の細胞標識および分子標識、リード２上の遺伝子、ならびにインデックス１リード上のサンプルインデックスを明らかにしうる。 Any amplification scheme can be used in the methods of the present disclosure. For example, in one scheme, first round PCR can amplify molecules bound to beads using gene-specific primers and primers for one universal Illumina sequencing primer sequence. The second round of PCR can amplify the first PCR product using nested gene-specific primers flanked by Illumina sequencing primer 2 sequences and primers to the universal Illumina sequencing primer 1 sequence. The third round of PCR adds P5 and P7 and a sample index to make the PCR product an Illumina sequencing library. Sequencing with 150 bp × 2 sequencing can reveal cell and molecular labels on read 1, genes on read 2, and sample indexes on index 1 read.

いくつかの実施形態では、核酸は、化学切断を用いて基材から除去可能である。たとえば、核酸中に存在する化学基または修飾塩基は、固体担体からのその除去を促進するために使用可能である。たとえば、酵素は、基材から核酸を除去するために使用可能である。たとえば、核酸は、制限エンドヌクレアーゼ消化による基材からの除去が可能である。たとえば、ｄＵＴＰまたはｄｄＵＴＰを含有する核酸のウラシル－ｄ－グリコシラーゼ（ＵＤＧ）処理は、基材から核酸を除去するために使用可能である。たとえば、核酸は、ヌクレオチド切除を行う酵素、たとえば、塩基除去修復酵素、たとえば、脱プリン／脱ピリミジン（ＡＰ）エンドヌクレアーゼを用いて基材から除去可能である。いくつかの実施形態では、核酸は、光切断性基と光とを用いて基材から除去可能である。いくつかの実施形態では、切断性リンカーは、基材から核酸を除去するために使用可能である。たとえば、切断性リンカーは、ビオチン／アビジン、ビオチン／ストレプトアビジン、ビオチン／ニュートラビジン、Ｉｇ－プロテインＡ、光不安定性リンカー、酸または塩基不安定性リンカー基、またはアプタマーの少なくとも１つを含みうる。 In some embodiments, the nucleic acid can be removed from the substrate using chemical cleavage. For example, a chemical group or modified base present in a nucleic acid can be used to facilitate its removal from a solid carrier. For example, enzymes can be used to remove nucleic acids from the substrate. For example, nucleic acids can be removed from the substrate by restriction endonuclease digestion. For example, uracil-d-glycosylase (UDG) treatment of nucleic acids containing dUTP or ddUTP can be used to remove nucleic acids from the substrate. For example, nucleic acids can be removed from the substrate using enzymes that perform nucleotide excision, such as base excision repair enzymes, such as depurine / depyrimidine (AP) endonucleases. In some embodiments, the nucleic acid can be removed from the substrate using a photocleavable group and light. In some embodiments, the cleaving linker can be used to remove nucleic acid from the substrate. For example, the cleaving linker may include at least one of biotin / avidin, biotin / streptavidin, biotin / neutralvidin, Ig-protein A, photolabile linker, acid or base unstable linker group, or aptamer.

プローブが遺伝子特異的である場合、分子は、プローブにハイブリダイズし、逆転写および／または増幅が可能である。いくつかの実施形態では、核酸が合成された後（たとえば、逆転写された後）、増幅が可能である。増幅は、複数の標的核酸配列が同時に増幅される条件で、多重方式で行いうる。増幅は、核酸にシーケンシングアダプターを付加しうる。 If the probe is gene-specific, the molecule can hybridize to the probe and be reverse transcribed and / or amplified. In some embodiments, amplification is possible after the nucleic acid has been synthesized (eg, after reverse transcription). Amplification can be performed by a multiplex method under the condition that a plurality of target nucleic acid sequences are simultaneously amplified. Amplification can add a sequencing adapter to the nucleic acid.

いくつかの実施形態では、増幅は、たとえばブリッジ増幅を用いて基材上に行うことが可能である。基材上でオリゴ（ｄＴ）プローブを用いてブリッジ増幅するのに適合していた末端を生成するために、ｃＤＮＡにホモポリマーテールを付加することが可能である。ブリッジ増幅では、テンプレート核酸の３’末端に相補的なプライマーは、固体粒子に共有結合された各ペアの第１のプライマーでありうる。テンプレート核酸を含有するサンプルが粒子に接触して１回の熱サイクルが行われる場合、テンプレート分子は第１のプライマーにアニールし、かつ第１のプライマーはヌクレオチドの付加により順方向に伸長して、テンプレート分子とテンプレートに相補的な新たに形成されたＤＮＡ鎖とからなる二本鎖分子を形成する。次のサイクルの加熱工程では、二本鎖分子は変性されて、粒子からテンプレート分子を放出し、第１のプライマーを介して粒子に結合された相補的ＤＮＡ鎖を残存させる。続くアニーリング・伸長工程のアニーリング段階では、相補鎖は、第１のプライマーから除去された位置の相補鎖のセグメントに相補的な第２のプライマーにハイブリダイズ可能である。このハイブリダイゼーションにより、相補鎖は、共有結合により第１のプライマーにかつハイブリダイゼーションにより第２のプライマーに固定されたブリッジを第１および第２のプライマー間に形成可能である。伸長段階では、第２のプライマーは、同一の反応混合物中にヌクレオチドを添加することにより反対方向に伸長し、それによりブリッジを二本鎖ブリッジに変換可能である。次いで、次のサイクルが開始され、二本鎖ブリッジは変性されて、それぞれ第１および第２のプライマーを介して粒子表面に結合された一方の末端と、それぞれ未結合の状態の他方の末端と、を有する２つの一本鎖核酸分子を与えることが可能である。この第２のサイクルのアニーリング・伸長工程では、各鎖は同一の粒子上のこれまで未使用であったさらなる相補的プライマーにハイブリダイズして新しい一本鎖ブリッジを形成可能である。この時点でハイブリダイズされる２つのこれまで未使用であったプライマーは伸長して２つの新しいブリッジを二本鎖ブリッジに変換可能である。 In some embodiments, amplification can be performed on the substrate using, for example, bridge amplification. It is possible to add a homopolymer tail to the cDNA to produce a terminal that was suitable for bridge amplification with an oligo (dT) probe on the substrate. In bridge amplification, the primer complementary to the 3'end of the template nucleic acid can be the first primer of each pair covalently attached to the solid particle. When a sample containing the template nucleic acid is in contact with the particles for a single thermal cycle, the template molecule is annealed to the first primer and the first primer is forwardly extended by the addition of nucleotides. It forms a double-stranded molecule consisting of a template molecule and a newly formed DNA strand complementary to the template. In the heating step of the next cycle, the double-stranded molecule is denatured to release the template molecule from the particle, leaving a complementary DNA strand attached to the particle via the first primer. In the annealing step of the subsequent annealing step, the complementary strand is capable of hybridizing to a second primer that is complementary to the segment of the complementary strand at the position removed from the first primer. This hybridization allows the complementary strand to form a bridge between the first and second primers that is covalently immobilized to the first primer and also to the second primer by hybridization. In the extension step, the second primer is extended in opposite directions by adding nucleotides in the same reaction mixture, thereby converting the bridge into a double-stranded bridge. The next cycle is then initiated and the double-stranded bridge is denatured with one end bound to the particle surface via the first and second primers, respectively, and the other end in the unbound state, respectively. It is possible to provide two single-stranded nucleic acid molecules with. In this second cycle of annealing and extension steps, each strand can hybridize to a previously unused additional complementary primer on the same particle to form a new single-stranded bridge. The two previously unused primers hybridized at this point can be extended to convert the two new bridges into double-stranded bridges.

増幅反応は、複数の核酸の少なくとも１％、２％、３％、４％、５％、６％、７％、８％、９％、１０％、１５％、２０％、２５％、３０％、３５％、４０％、４５％、５０％、５５％、６０％、６５％、７０％、７５％、８０％、８５％、９０％、９５％、９７％、または１００％を増幅する工程を含みうる。 Amplification reactions are at least 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 15%, 20%, 25%, 30% of multiple nucleic acids. , 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 97%, or 100%. Can include.

標識核酸の増幅は、ＰＣＲベースの方法または非ＰＣＲベースの方法を含みうる。標識核酸の増幅は、標識核酸の指数関数的増幅を含みうる。標識核酸の増幅は、標識核酸の線形増幅を含みうる。増幅は、ポリメラーゼ連鎖反応（ＰＣＲ）により行うことが可能である。ＰＣＲは、ＤＮＡの相補鎖の同時プライマー伸長により特定のＤＮＡ配列のｉｎｖｉｔｒｏ増幅を行う反応を意味しうる。ＰＣＲは、その反応の派生形、たとえば、限定されるものではないが、ＲＴ－ＰＣＲ、リアルタイムＰＣＲ、ネステッドＰＣＲ、定量ＰＣＲ、多重ＰＣＲ、ディジタルＰＣＲ、サプレッションＰＣＲ、セミサプレッシブＰＣＲ、およびアセンブリーＰＣＲを包含しうる。 Amplification of the labeled nucleic acid can include PCR-based or non-PCR-based methods. Amplification of the labeled nucleic acid may include exponential amplification of the labeled nucleic acid. Amplification of the labeled nucleic acid may include linear amplification of the labeled nucleic acid. Amplification can be performed by the polymerase chain reaction (PCR). PCR can mean a reaction that in vitro amplifies a particular DNA sequence by co-primer extension of the complementary strand of DNA. PCR includes variants of the reaction, such as, but not limited to, RT-PCR, real-time PCR, nested PCR, quantitative PCR, multiplex PCR, digital PCR, suppression PCR, semi-suppressive PCR, and assembly PCR. sell.

いくつかの実施形態では、標識核酸の増幅は非ＰＣＲベースの方法を含む。非ＰＣＲベースの方法の例としては、限定されるものではないが、多重置換増幅（ＭＤＡ）、転写媒介増幅（ＴＭＡ）、核酸配列ベースの増幅（ＮＡＳＢＡ）、鎖置換増幅（ＳＤＡ）、リアルタイムＳＤＡ、ローリングサークル増幅、またはサークル－サークル増幅が挙げられる。他の非ＰＣＲベースの増幅方法としては、ＤＮＡもしくはＲＮＡ標的を増幅するためのＤＮＡ依存性ＲＮＡポリメラーゼ駆動ＲＮＡ転写増幅またはＲＮＡ指向ＤＮＡ合成および転写の多重サイクル、リガーゼ連鎖反応（ＬＣＲ）、Ｑβレプリカーゼ（Ｑβ）、パリンドロームプローブの使用、鎖置換増幅、制限エンドヌクレアーゼを用いたオリゴヌクレオチド駆動増幅、プライマーが核酸配列にハイブリダイズされかつ得られた二本鎖が伸長反応および増幅の前に切断される増幅方法、５’エキソヌクレアーゼ活性の欠如した核酸ポリメラーゼを用いた鎖置換増幅、ローリングサークル増幅、および／または分岐伸長増幅（ＲＡＭ）が挙げられる。 In some embodiments, amplification of the labeled nucleic acid comprises a non-PCR-based method. Examples of non-PCR-based methods include, but are not limited to, multiplex substitution amplification (MDA), transcription-mediated amplification (TMA), nucleic acid sequence-based amplification (NASBA), strand substitution amplification (SDA), real-time SDA. , Rolling circle amplification, or circle-circle amplification. Other non-PCR-based amplification methods include DNA-dependent RNA polymerase-driven RNA transcription amplification or multiple cycles of RNA-oriented DNA synthesis and transcription to amplify DNA or RNA targets, ligase linkage reaction (LCR), Qβ replicase ( Qβ), use of parindrome probe, strand substitution amplification, oligonucleotide driven amplification with limiting endonuclease, primer hybridized to nucleic acid sequence and resulting double strand cleaved prior to extension reaction and amplification Amplification methods include chain substitution amplification, rolling circle amplification, and / or branch extension amplification (RAM) using nucleic acid polymerases lacking 5'exonuclease activity.

いくつかの実施形態では、本明細書に開示される方法は、増幅アンプリコン（たとえば標的）上でネステッドポリメラーゼ連鎖反応を行う工程をさらに含む。アンプリコンは二本鎖分子でありうる。二本鎖分子は、二本鎖ＲＮＡ分子、二本鎖ＤＮＡ分子、またはＤＮＡ分子にハイブリダイズされたＲＮＡ分子を含みうる。二本鎖分子の一方または両方の鎖は、サンプルタグまたは分子識別子標識を含みうる。代替的に、アンプリコンは一本鎖分子でありうる。一本鎖分子は、ＤＮＡ、ＲＮＡ、またはそれらの組合せを含みうる。本発明の核酸は、合成核酸または改変核酸を含みうる。 In some embodiments, the methods disclosed herein further comprise the step of performing a nested polymerase chain reaction on an amplified amplicon (eg, a target). Amplicons can be double-stranded molecules. The double-stranded molecule can include a double-stranded RNA molecule, a double-stranded DNA molecule, or an RNA molecule hybridized to the DNA molecule. One or both strands of a double-stranded molecule may include a sample tag or a molecular identifier label. Alternatively, the amplicon can be a single-stranded molecule. Single-stranded molecules can include DNA, RNA, or combinations thereof. The nucleic acids of the invention may include synthetic or modified nucleic acids.

いくつかの実施形態では、本方法は、多数のアンプリコンを生成するために標識核酸を繰返し増幅する工程を含む。本明細書に開示される方法は、少なくとも約１、２、３、４、５、６、７、８、９、１０、１１、１２、１３、１４、１５、１６、１７、１８、１９、または２０回の増幅反応を行う工程を含みうる。代替的に、本方法は、少なくとも約２５、３０、３５、４０、４５、５０、５５、６０、６５、７０、７５、８０、８５、９０、９５、または１００回の増幅反応を行う工程を含む。 In some embodiments, the method comprises the step of repeatedly amplifying the labeled nucleic acid to produce a large number of amplicon. The methods disclosed herein are at least about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 19. Alternatively, it may include a step of performing an amplification reaction 20 times. Alternatively, the method involves performing at least about 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, or 100 amplification reactions. include.

増幅工程は、複数の核酸を含む１つ以上のサンプルに１つ以上の対照核酸を添加する工程をさらに含みうる。増幅工程は、複数の核酸に１つ以上の対照核酸を添加する工程をさらに含みうる。対照核酸は、対照標識を含みうる。 The amplification step may further include adding one or more control nucleic acids to one or more samples containing the plurality of nucleic acids. The amplification step may further include adding one or more control nucleic acids to the plurality of nucleic acids. The control nucleic acid may include a control label.

増幅は、１つ以上の非天然ヌクレオチドの使用を含みうる。非天然ヌクレオチドは、光不安定性および／またはトリガー性ヌクレオチドを含みうる。非天然ヌクレオチドの例としては、限定されるものではないが、ペプチド核酸（ＰＮＡ）、モルホリノ核酸およびロックド核酸（ＬＮＡ）、さらにはグリコール核酸（ＧＮＡ）およびトレオース核酸（ＴＮＡ）が挙げられる。非天然ヌクレオチドは、増幅反応の１サイクル以上に添加しうる。非天然ヌクレオチドの添加は、増幅反応の特定のサイクルまたは時点で産物を同定するために使用しうる。 Amplification may include the use of one or more unnatural nucleotides. Unnatural nucleotides can include photolabile and / or triggering nucleotides. Examples of unnatural nucleotides include, but are not limited to, peptide nucleic acids (PNA), morpholino nucleic acids and locked nucleic acids (LNA), as well as glycol nucleic acids (GNA) and threose nucleic acids (TNA). Unnatural nucleotides can be added for more than one cycle of the amplification reaction. Addition of unnatural nucleotides can be used to identify the product at a particular cycle or time point of the amplification reaction.

増幅反応を１回以上行う工程は、１つ以上のプライマーの使用を含みうる。１つ以上のプライマーは１つ以上のオリゴヌクレオチドを含みうる。１つ以上のオリゴヌクレオチドは少なくとも約７～９ヌクレオチドを含みうる。１つ以上のオリゴヌクレオチドは１２～１５ヌクレオチド未満を含みうる。１つ以上のプライマーは、複数の標識核酸の少なくとも一部にアニールしうる。１つ以上のプライマーは、複数の標識核酸の３’末端および／または５’末端にアニールしうる。１つ以上のプライマーは、複数の標識核酸の内部領域にアニールしうる。内部領域は、複数の標識核酸の３’末端から少なくとも約５０、１００、１５０、２００、２２０、２３０、２４０、２５０、２６０、２７０、２８０、２９０、３００、３１０、３２０、３３０、３４０、３５０、３６０、３７０、３８０、３９０、４００、４１０、４２０、４３０、４４０、４５０、４６０、４７０、４８０、４９０、５００、５１０、５２０、５３０、５４０、５５０、５６０、５７０、５８０、５９０、６００、６５０、７００、７５０、８００、８５０、９００、または１０００ヌクレオチドでありうる。１つ以上のプライマーは、プライマーの一定パネルを含みうる。１つ以上のプライマーは、少なくとも１つ以上のカスタムプライマーを含みうる。１つ以上のプライマーは、少なくとも１つ以上の対照プライマーを含みうる。１つ以上のプライマーは、少なくとも１つ以上のハウスキーピング遺伝子プライマーを含みうる。１つ以上のプライマーは、ユニバーサルプライマーを含みうる。ユニバーサルプライマーは、ユニバーサルプライマー結合部位にアニールしうる。１つ以上のカスタムプライマーは、第１のサンプルタグ、第２のサンプルタグ、分子識別子標識、核酸、またはその産物にアニールしうる。１つ以上のプライマーは、ユニバーサルプライマーおよびカスタムプライマーを含みうる。カスタムプライマー、１つ以上の標的核酸を増幅するように設計しうる。標的核酸は、１つ以上のサンプル中の全核酸のサブセットを含みうる。いくつかの実施形態では、プライマーには、本開示のアレイに結合されたプローブである。 The step of performing the amplification reaction more than once may include the use of one or more primers. One or more primers may contain one or more oligonucleotides. One or more oligonucleotides may contain at least about 7-9 nucleotides. One or more oligonucleotides may contain less than 12-15 nucleotides. One or more primers can anneal to at least a portion of the labeled nucleic acid. One or more primers can anneal to the 3'end and / or 5'end of multiple labeled nucleic acids. One or more primers can anneal to the internal regions of multiple labeled nucleic acids. The internal region is at least about 50, 100, 150, 200, 220, 230, 240, 250, 260, 270, 280, 290, 300, 310, 320, 330, 340, 350 from the 3'end of the plurality of labeled nucleic acids. 360, 370, 380, 390, 400, 410, 420, 430, 440, 450, 460, 470, 480, 490, 500, 510, 520, 530, 540, 550, 560, 570, 580, 590, 600 , 650, 700, 750, 800, 850, 900, or 1000 nucleotides. One or more primers may include a constant panel of primers. The one or more primers may include at least one custom primer. One or more primers may include at least one control primer. One or more primers may include at least one housekeeping gene primer. One or more primers may include universal primers. The universal primer can anneal to the universal primer binding site. One or more custom primers can be annealed to a first sample tag, a second sample tag, a molecular identifier label, a nucleic acid, or a product thereof. One or more primers may include universal and custom primers. Custom primers can be designed to amplify one or more target nucleic acids. The target nucleic acid may include a subset of all nucleic acids in one or more samples. In some embodiments, the primer is a probe attached to the array of the present disclosure.

いくつかの実施形態では、サンプル中の複数の標的に確率バーコードを付ける工程は、確率バーコード付き断片の指標インデックスライブラリーを作製する工程をさらに含む。異なる確率バーコードの分子標識は、互いに異なっていてもよい。確率バーコード付き標的の指標インデックスライブラリーを作製する工程は、サンプル中の複数の標的から複数の指標インデックスポリヌクレオチドを作製する工程を含む。たとえば、第１の指標インデックス標的と第２の指標インデックス標的とを含む確率バーコード標的の指標インデックスライブラリーの場合、第１の指標インデックスポリヌクレオチドの標識領域は、第２の指標インデックスポリヌクレオチドの標識領域と、１、２、３、４、５、６、７、８、９、１０、２０、３０、４０、５０ヌクレオチド異なって、概ね、少なくとも、もしくは多くともこうした値、またはこれらの値のいずれか２つの間の数もしくは範囲のヌクレオチド異なってもよい。いくつかの実施形態では、確率バーコード付き標的の指標インデックスライブラリーを作製する工程は、ポリ（Ｔ）領域および標識領域などの複数のオリゴヌクレオチドと、複数の標識、たとえば、ｍＲＮＡ分子を接触させる工程と；各々がｃＤＮＡ領域および標識領域を含む一本鎖標識ｃＤＮＡ分子を生成するために、逆転写酵素を用いて、第１鎖合成を実施する工程と、を含み、ここで、複数の標的は、異なる配列の少なくとも２つのｍＲＮＡ分子を含み、複数のオリゴヌクレオチドは、異なる配列の少なくとも２つのオリゴヌクレオチドを含む。確率バーコード付き標的の指標インデックスライブラリーを作製する工程は、さらに、二本鎖標識ｃＤＮＡ分子を生成するために、一本鎖標識ｃＤＮＡ分子を増幅する工程と；標識アンプリコンを生成するために、二本鎖標識ｃＤＮＡ分子上でネステッドＰＣＲを実施する工程と、を含む。いくつかの実施形態では、本方法は、アダプター－標識アンプリコンを作製する工程を含みうる。 In some embodiments, the step of attaching a probability barcode to a plurality of targets in a sample further comprises creating an index index library of probabilistic barcoded fragments. Molecular labels with different probability barcodes may be different from each other. The step of making an index index library of targets with probability barcodes includes the step of making a plurality of index index polynucleotides from a plurality of targets in a sample. For example, in the case of a probabilistic bar code target index index library that includes a first index index target and a second index index target, the labeled region of the first index index polynucleotide is that of the second index index polynucleotide. Different from the labeled region by 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50 nucleotides, generally at least or at most these values, or of these values. The number or range of nucleotides between any two may differ. In some embodiments, the step of creating an index index library for a probabilistic bar coded target involves contacting a plurality of oligonucleotides, such as poly (T) and labeled regions, with a plurality of labels, such as mRNA molecules. A step of performing first-strand synthesis using a reverse transcription enzyme to generate a single-stranded labeled cDNA molecule, each containing a cDNA region and a labeled region, comprising a plurality of targets. Contains at least two mRNA molecules of different sequences, and the plurality of oligonucleotides comprises at least two oligonucleotides of different sequences. The steps to create an index index library for probabilistic barcoded targets are further to amplify single-stranded labeled cDNA molecules in order to generate double-stranded labeled cDNA molecules; and to generate labeled amplicons. , A step of performing nested PCR on a double-stranded labeled cDNA molecule. In some embodiments, the method may include making an adapter-labeled amplicon.

確率バーコーディングは、個々の核酸（たとえば、ＤＮＡまたはＲＮＡ）分子を標識するために、核酸バーコードもしくはタグを使用しうる。いくつかの実施形態では、これは、ＤＮＡバーコードもしくはタグがｍＲＮＡから生成される際に、ｃＤＮＡ分子にこれらを付加する工程を含む。ネステッドＰＣＲは、ＰＣＲ増幅バイアスの最小限化を実施することができる。アダプターは、たとえば、次世代シーケンシング（ＮＧＳ）を用いるシーケンシングのために付加することができる。シーケンシング結果を用いて、たとえば、図２のブロック２３２に位置する標的の１つ以上のコピーの細胞標識、分子標識、およびヌクレオチド断片の配列を決定することができる。 Probability barcoding can use nucleic acid barcodes or tags to label individual nucleic acid (eg, DNA or RNA) molecules. In some embodiments, this comprises adding DNA barcodes or tags to the cDNA molecule as they are generated from the mRNA. Nested PCR can perform PCR amplification bias minimization. Adapters can be added, for example, for sequencing using next-generation sequencing (NGS). Sequencing results can be used, for example, to sequence cell labels, molecular labels, and nucleotide fragments of one or more copies of the target located in block 232 of FIG.

図３は、確率バーコード付き標的、たとえば、ｍＲＮＡの指標インデックスライブラリーを作製する非限定的な例示的プロセスを示す概略図である。ステップ１に示すように、逆転写プロセスは、ユニーク分子標識、細胞標識、およびユニバーサルＰＣＲ部位を含む各ｍＲＮＡ分子をコードすることができる。特に、分子識別子標識３１０のセットとＲＮＡ分子３０２のポリ（Ａ）テール領域３０８の確率論的ハイブリダイゼーションによって、ＲＮＡ分子３０２を逆転写して、ｃＤＮＡ領域３０６を含む標識ｃＤＮＡ分子３０４を生成することができる。分子識別子標識３１０の各々は、標的結合領域、たとえば、ポリ（ｄＴ）領域３１２、標識領域３１４、およびユニバーサルＰＣＲ領域３１６を含みうる。 FIG. 3 is a schematic diagram showing a non-limiting exemplary process for creating an index index library of probabilistic barcoded targets, eg mRNA. As shown in step 1, the reverse transcription process can encode each mRNA molecule containing a unique molecular label, a cell label, and a universal PCR site. In particular, probabilistic hybridization of the set of molecular identifier labels 310 with the poly (A) tail region 308 of the RNA molecule 302 can reverse-transcribe the RNA molecule 302 to produce the labeled cDNA molecule 304 containing the cDNA region 306. can. Each of the molecular identifier labels 310 may include a target binding region, eg, a poly (dT) region 312, a labeled region 314, and a universal PCR region 316.

いくつかの実施形態では、細胞標識は、３～２０ヌクレオチドを含みうる。いくつかの実施形態では、分子標識は、３～２０ヌクレオチドを含みうる。いくつかの実施形態では、複数の確率バーコードの各々は、１つ以上のユニバーサル標識および細胞標識をさらに含み、ユニバーサル標識は、固体担体上の複数の確率バーコードについて同じであり、細胞標識は、固体担体上の複数の確率バーコードについて同じである。いくつかの実施形態では、ユニバーサル標識は、３～２０ヌクレオチドを含みうる。いくつかの実施形態では、細胞標識は、３～２０ヌクレオチドを含む。 In some embodiments, the cell label may contain 3-20 nucleotides. In some embodiments, the molecular label may contain 3-20 nucleotides. In some embodiments, each of the plurality of probability barcodes further comprises one or more universal labels and cell labels, the universal label is the same for the plurality of probability barcodes on a solid carrier, and the cell labels are. The same is true for multiple probability barcodes on solid carriers. In some embodiments, the universal label may contain 3-20 nucleotides. In some embodiments, the cell label comprises 3-20 nucleotides.

いくつかの実施形態では、標識領域３１４は、分子標識３１８および細胞標識３２０を含みうる。いくつかの実施形態では、標識領域３１４は、１つ以上のユニバーサル標識、次元標識、および細胞標識を含みうる。分子標識３１８は、１、２、３、４、５、６、７、８、９、１０、２０、３０、４０、５０、６０、７０、８０、９０、１００ヌクレオチド長であっても、概ね、少なくとも、もしくは多くともそうしたヌクレオチド長であってもよいし、またはこれらの値のいずれかの間の数もしくは範囲のヌクレオチド長であってもよい。細胞標識３２０は、１、２、３、４、５、６、７、８、９、１０、２０、３０、４０、５０、６０、７０、８０、９０、１００ヌクレオチド長であっても、概ね、少なくとも、もしくは多くともそうしたヌクレオチド長であってもよいし、またはこれらの値のいずれかの間の数もしくは範囲のヌクレオチド長であってもよい。ユニバーサル標識は、１、２、３、４、５、６、７、８、９、１０、２０、３０、４０、５０、６０、７０、８０、９０、１００ヌクレオチド長であっても、概ね、少なくとも、もしくは多くともそうしたヌクレオチド長であってもよいし、またはこれらの値のいずれかの間の数もしくは範囲のヌクレオチド長であってもよい。ユニバーサル標識は、固体担体上の複数の確率バーコードについて同じであってもよく、細胞標識は、固体担体上の複数の確率バーコードについて同じであってもよい。次元標識は、１、２、３、４、５、６、７、８、９、１０、２０、３０、４０、５０、６０、７０、８０、９０、１００ヌクレオチド長であっても、概ね、少なくとも、もしくは多くともそうしたヌクレオチド長であってもよいし、またはこれらの値のいずれかの間の数もしくは範囲のヌクレオチド長であってもよい。 In some embodiments, the labeled region 314 may include a molecular label 318 and a cell label 320. In some embodiments, the labeled region 314 may include one or more universal labels, dimensional labels, and cell labels. The molecular label 318 is generally 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, even if it is 100 nucleotides in length. , At least, or at most, such nucleotide lengths, or may be a number or range of nucleotide lengths between any of these values. The cell label 320 is generally 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, even if it is 100 nucleotides in length. , At least, or at most, such nucleotide lengths, or may be a number or range of nucleotide lengths between any of these values. Universal labels are generally 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, even if they are 100 nucleotides in length. It may be at least or at most such nucleotide length, or it may be a number or range of nucleotide lengths between any of these values. The universal label may be the same for multiple probability barcodes on a solid carrier and the cell label may be the same for multiple probability barcodes on a solid carrier. Dimensional markers are generally 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, even if they are 100 nucleotides in length. It may be at least or at most such a nucleotide length, or it may be a number or range of nucleotide lengths between any of these values.

いくつかの実施形態では、標識領域３１４は、１、２、３、４、５、６、７、８、９、１０、２０、３０、４０、５０、６０、７０、８０、９０、１００、２００、３００、４００、５００、６００、７００、８００、９００、１０００の異なる標識を含むか、概ねそうした値の異なる標識を含むか、少なくとも、もしくは多くともそうした値の異なる標識、またはこれらの値のいずれかの間の数もしくは範囲の異なる標識を含みうる。各標識は、１、２、３、４、５、６、７、８、９、１０、２０、３０、４０、５０、６０、７０、８０、９０、１００ヌクレオチド長であっても、概ね、少なくとも、もしくは多くともそうしたヌクレオチド長であってもよいし、またはこれらの値のいずれかの間の数もしくは範囲のヌクレオチド長であってもよい。分子識別子標識３１０のセットは、１０、２０、４０、５０、７０、８０、９０、１０²、１０³、１０⁴、１０⁵、１０⁶、１０⁷、１０⁸、１０⁹、１０¹⁰、１０¹¹、１０¹²、１０¹³、１０¹⁴、１０¹⁵、１０²⁰の分子識別子標識３１０を含むか、概ねそうした値の識別子標識３１０を含むか、少なくとも、もしくは多くともそうした値の分子識別子標識３１０、またはこれらの値のいずれかの間の数もしくは範囲の分子識別子標識３１０を含みうる。また、分子識別子標識３１０のセットは、たとえば、各々、ユニーク標識領域３１４を含みうる。余剰の分子識別子標識３１０を除去するために、標識ｃＤＮＡ分子３０４を精製することができる。精製は、Ａｍｐｕｒｅビーズ精製を含みうる。 In some embodiments, the labeled region 314 is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, Contains 200, 300, 400, 500, 600, 700, 800, 900, 1000 different labels, generally contains different labels with such values, or at least or at most different labels with such values, or of these values. It may contain different numbers or ranges of signs between them. Each label is generally 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, even if it is 100 nucleotides in length. It may be at least or at most such nucleotide length, or it may be a number or range of nucleotide lengths between any of these values. The set of molecular identifier labels 310 is 10, 20, 40, 50, 70, 80, 90, 10 ² , 10 ³ , 10 ⁴ , 10 ⁵ , 10 ⁶ , 10 ⁷ , 10 ⁸ , 10 ⁹ , 10 ¹⁰ , 10 Includes ¹¹ , 10 ¹² , 10 ¹³ , 10 ¹⁴ , 10 ¹⁵ , 10 ²⁰ molecular identifier labels 310, or generally contains identifier labels 310 of such values, or at least, or at most, molecular identifier labels 310 of such values, or A number or range of molecular identifier labels 310 between any of these values may be included. Also, each set of molecular identifier labels 310 may include, for example, a unique labeled region 314. The labeled cDNA molecule 304 can be purified to remove the excess molecular identifier label 310. Purification may include Angle bead purification.

工程２に示すように、工程１の逆転写プロセスからの産物を１チューブ中にプールし、第１ＰＣＲプライマープールおよび第１ユニバーサルＰＣＲプライマーを用いてＰＣＲ増幅することができる。プールする工程は、ユニーク標識領域３１４によって可能である。特に、ネステッドＰＣＲ標識アンプリコン３２２を生成するために、標識ｃＤＮＡ分子３０４を増幅することができる。増幅は、多重ＰＣＲ増幅を含みうる。増幅は、単一反応量で９６多重プライマーを用いる多重ＰＣＲ増幅を含みうる。いくつかの実施形態では、多重ＰＣＲ増幅は単一反応量で１０、２０、４０、５０、７０、８０、９０、１０²、１０³、１０⁴、１０⁵、１０⁶、１０⁷、１０⁸、１０⁹、１０¹⁰、１０¹¹、１０¹²、１０¹³、１０¹⁴、１０¹⁵、１０²⁰の多重プライマーを使用するか、概ねそうした値の多重プライマー、少なくとも、もしくは多くともそうした値の多重プライマーを使用するか、またはこれらの値のいずれかの間の数もしくは範囲の多重プライマーを使用することができる。増幅は、特定の遺伝子を標的とするカスタムプライマー３２６Ａ～Ｃの第１ＰＣＲプライマープール３２４と、ユニバーサルプライマー３２８とを含みうる。カスタムプライマー３２６は、標識ｃＤＮＡ分子３０４のｃＤＮＡ部分３０６’内の１領域とハイブリダイズすることができる。ユニバーサルプライマー３２８は、標識ｃＤＮＡ分子３０４のユニバーサルＰＣＲ領域３１６とハイブリダイズすることができる。 As shown in step 2, the product from the reverse transcription process of step 1 can be pooled in one tube and PCR amplified using the first PCR primer pool and the first universal PCR primer. The pooling step is possible with the unique labeled area 314. In particular, the labeled cDNA molecule 304 can be amplified to produce the nested PCR-labeled amplicon 322. Amplification may include multiplex PCR amplification. Amplification may include multiplex PCR amplification using 96 multiplex primers with a single reaction volume. In some embodiments, multiplex PCR amplification is 10, 20, 40, 50, 70, 80, 90, 10 ² , 10 ³ , 10 ⁴ , 10 ⁵ , 10 ⁶ , 10 ⁷ , 10 ⁸ with a single reaction volume. Use multiplex primers of 10 ⁹ , 10 ¹⁰ 10 ¹¹ 10 ¹² 10 ¹³ 10 ¹⁴ 10 ¹⁵ 10 ²⁰ or generally use multiplex primers of such values, at least or at most multiplex primers of such values. Multiplex primers can be used, or a number or range between any of these values. Amplification may include a first PCR primer pool 324 of custom primers 326A-C targeting a particular gene and universal primers 328. The custom primer 326 can hybridize to one region within the cDNA portion 306'of the labeled cDNA molecule 304. The universal primer 328 can hybridize to the universal PCR region 316 of the labeled cDNA molecule 304.

図３の工程３に示すように、工程２のＰＣＲ増幅からの産物は、ネステッドＰＣＲプライマープールおよび第２ユニバーサルＰＣＲプライマーを用いて増幅することができる。ネステッドＰＣＲは、ＰＣＲ増幅バイアスを最小限に抑えることができる。特に、ネステッドＰＣＲ標識アンプリコン３２２は、ネステッドＰＣＲによりさらに増幅することもできる。ネステッドＰＣＲは、単一反応量でネステッドＰＣＲプライマー３３２ａ～ｃのネステッドＰＣＲプライマープール３３０と、第２ユニバーサルＰＣＲプライマー３２８’とを含む多重ＰＣＲを含みうる。ネステッドＰＣＲプライマープール３２８は、１、２、３、４、５、６、７、８、９、１０、２０、３０、４０、５０、６０、７０、８０、９０、１００、２００、３００、４００、５００、６００、７００、８００、９００、１０００の異なるネステッドＰＣＲプライマー３３０を含むか、概ねそうした値の異なるネステッドＰＣＲプライマー３３０を含むか、少なくとも、もしくは多くともそうした値の異なるネステッドＰＣＲプライマー３３０、またはこれらの値のいずれかの間の数もしくは範囲の異なるネステッドＰＣＲプライマー３３０を含みうる。ネステッドＰＣＲプライマー３３２は、アダプター３３４を含有して、標識アンプリコン３２２のｃＤＮＡ部分３０６’内の１領域とハイブリダイズすることができる。ユニバーサルプライマー３２８’は、アダプター３３６を含有して、標識アンプリコン３２２のユニバーサルＰＣＲ領域３１６とハイブリダイズすることができる。このようにして、工程３は、アダプター標識アンプリコン３３８を生成する。いくつかの実施形態では、ネステッドＰＣＲプライマー３３２と第２ユニバーサルＰＣＲプライマー３２８’は、アダプター３３４および３３６を含有しなくてもよい。それに代わり、アダプター３３４および３３６は、アダプター標識アンプリコン３３８を生成するために、ネステッドＰＣＲの産物とライゲートすることができる。 As shown in step 3 of FIG. 3, the product from the PCR amplification of step 2 can be amplified using a nested PCR primer pool and a second universal PCR primer. Nested PCR can minimize PCR amplification bias. In particular, the nested PCR-labeled amplicon 322 can also be further amplified by nested PCR. Nested PCR may include multiplex PCR containing Nested PCR primer pools 330 of Nested PCR primers 332a-c and a second universal PCR primer 328'in a single reaction volume. The nested PCR primer pool 328 is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400. , 500, 600, 700, 800, 900, 1000 different nested PCR primers 330, generally different nested PCR primers 330 with such values, or at least, or at most different nested PCR primers 330 with such values, or Nested PCR primers 330 may contain different numbers or ranges between any of these values. The nested PCR primer 332 contains an adapter 334 and can hybridize to one region within the cDNA portion 306'of the labeled amplicon 322. The universal primer 328'contains the adapter 336 and can hybridize to the universal PCR region 316 of the labeled amplicon 322. In this way, step 3 produces an adapter-labeled amplicon 338. In some embodiments, the nested PCR primer 332 and the second universal PCR primer 328' may not contain adapters 334 and 336. Alternatively, the adapters 334 and 336 can be ligated to the product of nested PCR to produce the adapter-labeled amplicon 338.

工程４に示すように、工程３からのＰＣＲ産物は、ライブラリー増幅プライマーを用いたシーケンシングのためにＰＣＲ増幅することができる。特に、アダプター３３４および３３６を用いて、アダプター標識アンプリコン３３８に対するアッセイをさらに１回以上実施することができる。アダプター３３４および３３６は、プライマー３４０および３４２とハイブリダイズすることができる。１つ以上のプライマー３４０および３４２は、ＰＣＲ増幅プライマーであってよい。１つ以上のプライマー３４０および３４２は、シーケンシングプライマーであってよい。１つ以上のアダプター３３４および３３６は、アダプター標識アンプリコン３３８のさらなる増幅のために使用することができる。１つ以上のアダプター３３４および３３６は、アダプター標識アンプリコン３３８のシーケンシングのために使用することができる。プライマー３４２は、プレート指標インデックス３４４を含有することができ、これによって、分子識別子標識３１８の同じセットを用いて生成されたアンプリコンを、次世代シーケンシング（ＮＧＳ）を用いた１回のシーケンシング反応でシーケンシングすることができる。 As shown in step 4, the PCR product from step 3 can be PCR amplified for sequencing with library amplification primers. In particular, the adapters 334 and 336 can be used to perform one or more more assays against the adapter-labeled amplicon 338. Adapters 334 and 336 can hybridize with primers 340 and 342. One or more primers 340 and 342 may be PCR amplification primers. The one or more primers 340 and 342 may be sequencing primers. One or more adapters 334 and 336 can be used for further amplification of the adapter labeled amplicon 338. One or more adapters 334 and 336 can be used for sequencing the adapter labeled amplicon 338. Primer 342 can contain a plate index index 344, whereby the amplicon produced using the same set of molecular identifier labels 318 is sequenced once using next generation sequencing (NGS). It can be sequenced by reaction.

ＰＣＲおよびシーケンシングエラーの訂正
本明細書には、標的の数を決定するための方法が開示される。いくつかの実施形態では、本方法は、（ａ）複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程であって、複数の確率バーコードの各々が分子標識を含む工程と；（ｂ）確率バーコード付き標的のシーケンシングデータを取得する工程と；（ｃ）複数の標的の１つ以上について：（ｉ）シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数をカウントする工程と；（ｉｉ）（ｂ）で得られたシーケンシングデータ中の標的のクオリティステータスを決定する工程と；（ｉｉｉ）（ｂ）で得られたシーケンシングデータ中の１つ以上のシーケンシングデータエラーを決定する工程であって、シーケンシングデータ中の１つ以上のシーケンシングデータエラーを決定する工程が、以下：シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数、シーケンシングデータ中の標的のクオリティステータス、および複数の確率バーコードに識別可能な配列を有する分子標識の数のうち１つ以上を決定することを含む工程と；（ｉｖ）標的の数を推定する工程であって、推定された標的の数が、（ｉｉｉ）で決定された１つ以上のシーケンシングデータエラーに応じて調節された、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数と相関する工程と、を含む。工程（ｉ）、（ｉｉ）、（ｉｉｉ）、および（ｉｖ）は、複数の標的の各々について実施することができる。本方法は、多重化することができる。 Correction of PCR and Sequencing Errors The present specification discloses methods for determining the number of targets. In some embodiments, the method is (a) a step of using a plurality of probability barcodes to attach probability barcodes to a plurality of targets to generate a plurality of targets with probability barcodes. Each of the probabilistic barcodes of the above includes a molecular label; (b) a step of acquiring sequencing data of a target with a probabilistic bar code; (c) for one or more of a plurality of targets: (i) sequencing data. A step of counting the number of molecular labels having an identifiable sequence associated with a target in; and a step of determining the quality status of the target in the sequencing data obtained in (ii) (b); (iii). ) The step of determining one or more sequencing data errors in the sequencing data obtained in (b), wherein the step of determining one or more sequencing data errors in the sequencing data is as follows: Of the number of molecular labels with identifiable sequences associated with the target in the sequencing data, the quality status of the target in the sequencing data, and the number of molecular labels with identifiable sequences in multiple probability barcodes. A step comprising determining one or more; (iv) a step of estimating the number of targets, wherein the estimated number of targets results in one or more sequencing data errors determined in (iii). Containing a step that correlates with the number of molecular labels having an identifiable sequence associated with the target in the sequencing data counted in (i), adjusted accordingly. Steps (i), (ii), (iii), and (iv) can be performed for each of the plurality of targets. The method can be multiplexed.

いくつかの実施形態では、本方法は、１つ以上のシーケンシングデータエラーを決定する前に、シーケンシングデータを折りたたむ工程をさらに含む。シーケンシングデータを折りたたむ工程は、類似分子標識を有し、かつ所定の折りたたみ発生数閾値よりも少ない発生数を有する標的のコピーを、複数の標的について同じ分子標識を有するものとして帰属させる工程を含み、ここで、標的の２つのコピーは、標的の２つのコピーの分子標識が、配列において少なくとも１塩基相違する場合、類似の分子標識を有する。 In some embodiments, the method further comprises collapsing the sequencing data before determining one or more sequencing data errors. The step of collapsing the sequencing data comprises the step of assigning a copy of a target having a similar molecular label and having an occurrence number less than a predetermined folding occurrence number threshold as having the same molecular label for multiple targets. Here, the two copies of the target have similar molecular labels if the molecular labels of the two copies of the target differ by at least one base in the sequence.

１つ以上のシーケンシングデータエラーに応じてシーケンシングデータを調節した後に保持されるシーケンシングデータ中の分子標識のパーセンテージは、変動しうる。いくつかの実施形態では、１つ以上のシーケンシングデータエラーに応じてシーケンシングデータを調節した後に保持されるシーケンシングデータ中の分子標識のパーセンテージは、５０％、６０％、７０％、８０％、９０％、９５％、９９％、もしくは９９．９％、または概ねそうしたパーセンテージであるか、あるいはこれらの値のいずれか２つの間の数もしくは範囲でありうる。いくつかの実施形態では、１つ以上のシーケンシングデータエラーに応じてシーケンシングデータを調節した後に保持されるシーケンシングデータ中の分子標識のパーセンテージは、少なくとも、または多くとも、５０％、６０％、７０％、８０％、９０％、９５％、９９％、もしくは９９．９％でありうる。 The percentage of molecular labels in the sequencing data retained after adjusting the sequencing data in response to one or more sequencing data errors can vary. In some embodiments, the percentage of molecular labels in the sequencing data retained after adjusting the sequencing data in response to one or more sequencing data errors is 50%, 60%, 70%, 80%. , 90%, 95%, 99%, or 99.9%, or roughly such a percentage, or a number or range between any two of these values. In some embodiments, the percentage of molecular labels in the sequencing data retained after adjusting the sequencing data in response to one or more sequencing data errors is at least, or at most 50%, 60%. , 70%, 80%, 90%, 95%, 99%, or 99.9%.

分子標識カウントの決定
図５は、分子標識を用いてＰＣＲおよびシーケンシングエラーを訂正する非限定的な例示的実施形態５００を示すフローチャートである。実施形態５００は、複数の確率バーコードを用いて、複数の標的に確率バーコード（複数の確率バーコードの各々は、分子標識を含む）を付けて、複数の確率バーコード付き標的を生成する工程の後、ならびに、確率バーコード付き標的のシーケンシングデータを取得する工程の後、開始ブロック５０４から開始する。 Determining the Molecular Labeling Count FIG. 5 is a flow chart illustrating a non-limiting exemplary embodiment 500 that uses molecular labeling to correct PCR and sequencing errors. In the 500 embodiment, a plurality of probability barcodes are used to attach a probability barcode (each of the plurality of probability barcodes includes a molecular label) to the plurality of targets to generate a plurality of targets with probability barcodes. After the step and after the step of acquiring the sequencing data of the target with the probability barcode, it starts from the start block 504.

標的、たとえば、マイクロウェルアレイのマイクロウェル内の細胞に由来する遺伝子の場合、シーケンシングデータ中の標的に関連付けられた識別可能な配列を含む分子標識の数をブロック５０８でカウントすることができる。シーケンシングデータ中で、標的の２つのコピーは、類似の分子標識を有してもよく、たとえば、標的の２つのコピーの分子標識は、配列の１塩基が異なりうる。標的の２つのコピーは、いずれも真であってもよく、標的の一方のコピーが真で、標的の他方のコピーは、シーケンシングエラーもしくはＰＣＲエラーの結果であってもよいし、または標的の両方のコピーが、シーケンシングエラーもしくはＰＣＲエラーの結果であってもよい。 For a target, eg, a gene derived from a cell in a microwell of a microwell array, the number of molecular labels containing the identifiable sequence associated with the target in the sequencing data can be counted in block 508. In the sequencing data, the two copies of the target may have similar molecular labels, for example, the molecular labels of the two copies of the target may differ by one base in the sequence. The two copies of the target may both be true, one copy of the target may be true and the other copy of the target may be the result of sequencing or PCR errors, or the target. Both copies may be the result of sequencing or PCR errors.

シーケンシングデータの折りたたみ
ブロック５１２で、シーケンシングデータを折りたたむことができる。シーケンシングデータを折りたたむ工程は、類似分子標識を有し、かつ所定の折りたたみ発生数閾値よりも少ない発生数を有する標的のコピーを、複数の標的について同じ分子標識を有するものとして帰属させる工程を含みうる。所定の折りたたみ発生数閾値は、１～１００の範囲で変動しうる。いくつかの実施形態では、所定の折りたたみ発生数閾値は、確率バーコードが、識別可能な配列を有する約６５６１の分子標識を含む場合、１、２、３、４、５、６、７、８、９、１０、１７、２０、３０、４０、５０、６０、７０、８０、９０、１００、または概ねそうした値であるか、あるいはこれらの値のいずれか２つの間の数もしくは範囲でありうる。いくつかの実施形態では、所定の折りたたみ発生数閾値は、確率バーコードが、識別可能な配列を有する約６５６１の分子標識を含む場合、少なくとも、または多くとも、１、２、３、４、５、６、７、８、９、１０、１７、２０、３０、４０、５０、６０、７０、８０、９０、もしくは１００でありうる。たとえば、分子標識は、８ヌクレオチド長であってよく、各ヌクレオチド位置は、アデニン（Ａ）、シトシン（Ｃ）、グアニン（Ｇ）；Ｃ、Ｇ、チミン（Ｔ）；Ａ、Ｇ、Ｔ；またはＡ、Ｃ、Ｔなどの３つの可能性を有しうるため、３⁸＝６５６１のユニーク分子標識を生成しうる。 Folding Blocks for Sequencing Data Block 512 allows you to fold the sequencing data. The step of collapsing the sequencing data comprises the step of attributing a copy of a target having a similar molecular label and having an incidence less than a predetermined folding incidence threshold as having the same molecular label for multiple targets. sell. The predetermined number of folding occurrence thresholds can vary from 1 to 100. In some embodiments, the predetermined folding occurrence threshold is 1, 2, 3, 4, 5, 6, 7, 8 if the probability barcode contains about 6651 molecular labels with identifiable sequences. , 9, 10, 17, 20, 30, 40, 50, 60, 70, 80, 90, 100, or generally such values, or a number or range between any two of these values. .. In some embodiments, the predetermined folding occurrence threshold is at least, or at most 1, 2, 3, 4, 5 where the probability barcode contains about 6651 molecular labels with identifiable sequences. , 6, 7, 8, 9, 10, 17, 20, 30, 40, 50, 60, 70, 80, 90, or 100. For example, the molecular label may be 8 nucleotides in length and each nucleotide position may be adenine (A), cytosine (C), guanine (G); C, G, thymine (T); A, G, T; or Since it may have three possibilities such as A, C, T, etc., it is possible to generate a unique molecular label of 38 = ⁶⁵⁶¹ .

いくつかの実施形態では、所定の折りたたみ発生数閾値は、確率バーコードが、識別可能な配列を有する約６５５３６の分子標識を含む場合、１、２、３、４、５、６、７、８、９、１０、２０、３０、４０、５０、６０、７０、８０、９０、１００、または概ねそうした値であるか、あるいはこれらの値のいずれか２つの間の数もしくは範囲でありうる。いくつかの実施形態では、所定の折りたたみ発生数閾値は、確率バーコードが、識別可能な配列を有する約６５５３６の分子標識を含む場合、少なくとも、または多くとも、１、２、３、４、５、６、７、８、９、１０、２０、３０、４０、５０、６０、７０、８０、９０、もしくは１００でありうる。たとえば、分子標識は、８ヌクレオチド長であってよく、各ヌクレオチド位置は、４つの可能性：Ａ、Ｃ、Ｇ、Ｔを有しうるため、３⁴＝６５５３６のユニーク分子標識を生成しうる。 In some embodiments, the predetermined folding occurrence threshold is 1, 2, 3, 4, 5, 6, 7, 8 if the probability barcode contains about 65536 molecular labels with identifiable sequences. , 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, or generally such values, or a number or range between any two of these values. In some embodiments, a given folding occurrence threshold is at least, or at most, 1, 2, 3, 4, 5 if the probability barcode contains about 65536 molecular labels with identifiable sequences. , 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, or 100. For example, the molecular label may be 8 nucleotides in length, and each nucleotide position may have ^four possibilities: A, C, G, T, thus producing a unique molecular label of 34 = 65536.

たとえば、標的の５つのコピーが存在しうる。標的の５つのコピーは、

の分子標識を有するものであってよく、分子標識当たりのリードの数は、それぞれ、２６１、２、２、１、および１である。分子標識

は、それらが、分子標識ＴＧＴＧＣＧＴＧと１ヌクレオチド（下線部）異なっているため、分子標識ＴＧＴＧＣＧＴＧと類似している。識別可能な配列を有する６５６１の分子標識があり、かつ所定の折りたたみ発生数閾値が７である場合、分子標識

の発生数は、分子標識ＴＧＴＧＣＧＴＧに帰属させることができる。 For example, there can be five copies of the target. Five copies of the target

The number of leads per molecular label may be 261, 2, 2, 1, and 1, respectively. Molecular labeling

Is similar to the molecularly labeled TGTGCGTG because they differ by one nucleotide (underlined) from the molecularly labeled TGTGCGTG. If there is a molecular label of 6561 with an identifiable sequence and the predetermined number of folds threshold is 7, then the molecular label

The number of occurrences of can be attributed to the molecular label TGTGCGTG.

別の例として、標的の７つのコピーが存在しうる。標的の７つのコピーは、

の分子標識を有するものであってよく、分子標識当たりのリードの数は、それぞれ、１０、７、５、４、１、１、および１である。分子標識

は、分子標識ＣＧＣＧＴＴＣＡと、互いに１ヌクレオチド（下線部）異なっているため、類似している。識別可能な配列を有する６５６１の分子標識があり、かつ所定の折りたたみ発生数閾値が７である場合、分子標識

の発生数は、分子標識ＣＧＣＧＴＴＣＡに帰属させることができる。 As another example, there may be seven copies of the target. Seven copies of the target

The number of leads per molecular label may be 10, 7, 5, 4, 1, 1, and 1, respectively. Molecular labeling

Is similar to the molecularly labeled CGCGTTCA because they differ by 1 nucleotide (underlined) from each other. If there is a molecular label of 6561 with an identifiable sequence and the predetermined number of folds threshold is 7, then the molecular label

The number of occurrences of can be attributed to the molecular label CGCGTTCA.

シーケンシングデータエラー
本明細書に開示する方法は、シーケンシングデータエラー、たとえば、１つ以上の標的核酸をカウントする方法に発生するエラーを同定および／または訂正するために使用することができる。いくつかの実施形態では、シーケンシングデータエラーは、ＰＣＲ導入エラー、シーケンシング導入エラー、バーコード混入に起因するエラー、ライブラリー作製エラー、またはそれらの任意の組合せを含むか、これらでありうる。ＰＣＲ導入エラーは、ＰＣＲ増幅エラー、ＰＣＲ増幅バイアス、不十分なＰＣＲ増幅、またはそれらの任意の組合せの結果を含むか、これらでありうる。シーケンシング導入エラーは、不正確なベースコーリング、不十分なシーケンシング、またはそれらの任意の組合せの結果を含むか、これらでありうる。エラーは、１つ以上のヌクレオチドの欠失、１つ以上のヌクレオチドの置換、１つ以上のヌクレオチドの付加、またはそれらの任意の組合せを含むか、これらでありうる。 Sequencing Data Errors The methods disclosed herein can be used to identify and / or correct sequencing data errors, eg, errors that occur in the method of counting one or more target nucleic acids. In some embodiments, the sequencing data error may include, or may be, a PCR introduction error, a sequencing introduction error, an error due to bar code contamination, a library fabrication error, or any combination thereof. PCR induction errors may include or may be the result of PCR amplification errors, PCR amplification bias, inadequate PCR amplification, or any combination thereof. Sequencing implementation errors may include or may result from inaccurate base calling, inadequate sequencing, or any combination thereof. The error may include, or may be, a deletion of one or more nucleotides, a substitution of one or more nucleotides, an addition of one or more nucleotides, or any combination thereof.

シーケンシングステータスの決定
前述したように、複数の確率バーコードを用いて、複数の標的に確率バーコードを付けることにより、複数の確率バーコード付き標的を生成することができ、複数の確率バーコードの各々は、分子標識、ならびに確率バーコード付き標的のシーケンシングデータの取得を含みうる。標識、たとえば、マイクロウェルアレイのマイクロウェル内の１細胞に由来する遺伝子の場合、シーケンシングデータ中の標的と関連付けられた識別可能な配列を有する分子標識の数をカウントすることができる。カウントされたシーケンシングデータは、たとえば、類似した分子標識を有し、かつ所定の折りたたみ発生数閾値よりも少ない発生数を有する標的のコピーを、複数の標的について同じ分子標識を有するものとして帰属させる工程によって、折りたたむことができる。シーケンシングデータを折りたたんだ後、標的のクオリティステータスを決定することができる。 Determining Sequencing Status As mentioned above, by using multiple probability barcodes and attaching probability barcodes to multiple targets, it is possible to generate multiple targets with probability barcodes, and multiple probability barcodes can be generated. Each of these may include molecular labeling, as well as acquisition of sequencing data for targets with probability barcodes. For labels, eg, genes derived from one cell within a microwell of a microwell array, the number of molecular labels with identifiable sequences associated with the target in the sequencing data can be counted. The counted sequencing data, for example, assigns a copy of a target having a similar molecular label and having an incidence less than a predetermined folding incidence threshold as having the same molecular label for multiple targets. It can be folded depending on the process. After collapsing the sequencing data, the quality status of the target can be determined.

図５を参照にして、いくつかの実施形態では、ブロック５１６、シーケンシングデータ中の標的のクオリティステータスは、完全シーケンシング、不完全シーケンシング、または飽和シーケンシングであると決定することができる。標的のクオリティステータスは、真の分子標識またはリアル分子標識のすべてがシーケンシングランの深度に観察されたか否かに依存しうる。真の分子標識またはリアル分子標識は、エラーまたは偽の分子標識ではない分子標識を意味しうる。エラーまたは偽の分子標識は、ＰＣＲエラー、人工物、またはシーケンシングエラーから生じた配列を有する分子標識を意味しうる。シーケンシングデータ中の標的のクオリティステータスは、複数の確率バーコード中の識別可能な配列を有する分子標識の数と、カウントされたシーケンシングデータ中の標的と関連付けられた識別可能な配列を有する分子標識の数によって決定することができる。 With reference to FIG. 5, in some embodiments, the quality status of the target in block 516, sequencing data can be determined to be complete sequencing, incomplete sequencing, or saturated sequencing. The quality status of the target may depend on whether all of the true or real molecular labels were observed at the depth of the sequencing run. A true or real molecular label can mean a molecular label that is not an error or false molecular label. An error or false molecular label can mean a molecular label having a sequence resulting from a PCR error, man-made object, or sequencing error. The quality status of a target in the sequencing data is the number of molecular labels with an identifiable sequence in multiple probability barcodes and the molecule with an identifiable sequence associated with the target in the counted sequencing data. It can be determined by the number of signs.

いくつかの実施形態では、完全シーケンシングクオリティステータスは、所定の完全シーケンシング散布閾値以上のポアソン分布と比較した散布指数によって決定することができる。散布指数は、標的の分散／平均として定義することができる。図６は、完全シーケンシングと不完全シーケンシングにより得られたシーケンシングデータを示す概略図である。図６は、ライブラリー（左側サークル）中の遺伝子Ａの３つのコピーと、遺伝子Ｂの６つのコピーとを示す。遺伝子Ａの３つのコピーが、シーケンシングデータ（右上のサークル）中に６回、５回、および１回のシーケンシングリードを有した場合、分散は７、平均は４、散布指数は１．７５である。遺伝子Ｂの６つのコピーが、シーケンシングデータ（右上のサークル）中に９回、２回、２回、２回、１回、および１回のシーケンシングリードを有した場合、分散は９．３６、平均は２．８３、散布指数は３．３１である。これらのシーケンシングデータを用いて、所定の完全シーケンシング散布閾値が、たとえば、完全シーケンシングについて０．９である場合、遺伝子Ａおよび遺伝子Ｂは、完全シーケンシングステータスを有するとみなすことができる。 In some embodiments, the complete sequencing quality status can be determined by a dispersal index compared to a Poisson distribution above a predetermined complete sequencing dispersal threshold. The dispersal index can be defined as the variance / average of the targets. FIG. 6 is a schematic diagram showing sequencing data obtained by complete sequencing and incomplete sequencing. FIG. 6 shows three copies of gene A and six copies of gene B in the library (left circle). If three copies of gene A had six, five, and one sequencing reads in the sequencing data (upper right circle), the variance was 7, the mean was 4, and the dispersal index was 1.75. Is. If the six copies of gene B had nine, two, two, two, one, and one sequencing reads in the sequencing data (upper right circle), the variance was 9.36. The average is 2.83 and the dispersion index is 3.31. Using these sequencing data, Gene A and Gene B can be considered to have complete sequencing status if a given complete sequencing application threshold is, for example, 0.9 for complete sequencing.

遺伝子Ａの１つのコピーが観察されず、遺伝子Ａの他の２つのコピーがシーケンシングデータ（右下のサークル）中に２回および３回のシーケンシングリードを有した場合、分散は０．５、平均は２．５、散布指数は０．２である。遺伝子Ｂの２つのコピーが観察されず、遺伝子Ｂの他の４つのコピーがシーケンシングデータ（右下のサークル）中に４回、２回、１回、および１回のシーケンシングリードを有した場合、分散は２、平均は２、散布指数は２である。これらのシーケンシングデータを用いて、所定の完全シーケンシング散布閾値が、たとえば、完全シーケンシングについて１．１である場合、遺伝子Ａおよび遺伝子Ｂは、不完全シーケンシングステータスを有するとみなすことができる。 If one copy of gene A is not observed and the other two copies of gene A have two and three sequencing reads in the sequencing data (bottom right circle), the variance is 0.5. The average is 2.5 and the dispersion index is 0.2. Two copies of gene B were not observed, and the other four copies of gene B had four, two, one, and one sequencing reads in the sequencing data (lower right circle). In the case, the variance is 2, the average is 2, and the dispersal index is 2. Using these sequencing data, Gene A and Gene B can be considered to have incomplete sequencing status if a given complete sequencing application threshold is, for example, 1.1 for complete sequencing. ..

所定の完全シーケンシング散布閾値は、０．５～５の範囲で変動しうる。いくつかの実施形態では、所定の完全シーケンシング散布閾値は、０．５、０．６、０．７、０．８、０．９、１、２、３、４、５、６、または概ねそうした値であるか、あるいはこれらの値のいずれか２つの間の数もしくは範囲でありうる。いくつかの実施形態では、所定の完全シーケンシング散布閾値は、少なくとも、または多くとも０．５、０．６、０．７、０．８、０．９、１、２、３、４、５、もしくは６でありうる。 A given complete sequencing application threshold can vary from 0.5 to 5. In some embodiments, a given complete sequencing application threshold is 0.5, 0.6, 0.7, 0.8, 0.9, 1, 2, 3, 4, 5, 6, or roughly. It can be such a value, or it can be a number or range between any two of these values. In some embodiments, a given complete sequencing application threshold is at least or at most 0.5, 0.6, 0.7, 0.8, 0.9, 1, 2, 3, 4, 5 , Or it can be 6.

いくつかの実施形態では、完全シーケンシングクオリティステータスは、さらに、シーケンシングデータ中の所定の完全シーケンシング発生数閾値以上の発生数を有する分子標識によっても決定することができる。所定の完全シーケンシング発生数閾値は、８～２０の範囲で変動しうる。いくつかの実施形態では、完全シーケンシング発生数閾値は、８、９、１０、１１、１２、１３、１４、１５、１６、１７、１８、１９、２０、または概ねそうした値であるか、あるいはこれらの値のいずれか２つの間の数もしくは範囲でありうる。いくつかの実施形態では、完全シーケンシング発生数閾値は、少なくとも、または多くとも、８、９、１０、１１、１２、１３、１４、１５、１６、１７、１８、１９、もしくは２０でありうる。 In some embodiments, the complete sequencing quality status can also be further determined by a molecular label having an incidence greater than or equal to a predetermined complete sequencing occurrence threshold in the sequencing data. The predetermined complete sequencing occurrence threshold can vary from 8 to 20. In some embodiments, the complete sequencing occurrence threshold is 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or generally such a value, or It can be a number or range between any two of these values. In some embodiments, the complete sequencing occurrence threshold can be at least, or at most 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 20. ..

いくつかの実施形態では、飽和シーケンシングクオリティステータスは、所定の飽和閾値よりも大きい、識別可能な配列を含む分子標識の数を有する標的によって、決定することができる。飽和シーケンシングクオリティステータスは、さらに、所定の飽和閾値よりも大きい、識別可能な配列を含む分子標識の数を有する複数の標的のうちの他の１つの標的によって、決定することもできる。 In some embodiments, the saturation sequencing quality status can be determined by a target having a number of molecular labels containing an identifiable sequence that is greater than a predetermined saturation threshold. Saturation sequencing quality status can also be further determined by the other one of a plurality of targets having a number of molecular labels containing identifiable sequences greater than a predetermined saturation threshold.

所定の飽和閾値は、変動しうる。いくつかの実施形態では、所定の飽和閾値は、確率バーコードが、識別可能な配列を有する約６５６１の分子標識を含む場合、６０００、６１００、６２００、６３００、６４００、６５００、６５５７、６５５８、６５５９、６５６０、６５６１、または概ねそうした値であるか、あるいはこれらの値のいずれか２つの間の数もしくは範囲でありうる。いくつかの実施形態では、所定の飽和閾値は、確率バーコードが、識別可能な配列を有する約６５６１の分子標識を含む場合、少なくとも、または多くとも、６０００、６１００、６２００、６３００、６４００、６５００、６５５７、６５５８、６５５９、６５６０、もしくは６５６１でありうる。いくつかの実施形態では、所定の飽和閾値は、確率バーコードが、識別可能な配列を有する約６５５３６の分子標識を含む場合、６４０００、６４１００、６４２００、６４３００、６４４００、６４５００、６４６００、６４７００、６４８００、６４９００、６５０００、６５１００、６５２００、６５３００、６５４００、６５５００、６５５１０、６５５２０、６５５３０、６５５３２、６５５３３、６５５３４、６５５３５、または概ねそうした値であるか、あるいはこれらの値のいずれか２つの間の数もしくは範囲でありうる。いくつかの実施形態では、所定の飽和閾値は、確率バーコードが、識別可能な配列を有する約６５５３６の分子標識を含む場合、少なくとも、または多くとも、６４０００、６４１００、６４２００、６４３００、６４４００、６４５００、６４６００、６４７００、６４８００、６４９００、６５０００、６５１００、６５２００、６５３００、６５４００、６５５００、６５５１０、６５５２０、６５５３０、６５５３２、６５５３３、６５５３４、もしくは６５５３５でありうる。 A given saturation threshold can fluctuate. In some embodiments, the predetermined saturation threshold is 6000, 6100, 6200, 6300, 6400, 6500, 6557, 6558, 6559 if the probability barcode contains about 6651 molecular labels with identifiable sequences. , 6560, 6651, or generally such values, or can be a number or range between any two of these values. In some embodiments, the predetermined saturation threshold is at least, or at most, 6000, 6100, 6200, 6300, 6400, 6500 if the probability barcode contains about 6651 molecular labels with identifiable sequences. , 6557, 6558, 6559, 6560, or 6651. In some embodiments, a given saturation threshold is 64000, 64100, 64200, 64300, 64400, 64500, 64600, 64700, 64800 if the probability bar code contains about 65536 molecular labels with identifiable sequences. , 64900, 65000, 65100, 65200, 65300, 65400, 65500, 65510, 65520, 65530, 65532, 65533, 65534, 65535, or generally such values, or a number between any two of these values. It can be a range. In some embodiments, the predetermined saturation threshold is at least, or at most, 64000, 64100, 64200, 64300, 64400, 64500, where the probability bar code contains about 65536 molecular labels with identifiable sequences. , 64600, 64700, 64800, 64900, 65000, 65100, 65200, 65300, 65400, 65500, 65510, 65520, 65530, 65532, 65533, 65534, or 65535.

いくつかの実施形態では、シーケンシングデータ中の標的のクオリティステータスは、シーケンシングデータ中の標的のクオリティステータスが、完全シーケンシングではなく、かつ飽和シーケンシングではない場合に、不完全シーケンシングとして分類することができる。 In some embodiments, the quality status of the target in the sequencing data is classified as incomplete sequencing if the quality status of the target in the sequencing data is not complete sequencing and not saturated sequencing. can do.

完全シーケンシングクオリティステータス
本明細書に開示する方法は、標的が、完全シーケンシングクオリティステータスを有する場合、シーケンシングライブラリー中の標的の数の推定値を提供することができる。シーケンシングライブラリー中の標的が、完全シーケンシングクオリティステータスを有する場合、真の確率バーコードおよびエラー確率バーコードのシーケンシングリードについて個別のポアソンモデルを介して閾値を確立することができる。標的のクオリティステータスは、真の分子標識またはリアル分子標識のすべてがシーケンシングランの深度で観察されたか否かに依存しうる。真の分子標識またはリアル分子標識は、エラーまたは偽の分子標識ではない分子標識を意味しうる。エラーまたは偽の分子標識は、ＰＣＲエラー、人工物、またはシーケンシングエラーから生じた配列を有する分子標識を意味しうる。 Full Sequencing Quality Status The methods disclosed herein can provide an estimate of the number of targets in a sequencing library if the target has a complete sequencing quality status. If the targets in the sequencing library have full sequencing quality status, thresholds can be established for sequencing reads of true and error probability barcodes via individual Poisson models. The quality status of the target can depend on whether all of the true or real molecular labels were observed at the depth of the sequencing run. A true or real molecular label can mean a molecular label that is not an error or false molecular label. An error or false molecular label can mean a molecular label having a sequence resulting from a PCR error, man-made object, or sequencing error.

図５を参照にして、決定状態５２０で、標的分子が、完全シーケンシングステータスを有する場合、実施形態５００は、ブロック５２４に進む。ブロック５２４では、１塩基のシーケンシングエラーを次の工程により除去することができる。工程（１）、シーケンシングリードが２５より大きい場合、最も豊富なシーケンシングリードに関連付けられた分子標識を第１の親分子標識として選択する。たとえば、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数をカウントした後、最も高いシーケンシングリードを有するシーケンシングデータ中の標的に関連付けられた分子標識を選択する。 With reference to FIG. 5, in the determined state 520, if the target molecule has a complete sequencing status, embodiment 500 proceeds to block 524. In block 524, the single base sequencing error can be removed by the following steps. Step (1) If the sequencing lead is greater than 25, the molecular label associated with the most abundant sequencing lead is selected as the first parent molecular label. For example, after counting the number of molecular labels with identifiable sequences associated with the target in the sequencing data, the molecular label associated with the target in the sequencing data with the highest sequencing read is selected.

工程（２）、子供分子標識：シーケンシングリード≦３を有し、第１の親分子標識から１塩基隔てた分子標識を同定し；子供分子標識または１塩基子供分子標識が見出されない場合、工程（５）に進む。工程（３）、すべての子供分子標識および親分子標識に対して複数の二項検定を実施し、その帰無仮説が容認された子供分子標識を除去してから、それらのシーケンシングリードをそれらの親に帰属させる。帰無仮説のいずれも容認されなければ、これは、すべての子供分子標識が、親分子標識の１塩基シーケンシングエラーではないことを意味し、その場合、リード訂正を実施する必要はない。工程（４）、分子標識配列ならびにシーケンシングリードを更新する。たとえば、複数の二項検定の帰無仮説が容認されれば、子供分子標識の発生数を親分子標識に帰属させることができる。工程（５）、次に大きいシーケンシングリードを有する分子標識を親分子標識として選択し、適格の親分子標識または適格の子供分子標識がなくなるまで前述の工程を反復する。 Step (2), Child Molecular Label: Identify a molecular label that has a sequencing read ≤ 3 and is one base away from the first parent molecule label; if no child molecule label or one-base child molecule label is found. Proceed to step (5). Step (3), multiple binomial tests are performed on all child molecule labels and parent molecule labels, the child molecule labels for which the null hypothesis is accepted are removed, and then their sequencing leads are subjected to them. Attributable to the parent of. If none of the null hypotheses are tolerated, this means that not all child molecule labels are single-base sequencing errors of the parent molecule label, in which case no read correction needs to be performed. Step (4), molecularly labeled sequences and sequencing reads are updated. For example, if the null hypothesis of multiple binomial tests is accepted, the number of child molecule labels generated can be attributed to the parent molecule label. Step (5), the molecular label with the next largest sequencing lead is selected as the parent molecule label and the above steps are repeated until there are no eligible parent molecule labels or eligible child molecule labels.

いくつかの実施形態では、カウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数は、標的が、完全シーケンシングクオリティステータスを有していれば、１つ以上の親分子標識についてすべての子供分子標識を決定する工程と；少なくとも１つの子供分子標識および親分子標識について複数の二項検定などの統計解析を実施する工程と；統計解析の帰無仮説が容認されれば、子供分子標識の発生数を親分子標識に帰属させる工程と、によって調節することができる。 In some embodiments, the number of molecular labels with identifiable sequences associated with the target in the counted sequencing data is one or more if the target has a complete sequencing quality status. The step of determining all child molecule labels for the parent molecule label of the cell; and the step of performing a statistical analysis such as multiple binomial tests on at least one child molecule label and the parent molecule label; If so, the number of child molecule labels generated can be adjusted by the step of assigning to the parent molecule label.

いくつかの実施形態では、子供分子標識は、親分子標識と１塩基相違し、かつ、所定の完全シーケンシング子供閾値以下の発生数を有する分子標識を含みうる。所定の完全シーケンシング子供閾値は、変動しうる。いくつかの実施形態では、所定の完全シーケンシング子供閾値は、１、２、３、４、５、６、７、８、９、１０、または概ねそうした値であるか、あるいはこれらの値のいずれか２つの間の数もしくは範囲でありうる。いくつかの実施形態では、所定の完全シーケンシング子供閾値は、少なくとも、または多くとも、１、２、３、４、５、６、７、８、９、もしくは１０でありうる。 In some embodiments, the child molecule label may comprise a molecule label that is one base different from the parent molecule label and has a number of occurrences that is less than or equal to a given complete sequencing child threshold. A given complete sequencing child threshold can fluctuate. In some embodiments, the predetermined complete sequencing child threshold is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or roughly such, or any of these values. It can be a number or range between the two. In some embodiments, the predetermined complete sequencing child threshold can be at least, or at most, 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10.

いくつかの実施形態では、１つ以上の親分子標識は、所定の完全シーケンシング親閾値以上の発生数を有する分子標識を含み、ここで、所定の完全シーケンシング親閾値は、所定の完全シーケンシング発生数閾値、たとえば、８と等しい。第１の統計解析の帰無仮説は、帰無仮説が真である確率が、偽発見率を下回れば、容認されうる。偽発見率は、変動しうる。いくつかの実施形態では、偽発見率は、１％、２％、３％、４％、５％、６％、７％、８％、９％、１０％、１１％、１２％、１３％、１４％、１５％、１６％、１７％、１８％、１９％、２０％、または概ねそうした値であるか、あるいはこれらの値のいずれか２つの間の数もしくは範囲でありうる。いくつかの実施形態では、偽発見率は、少なくとも、または多くとも、１％、２％、３％、４％、５％、６％、７％、８％、９％、１０％、１１％、１２％、１３％、１４％、１５％、１６％、１７％、１８％、１９％、もしくは２０％でありうる。第１の統計解析は、複数の二項検定であってよい。 In some embodiments, one or more parent molecular labels include a molecular label having an incidence greater than or equal to a predetermined complete sequencing parent threshold, wherein the predetermined complete sequencing parent threshold is a predetermined complete sequence. Equal to a single occurrence threshold, for example 8. The null hypothesis of the first statistical analysis is acceptable if the probability that the null hypothesis is true is less than the false discovery rate. False discovery rate can fluctuate. In some embodiments, the false discovery rate is 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 11%, 12%, 13%. , 14%, 15%, 16%, 17%, 18%, 19%, 20%, or generally such values, or a number or range between any two of these values. In some embodiments, the false discovery rate is at least, or at most, 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 11%. , 12%, 13%, 14%, 15%, 16%, 17%, 18%, 19%, or 20%. The first statistical analysis may be a plurality of binomial tests.

ブロック５２８では、ポアソンモデルを標的の分子標識の閾値化に用いて、シーケンシングデータ中の標的に関連付けられた真の分子標識および偽の分子標識を決定することができる。たとえば、人工物から「真である可能性がある」分子標識を識別するために、ポアソンモデルをシーケンシングリードに適用することができる。 At block 528, the Poisson model can be used to threshold the target's molecular label to determine the true and false molecular label associated with the target in the sequencing data. For example, a Poisson model can be applied to sequencing reads to identify "potentially true" molecular labels from an artificial object.

いくつかの実施形態では、カウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数は、標的が完全シーケンシングクオリティステータスを有していれば、標的の分子標識を閾値化して、シーケンシングデータ中の標的に関連付けられた真の分子標識および偽の分子標識を決定する工程によって調節されうる。標的の分子標識を閾値化する工程は、標的の分子標識について統計解析を実施する工程を含みうる。 In some embodiments, the number of molecular labels with identifiable sequences associated with the target in the counted sequencing data will be the molecular label of the target if the target has a complete sequencing quality status. Can be regulated by the step of thresholding to determine the true and false molecular labels associated with the target in the sequencing data. The step of thresholding the target molecular label may include performing a statistical analysis on the target molecular label.

いくつかの実施形態では、統計解析を実施する工程は、以下：標的の分子標識の分布およびそれらの発生数を２つのポアソン分布に当てはめる工程と；２つのポアソン分布を用いて、真の分子標識の数ｎを決定する工程と；シーケンシングデータから偽の分子標識を除去する工程と、を含み、ここで、偽の分子標識は、ｎ番目に豊富な分子標識の発生数よりも低い発生数を有する分子標識を含み、また、真の分子標識は、ｎ番目に豊富な分子標識の発生数以上の発生数を有する分子標識を含む。２つのポアソン分布は、真の分子標識に対応する第１ポアソン分布と、偽の分子標識に対応する第２ポアソン分布とを含みうる。 In some embodiments, the steps of performing a statistical analysis are as follows: fitting the distribution of target molecular labels and their number of occurrences to two Poisson distributions; using the two Poisson distributions, a true molecular label. A step of determining the number n; a step of removing the fake molecular label from the sequencing data, wherein the fake molecular label is a lower number of occurrences than the nth richest molecular label. Also, a true molecular label comprises a molecular label having a number of occurrences equal to or greater than the number of occurrences of the nth most abundant molecular label. The two Poisson distributions may include a first Poisson distribution corresponding to a true molecular label and a second Poisson distribution corresponding to a false molecular label.

ブロック５３２では、複数の二項検定または２つのポアソン分布を用いて、シーケンシングデータを訂正または調節した後、標的の数を推定して、出力を生成することができる。実施形態５００は、終点ブロック５３６で終了する。 At block 532, the number of targets can be estimated to generate output after correcting or adjusting the sequencing data using multiple binomial tests or two Poisson distributions. Embodiment 500 ends at the end point block 536.

飽和シーケンシングクオリティステータス
本明細書に開示する方法は、分子標識カウントを推定する際の大きな不確実性のために、標的が飽和シーケンシングクオリティステータスを有する場合、シーケンシングライブ中の標的の数の推定値を提供することができないこともある。図５を参照にして、いくつかの実施形態では、決定状態５２０で、シーケンシングステータスが完全シーケンシングステータスではない場合、実施形態５００は、決定状態５４０に進む。決定状態５４０で、標的が、飽和シーケンシングステータスを有する場合、実施形態５００は、終点ブロック５３６に進む。飽和シーケンシングステータスの場合、分子標識カウントを推定する際の大きな不確実性のために、標的の数が決定されないことがある。 Saturated Sequencing Quality Status The method disclosed herein is due to the large uncertainty in estimating the molecular label count, if the target has a saturated sequencing quality status, the number of targets during sequencing live. It may not be possible to provide an estimate. With reference to FIG. 5, in some embodiments, in decision state 520, if the sequencing status is not a complete sequencing status, embodiment 500 proceeds to decision state 540. In decision state 540, if the target has a saturated sequencing status, embodiment 500 proceeds to endpoint block 536. For saturated sequencing status, the number of targets may not be determined due to the large uncertainty in estimating the molecular label count.

不完全シーケンシングクオリティステータス
本明細書に開示する方法は、標的が不完全シーケンシングクオリティステータスを有する場合、シーケンシングライブラリー中の標的の数の推定値を提供することができる。シーケンシングライブラリー中の標的は、不完全シーケンシングクオリティステータスを有するとき、ノイジー標的、たとえば、ノイジー遺伝子は除去することができる。標的は、その増幅速度（分子標識当たりの平均リード）が、標的を含む同じライブラリー中の完全にシーケンシングされた遺伝子からに由来するエラーの増幅速度と類似していれば、ノイジーでありうる。ライブラリー中に存在する識別可能な分子標識を有する標的を含む確率バーコードの数の推定値を補外するために、不完全シーケンシングのクオリティステータスを有する標的のシーケンシングリードに対して、ゼロ切断ポアソンモデルを適用することができる。 Incomplete Sequencing Quality Status The methods disclosed herein can provide an estimate of the number of targets in a sequencing library if the target has an incomplete sequencing quality status. When the target in the sequencing library has an incomplete sequencing quality status, the noisy target, eg, the noisy gene, can be eliminated. A target can be noisy if its amplification rate (mean read per molecular label) is similar to the amplification rate of errors derived from fully sequenced genes in the same library containing the target. .. Zero for sequencing reads of targets with incomplete sequencing quality status to extrapolate an estimate of the number of probability barcodes containing targets with identifiable molecular labels present in the library. A cutting Poisson model can be applied.

実施形態５００は、出発標的を標識するために用いられる真の確率バーコードのいくつかが、不適切なシーケンシング深度のために観察されなかった場合、シーケンシングライブラリー中の標的の数の推定値を提供することができる。決定状態５４０で、標的が、飽和シーケンシングステータスを有していなければ、標的は、不完全シーケンシングステータスを有し、実施形態５００は、ブロック５４４に進んで、ノイジー標的、たとえば、ノイジー遺伝子を除去する。 Embodiment 500 estimates the number of targets in the sequencing library if some of the true probability barcodes used to label the starting targets are not observed due to improper sequencing depth. A value can be provided. In decision state 540, if the target does not have a saturated sequencing status, the target has an incomplete sequencing status and embodiment 500 proceeds to block 544 for a noisy target, eg, a noisy gene. Remove.

標的の散布指数が、＞４であり、かつ、その標的の最大シーケンシングリードが、＞１８である場合、ポアソンモデル化を用いて、真のバーコードとエラーバーコードを区別するための閾値を取得しても、やはり相応しい推定値を提供することができる。シーケンシングデータが、軽度の過剰散布、たとえば、１．５＜散布指数≦４を示し、かつ、その標的の最大シーケンシングリードが≦１８である場合には、ポアソンモデルを用いて、閾値を得ると、真の分子標識カウントを過少評価する恐れがある。過少評価の理由は、低リードを有する分子標識が、恐らく真の分子標識と偽の分子標識との混合でありうるためでありうる。その結果、低シーケンシングリードを有するこれらの真の分子標識は、エラーのポアソンモデルに入ることを余儀なくされ、真の分子標識のポアソンモデルが、本来あるべきよりも少ない分子標識を有しうる。例えば、１などの低い分子標識カウントが取り除かれた後の分子標識カウントを使うその場限りの方法を使用することができる。散布指数が１に近い、たとえば、０．９～１．５である場合、観察された分子標識カウントが、相応しい推定値で生成されうる。散布指数が、０．１～０．９であれば、過少散布ポアソンモデルを特徴とするゼロ切断ポアソンモデルが、相応しい推定値を生成しうるが；シーケンシングデータ中にエラーが存在する場合には、このモデルは、過大評価する傾向がありうる。 If the target's dispersal index is> 4 and the target's maximum sequencing read is> 18, Poisson modeling is used to set a threshold for distinguishing between true and error barcodes. Even if it is acquired, it is still possible to provide a suitable estimated value. If the sequencing data show mild overspraying, eg 1.5 <spreading index ≤4, and the maximum sequencing read for the target is ≤18, then a Poisson model is used to obtain a threshold. And may underestimate the true molecular labeling count. The reason for the underestimation may be that the molecular label with the low read may be a mixture of a true molecular label and a false molecular label. As a result, these true molecular labels with low sequencing leads are forced to enter the Poisson model of error, and the Poisson model of true molecular labels may have fewer molecular labels than they should. For example, an ad hoc method can be used that uses the molecular label count after the low molecular label count, such as 1, has been removed. If the dispersal index is close to 1, eg 0.9-1.5, the observed molecular label counts can be generated with reasonable estimates. If the dispersal index is 0.1-0.9, a zero-cut Poisson model characterized by an under-dispersed Poisson model can produce reasonable estimates; if there are errors in the sequencing data. , This model can tend to be overestimated.

いくつかの実施形態では、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数は、シーケンシングデータ中の標的のクオリティステータスが、不完全シーケンシングクオリティステータスである場合、標的が、シーケンシングデータにおいてノイジーであるか否かを決定する工程と；シーケンシングデータからノイジー標的を除去する工程と、によって調節することができる。ノイジー標的の分子標識の発生数が、不完全シーケンシングノイジー標的閾値以下であれば、標的はノイジーでありうる。不完全シーケンシングノイジー遺伝子閾値は、変動しうる。いくつかの実施形態では、不完全シーケンシングノイジー標的閾値は、完全シーケンシングのクオリティステータスを有する複数の標的の分子標識の中央または平均発生数と等しくてもよい。いくつかの実施形態では、不完全シーケンシングノイジー遺伝子閾値は、１、２、３、４、５、６、７、８、９、１０、または概ねそうした値であるか、あるいはこれらの値のいずれか２つの間の数もしくは範囲でありうる。いくつかの実施形態では、不完全シーケンシングノイジー遺伝子閾値は、少なくとも、または多くとも、１、２、３、４、５、６、７、８、９、もしくは１０でありうる。 In some embodiments, the number of molecular labels with identifiable sequences associated with the target in the sequencing data is such that the quality status of the target in the sequencing data is incomplete sequencing quality status. It can be regulated by a step of determining whether the target is noisy in the sequencing data; and a step of removing the noisy target from the sequencing data. The target can be noisy if the number of molecular labels generated by the noisy target is less than or equal to the incomplete sequencing noisy target threshold. Incomplete sequencing noisy gene thresholds can fluctuate. In some embodiments, the incomplete sequencing noisy target threshold may be equal to the median or average incidence of molecular labels for multiple targets with complete sequencing quality status. In some embodiments, the incomplete sequencing noisy gene threshold is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or roughly such, or any of these values. It can be a number or range between the two. In some embodiments, the incomplete sequencing noisy gene threshold can be at least, or at most, 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10.

ブロック５４８では、ライブラリー中に存在する識別可能な分子標識を有する標的を含む確率バーコードの数の推定値を補外するために、不完全シーケンシングのクオリティステータスを有する標的のシーケンシングリードに対して、ゼロ切断ポアソンモデルを適用する。いくつかの実施形態では、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数は、得られたシーケンシングデータ中の標的のクオリティステータスが、不完全シーケンシングクオリティステータスである場合、標的が、シーケンシングデータにおいてノイジーであるか否かを決定する工程と；ノイジー標的を除去する工程と、によって調節する。 At block 548, to extrapolate an estimate of the number of probability barcodes containing targets with identifiable molecular labels present in the library, to sequencing reads of targets with incomplete sequencing quality status. In contrast, the zero-cut Poisson model is applied. In some embodiments, the number of molecular labels with identifiable sequences associated with the target in the sequencing data is such that the quality status of the target in the resulting sequencing data is incompletely sequenced quality status. In some cases, it is regulated by a step of determining whether the target is noisy in the sequencing data; and a step of removing the noisy target.

いくつかの実施形態では、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数は、シーケンシングデータ中の標的のクオリティステータスが、不完全シーケンシングクオリティステータスである場合、シーケンシングデータ中の真の分子標識と偽の分子標識とを決定するために、標的の分子標識を閾値化する工程によって調節することができる。標的の分子標識を閾値化する工程は、分子標識に対する統計解析を実施する工程を含みうる。分子標識について統計解析を実施する工程は、ゼロ切断ポアソンモデルを用いて、真の分子標識の数ｎを決定する工程と；シーケンシングデータから偽の分子標識を除去する工程と、を含みうる。 In some embodiments, the number of molecular labels with identifiable sequences associated with the target in the sequencing data is such that the quality status of the target in the sequencing data is incomplete sequencing quality status. It can be adjusted by the step of thresholding the target molecular label to determine the true molecular label and the false molecular label in the sequencing data. The step of thresholding the target molecular label may include performing a statistical analysis on the molecular label. The step of performing a statistical analysis on a molecular label may include a step of determining the number n of true molecular labels using a zero-cut Poisson model; and a step of removing false molecular labels from the sequencing data.

いくつかの実施形態では、偽の分子標識は、ｎ番目に豊富な分子標識の発生数よりも低い発生数を有する分子標識を含みうる。真の分子標識は、ｎ番目に豊富な分子標識の発生数以上の発生数を有する分子標識を含みうる。 In some embodiments, the sham molecular label may comprise a molecular label having a lower incidence than the nth most abundant molecular label occurrence . A true molecular label may include a molecular label having an incidence greater than or equal to the nth abundant molecular label occurrence .

シーケンシングデータエラー
本明細書に開示する方法は、シーケンシングデータエラー、たとえば、１つ以上の標的核酸をカウントする方法に発生するエラーを同定および／または訂正するために使用することができる。いくつかの実施形態ではエラーは、１つ以上のヌクレオチドの欠失、１つ以上のヌクレオチドの置換、１つ以上のヌクレオチドの付加、またはそれらの任意の組合せを含むか、そうしたものでありうる。エラーは、分子標識（ＭＬ）、サンプル標識（ＳＬ）、確率バーコード上の他の標識に存在しうる。いくつかの実施形態では、シーケンシングデータエラーは、ＰＣＲ導入エラー、シーケンシング導入エラー、逆転写（ＲＴ）プライマー混入エラー、またはそれらの任意の組合せを含むか、またはそうしたものでありうる。ＰＣＲ導入エラーは、ＰＣＲ増幅エラー、ＰＣＲ増幅バイアス、不十分なＰＣＲ増幅、またはそれらの任意の組合せの結果を含むか、またはそうしたものでありうる。シーケンシング導入エラーは、不正確なベースコーリング、不十分なシーケンシング、またはそれらの任意の組合せの結果を含むか、またはそうしたものでありうる。ＲＴプライマー混入エラーは、ＰＣＲに進入した逆転写プライマーに起因するエラーでありうる。 Sequencing Data Errors The methods disclosed herein can be used to identify and / or correct sequencing data errors, eg, errors that occur in the method of counting one or more target nucleic acids. In some embodiments, the error may include or be a deletion of one or more nucleotides, a substitution of one or more nucleotides, an addition of one or more nucleotides, or any combination thereof. Errors can be present in molecular labels (ML), sample labels (SL), and other labels on probability barcodes. In some embodiments, the sequencing data error may include, or may be, a PCR induction error, a sequencing induction error, a reverse transcriptase (RT) primer contamination error, or any combination thereof. PCR induction errors can include or be the result of PCR amplification errors, PCR amplification bias, inadequate PCR amplification, or any combination thereof. Sequencing implementation errors may include or be the result of inaccurate base calling, inadequate sequencing, or any combination thereof. The RT primer contamination error can be an error caused by the reverse transcription primer that has entered the PCR.

本明細書で使用される場合、「カバー率」または「シーケンシング深度」という用語は、シーケンシングデータ中の特定のＭＬおよび特定のＳＬを有するバーコード付き標的のリードの数を意味しうる。たとえば、バーコード付き標的は複数回シーケンシングされうる。従って、特定のＭＬおよびＳＬを有するバーコード付き標的を複数回観察することができる。別の例として、細胞は、標的の複数のコピー（たとえば、遺伝子のｍＲＮＡ分子の複数のコピー）を含有しうる。こうした標的の複数のコピーにバーコードを付けることができる。ＰＣＲ増幅の後（たとえば、図のブロック２８）、特定のＭＬおよびＳＬを有するバーコード付き標的の複数のコピーが存在しうる。シーケンシングに際して、特定のＭＬおよびＳＬを有するバーコード付き標的の複数のコピーの一部または全部がシーケンシングされうる。シーケンシングデータ中に観察される同じＭＬおよびＳＬを有するバーコード付き標的のリードの数は、「カバー率」または「シーケンシング深度」と呼ばれることがある。 As used herein, the term "coverage" or "sequencing depth" can mean the number of barcoded target reads with a particular ML and a particular SL in the sequencing data. For example, a bar coded target can be sequenced multiple times. Therefore, a barcoded target with a specific ML and SL can be observed multiple times. As another example, the cell may contain multiple copies of the target (eg, multiple copies of the mRNA molecule of a gene). Multiple copies of these targets can be barcoded. After PCR amplification (eg, block 28 in the figure), there can be multiple copies of the barcoded target with a particular ML and SL. Upon sequencing, some or all of multiple copies of a barcoded target with a particular ML and SL may be sequenced. The number of barcoded target reads with the same ML and SL observed in the sequencing data is sometimes referred to as "coverage" or "sequencing depth".

いくつかの実施形態では、シーケンシングデータエラーは、同定および／または訂正することができる。たとえば、細胞からの標的のコピーは、異なるＭＬと同じＳＬとを有するバーコードを付けることができる。ＭＬを有するバーコード付き標的は、シーケンシングデータ中の複数のリードを有しうる。異なるＭＬを有するバーコード付き標的は、少数のリード（たとえば、１リード）のみを有しうる。前のバーコード付き標的の方が、後のバーコード付き標的と比較して、真のＭＬ（またはリアルもしくはシグナルＭＬ）を有する傾向が高くなりうる。後のバーコード付き標的は、エラーＭＬ（または偽のもしくはノイズＭＬ）を含みうる。これは、２つのＭＬは、類似のカバー率またはシーケンシング深度を有することが予想できるためでありうる。少数のリードしか含まない後のバーコード付き標的は、シーケンシングまたはＰＣＲの最中に生じる人工物もしくはエラーとなり得る。 In some embodiments, sequencing data errors can be identified and / or corrected. For example, a copy of a target from a cell can be barcoded with a different ML and the same SL. A bar coded target with ML can have multiple reads in the sequencing data. Barcoded targets with different MLs may have only a small number of reads (eg, one lead). Earlier barcoded targets may be more likely to have true ML (or real or signal ML) than later barcoded targets. Later barcoded targets may contain error ML (or false or noise ML). This may be because the two MLs can be expected to have similar coverage or sequencing depth. Barcoded targets after containing only a few reads can be artifacts or errors that occur during sequencing or PCR.

別の例として、ＰＣＲに進入する確率バーコードは、ＲＴプライマー混入エラーを引き起こしうる。いくつかの実施形態ではｃＤＮＡ分子にｍＲＮＡ分子を逆転写した後（たとえば、図の２４）、ｃＤＮＡ分子に組み込まれない確率バーコードは、たとえば、Ａｍｐｕｒｅビーズ精製により除去することができる。除去方法、たとえば、Ａｍｐｕｒｅビーズ精製は、確率バーコード付きｃＤＮＡ分子に組み込まれる逆転写によって伸長されない確率バーコードを完全には除去されない可能性がある。たとえば、確率バーコード付きｃＤＮＡ分子に組み込まれる逆転写によって伸長されない確率バーコードの１５％、１０％、９％、８％、７％、６％、５％、４％、３％、２％、１％、０．５％、０．１％、またはこれらのいずれか２つの値の間の範囲が、Ａｍｐｕｒｅビーズ精製により除去されない可能性がある。これらの未除去確率バーコードは、ｃＤＮＡ分子の増幅中（たとえば、図のブロック２８）にシーケンシングデータエラーを引き起こしうる。確率バーコードは、サンプルの間で非常に類似しうる。たとえば、確率バーコードのサンプル標識は、同じサンプルの場合、同一でありうる。従って、これらの未除去確率バーコードが、ＰＣＲの最中に同じサンプルからの他の核酸分子（たとえば、確率バーコード付きｍＲＮＡ分子のＳＬ領域）にハイブリダイズする可能性があることから、ＰＣＲ交差が起こり、その結果、ＳＬエラーと呼ばれるシーケンシングデータエラーが生じうる。 As another example, probability barcodes that enter PCR can cause RT primer contamination errors. In some embodiments, after reverse transcription of the mRNA molecule into the cDNA molecule (eg, Figure 24), probability barcodes that are not integrated into the cDNA molecule can be removed, for example, by purification bead purification. Removal methods, such as Aple bead purification, may not completely remove probability barcodes that are not extended by reverse transcription incorporated into the probability barcoded cDNA molecule. For example, 15%, 10%, 9%, 8%, 7%, 6%, 5%, 4%, 3%, 2% of probability barcodes that are not extended by reverse transcription incorporated into a cDNA molecule with probability barcodes. The range between 1%, 0.5%, 0.1%, or any two of these values may not be removed by Ample bead purification. These unremoved probability barcodes can cause sequencing data errors during amplification of cDNA molecules (eg, block 28 in the figure). Probability barcodes can be very similar between samples. For example, the sample markers for probability barcodes can be the same for the same sample. Therefore, these unremoved probability barcodes may hybridize to other nucleic acid molecules from the same sample (eg, the SL region of the mRNA molecule with probability barcode) during PCR, thus crossing the PCR. As a result, a sequencing data error called an SL error can occur.

真のＭＬ、エラーＭＬ、およびＳＬエラーは、識別可能な分布を有しうる。図４は、分子標識エラー、サンプル標識エラー、および真の分子標識シグナルの非限定的な例示的分布を示す概略図である。図４に示されるように、エラーＭＬは、ＰＣＲまたはシーケンシングエラーに起因する可能性があるため、エラーＭＬは、より低いＭＬカバー率を有する傾向があると考えられる。たとえば、エラーＭＬは、シーケンシングエラーの大部分およびＰＣＲエラーの一部に起因する可能性がある。ＳＬエラーは、ＰＣＲに進入する確率バーコードに大部分起因する可能性があるため、ＳＬエラーは、より低いＭＬカバー率を有する傾向があると考えられる。 True ML, error ML, and SL error can have an identifiable distribution. FIG. 4 is a schematic diagram showing a non-limiting exemplary distribution of molecular labeling errors, sample labeling errors, and true molecular labeling signals. As shown in FIG. 4, it is believed that error MLs tend to have lower ML coverage because error MLs can be due to PCR or sequencing errors. For example, error ML can be due to the majority of sequencing errors and some of the PCR errors. It is believed that SL errors tend to have lower ML coverage, as SL errors can be largely due to the probability barcode of entering the PCR.

方向近接性に基づくＰＣＲおよびシーケンシングエラーの訂正
本明細書には、ＰＣＲまたはシーケンシングエラーを訂正する方法が開示される。いくつかの実施形態では、本方法は、（ａ）確率バーコード付き標的のシーケンシングデータを受け取る工程を含む。確率バーコード付き標的は、複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程により取得することができ、ここで、複数の確率バーコードの各々が分子標識を含む。いくつかの実施形態では、本方法は、（ｂ）複数の標的の１つ以上について：（ｉ）シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数をカウントする工程と；（ｉｉ）方向近接性を用いて、標的の分子標識のクラスターを同定する工程と；（ｉｉｉ）（ｉｉ）で同定された標的の分子標識のクラスターを用いて、（ｂ）で受け取られたシーケンシングデータを折りたたむ工程と；（ｉｖ）標的の数を推定する工程であって、推定された標的の数が、（ｉｉ）でシーケンシングデータを折りたたんだ後、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数と相関する工程と、を含む。複数の標的は、細胞の全トランスクリプトームの標的を含む。いくつかの実施形態では、本方法は、さらに、（ｃ）複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程と；（ｄ）確率バーコード付き標的をシーケンシングして、受け取った確率バーコード付き標的のシーケンシングデータを生成する工程と、を含む。 Correction of PCR and Sequencing Errors Based on Directional Accessibility This specification discloses methods of correcting PCR or sequencing errors. In some embodiments, the method comprises (a) receiving sequencing data for a target with a probability barcode. A target with a probability barcode can be obtained by a process of attaching a probability barcode to a plurality of targets using a plurality of probability barcodes to generate a plurality of targets with a probability barcode, and here, a plurality of targets have a probability barcode. Each of the probability barcodes contains a molecular label. In some embodiments, the method is (b) for one or more of a plurality of targets: (i) counting the number of molecular labels having identifiable sequences associated with the targets in the sequencing data. And; (ii) using directional proximity to identify clusters of target molecular labels; and; (iii) using clusters of target molecular labels identified in (ii), received in (b). The step of collapsing the sequencing data; (iv) the step of estimating the number of targets, the estimated number of targets was counted in (i) after folding the sequencing data in (ii). Includes a step that correlates with the number of molecular labels having an identifiable sequence associated with the target in the sequencing data. Multiple targets include targets for the entire transcriptome of the cell. In some embodiments, the method further comprises (c) using multiple probability barcodes to attach probability barcodes to the plurality of targets to generate multiple probability barcoded targets; d) Includes a step of sequencing a target with a probability barcode to generate sequencing data for the received target with a probability barcode.

図７は、方向近接性に基づく分子標識を用いて、ＰＣＲまたはシーケンシングエラーを訂正する、非限定的な例示的実施形態７００を示すフローチャートである。方向近接性に基づく分子標識を用いて、ＰＣＲまたはシーケンシングエラーを訂正する工程は、再帰的置換エラー訂正（ＲＳＥＣ）と呼ばれることもある。この方法７００は、複数の確率バーコード付き標的のシーケンシングデータを受け取った後、ブロック７０４で開始する。いくつかの実施形態では、方法７００は、さらに、複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数のバーコード付き標的を生成する工程を含み、ここで、複数の確率バーコードの各々は、分子標識を含む。いくつかの実施形態では、方法７００は、さらに、複数の確率バーコード付き標的をシーケンシングして、シーケンシングデータを取得する工程も含む。 FIG. 7 is a flow chart illustrating a non-limiting exemplary embodiment 700 that corrects PCR or sequencing errors using molecular labeling based on directional proximity. The process of correcting PCR or sequencing errors using directional proximity-based molecular labeling is sometimes referred to as recursive replacement error correction (RSEC). The method 700 starts at block 704 after receiving sequencing data for targets with multiple probability barcodes. In some embodiments, the method 700 further comprises the step of assigning a probability barcode to the plurality of targets to generate a plurality of barcoded targets using the plurality of probability barcodes, wherein the method 700 comprises a plurality of. Each of the probability barcodes of is contained a molecular label. In some embodiments, the method 700 further comprises the step of sequencing a plurality of probabilistic barcoded targets to obtain sequencing data.

ブロック７０８で、複数の標的の１つ以上について：シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数をカウントすることができる。ブロック７１２で、方向近接性を用いて、標的の分子標識のクラスターを同定することができる。クラスター内の標的の分子標識は、互いの所定の方向近接性閾値内に位置しうる。方向近接性閾値は、変動しうる。いくつかの実施形態では、所定の方向近接性閾値は、１もしくは２のハミング距離であるか、概ね、少なくとも、または多くとも、そうした距離でありうる。 At block 708, for one or more of the targets: the number of molecular labels with identifiable sequences associated with the targets in the sequencing data can be counted. At block 712, directional proximity can be used to identify clusters of targeted molecular labels. The target molecular labels within the cluster can be located within a predetermined directional proximity threshold of each other. The directional proximity threshold can fluctuate. In some embodiments, the predetermined directional proximity threshold can be one or two Hamming distances, or generally at least, or at most, such distances.

いくつかの実施形態では、クラスター内の標的の分子標識は、１つ以上の親分子標識と１つ以上の親分子標識の１つ以上の子供分子標識とを含みうる。親分子標識の発生数は、所定の方向近接性発生数閾値以上であってよい。いくつかの実施形態では、所定の方向近接性発生数閾値は、２×（子供分子標識の発生数）－１であるか、概ね、少なくとも、または多くとも、そうした値でありうる。いくつかの実施形態では、所定の方向近接性発生数閾値は、子供分子標識の発生数の１．５倍、２倍、３倍、４倍、５倍、６倍、７倍、８倍、９倍、１０倍、または概ねそうした値であるか、あるいはこれらの値のいずれか２つの間の数もしくは範囲でありうる。いくつかの実施形態では、所定の方向近接性発生数閾値は、子供分子標識の発生数の少なくともまたは多くとも１．５倍、２倍、３倍、４倍、５倍、６倍、７倍、８倍、９倍、１０倍でありうる。 In some embodiments, the target molecular label within the cluster may include one or more parent molecule labels and one or more child molecule labels of one or more parent molecule labels. The number of occurrences of the parent molecule label may be equal to or higher than the predetermined number of occurrences of directional proximity. In some embodiments, the predetermined directional proximity occurrence number threshold can be 2 × ( number of child molecule labeling occurrences) -1 , or, at least, or at most, such a value. In some embodiments, the predetermined directional proximity occurrence number threshold is 1.5 times, 2 times, 3 times, 4 times, 5 times, 6 times, 7 times, 8 times, the number of occurrences of child molecule labels. It can be 9x, 10x, or roughly such a value, or a number or range between any two of these values. In some embodiments, the predetermined directional proximity occurrence number threshold is at least or at most 1.5 times, 2 times, 3 times, 4 times, 5 times, 6 times, 7 times the number of occurrences of child molecule labels. , 8 times, 9 times, 10 times.

ブロック７２０で、標的の分子標識のクラスターを用いて、シーケンシングデータを折りたたむ。シーケンシングデータを折りたたむ工程は、子供分子標識の発生数を親分子標識に帰属させる工程を含みうる。ブロック７３２で、シーケンシングデータを折りたたんだ後、標的の数を推定して、出力を生成することができる。方法７００は、ブロック７３６で終了する。 At block 720, the sequencing data is collapsed using a cluster of target molecular labels. The step of collapsing the sequencing data may include a step of assigning the number of occurrences of the child molecule label to the parent molecule label. At block 732, after collapsing the sequencing data, the number of targets can be estimated to generate output. Method 700 ends at block 736.

いくつかの実施形態では、本方法は、さらに、標的のシーケンシング深度を決定する工程を含む。標的の数を推定する工程は、標的のシーケンシング深度が、所定のシーケンシング深度閾値を超える場合、（ｉ）でカウントされたシーケンシングデータを調節する工程を含む。所定のシーケンシング深度閾値は、１５～２０の間であってよい。（ｉ）でカウントされたシーケンシングデータを調節する工程は、標的の分子標識を閾値化して、（ｂ）で得られたシーケンシングデータ中の標的に関連付けられた真の分子標識および偽の分子標識を決定する工程を含む。標的の分子標識を閾値化する工程は、標的の分子標識について統計解析を実施する工程を含む。統計解析を実施する工程は、標的の分子標識の分布およびそれらの発生数を、２つのネガティブ二項分布などの２つの分布に当てはめる工程と；２つのネガティブ二項分布を用いて真の分子標識の数ｎを決定する工程と；（ｂ）で得られたシーケンシングデータから偽の分子標識を除去する工程と、を含み、ここで、偽の分子標識は、ｎ番目に豊富な分子標識の発生数よりも低い発生数を有する分子標識を含み、また、真の分子標識は、ｎ番目に豊富な分子標識の発生数以上の発生数を有する分子標識を含む。 In some embodiments, the method further comprises the step of determining the sequencing depth of the target. The step of estimating the number of targets includes adjusting the sequencing data counted in (i) when the sequencing depth of the targets exceeds a predetermined sequencing depth threshold. The predetermined sequencing depth threshold may be between 15 and 20. The step of adjusting the sequencing data counted in (i) is to threshold the molecular label of the target and the true molecular label and false molecule associated with the target in the sequencing data obtained in (b). Includes the step of determining the label. The step of thresholding the target molecular label includes a step of performing statistical analysis on the target molecular label. The steps to perform statistical analysis are to apply the distribution of target molecular labels and the number of their occurrences to two distributions, such as two negative binomial distributions; true molecular labeling using two negative binomial distributions. A step of determining the number n of the above; a step of removing the fake molecular label from the sequencing data obtained in (b), wherein the fake molecular label is the nth richest molecular label. A true molecular label comprises a molecular label having a number of occurrences lower than the number of occurrences, and a true molecular label comprises a molecular label having a number of occurrences greater than or equal to the number of occurrences of the nth most abundant molecular label.

方向近接性および二次導関数に基づくＰＣＲおよびシーケンシングエラーの訂正
本明細書には、標的の数を決定する方法が開示される。いくつかの実施形態では、一方法は、（ａ）複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程であって、複数の確率バーコードの各々が分子標識を含む工程と；（ｂ）確率バーコード付き標的のシーケンシングデータを取得する工程と；（ｃ）複数の標的の１つ以上について：（ｉ）シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数をカウントする工程と；（ｉｉ）方向近接性を用いて、標的の分子標識のクラスターを同定する工程と；（ｉｉｉ）（ｉｉ）で同定された標的の分子標識のクラスターを用いて、（ｂ）で得られたシーケンシングデータを折りたたむ工程と；（ｉｖ）標的の数を推定する工程であって、推定された標的の数が、（ｉｉ）でシーケンシングデータを折りたたんだ後、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数と相関する工程と、を含む。複数の標的は、細胞の全トランスクリプトームの標的を含みうる。 Correction of PCR and Sequencing Errors Based on Directional Accessibility and Second Derivative Functions herein discloses a method for determining the number of targets. In some embodiments, one method is (a) using a plurality of probabilistic barcodes to attach probabilistic barcodes to a plurality of targets to generate a plurality of probabilistic barcoded targets. Each of the probabilistic barcodes of the above includes a molecular label; (b) a step of acquiring sequencing data of a target with a probabilistic barcode; (c) for one or more of a plurality of targets: (i) sequencing data. A step of counting the number of molecular labels having an identifiable sequence associated with a target in; (iii) a step of identifying clusters of target molecular labels using directional proximity; (iii) (ii). ) Using the cluster of molecular labels of the targets identified in), the step of collapsing the sequencing data obtained in (b) and the step of estimating the number of targets; (iv) the estimated number of targets. Consists of folding the sequencing data in (ii) and then correlating with the number of molecular labels having identifiable sequences associated with the target in the sequencing data counted in (i). Multiple targets may include targets for the entire transcriptome of the cell.

いくつかの実施形態では、本方法は、シーケンシングデータ中の標的のシーケンシングステータスを決定する工程を含む。シーケンシングデータ中の標的のシーケンシングステータスは、飽和シーケンシングを含むか、または飽和シーケンシングであってもよい。いくつかの実施形態では、シーケンシングデータ中の標的のシーケンシングステータスが、飽和シーケンシングステータスである場合、（ｉｖ）で推定された標的の数は、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数と相関する。 In some embodiments, the method comprises the step of determining the sequencing status of the target in the sequencing data. The sequencing status of the target in the sequencing data may include saturated sequencing or may be saturated sequencing. In some embodiments, if the sequencing status of the targets in the sequencing data is saturated sequencing status, the number of targets estimated in (iv) will be in the sequencing data counted in (i). Correlates with the number of molecular labels with identifiable sequences associated with the target of.

いくつかの実施形態では、推定された標的の数は、ＳＬエラーを訂正した後に（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数と相関する。ＳＬエラーを訂正する工程は、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の累積和プロットを作成する工程と；累積和プロットの二次導関数を決定する工程と；累積和プロットの二次導関数の最小値に基づき、ＭＬリード深度カットオフを決定する工程と、を含む。いくつかの実施形態ではＳＬエラーを訂正する工程は、決定されたＭＬリード深度カットオフより低いリード深度を有する、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識を除去する工程を含みうる。 In some embodiments, the estimated number of targets correlates with the number of molecular labels having identifiable sequences associated with the targets in the sequencing data counted in (i) after correcting the SL error. do. The steps to correct SL errors are to create a cumulative sum plot of molecular labels with identifiable sequences associated with the target in the sequencing data; and to determine the quadratic derivative of the cumulative sum plot; It comprises the step of determining the ML read depth cutoff based on the minimum value of the quadratic derivative of the cumulative sum plot. In some embodiments, the step of correcting SL errors removes molecular labels with identifiable sequences associated with targets in sequencing data that have a read depth lower than the determined ML read depth cutoff. May include steps.

図８は、方向近接性および二次導関数に基づく分子標識を用いて、ＰＣＲおよびシーケンシングエラーを訂正する、非限定的な例示的実施形態８００を示すフローチャートである。方法８００は、複数の確率バーコード付き標的のシーケンシングデータを受け取った後、ブロック８０４から開始する。いくつかの実施形態では、方法８００は、さらに、複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程を含み、ここで、複数の確率バーコードの各々が分子標識を含む。いくつかの実施形態では、方法８００は、さらに、複数の確率バーコード付き標的をシーケンシングして、シーケンシングデータを取得する工程も含む。 FIG. 8 is a flow chart illustrating a non-limiting exemplary embodiment 800 that corrects PCR and sequencing errors using molecular labeling based on directional proximity and quadratic derivatives. Method 800 begins at block 804 after receiving sequencing data for targets with multiple probability barcodes. In some embodiments, the method 800 further comprises the step of attaching a probability barcode to the plurality of targets to generate a plurality of probabilistic barcoded targets using the plurality of probability barcodes. Each of the multiple probability barcodes contains a molecular label. In some embodiments, Method 800 further comprises the step of sequencing a plurality of probabilistic barcoded targets to obtain sequencing data.

ブロック８０８で、複数の標的の１つ以上について：シーケンシング中の標的に関連付けられた識別可能な配列を有する分子標識の数をカウントすることができる。決定ブロック８１２で、シーケンシングデータが、飽和シーケンシングステータスを有するか否かを決定することができる。飽和シーケンシングステータスは、所定の飽和閾値より高い、識別可能な配列を有する分子標識の数を有する標的によって決定することができる。所定の飽和閾値は、さまざまな履行で異なりうる。たとえば、確率バーコードが、識別可能な配列を有する約６５６１の分子標識を有する場合、所定の飽和閾値は、約６５５７となりうる。別の例として、確率バーコードが、識別可能な配列を有する約６５５３６の分子標識を有する場合、所定の飽和閾値は、約６５５３２となりうる。 At block 808, for one or more of the targets: the number of molecular labels with identifiable sequences associated with the sequencing targets can be counted. At decision block 812, it can be determined whether or not the sequencing data has a saturated sequencing status. Saturation sequencing status can be determined by a target having a number of molecular labels with identifiable sequences above a predetermined saturation threshold. A given saturation threshold can vary for different performances. For example, if the probability barcode has a molecular label of about 6651 with an identifiable sequence, the predetermined saturation threshold can be about 6557. As another example, if the probability barcode has a molecular label of about 65536 with an identifiable sequence, the predetermined saturation threshold can be about 65532.

シーケンシングデータが、決定ブロック８１２で飽和シーケンシングステータスを有していない場合、方法８００は、ブロック８１６に進むことができ、ここで、分子標識カウントが、方向近接性に基づき調節されうる。標的は、たとえば、それが、１０００、２０００、３０００、４０００、５０００、６０００、７０００、８０００、９０００、１００００、２００００、３００００、４００００、５００００、６００００、７００００、８００００、９００００、１０００００を超える、またはこれらのいずれか２つの間の数もしくは範囲を超える、識別可能な配列を有する分子標識の数を有する場合、飽和シーケンシングステータスを有するとみなすことができる。別の例として、標的は、識別可能な配列を有する確率バーコードの分子バーコードの５０％、６０％、７０％、８０％、９０％、９５％、９９％、もしくは９９．９％を超える、またはこれらのうちいずれか２つの間の数もしくは範囲を超える、識別可能な配列を有する分子標識の数を有する場合、飽和シーケンシングステータスを有するとみなすことができる。いくつかの実施形態では、方向近接性に基づき分子カウントを調節する工程は、図７を参照にして説明することができる。たとえば、辞書に基づき分子カウントを調節する工程は、方向近接性を用いて、標的の分子標識のクラスターを同定する工程と；同定された標的の分子標識のクラスターを用いて、シーケンシングデータを折りたたむ工程と；標的の数を推定する工程と、を含むことができ、ここで、推定された標的の数は、シーケンシングデータを折りたたんだ後、カウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数と相関する。 If the sequencing data does not have a saturated sequencing status at decision block 812, method 800 can proceed to block 816, where the molecular label count can be adjusted based on directional proximity. The target is, for example, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40,000, 50000, 60000, 70000, 80000, 90000, 100,000, or these. Having a number of molecular labels with an identifiable sequence that exceeds the number or range between any two of the above can be considered to have saturated sequencing status. As another example, the target exceeds 50%, 60%, 70%, 80%, 90%, 95%, 99%, or 99.9% of the molecular barcode of the probability barcode having an identifiable sequence. , Or a number of molecular labels having an identifiable sequence that exceeds the number or range between any two of these, can be considered to have saturated sequencing status. In some embodiments, the step of adjusting the molecular count based on directional proximity can be described with reference to FIG. For example, the step of adjusting the molecular count based on a dictionary is the step of identifying a cluster of target molecular labels using directional proximity; and the step of collapsing sequencing data using the cluster of identified target molecular labels. A step and a step of estimating the number of targets can be included, wherein the estimated number of targets is associated with the targets in the counted sequencing data after collapsing the sequencing data. Correlates with the number of molecular labels with identifiable sequences.

ブロック８２０で、累積和プロットの二次導関数を決定することができる。累積和プロットの二次導関数を決定する工程は、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の累積和プロットを作成する工程を含みうる。 At block 820, the quadratic derivative of the cumulative sum plot can be determined. The step of determining the quadratic derivative of the cumulative sum plot may include creating a cumulative sum plot of molecular labels with identifiable sequences associated with the target in the sequencing data.

ブロック８２４で、分子標識は、ＭＬリード深度カットオフに基づき調節することができる。ＭＬリード深度カットオフは、累積和プロットの二次導関数の最小値（たとえば、局所的最小値または大域的最小値）に基づくものでよい。いくつかの実施形態では、ＳＬエラーを訂正する工程は、決定されたＭＬリード深度カットオフより低いリード深度を有するシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識を除去する工程を含みうる。 At block 824, the molecular label can be adjusted based on the ML read depth cutoff. The ML read depth cutoff may be based on the minimum value of the quadratic derivative of the cumulative sum plot (eg, the local minimum or the global minimum). In some embodiments, the step of correcting SL errors removes molecular labels with identifiable sequences associated with targets in sequencing data with read depths lower than the determined ML read depth cutoff. May include steps.

ブロック８２８で、標的の数を推定して、シーケンシングデータを折りたたみ、ＳＬエラーを訂正した後の出力を生成することができる。決定ブロック８１２で、シーケンシングデータが、飽和シーケンシングステータスを有する場合、方法８００は、ブロック８２８に進んで、シーケンシングデータの折りたたみおよびＳＬエラーの訂正なしに出力を生成することができる。方法８００は、ブロック８３２で終了する。 At block 828, the number of targets can be estimated, the sequencing data can be collapsed, and the output after correcting the SL error can be generated. In decision block 812, if the sequencing data has a saturated sequencing status, method 800 can proceed to block 828 to generate output without folding the sequencing data and correcting SL errors. Method 800 ends at block 832.

方向近接性に基づくＰＣＲおよびシーケンシングエラーの訂正ならびに分布に基づくエラーの訂正
本明細書には、ＰＣＲまたはシーケンシングエラーを訂正する方法が開示される。本方法を用いて、標的の数を決定することができる。いくつかの実施形態では、本方法は、（ａ）確率バーコード付き標的のシーケンシングデータを受け取る工程を含む。確率バーコード付き標的は、複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程により取得することができ、ここで、複数の確率バーコードの各々が分子標識を含む。いくつかの実施形態では、本方法は、（ｂ）複数の標的の１つ以上について：（ｉ）シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数をカウントする工程と；（ｉｉ）シーケンシングデータ中の標的に関連付けられた識別可能な配列を有するノイズ分子標識の数を決定する工程と；（ｉｉｉ）標的の数を推定する工程と、を含み、ここで、推定された標的の数は、（ｉｉ）で決定されたノイズ分子標識の数に従って調節される、（ｉ）でカウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数と相関する。いくつかの実施形態では、本方法は、シーケンシングデータ中の標的のシーケンシングステータスを決定する工程を含む。いくつかの実施形態では、本方法は、さらに、（ｃ）複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程と；（ｄ）確率バーコード付き標的をシーケンシングして、受け取った確率バーコード付き標的のシーケンシングデータを生成する工程と、を含む。 Correction of PCR and Sequencing Errors Based on Directional Accessibility and Correction of Errors Based on Distribution This specification discloses methods of correcting PCR or sequencing errors. The method can be used to determine the number of targets. In some embodiments, the method comprises (a) receiving sequencing data for a target with a probability barcode. A target with a probability barcode can be obtained by a process of attaching a probability barcode to a plurality of targets using a plurality of probability barcodes to generate a plurality of targets with a probability barcode, and here, a plurality of targets have a probability barcode. Each of the probability barcodes contains a molecular label. In some embodiments, the method is (b) for one or more of a plurality of targets: (i) counting the number of molecular labels having an identifiable sequence associated with the target in the sequencing data. And; (iii) a step of determining the number of noise molecule labels having an identifiable sequence associated with the target in the sequencing data; (iii) a step of estimating the number of targets, wherein. The estimated number of targets is adjusted according to the number of noise molecular labels determined in (ii), molecular labels with identifiable sequences associated with the targets in the sequencing data counted in (i). Correlates with the number of. In some embodiments, the method comprises the step of determining the sequencing status of the target in the sequencing data. In some embodiments, the method further comprises (c) using multiple probability barcodes to attach probability barcodes to the plurality of targets to generate multiple probability barcoded targets; d) Includes a step of sequencing a target with a probability barcode to generate sequencing data for the received target with a probability barcode.

図９は、再帰的置換エラー訂正および分布ベースのエラー訂正に基づいて、ＰＣＲおよびシーケンシングエラーを訂正する、非限定的な例示的実施形態９００を示すフローチャートである。方法９００は、複数の確率バーコード付き標的のシーケンシングデータを受け取った後、ブロック９０４から開始する。いくつかの実施形態では、方法９００は、複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程をさらに含み、ここで、複数の確率バーコードの各々が分子標識を含む。いくつかの実施形態では、方法９００は、さらに、複数の確率バーコード付き標的をシーケンシングして、シーケンシングデータを取得する工程も含む。 FIG. 9 is a flow chart illustrating a non-limiting exemplary embodiment 900 that corrects PCR and sequencing errors based on recursive replacement error correction and distribution-based error correction. Method 900 starts at block 904 after receiving sequencing data for targets with multiple probability barcodes. In some embodiments, the method 900 further comprises the step of attaching a probability barcode to the plurality of targets to generate a plurality of probabilistic barcoded targets using the plurality of probability barcodes, wherein the method 900 further comprises a plurality of steps. Each of the probability barcodes of the contains a molecular label. In some embodiments, the method 900 further comprises the step of sequencing a plurality of probabilistic barcoded targets to obtain sequencing data.

ブロック９０８で、複数の標的の１つ以上について：シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数をカウントすることができる。決定ブロック９１２で、シーケンシングデータは、飽和シーケンシングステータスを有するか否かを決定することができる。たとえば、標的は、それが、１０００、２０００、３０００、４０００、５０００、６０００、７０００、８０００、９０００、１００００、２００００、３００００、４００００、５００００、６００００、７００００、８００００、９００００、１０００００を超える、またはこれらのいずれか２つの間の数もしくは範囲を超える、識別可能な配列を有する分子標識の数を有する場合、飽和シーケンシングステータスを有するとみなすことができる。別の例として、標的は、識別可能な配列を有する確率バーコードの分子バーコードの５０％、６０％、７０％、８０％、９０％、９５％、９９％、もしくは９９．９％を超える、またはこれらのうちいずれか２つの間の数もしくは範囲を超える、識別可能な配列を有する分子標識の数を有する場合、飽和シーケンシングステータスを有するとみなすことができる。 At block 908, for one or more of the targets: the number of molecular labels with identifiable sequences associated with the targets in the sequencing data can be counted. At decision block 912, the sequencing data can determine whether or not it has a saturated sequencing status. For example, the target is that it exceeds 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100,000, or these. Having a number of molecular labels with an identifiable sequence that exceeds the number or range between any two of the above can be considered to have saturated sequencing status. As another example, the target exceeds 50%, 60%, 70%, 80%, 90%, 95%, 99%, or 99.9% of the molecular barcode of the probability barcode having an identifiable sequence. , Or a number of molecular labels having an identifiable sequence that exceeds the number or range between any two of these, can be considered to have saturated sequencing status.

いくつかの実施形態では、飽和シーケンシングステータスは、所定の飽和閾値を超える、識別可能な配列を有する分子標識の数を有する標的によって決定することができる。所定の飽和閾値は、さまざまな履行において異なるものであってよい。たとえば、所定の飽和閾値は、１０００、２０００、３０００、４０００、５０００、６０００、６５５７、７０００、８０００、９０００、１００００、２００００、３００００、４００００、５００００、６００００、６５５３２、７００００、８００００、９００００、１０００００、またはこれらの値のいずれか２つの間の数もしくは範囲であるか、または概ねそうした値でありうる。別の例として、所定の飽和閾値は、少なくとも、または多くとも、１０００、２０００、３０００、４０００、５０００、６０００、６５５７、７０００、８０００、９０００、１００００、２００００、３００００、４００００、５００００、６００００、６５５３２、７００００、８００００、９００００、もしくは１０００００でありうる。 In some embodiments, the saturation sequencing status can be determined by a target having a number of molecular labels with identifiable sequences that exceed a predetermined saturation threshold. The predetermined saturation thresholds may be different in different performances. For example, the predetermined saturation thresholds are 1000, 2000, 3000, 4000, 5000, 6000, 6557, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 65532, 70000, 80000, 90000, 100000, Or it can be a number or range between any two of these values, or it can be approximately such a value. As another example, a given saturation threshold is at least or at most 1000, 2000, 3000, 4000, 5000, 6000, 6557, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 65532. , 70,000, 80,000, 90,000, or 100,000.

いくつかの実施形態では、飽和シーケンシングステータスは、識別可能な配列を有する確率バーコードの分子標識の数に依存しうる。たとえば、確率バーコードが、識別可能な配列を有する約６５６１の分子標識を有する場合、所定の飽和閾値は、約６５５７となりうる。別の例として、確率バーコードが、識別可能な配列を有する約６５５３６の分子標識を有する場合、所定の飽和閾値は、約６５５３２となりうる。いくつかの実施形態では、飽和シーケンシングステータスは、識別可能な配列を有する確率バーコードの分子標識の数に依存しない場合もある。 In some embodiments, the saturated sequencing status may depend on the number of molecular labels on the probability barcode having an identifiable sequence. For example, if the probability barcode has a molecular label of about 6651 with an identifiable sequence, the predetermined saturation threshold can be about 6557. As another example, if the probability barcode has a molecular label of about 65536 with an identifiable sequence, the predetermined saturation threshold can be about 65532. In some embodiments, the saturated sequencing status may be independent of the number of molecular labels on the probabilistic barcode having an identifiable sequence.

シーケンシングデータが、決定ブロック９１２で、飽和シーケンシングステータスを有していなければ、方法９００は、ブロック９１６に進み、ここで、分子標識カウントは、方向近接性に基づいて調節されうる。いくつかの実施形態では、方向近接性に基づき分子カウントを調節する工程は、図７を参照にして説明することができる。たとえば、辞書に基づき分子カウントを調節する工程は、方向近接性を用いて、標的の分子標識のクラスターを同定する工程と；同定された標的の分子標識のクラスターを用いて、シーケンシングデータを折りたたむ工程と；標的の数を推定する工程と、を含み、ここで、推定された標的の数は、シーケンシングデータを折りたたんだ後、カウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数と相関する。 If the sequencing data is at decision block 912 and does not have a saturated sequencing status, method 900 proceeds to block 916, where the molecular label count can be adjusted based on directional proximity. In some embodiments, the step of adjusting the molecular count based on directional proximity can be described with reference to FIG. For example, the step of adjusting the molecular count based on a dictionary is the step of identifying a cluster of target molecular labels using directional proximity; and the step of collapsing sequencing data using the cluster of identified target molecular labels. Steps; include estimating the number of targets; where the estimated number of targets is identifiable associated with the target in the counted sequencing data after collapsing the sequencing data. Correlates with the number of molecular labels with sequences.

ブロック９２０で、シーケンシングデータ中の標的のシーケンシングステータスを決定することができる。シーケンシングデータ中の標的のシーケンシングステータスは、過少シーケンシングを含むか、または過少シーケンシングでありうる。決定ブロック９２４で、シーケンシングデータ中の標的のシーケンシングステータスが、過少シーケンシングステータスであるか否かを決定することができる。たとえば、標的は、その深度（たとえば、平均、最小、もしくは最大深度）が、１、２、３、４、５、６、７、８、９、１０、２０、３０、４０、５０、６０、７０、８０、９０、１００、またはこれらの値のいずれか２つの間の数もしくは範囲より小さいか、または概ねそうした値より小さい場合、過少シーケンシングステータスを有するとみなすことができる。別の例として、標的は、その深度が、少なくとも、または多くとも、１、２、３、４、５、６、７、８、９、１０、２０、３０、４０、５０、６０、７０、８０、９０、もしくは１００より小さい場合、過少シーケンシングステータスを有するとみなすことができる。 At block 920, the sequencing status of the target in the sequencing data can be determined. The sequencing status of the target in the sequencing data may include or may be undersequencing. At decision block 924, it is possible to determine whether the sequencing status of the target in the sequencing data is under-sequencing status. For example, a target has a depth (eg, average, minimum, or maximum depth) of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, If it is less than, or generally less than, a number or range between 70, 80, 90, 100, or any two of these values, it can be considered to have an undersequencing status. As another example, the target has a depth of at least, or at most, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, If it is less than 80, 90, or 100, it can be considered to have an undersequencing status.

いくつかの実施形態では、過少シーケンシングステータスは、所定の過少シーケンシング閾値より小さい深度（たとえば、平均、最小、もしくは最大深度）を有する標的によって決定することができる。過少シーケンシング閾値は、さまざまな履行で異なるものであってよい。たとえば、過少シーケンシング閾値は、１、２、３、４、５、６、７、８、９、１０、２０、３０、４０、５０、６０、７０、８０、９０、１００、またはこれらの値のいずれか２つの間の数もしくは範囲、または概ねそうした値でありうる。別の例として、過少シーケンシング閾値は、少なくとも、または多くとも、１、２、３、４、５、６、７、８、９、１０、２０、３０、４０、５０、６０、７０、８０、９０、もしくは１００でありうる。 In some embodiments, the undersequencing status can be determined by a target having a depth less than a predetermined undersequencing threshold (eg, mean, minimum, or maximum depth). The undersequencing threshold may be different for different fulfillments. For example, the undersequencing threshold is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, or any of these values. It can be a number or range between any two of the above, or roughly such a value. As another example, the undersequencing threshold is at least, or at most, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80. , 90, or 100.

いくつかの実施形態では、過少シーケンシングステータスは、識別可能な配列を有する確率バーコードの分子標識の数に依存しうる。たとえば、確率バーコードが、識別可能な配列を有する、１０００、２０００、３０００、４０００、５０００、６０００、６５６１、７０００、８０００、９０００、１００００、２００００、３００００、４００００、５００００、６００００、６５５３２、７００００、８００００、９００００、１０００００、またはこれらの値のいずれか２つの間の数もしくは範囲、または概ねそうした値の分子標識を有する場合、過少シーケンシング閾値は、１０（または別の閾値数）となりうる。別の例として、確率バーコードが、少なくとも、または多くとも、１０００、２０００、３０００、４０００、５０００、６０００、６５６１、７０００、８０００、９０００、１００００、２００００、３００００、４００００、５００００、６００００、６５５３２、７００００、８００００、９００００、もしくは１０００００を含む場合、過少シーケンシング閾値は、１０（または別の閾値数）となりうる。いくつかの実施形態では、飽和シーケンシングステータスは、識別可能な配列を有する確率バーコードの分子標識の数に依存しない場合もある。 In some embodiments, the undersequencing status may depend on the number of molecular labels on the probability barcode having an identifiable sequence. For example, the probability bar code has an identifiable sequence, 1000, 2000, 3000, 4000, 5000, 6000, 6651, 7000, 8000, 9000, 10000, 20000, 30000, 40,000, 50000, 60000, 65532, 70000, If you have a number or range between 80,000, 90,000, 100,000, or any two of these values, or approximately such a value, the undersequencing threshold can be 10 (or another number of thresholds). As another example, the probability bar code is at least or at most 1000, 2000, 3000, 4000, 5000, 6000, 6651, 7000, 8000, 9000, 10000, 20000, 30000, 40,000, 50000, 60000, 65532, When including 70,000, 80,000, 90,000, or 100,000, the undersequencing threshold can be 10 (or another number of thresholds). In some embodiments, the saturated sequencing status may be independent of the number of molecular labels on the probabilistic barcode having an identifiable sequence.

決定ブロック９２４で、シーケンシングデータ中の標的のシーケンシングステータスが、過少シーケンシングステータスではなければ、方法９００は、ブロック９２８に進んで、分子標識カウントをフィルタリングすることができる。分子標識カウントをフィルタリングする工程は、決定ブロック９３２で、擬似点閾値より少ない、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数を決定する工程を含む。擬似点閾値は、さまざまな履行で異なるものであってよい。たとえば、確率バーコードが、識別可能な配列を有する約６５６１の分子標識を有する場合、擬似点閾値は、１、２、３、４、５、６、７、８、９、１０、２０、３０、４０、５０、６０、７０、８０、９０、１００、またはこれらの値のいずれか２つの間の数もしくは範囲、または概ねそうした値でありうる。別の例として、確率バーコードが、識別可能な配列を有する約６５６１の分子標識を有する場合、擬似点閾値は、少なくとも、または多くとも、１、２、３、４、５、６、７、８、９、１０、２０、３０、４０、５０、６０、７０、８０、９０、もしくは１００でありうる。 At decision block 924, if the sequencing status of the target in the sequencing data is not undersequencing status, method 900 can proceed to block 928 to filter the molecular label count. The step of filtering the molecular label count comprises in determination block 932 determining the number of molecular labels having an identifiable sequence associated with the target in the sequencing data, which is less than the pseudo-point threshold. Pseudo-point thresholds may differ for different fulfillments. For example, if the probability barcode has a molecular label of about 6651 with an identifiable sequence, the pseudopoint thresholds are 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30. , 40, 50, 60, 70, 80, 90, 100, or a number or range between any two of these values, or generally such values. As another example, if the probability barcode has about 6651 molecular labels with identifiable sequences, the pseudopoint thresholds are at least, or at most 1, 2, 3, 4, 5, 6, 7, ,. It can be 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, or 100.

決定ブロック９３２で、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数が、擬似点閾値より少ない場合、方法９００は、任意選択で、ブロック９３６に進み、そこで、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有するノイズ分子標識の数を決定する前に、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数に擬似点を追加することができる。擬似点は、さまざまな履行で異なる分子標識カウントを有しうる。たとえば、擬似点の分子標識カウントは、０．０００１、０．００１、０．０１、０．１、１、２、３、４、５、６、７、８、９、１０、２０、３０、４０、５０、６０、７０、８０、９０、１００、またはこれらの値のいずれか２つの間の数もしくは範囲、または概ねそうした値でありうる。別の例として、擬似点の分子標識カウントは、少なくとも、または多くとも、０．０００１、０．００１、０．０１、０．１、１、２、３、４、５、６、７、８、９、１０、２０、３０、４０、５０、６０、７０、８０、９０、もしくは１００でありうる。いくつかの実施形態では、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数が、擬似点閾値より少ない場合、方法９００は、ブロックに進むことができ、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数が、擬似点閾値より少ない場合、方法９４４。 In decision block 932, if the number of molecular labels with identifiable sequences associated with the target in the sequencing data is less than the pseudo-point threshold, method 900 optionally proceeds to block 936, where sequence is performed. Before determining the number of noise molecular labels with identifiable sequences associated with a target in the single data, pseudopoint the number of molecular labels with identifiable sequences associated with the target in the sequencing data. Can be added. Pseudopoints can have different molecular label counts for different performances. For example, the molecular label counts for pseudo points are 0.0001, 0.001, 0.01, 0.1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, It can be a number or range between 40, 50, 60, 70, 80, 90, 100, or any two of these values, or roughly such a value. As another example, the molecular label counts for pseudopoints are at least or at most 0.0001, 0.001, 0.01, 0.1, 1, 2, 3, 4, 5, 6, 7, 8 , 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, or 100. In some embodiments, if the number of molecular labels with identifiable sequences associated with the target in the sequencing data is less than the pseudopoint threshold, method 900 can proceed to the block and the sequencing data. Method 944 when the number of molecular labels with identifiable sequences associated with the target in is less than the pseudopoint threshold.

決定ブロック９３２で、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数が、擬似点閾値以上である場合、非ユニーク分子標識をブロック９４０で除去することができる。非ユニーク分子標識は、ブロック９４４でシーケンシングデータ中の標的に関連付けられた識別可能な配列を有するノイズ分子標識の数を決定するために、除去することができる。非ユニーク分子標識は、所定の再使用分子標識閾値より大きい、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識を含みうる。再使用分子標識閾値は、さまざまな履行において異なるものであってよい。たとえば、再使用分子標識閾値は、確率バーコードが、識別可能な配列を有する約６５６１の分子標識を含む場合、１００、２００、３００、４００、５００、６００、６５０、７００、９００、１０００、２０００、またはこれらの値のいずれか２つの間の数もしくは範囲、または概ねそうした値でありうる。別の例として、再使用分子標識閾値は、確率バーコードが、識別可能な配列を有する約６５６１の分子標識を含む場合、少なくとも、または多くとも、１００、２００、３００、４００、５００、６００、６５０、７００、９００、１０００、もしくは２０００でありうる。 In decision block 932, non-unique molecular labels can be removed in block 940 if the number of molecular labels with identifiable sequences associated with the target in the sequencing data is greater than or equal to the pseudopoint threshold. Non-unique molecular labels can be removed at block 944 to determine the number of noise molecular labels with identifiable sequences associated with the target in the sequencing data. Non-unique molecular labels may include molecular labels with identifiable sequences associated with targets in sequencing data that are greater than a given reused molecular label threshold. Reuse molecular labeling thresholds may differ in different performances. For example, the reuse molecular label threshold is 100, 200, 300, 400, 500, 600, 650, 700, 900, 1000, 2000 if the probability barcode contains about 6651 molecular labels with identifiable sequences. , Or a number or range between any two of these values, or could be approximately such a value. As another example, the reuse molecular labeling threshold is 100, 200, 300, 400, 500, 600, at least, or at most, if the probability barcode contains about 6651 molecular labels with identifiable sequences. It can be 650, 700, 900, 1000, or 2000.

いくつかの実施形態では、非ユニーク分子標識を除去する工程は、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数に対する非ユニーク分子標識の理論上の数を決定する工程を含む。非ユニーク分子標識を除去する工程は、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識のｎ番目に豊富な分子標識より大きな発生数を有する分子標識を除去する工程を含みうる。数ｎは、非ユニーク分子標識の理論上の数でありうる。 In some embodiments, the step of removing the non-unique molecular label determines the theoretical number of non-unique molecular labels relative to the number of molecular labels having an identifiable sequence associated with the target in the sequencing data. Includes steps. The step of removing the non-unique molecular label comprises the step of removing the molecular label having a larger number of occurrences than the nth richest molecular label of the molecular label having the identifiable sequence associated with the target in the sequencing data. sell. The number n can be a theoretical number of non-unique molecular labels.

ブロック９４４で、分布ベースのエラー訂正方法を用いて、分子標識カウントを調節することができる。分布ベースのエラー訂正方法は、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有するノイズ分子標識の数を決定する工程を含みうる。ノイズ分子標識の数を決定する工程は、２つのネガティブ二項分布を、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数に当てはめる工程を含みうる。たとえば、ノイズ分子標識の数を決定する工程は、シグナルネガティブ二項分布（２つのネガティブ二項分布の一方）を、カウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数に当てはめる工程を含んでよく、ここで、シグナルネガティブ二項分布は、シグナル分子標識である、カウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数に対応する。ノイズ分子標識の数を決定する工程は、ノイズネガティブ二項分布（２つのネガティブ二項分布の他方）を、カウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数に当てはめる工程を含んでよく、ここでノイズネガティブ二項分布は、ノイズ分子標識である、カウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数に対応する。ノイズ分子標識の数を決定する工程は、当てはめたシグナルネガティブ二項分布と、当てはめたノイズネガティブ二項分布を用いて、ノイズ分子標識の数を決定する工程を含んでよい。 At block 944, a distribution-based error correction method can be used to adjust the molecular label count. Distribution-based error correction methods may include determining the number of noise molecule labels with identifiable sequences associated with the target in the sequencing data. The step of determining the number of noise molecular labels may include fitting the two negative binomial distributions to the number of molecular labels having an identifiable sequence associated with the target in the sequencing data. For example, the step of determining the number of noise molecule labels is a signal negative binomial distribution (one of the two negative binomial distributions), a molecule with an identifiable sequence associated with the target in the counted sequencing data. A step of fitting to the number of labels may include, where the signal negative binomial distribution is the number of molecular labels having an identifiable sequence associated with the target in the counted sequencing data, which is the signal molecular label. Corresponds to. The step of determining the number of noise molecular labels is to make the noise negative binomial distribution (the other of the two negative binomial distributions) a molecular label having an identifiable sequence associated with the target in the counted sequencing data. It may include a number fitting step, where the noise negative binomial distribution corresponds to the number of molecular labels with identifiable sequences associated with the target in the counted sequencing data, which are noise molecular labels. .. The step of determining the number of noise molecule labels may include a step of determining the number of noise molecule labels using the fitted signal negative binomial distribution and the fitted noise negative binomial distribution.

いくつかの実施形態では、当てはめたシグナルネガティブ二項分布と、当てはめたノイズネガティブ二項分布を用いて、ノイズ分子標識の数を決定する工程は、シーケンシングデータ中の標的に関連付けられた識別可能な配列の各々について：識別可能な配列が、シグナルネガティブ二項分布であるシグナル確率を決定する工程を含む。そして、識別可能な配列のノイズ確率が、ノイズネガティブ二項分布であるノイズ確率を決定することができる。さらには、シグナル確率が、ノイズ確率より小さければ、識別可能な配列は、ノイズ分子標識であると決定することができる。いくつかの実施形態では、ブロック９４４で分子標識カウントを調節する工程は、２つ未満のピークが見出される（シグナルネガティブ二項分布とノイズネガティブ二項分布を決定するために、２つのピークが必要とされうるため）場合、シングルトン（たとえば、単一塩基置換）を除去する工程を含みうる。 In some embodiments, using the fitted signal-negative binomial distribution and the fitted noise-negative binomial distribution, the step of determining the number of noise molecule labels is identifiable associated with the target in the sequencing data. For each of the sequences: the identifiable sequence comprises the step of determining the signal probability, which is a signal negative binomial distribution. Then, the noise probability of the identifiable array can determine the noise probability which is the noise negative binomial distribution. Furthermore, if the signal probability is less than the noise probability, the identifiable sequence can be determined to be a noise molecule label. In some embodiments, the step of adjusting the molecular label count at block 944 finds less than two peaks (two peaks are required to determine the signal negative binomial distribution and the noise negative binomial distribution). If so, it may include the step of removing a single ton (eg, a single base substitution).

ブロック９４８で、標的の数を推定して、近接性に基づくエラー訂正および分布ベースのエラー訂正後に出力を生成することができる。決定ブロック９１２で、シーケンシングデータ中の標的のシーケンシングステータスが、飽和シーケンシングステータスである場合、方法９００は、ブロック９４８に進んで、方向近接性および分布ベースのエラー訂正に基づいて分子標識を調節することなく、出力を生成することができる。たとえば、決定されたノイズ分子標識の数は、ゼロであってもよい。 At block 948, the number of targets can be estimated to produce output after proximity-based error correction and distribution-based error correction. In decision block 912, if the sequencing status of the target in the sequencing data is saturated sequencing status, method 900 proceeds to block 948 for molecular labeling based on directional proximity and distribution-based error correction. The output can be generated without adjustment. For example, the number of determined noise molecule labels may be zero.

決定ブロック９２４で、シーケンシングデータ中の標的のシーケンシングステータスが、過少シーケンシングステータスである場合、方法９００は、ブロック９４８に進んで、分布ベースのエラー訂正に基づいて分子標識を調節することなく、出力を生成することができる。たとえば、決定されたノイズ分子標識の数は、ゼロであってもよい。方法９００は、ブロック９５２で終了する。 In decision block 924, if the sequencing status of the target in the sequencing data is under-sequencing status, method 900 proceeds to block 948 without adjusting the molecular label based on distribution-based error correction. , Can produce output. For example, the number of determined noise molecule labels may be zero. Method 900 ends at block 952.

図１０は、２つのネガティブ二項分布を用いたエラー訂正の非限定的な例示的実施形態１０００を示すフローチャートである。方法１０００のブロック（たとえば、ブロック９０４～９５２）は、図９を参照にして説明されている。手短には、方法１０００は、複数の確率バーコード付き標的のシーケンシングデータを受け取った後、ブロック９０４で開始する。いくつかの実施形態では、方法１０００は、複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程をさらに含み、ここで、複数の確率バーコードの各々は、分子標識を含む。いくつかの実施形態では、方法１０００は、さらに、複数の確率バーコード付き標的をシーケンシングして、シーケンシングデータを取得する工程も含む。 FIG. 10 is a flowchart illustrating a non-limiting exemplary embodiment 1000 of error correction using two negative binomial distributions. The blocks of method 1000 (eg, blocks 904-952) are described with reference to FIG. Briefly, method 1000 begins at block 904 after receiving sequencing data for targets with multiple probability barcodes. In some embodiments, the method 1000 further comprises the step of attaching a probability barcode to the plurality of targets to generate a plurality of probabilistic barcoded targets using the plurality of probability barcodes, wherein the plurality of. Each of the probability barcodes of is contained a molecular label. In some embodiments, Method 1000 further comprises the step of sequencing a plurality of probabilistic barcoded targets to obtain sequencing data.

ブロック９０８で、複数の標的の１つ以上について：シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数をカウントすることができる。ブロック９１６で、方向近接性に基づいて分子標識カウントを調節することができる。いくつかの実施形態では、方向近接性に基づいて分子標識カウントを調節する工程は、図７を参照にして説明することができる。 At block 908, for one or more of the targets: the number of molecular labels with identifiable sequences associated with the targets in the sequencing data can be counted. At block 916, the molecular label count can be adjusted based on directional proximity. In some embodiments, the step of adjusting the molecular label count based on directional proximity can be described with reference to FIG.

ブロック９２０で、シーケンシングデータ中の標的のシーケンシングステータスを決定することができる。シーケンシングデータ中の標的のシーケンシングステータスは、過少シーケンシングを含むか、または過少シーケンシングであってもよい。決定ブロック９２４で、シーケンシングデータ中の標的のシーケンシングステータスが、過少シーケンシングステータスであるか否かを決定することができる。 At block 920, the sequencing status of the target in the sequencing data can be determined. The sequencing status of the target in the sequencing data may include under-sequencing or may be under-sequencing. At decision block 924, it is possible to determine whether the sequencing status of the target in the sequencing data is under-sequencing status.

決定ブロック９２４で、シーケンシングデータ中の標的のシーケンシングステータスが、過少シーケンシングステータスでなければ、方法１０００は、任意選択で、決定ブロック１００４に進むことができる。決定ブロック１００４で、標的のシーケンシング深度を所定のシーケンシング深度閾値と比較することができる。シーケンシング深度閾値は、さまざまな履行において異なるものであってよい。たとえば、標的のシーケンシング深度は、１、２、３、４、５、６、７、８、９、１０、２０、３０、４０、５０、６０、７０、８０、９０、１００、またはこれらの値のいずれか２つの間の数もしくは範囲、または概ねそうした値でありうる。別の例として、標的のシーケンシング深度は、少なくとも、または多くとも、１、２、３、４、５、６、７、８、９、１０、２０、３０、４０、５０、６０、７０、８０、９０、もしくは１００でありうる。 At decision block 924, if the target sequencing status in the sequencing data is not under-sequencing status, method 1000 can optionally proceed to decision block 1004. At the determination block 1004, the target sequencing depth can be compared to a predetermined sequencing depth threshold. Sequencing depth thresholds may be different for different fulfillments. For example, the target sequencing depths are 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, or these. It can be a number or range between any two of the values, or roughly such a value. As another example, the target sequencing depth is at least, or at most, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, It can be 80, 90, or 100.

標的のシーケンシング深度が、シーケンシング深度閾値より大きい場合、方法１０００は、ブロック９２８に進む。標的のシーケンシング深度が、シーケンシング深度閾値以下である場合、方法１０００は、ブロック１００８に進む。ブロック１００８で、ブロック９４８の出力を生成する工程の前に、シングルトン（たとえば、単一塩基置換）を除去することができる。 If the target sequencing depth is greater than the sequencing depth threshold, method 1000 proceeds to block 928. If the target sequencing depth is less than or equal to the sequencing depth threshold, method 1000 proceeds to block 1008. At block 1008, singletons (eg, single base substitutions) can be removed prior to the step of producing the output of block 948.

ブロック９２８で、分子標識カウントをフィルタリングすることができる。分子標識カウントをフィルタリングする工程は、決定ブロック９１２で、シーケンシングデータが、飽和シーケンシングステータスを有するか否かを決定する工程を含むことができる。シーケンシングデータが、決定ブロック９１２で、飽和シーケンシングステータスを有していない場合、方法１０００は、決定ブロック９３２に進むことができる。決定ブロック９３２で、擬似点閾値より少ない、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数を決定することができる。 At block 928, the molecular label count can be filtered. The step of filtering the molecular label count can include, at decision block 912, the step of determining whether or not the sequencing data has a saturated sequencing status. If the sequencing data is in decision block 912 and does not have a saturated sequencing status, method 1000 can proceed to decision block 932. At the determination block 932, the number of molecular labels having an identifiable sequence associated with the target in the sequencing data, which is less than the pseudo-point threshold, can be determined.

決定ブロック９３２で、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数が、擬似点閾値より少ない場合、方法９００は、任意選択で、ブロック９３６に進み、そこで、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有するノイズ分子標識の数を決定する前に、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数に擬似点を追加することができる。いくつかの実施形態では、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数が、擬似点閾値より少ない場合、方法９００は、ブロックに進むことができ、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数が、擬似点閾値より少ない場合、方法９４４。 In decision block 932, if the number of molecular labels with identifiable sequences associated with the target in the sequencing data is less than the pseudo-point threshold, method 900 optionally proceeds to block 936, where sequence is performed. Before determining the number of noise molecular labels with identifiable sequences associated with a target in the single data, pseudopoint the number of molecular labels with identifiable sequences associated with the target in the sequencing data. Can be added. In some embodiments, if the number of molecular labels with identifiable sequences associated with the target in the sequencing data is less than the pseudopoint threshold, method 900 can proceed to the block and the sequencing data. Method 944 when the number of molecular labels with identifiable sequences associated with the target in is less than the pseudopoint threshold.

決定ブロック９３２で、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数が、擬似点閾値以上である場合、非ユニーク分子標識をブロック９４０で除去することができる。非ユニーク分子標識は、ブロック９４４でシーケンシングデータ中の標的に関連付けられた識別可能な配列を有するノイズ分子標識の数を決定するために、除去することができる。非ユニーク分子標識は、所定の再使用分子標識閾値より大きい、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識を含みうる。 In decision block 932, non-unique molecular labels can be removed in block 940 if the number of molecular labels with identifiable sequences associated with the target in the sequencing data is greater than or equal to the pseudopoint threshold. Non-unique molecular labels can be removed at block 944 to determine the number of noise molecular labels with identifiable sequences associated with the target in the sequencing data. Non-unique molecular labels may include molecular labels with identifiable sequences associated with targets in sequencing data that are greater than a given reused molecular label threshold.

ブロック９４４で、分布ベースのエラー訂正方法を用いて、分子標識カウントを調節することができる。分布ベースのエラー訂正方法は、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有するノイズ分子標識の数を決定する工程を含みうる。ノイズ分子標識の数を決定する工程は、２つのネガティブ二項分布、すなわち、シグナルネガティブ二項分布とノイズネガティブ二項分布とを、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数に当てはめる工程を含みうる。シグナルネガティブ二項分布は、シグナル分子標識である、カウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数に対応する。ノイズネガティブ二項分布は、ノイズ分子標識である、カウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数に対応する。ノイズ分子標識の数を決定する工程は、当てはめたシグナルネガティブ二項分布と、当てはめたノイズネガティブ二項分布を用いて、ノイズ分子標識の数を決定する工程を含むことができる。 At block 944, a distribution-based error correction method can be used to adjust the molecular label count. Distribution-based error correction methods may include determining the number of noise molecule labels with identifiable sequences associated with the target in the sequencing data. The step of determining the number of noise molecular labels has two negative binomial distributions, i.e., a signal negative binomial distribution and a noise negative binomial distribution, with an identifiable sequence associated with the target in the sequencing data. It may include a step that fits into the number of molecular labels. The signal negative binomial distribution corresponds to the number of molecular labels with identifiable sequences associated with the target in the counted sequencing data, which are signal molecular labels. The noise-negative binomial distribution corresponds to the number of molecular labels with identifiable sequences associated with the target in the counted sequencing data, which are noise molecular labels. The step of determining the number of noise molecule labels can include a step of determining the number of noise molecule labels using the fitted signal negative binomial distribution and the fitted noise negative binomial distribution.

ブロック９４８で、標的の数を推定して、近接性に基づくエラー訂正および分布ベースのエラー訂正後に出力を生成することができる。決定ブロック９１２で、シーケンシングデータ中の標的のシーケンシングステータスが、飽和シーケンシングステータスである場合、方法１０００は、ブロック９４８に進んで、方向近接性および分布ベースのエラー訂正に基づいて分子標識を調節することなく、出力を生成することができる。 At block 948, the number of targets can be estimated to produce output after proximity-based error correction and distribution-based error correction. In decision block 912, if the sequencing status of the target in the sequencing data is saturated sequencing status, method 1000 proceeds to block 948 for molecular labeling based on directional proximity and distribution-based error correction. The output can be generated without adjustment.

決定ブロック９２４で、シーケンシングデータ中の標的のシーケンシングステータスが、過少シーケンシングステータスである場合、方法１０００は、ブロック９４８に進んで、分布ベースのエラー訂正に基づいて分子標識を調節することなく、出力を生成することができる。たとえば、決定されたノイズ分子標識の数は、ゼロであってもよい。方法１０００は、ブロック９５２で終了する。 In decision block 924, if the target sequencing status in the sequencing data is under-sequencing status, method 1000 proceeds to block 948 without adjusting the molecular label based on distribution-based error correction. , Can produce output. For example, the number of determined noise molecule labels may be zero. Method 1000 ends at block 952.

方向近接性に基づくＰＣＲおよびシーケンシングエラーの訂正、分布に基づくエラーの訂正、ならびにサブサンプリング
図１１は、２つのネガティブ二項分布を用いたエラー訂正の非限定的な例示的実施形態１１００を示すフローチャートである。方法１１００のブロック（たとえば、ブロック９０４～９５２）は、図９を参照にして説明されている。手短には、方法１１００は、複数の確率バーコード付き標的のシーケンシングデータを受け取った後、ブロック９０４で開始する。いくつかの実施形態では、方法１１００は、複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程をさらに含み、ここで、複数の確率バーコードの各々は、分子標識を含む。いくつかの実施形態では、方法１１００は、さらに、複数の確率バーコード付き標的をシーケンシングして、シーケンシングデータを取得する工程も含む。 PCR and Sequencing Error Correction Based on Directional Accessibility, Error Correction Based on Distribution, and Subsampling FIG. 11 shows a non-limiting exemplary embodiment 1100 of error correction using two negative binomial distributions. It is a flowchart. The blocks of method 1100 (eg, blocks 904-952) are described with reference to FIG. Briefly, method 1100 begins at block 904 after receiving sequencing data for targets with multiple probability barcodes. In some embodiments, method 1100 further comprises the step of assigning a plurality of probabilistic barcodes to a plurality of targets to generate a plurality of probabilistic barcoded targets using the plurality of probability barcodes. Each of the probability barcodes of is contained a molecular label. In some embodiments, method 1100 further comprises the step of sequencing a plurality of probabilistic barcoded targets to obtain sequencing data.

ブロック９０８で、複数の標的の１つ以上について：シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数をカウントすることができる。決定ブロック９１２で、シーケンシングデータが、飽和シーケンシングステータスを有するか否かを決定することができる。シーケンシングデータが、決定ブロック９１２で、飽和シーケンシングステータスを有していなければ、方法１１００は、ブロック９１６に進み、ここで、分子標識カウントは、方向近接性に基づいて調節されうる。いくつかの実施形態では、方向近接性に基づき分子カウントを調節する工程は、図７を参照にして説明することができる。 At block 908, for one or more of the targets: the number of molecular labels with identifiable sequences associated with the targets in the sequencing data can be counted. At decision block 912, it can be determined whether or not the sequencing data has a saturated sequencing status. If the sequencing data is at decision block 912 and does not have a saturated sequencing status, method 1100 proceeds to block 916, where the molecular label count can be adjusted based on directional proximity. In some embodiments, the step of adjusting the molecular count based on directional proximity can be described with reference to FIG.

ブロック９２０で、シーケンシングデータ中の標的のシーケンシングステータスを決定することができる。シーケンシングデータ中の標的のシーケンシングステータスは、過少シーケンシングを含むか、または過少シーケンシングでありうる。決定ブロック９２４で、シーケンシングデータ中の標的のシーケンシングステータスが、過少シーケンシングステータスであるか否かを決定することができる。 At block 920, the sequencing status of the target in the sequencing data can be determined. The sequencing status of the target in the sequencing data may include or may be undersequencing. At decision block 924, it is possible to determine whether the sequencing status of the target in the sequencing data is under-sequencing status.

決定ブロック９２４で、シーケンシングデータ中の標的のシーケンシングステータスが、過少シーケンシングステータスでなければ、方法１１００は、任意選択で、決定ブロック１１０４に進むことができる。決定ブロック１１０４で、シーケンシングデータ中の標的のシーケンシングステータスが、過剰シーケンシングデータであるか否かを決定することができる。たとえば、標的は、その深度（たとえば、平均、最小、もしくは最大深度）が、５０、１００、２００、２５０、３００、４００、５００、６００、７００、８００、９００、１０００、またはこれらの値のいずれか２つの間の数もしくは範囲より大きいか、または概ねそうした値より大きい場合、過剰シーケンシングステータスまたは高度発現標的を有するとみなすことができる。別の例として、標的は、その深度が、少なくとも、または多くとも、５０、１００、２００、２５０、３００、４００、５００、６００、７００、８００、９００、もしくは１０００より大きい場合、過少シーケンシングステータスを有するとみなすことができる。 At decision block 924, if the sequencing status of the target in the sequencing data is not under-sequencing status, method 1100 can optionally proceed to decision block 1104. At decision block 1104, it can be determined whether the sequencing status of the target in the sequencing data is excessive sequencing data. For example, a target may have a depth (eg, average, minimum, or maximum depth) of 50, 100, 200, 250, 300, 400, 500, 600, 700, 800, 900, 1000, or any of these values. If it is greater than the number or range between the two, or generally greater than such a value, it can be considered to have an oversequencing status or highly expressed target. As another example, a target has an undersequencing status if its depth is at least, or at most, greater than 50, 100, 200, 250, 300, 400, 500, 600, 700, 800, 900, or 1000. Can be considered to have.

いくつかの実施形態では、過剰シーケンシングステータスまたは高度発現標的は、所定の過剰シーケンシング閾値より大きい深度（たとえば、平均、最小、もしくは最大深度）を有する標的によって決定することができる。過剰シーケンシング閾値は、さまざまな履行において異なるものであってよい。たとえば、過剰シーケンシング閾値は、５０、１００、２００、２５０、３００、４００、５００、６００、７００、８００、９００、１０００、またはこれらの値のいずれか２つの間の数もしくは範囲、または概ねそうした値でありうる。別の例として、過剰シーケンシング閾値は、少なくとも、または多くとも、５０、１００、２００、２５０、３００、４００、５００、６００、７００、８００、９００、１０００でありうる。 In some embodiments, the oversequencing status or highly expressed target can be determined by a target having a depth greater than a predetermined oversequencing threshold (eg, mean, minimum, or maximum depth). Excessive sequencing thresholds can be different for different performances. For example, the excess sequencing threshold is a number or range between 50, 100, 200, 250, 300, 400, 500, 600, 700, 800, 900, 1000, or any two of these values, or generally such. Can be a value. As another example, the excess sequencing threshold can be at least, or at most, 50, 100, 200, 250, 300, 400, 500, 600, 700, 800, 900, 1000.

いくつかの実施形態では、過剰シーケンシングステータスは、識別可能な配列を有する確率バーコードの分子標識の数に依存しうる。たとえば、確率バーコードが、識別可能な配列を有する約６５６１の分子標識を含む場合、過剰シーケンシング閾値は、５０、１００、２００、２５０、３００、４００、５００、６００、７００、８００、９００、１０００、またはこれらの値のいずれか２つの間の数もしくは範囲、または概ねそうした値でありうる。別の例として、確率バーコードが、識別可能な配列を有する約６５６１の分子標識を含む場合、過剰シーケンシング閾値は、少なくとも、または多くとも、５０、１００、２００、２５０、３００、４００、５００、６００、７００、８００、９００、１０００でありうる。いくつかの実施形態では、過少シーケンシングステータスは、識別可能な配列を有する確率バーコードの分子標識の数に依存しない場合もある。 In some embodiments, the oversequencing status may depend on the number of molecular labels on the probabilistic barcode having an identifiable sequence. For example, if the probability barcode contains about 6651 molecular labels with identifiable sequences, the excess sequencing thresholds are 50, 100, 200, 250, 300, 400, 500, 600, 700, 800, 900, It can be 1000, or a number or range between any two of these values, or roughly such a value. As another example, if the probability barcode contains about 6651 molecular labels with identifiable sequences, the excess sequencing threshold is at least, or at most, 50, 100, 200, 250, 300, 400, 500. , 600, 700, 800, 900, 1000. In some embodiments, the undersequencing status may not depend on the number of molecular labels on the probability barcode having an identifiable sequence.

決定ブロック１１０４で、標的が、過剰シーケンシングステータスを有する場合、方法１１００は、ブロック１１０８に進む。ブロック１１０８で、標的のＭＬカバー率は、たとえば、全標的のＭＬカバー率をサブサンプリングすることによって減少されうる。たとえば、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数は、全標的についての所定の過剰シーケンシング閾値の近似値までサブサンプリングされうる（たとえば、１０）。方法１１００は、ブロック１１０８からブロック９２８に進む。 In decision block 1104, if the target has an excessive sequencing status, method 1100 proceeds to block 1108. At block 1108, target ML coverage can be reduced, for example, by subsampling the ML coverage of all targets. For example, the number of molecular labels with identifiable sequences associated with a target in sequencing data can be subsampled to an approximation of a given excess sequencing threshold for all targets (eg, 10). Method 1100 proceeds from block 1108 to block 928.

決定ブロック１１０４で、標的が、過剰シーケンシングステータスを有していなければ、方法１１００は、ブロック９２８に進んで、分子標識カウントをフィルタリングする。分子標識カウントをフィルタリングする工程は、決定ブロック９３２で、擬似点閾値より少ない、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数を決定する工程を含みうる。 In determination block 1104, if the target does not have excessive sequencing status, method 1100 proceeds to block 928 to filter the molecular label count. The step of filtering the molecular label count may include in determination block 932 determining the number of molecular labels having an identifiable sequence associated with the target in the sequencing data, which is less than the pseudo-point threshold.

ブロック９４４で、分布ベースのエラー訂正方法を用いて、分子標識カウントを調節することができる。分布ベースのエラー訂正方法は、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有するノイズ分子標識の数を決定する工程を含みうる。ノイズ分子標識の数を決定する工程は、２つのネガティブ二項分布、すなわち、シグナルネガティブ二項分布とノイズネガティブ二項分布とを、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数に当てはめる工程を含みうる。シグナルネガティブ二項分布は、シグナル分子標識である、カウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数に対応する。ノイズネガティブ二項分布は、ノイズ分子標識である、カウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数に対応する。ノイズ分子標識の数を決定する工程は、当てはめたシグナルネガティブ二項分布と、当てはめたノイズネガティブ二項分布を用いて、ノイズ分子標識の数を決定する工程を含む。 At block 944, a distribution-based error correction method can be used to adjust the molecular label count. Distribution-based error correction methods may include determining the number of noise molecule labels with identifiable sequences associated with the target in the sequencing data. The step of determining the number of noise molecular labels has two negative binomial distributions, i.e., a signal negative binomial distribution and a noise negative binomial distribution, with an identifiable sequence associated with the target in the sequencing data. It may include a step that fits into the number of molecular labels. The signal negative binomial distribution corresponds to the number of molecular labels with identifiable sequences associated with the target in the counted sequencing data, which are signal molecular labels. The noise-negative binomial distribution corresponds to the number of molecular labels with identifiable sequences associated with the target in the counted sequencing data, which are noise molecular labels. The step of determining the number of noise molecule labels includes a step of determining the number of noise molecule labels using the fitted signal negative binomial distribution and the fitted noise negative binomial distribution.

ブロック９４４で分布ベースのエラー訂正を用いて分子標識カウントを調節した後、方法１１００は、任意選択で、ブロック１１１２に進む。ブロック１１１２で、ブロック９４４からの調節された分子標識カウントを、ブロック９１６で決定されて、方向近接性に基づき調節された分子標識カウントと合わせることができる。たとえば、非ユニーク分子標識は、ブロック９４０で除去されるため、ブロック９４４で分布当てはめには使用されない。しかし、これらの分子標識は、ブロック９１６で決定されて、方向近接性に基づき調節された分子標識カウント中に依然として存在する。従って、ブロック９４４からの調節された分子標識カウントと、ブロック９４４で調節された分子標識カウントを合わせて、ブロック９４８で出力を生成することができる。 After adjusting the molecular label count with distribution-based error correction in block 944, method 1100 optionally proceeds to block 1112. At block 1112, the adjusted molecular labeling count from block 944 can be combined with the molecular labeling count determined at block 916 and adjusted based on directional proximity. For example, non-unique molecular labels are removed in block 940 and are not used for distribution fitting in block 944. However, these molecular labels are still present in the molecular label count determined in block 916 and adjusted based on directional proximity. Thus, the adjusted molecular labeling count from block 944 and the adjusted molecular labeling count at block 944 can be combined to generate an output at block 948.

決定ブロック９１２で、シーケンシングデータ中の標的のシーケンシングステータスが、飽和シーケンシングステータスであれば、方法１１００は、ブロック９４８に進んで、方向近接性および分布ベースのエラー訂正に基づいて分子標識を調節することなく、出力を生成することができる。決定ブロック９２４で、シーケンシングデータ中の標的のシーケンシングステータスが、過少シーケンシングステータスであれば、方法１１００は、ブロック９４８に進んで、分布ベースのエラー訂正に基づいて分子標識を調節することなく、出力を生成することができる。たとえば、決定されるノイズ分子標識の数は、ゼロでありうる。方法１１００は、たとえば、ブロック９５２で終了しうる。 In decision block 912, if the sequencing status of the target in the sequencing data is saturated sequencing status, method 1100 proceeds to block 948 for molecular labeling based on directional proximity and distribution-based error correction. The output can be generated without adjustment. In decision block 924, if the sequencing status of the target in the sequencing data is undersequencing status, method 1100 proceeds to block 948 without adjusting the molecular label based on distribution-based error correction. , Can produce output. For example, the number of noise molecule labels determined can be zero. Method 1100 may end, for example, at block 952.

図１２は、２つのネガティブ二項分布を用いたエラー訂正の非限定的な例示的実施形態１２００を示すフローチャートである。方法１２００のブロック（たとえば、ブロック９０４～９５２およびブロック１１０４）は、図９および１１を参照にして説明されている。手短には、方法１２００は、複数の確率バーコード付き標的のシーケンシングデータを受け取った後、ブロック９０４で開始する。いくつかの実施形態では、方法１２００は、複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程をさらに含み、ここで、複数の確率バーコードの各々は、分子標識を含む。いくつかの実施形態では、方法１２００は、さらに、複数の確率バーコード付き標的をシーケンシングして、シーケンシングデータを取得する工程も含む。 FIG. 12 is a flowchart illustrating a non-limiting exemplary embodiment 1200 of error correction using two negative binomial distributions. The blocks of method 1200 (eg, blocks 904-952 and blocks 1104) are described with reference to FIGS. 9 and 11. Briefly, method 1200 begins at block 904 after receiving sequencing data for targets with multiple probability barcodes. In some embodiments, method 1200 further comprises the step of assigning a plurality of probabilistic barcodes to a plurality of targets to generate a plurality of probabilistic barcoded targets using the plurality of probability barcodes. Each of the probability barcodes of is contained a molecular label. In some embodiments, method 1200 further comprises the step of sequencing a plurality of probabilistic barcoded targets to obtain sequencing data.

ブロック９０８で、複数の標的の１つ以上について：シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数をカウントすることができる。決定ブロック９１２で、シーケンシングデータが、飽和シーケンシングステータスを有するか否かを決定することができる。シーケンシングデータが、決定ブロック９１２で、飽和シーケンシングステータスを有していなければ、方法１２００は、ブロック９１６に進み、ここで、分子標識カウントは、方向近接性に基づいて調節されうる。いくつかの実施形態では、方向近接性に基づき分子カウントを調節する工程は、図７を参照にして説明することができる。 At block 908, for one or more of the targets: the number of molecular labels with identifiable sequences associated with the targets in the sequencing data can be counted. At decision block 912, it can be determined whether or not the sequencing data has a saturated sequencing status. If the sequencing data is at decision block 912 and does not have a saturated sequencing status, method 1200 proceeds to block 916, where the molecular label count can be adjusted based on directional proximity. In some embodiments, the step of adjusting the molecular count based on directional proximity can be described with reference to FIG.

決定ブロック９２４で、シーケンシングデータ中の標的のシーケンシングステータスが、過少シーケンシングステータスでなければ、方法１２００は、任意選択で、決定ブロック１１０４に進むことができる。決定ブロック１１０４で、シーケンシングデータ中の標的のシーケンシングステータスが、過剰シーケンシングデータであるか否かを決定することができる。 At decision block 924, if the sequencing status of the target in the sequencing data is not under-sequencing status, method 1200 can optionally proceed to decision block 1104. At decision block 1104, it can be determined whether the sequencing status of the target in the sequencing data is excessive sequencing data.

決定ブロック１１０４で、標的が、過剰シーケンシングステータスを有するか、または標的が、高度発現標的である場合、方法１２００は、任意選択で、ブロック１２０８に進む。ブロック１２０８で、標的のＭＬカバー率は、たとえば、標的毎にＭＬカバー率をサブサンプリングすることによって減少されうる。たとえば、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数は、標的毎に所定の過剰シーケンシング閾値の近似値までサブサンプリングされうる。方法１２００は、ブロック１２０８からブロック９２８に進む。 In determination block 1104, if the target has an oversequencing status or the target is a highly expressed target, method 1200 proceeds to block 1208, optionally. At block 1208, the ML coverage of the target can be reduced, for example, by subsampling the ML coverage for each target. For example, the number of molecular labels with identifiable sequences associated with a target in the sequencing data can be subsampled per target to an approximation of a predetermined excess sequencing threshold. Method 1200 proceeds from block 1208 to block 928.

決定ブロック１１０４で、標的が、過剰シーケンシングステータスを有していなければ、方法１２００は、ブロック９２８に進んで、分子標識カウントをフィルタリングする。分子標識カウントをフィルタリングする工程は、決定ブロック９３２で、擬似点閾値より少ない、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数を決定する工程を含みうる。 In determination block 1104, if the target does not have excessive sequencing status, method 1200 proceeds to block 928 to filter the molecular label count. The step of filtering the molecular label count may include in determination block 932 determining the number of molecular labels having an identifiable sequence associated with the target in the sequencing data, which is less than the pseudo-point threshold.

ブロック９３２で、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数が、擬似点閾値より少ない場合、方法９００は、任意選択で、ブロック９３６に進み、そこで、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有するノイズ分子標識の数を決定する前に、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数に擬似点を任意選択で追加することができる。いくつかの実施形態では、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数が、擬似点閾値より少ない場合、方法９００は、ブロックに進むことができ、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数が、擬似点閾値より少ない場合、方法９４４。 At block 932, if the number of molecular labels with identifiable sequences associated with the target in the sequencing data is less than the pseudo-point threshold, method 900 optionally proceeds to block 936, where sequencing. Arbitrary pseudopoints in the number of molecular labels with identifiable sequences associated with the target in the sequencing data before determining the number of noise molecule labels with identifiable sequences associated with the target in the data. It can be added by selection. In some embodiments, if the number of molecular labels with identifiable sequences associated with the target in the sequencing data is less than the pseudopoint threshold, method 900 can proceed to the block and the sequencing data. Method 944 when the number of molecular labels with identifiable sequences associated with the target in is less than the pseudopoint threshold.

ブロック９４４で、分布ベースのエラー訂正方法を用いて、分子標識カウントを調節することができる。分布ベースのエラー訂正方法は、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有するノイズ分子標識の数を決定する工程を含みうる。ノイズ分子標識の数を決定する工程は、２つのネガティブ二項分布、すなわち、シグナルネガティブ二項分布とノイズネガティブ二項分布とを、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数に当てはめる工程を含みうる。シグナルネガティブ二項分布は、シグナル分子標識である、カウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数に対応する。ノイズネガティブ二項分布は、ノイズ分子標識である、カウントされたシーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数に対応する。ノイズ分子標識の数を決定する工程は、当てはめたシグナルネガティブ二項分布と、当てはめたノイズネガティブ二項分布とを用いて、ノイズ分子標識の数を決定する工程を含む。 At block 944, a distribution-based error correction method can be used to adjust the molecular label count. Distribution-based error correction methods may include determining the number of noise molecule labels with identifiable sequences associated with the target in the sequencing data. The step of determining the number of noise molecular labels has two negative binomial distributions, i.e., a signal negative binomial distribution and a noise negative binomial distribution, with an identifiable sequence associated with the target in the sequencing data. It may include a step that fits into the number of molecular labels. The signal negative binomial distribution corresponds to the number of molecular labels with identifiable sequences associated with the target in the counted sequencing data, which are signal molecular labels. The noise-negative binomial distribution corresponds to the number of molecular labels with identifiable sequences associated with the target in the counted sequencing data, which are noise molecular labels. The step of determining the number of noise molecule labels includes the step of determining the number of noise molecule labels using the fitted signal-negative binomial distribution and the fitted noise-negative binomial distribution.

ブロック９４４で分布ベースのエラー訂正を用いて分子標識カウントを調節した後、方法１２００は、任意選択で、ブロック１１１２に進む。ブロック１１１２で、ブロック９４４からの調節された分子標識カウントを、ブロック９１６で決定されて、方向近接性に基づき調節された分子標識カウントと合わせることができる。たとえば、非ユニーク分子標識は、ブロック９４０で除去されるため、ブロック９４４で分布当てはめには使用されない。しかし、これらの分子標識は、ブロック９１６で決定されて、方向近接性に基づき調節された分子標識カウント中に依然として存在する。従って、ブロック９４４からの調節された分子標識カウントと、ブロック９４４で調節された分子標識カウントを合わせて、ブロック９４８で出力を生成することができる。 After adjusting the molecular label count with distribution-based error correction in block 944, method 1200 proceeds to block 1112, optionally. At block 1112, the adjusted molecular labeling count from block 944 can be combined with the molecular labeling count determined at block 916 and adjusted based on directional proximity. For example, non-unique molecular labels are removed in block 940 and are not used for distribution fitting in block 944. However, these molecular labels are still present in the molecular label count determined in block 916 and adjusted based on directional proximity. Thus, the adjusted molecular labeling count from block 944 and the adjusted molecular labeling count at block 944 can be combined to generate an output at block 948.

決定ブロック９１２で、シーケンシングデータ中の標的のシーケンシングステータスが、飽和シーケンシングステータスであれば、方法１２００は、ブロック９４８に進んで、方向近接性および分布ベースのエラー訂正に基づいて分子標識を調節することなく、出力を生成することができる。決定ブロック９２４で、シーケンシングデータ中の標的のシーケンシングステータスが、過少シーケンシングステータスであれば、方法１２００は、ブロック９４８に進んで、分布ベースのエラー訂正に基づいて分子標識を調節することなく、出力を生成することができる。たとえば、決定されるノイズ分子標識の数は、ゼロでありうる。方法１２００は、ブロック９５２で終了する。 In decision block 912, if the sequencing status of the target in the sequencing data is saturated sequencing status, method 1200 proceeds to block 948 for molecular labeling based on directional proximity and distribution-based error correction. The output can be generated without adjustment. In decision block 924, if the sequencing status of the target in the sequencing data is undersequencing status, method 1200 proceeds to block 948 without adjusting the molecular label based on distribution-based error correction. , Can produce output. For example, the number of noise molecule labels determined can be zero. Method 1200 ends at block 952.

分布当てはめのための初期パラメータ推定を用いた、方向近接性および分布ベースのエラー訂正に基づくＰＣＲおよびシーケンシングエラーの訂正
図１３は、再帰による再帰的置換エラー訂正および分布ベースのエラー訂正に基づくＰＣＲおよびシーケンシングエラーの訂正の非限定的な例示的実施形態１３を示すフローチャートである。本方法１３００のブロック（たとえば、ブロック９０４～９５２）は、図９を参照にして説明されている。手短には、方法１３００は、複数の確率バーコード付き標的のシーケンシングデータを受け取った後、ブロック９０４で開始する。いくつかの実施形態では、方法１３００は、複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程をさらに含み、ここで、複数の確率バーコードの各々は、分子標識を含む。いくつかの実施形態では、方法１３００は、さらに、複数の確率バーコード付き標的をシーケンシングして、シーケンシングデータを取得する工程も含む。 PCR and Sequencing Error Correction Based on Directional Proximity and Distribution-Based Error Correction Using Initial Parameter Estimate for Distribution Fit Figure 13 is PCR based on recursive recursive replacement error correction and distribution-based error correction. It is a flowchart which shows the non-limiting exemplary embodiment 13 of the correction of a sequencing error. The blocks of the method 1300 (eg, blocks 904-952) are described with reference to FIG. Briefly, method 1300 begins at block 904 after receiving sequencing data for targets with multiple probability barcodes. In some embodiments, the method 1300 further comprises the step of attaching a probability barcode to the plurality of targets to generate a plurality of probabilistic barcoded targets using the plurality of probability barcodes, wherein the plurality of. Each of the probability barcodes of is contained a molecular label. In some embodiments, method 1300 further comprises the step of sequencing a plurality of probabilistic barcoded targets to obtain sequencing data.

ブロック９０８で、複数の標的の１つ以上について：シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数をカウントすることができる。決定ブロック９１２で、シーケンシングデータが、飽和シーケンシングステータスを有するか否かを決定することができる。シーケンシングデータが、決定ブロック９１２で、飽和シーケンシングステータスを有していなければ、方法１３００は、ブロック９１６に進み、ここで、方向近接性に基づいて分子標識カウントを調節することができる。いくつかの実施形態では、方向近接性に基づいて分子標識カウントを調節する工程は、図７を参照にして説明することができる。 At block 908, for one or more of the targets: the number of molecular labels with identifiable sequences associated with the targets in the sequencing data can be counted. At decision block 912, it can be determined whether or not the sequencing data has a saturated sequencing status. If the sequencing data is at decision block 912 and does not have a saturated sequencing status, method 1300 proceeds to block 916, where the molecular label count can be adjusted based on directional proximity. In some embodiments, the step of adjusting the molecular label count based on directional proximity can be described with reference to FIG.

決定ブロック９２４で、シーケンシングデータ中の標的のシーケンシングステータスが、過少シーケンシングステータスでなければ、方法１３００は、ブロック９２８に進んで、分子標識カウントをフィルタリングすることができる。分子標識カウントをフィルタリングする工程は、決定ブロック９３２で、擬似点閾値より少ない、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数を決定する工程を含みうる。 At decision block 924, if the target sequencing status in the sequencing data is not under-sequencing status, method 1300 can proceed to block 928 to filter the molecular label count. The step of filtering the molecular label count may include in determination block 932 determining the number of molecular labels having an identifiable sequence associated with the target in the sequencing data, which is less than the pseudo-point threshold.

決定ブロック９３２で、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数が、擬似点閾値以上である場合、非ユニーク分子標識をブロック９４０で除去することができる。 In decision block 932, non-unique molecular labels can be removed in block 940 if the number of molecular labels with identifiable sequences associated with the target in the sequencing data is greater than or equal to the pseudopoint threshold.

ブロック９４４で分子標識カウントを調節する前に、ブロック１３０４で、２つのネガティブ二項分布の初期パラメータを任意選択で推定することができる。２つのネガティブ二項分布の初期パラメータは、さまざまな履行において異なるものであってよい。いくつかの実施形態では、２つのネガティブ二項分布の各々の平均および散布度は、１でありうる。いくつかの実施形態では、２つのネガティブ二項分布の平均および散布度は、ブロック９２８からのフィルタリング済分子標識カウントの空でない部分集合の平均および散布度であると推定されうる。たとえば、サブセットは、ブロック９２８からのフィルタリング済分子標識カウントの２５％～７５％分位でありうる。これらの分位の上限または下限は、さまざまな履行において異なるものであってよい。いくつかの実施形態では、上限または下限は、１％、２％、３％、４％、５％、６％、７％、８％、９％、１０％、２０％、３０％、４０％、５０％、７０％、８０％、９０％、９９％、またはこれらの値のいずれか２つの間の数もしくは範囲であるか、またはこれらの近似値でありうる。いくつかの実施形態では、上限または下限は、少なくとも、または多くとも、１％、２％、３％、４％、５％、６％、７％、８％、９％、１０％、２０％、３０％、４０％、５０％、７０％、８０％、９０％、９９％、もしくは１００％でありうる。 The initial parameters of the two negative binomial distributions can be optionally estimated in block 1304 before adjusting the molecular label count in block 944. The initial parameters of the two negative binomial distributions may be different in different implementations. In some embodiments, the average and dispersal degree of each of the two negative binomial distributions can be 1. In some embodiments, the average and dispersal of the two negative binomial distributions can be estimated to be the average and dispersal of a non-empty subset of the filtered molecular label counts from block 928. For example, the subset can be a 25% to 75% quantile of the filtered molecular label count from block 928. The upper or lower bounds of these quantiles may be different in different performances. In some embodiments, the upper or lower bounds are 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 20%, 30%, 40%. , 50%, 70%, 80%, 90%, 99%, or any number or range between two of these values, or an approximation thereof. In some embodiments, the upper or lower bound is at least or at most 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 20%. , 30%, 40%, 50%, 70%, 80%, 90%, 99%, or 100%.

ブロック９４８で、標的の数を推定して、近接性に基づくエラー訂正および分布ベースのエラー訂正後に出力を生成することができる。決定ブロック９１２で、シーケンシングデータ中の標的のシーケンシングステータスが、飽和シーケンシングステータスである場合、方法１３００は、ブロック９４８に進んで、方向近接性および分布ベースのエラー訂正に基づいて分子標識を調節することなく、出力を生成することができる。 At block 948, the number of targets can be estimated to produce output after proximity-based error correction and distribution-based error correction. In decision block 912, if the sequencing status of the target in the sequencing data is saturated sequencing status, method 1300 proceeds to block 948 for molecular labeling based on directional proximity and distribution-based error correction. The output can be generated without adjustment.

決定ブロック９２４で、シーケンシングデータ中の標的のシーケンシングステータスが、過少シーケンシングステータスである場合、方法１３００は、ブロック９４８に進んで、分布ベースのエラー訂正に基づいて分子標識を調節することなく、出力を生成することができる。たとえば、決定されたノイズ分子標識の数は、ゼロであってもよい。方法１３００は、たとえば、ブロック９５２で終了する。 In decision block 924, if the sequencing status of the target in the sequencing data is under-sequencing status, method 1300 proceeds to block 948 without adjusting the molecular label based on distribution-based error correction. , Can produce output. For example, the number of determined noise molecule labels may be zero. Method 1300 ends, for example, at block 952.

図１４は、初期パラメータ推定値のための２番目に高い分子標識を用いることによる、再帰的置換エラー訂正および分布ベースのエラー訂正に基づくＰＣＲおよびシーケンシングエラーの訂正の非限定的な例示的実施形態を示すフローチャートである。本方法１４００のブロック（たとえば、ブロック９０４～９５２）は、図９を参照にして説明されている。手短には、方法１４００は、複数の確率バーコード付き標的のシーケンシングデータを受け取った後、ブロック９０４で開始する。いくつかの実施形態では、方法１４００は、複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程をさらに含み、ここで、複数の確率バーコードの各々は、分子標識を含む。いくつかの実施形態では、方法１４００は、さらに、複数の確率バーコード付き標的をシーケンシングして、シーケンシングデータを取得する工程も含む。 FIG. 14 is a non-limiting exemplary implementation of PCR and sequencing error correction based on recursive replacement error correction and distribution-based error correction by using the second highest molecular label for initial parameter estimates. It is a flowchart which shows a form. The blocks of the method 1400 (eg, blocks 904-952) are described with reference to FIG. Briefly, method 1400 begins at block 904 after receiving sequencing data for targets with multiple probability barcodes. In some embodiments, method 1400 further comprises the step of assigning a plurality of probabilistic barcodes to a plurality of targets to generate a plurality of probabilistic barcoded targets using the plurality of probability barcodes. Each of the probability barcodes of is contained a molecular label. In some embodiments, method 1400 further comprises the step of sequencing a plurality of probabilistic barcoded targets to obtain sequencing data.

ブロック９０８で、複数の標的の１つ以上について：シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数をカウントすることができる。決定ブロック９１２で、シーケンシングデータが、飽和シーケンシングステータスを有するか否かを決定することができる。シーケンシングデータが、決定ブロック９１２で、飽和シーケンシングステータスを有していなければ、方法１４００は、ブロック９１６に進み、ここで、方向近接性に基づいて分子標識カウントを調節することができる。いくつかの実施形態では、方向近接性に基づいて分子標識カウントを調節する工程は、図７を参照にして説明することができる。 At block 908, for one or more of the targets: the number of molecular labels with identifiable sequences associated with the targets in the sequencing data can be counted. At decision block 912, it can be determined whether or not the sequencing data has a saturated sequencing status. If the sequencing data is at decision block 912 and does not have a saturated sequencing status, method 1400 proceeds to block 916, where the molecular label count can be adjusted based on directional proximity. In some embodiments, the step of adjusting the molecular label count based on directional proximity can be described with reference to FIG.

決定ブロック９２４で、シーケンシングデータ中の標的のシーケンシングステータスが、過少シーケンシングステータスでなければ、方法１４００は、ブロック９２８に進んで、分子標識カウントをフィルタリングすることができる。分子標識カウントをフィルタリングする工程は、決定ブロック９３２で、擬似点閾値より少ない、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数を決定する工程を含みうる。 At decision block 924, if the target sequencing status in the sequencing data is not under-sequencing status, method 1400 can proceed to block 928 to filter the molecular label count. The step of filtering the molecular label count may include in determination block 932 determining the number of molecular labels having an identifiable sequence associated with the target in the sequencing data, which is less than the pseudo-point threshold.

決定ブロック９３２で、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数が、擬似点閾値より少ない場合、方法９００は、任意選択で、ブロック９３６に進み、そこで、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有するノイズ分子標識の数を決定する前に、シーケンシングデータ中の標的に関連付けられた識別可能な配列を有する分子標識の数に擬似点を追加することができる。 In decision block 932, if the number of molecular labels with identifiable sequences associated with the target in the sequencing data is less than the pseudo-point threshold, method 900 optionally proceeds to block 936, where sequence is performed. Before determining the number of noise molecular labels with identifiable sequences associated with a target in the single data, pseudopoint the number of molecular labels with identifiable sequences associated with the target in the sequencing data. Can be added.

ブロック９４４で、分布ベースのエラー訂正方法を用いて、分子標識カウントを調節することができる。分布ベースのエラー訂正方法のための初期パラメータは、分子標識のカウントに基づくものであってよい。たとえば、ネガティブ二項分布（たとえば、シグナルネガティブ二項分布もしくはノイズネガティブ二項分布）の一方の初期パラメータ（たとえば、平均および散布度）は、分子標識のカウントまたは分子標識の数の平均もしくはカウントに基づくものであってよい。この分子標識は、２番目に高いカウントの分子標識または任意の等級付け（たとえば、１０番目に高いカウント）の分子標識であってもよい。分子標識の等級付けは、さまざまな履行において異なりうる。いくつかの実施形態では、等級付けは、１、２、３、４、５、６、７、８、９、１０、２０、３０、４０、５０、６０、７０、８０、９０、１００、またはこれらの値のいずれか２つの間の数もしくは範囲、またはこれらの近似値でありうる。いくつかの実施形態では、等級付けは、少なくとも、または多くとも、１、２、３、４、５、６、７、８、９、１０、２０、３０、４０、５０、６０、７０、８０、９０、もしくは１００でありうる。分子標識の数は、さまざまな履行において異なりうる。いくつかの実施形態では、分子標識の数は、１、２、３、４、５、６、７、８、９、１０、２０、３０、４０、５０、６０、７０、８０、９０、１００、またはこれらの値のいずれか２つの間の数もしくは範囲、またはこれらの近似値でありうる。いくつかの実施形態では、分子標識の数は、少なくとも、または多くとも、１、２、３、４、５、６、７、８、９、１０、２０、３０、４０、５０、６０、７０、８０、９０、もしくは１００でありうる。 At block 944, a distribution-based error correction method can be used to adjust the molecular label count. The initial parameters for a distribution-based error correction method may be based on the count of molecular labels. For example, one initial parameter (eg, mean and dispersal) of one of the negative binomial distributions (eg, signal negative binomial distribution or noise negative binomial distribution) is the average or count of molecular labels or the number of molecular labels. It may be based. This molecular label may be the second highest count molecular label or any graded (eg, tenth highest count) molecular label. The grading of molecular labels can vary in various practices. In some embodiments, the grading is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, or It can be a number or range between any two of these values, or an approximation of them. In some embodiments, the grading is at least, or at most, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80. , 90, or 100. The number of molecular labels can vary in different implementations. In some embodiments, the number of molecular labels is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100. , Or a number or range between any two of these values, or an approximation thereof. In some embodiments, the number of molecular labels is at least or at most 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70. , 80, 90, or 100.

ブロック９４８で、標的の数を推定して、近接性に基づくエラー訂正および分布ベースのエラー訂正後に出力を生成することができる。決定ブロック９１２で、シーケンシングデータ中の標的のシーケンシングステータスが、飽和シーケンシングステータスである場合、方法１４００は、ブロック９４８に進んで、方向近接性および分布ベースのエラー訂正に基づいて分子標識を調節することなく、出力を生成することができる。 At block 948, the number of targets can be estimated to produce output after proximity-based error correction and distribution-based error correction. In decision block 912, if the sequencing status of the target in the sequencing data is saturated sequencing status, method 1400 proceeds to block 948 for molecular labeling based on directional proximity and distribution-based error correction. The output can be generated without adjustment.

決定ブロック９２４で、シーケンシングデータ中の標的のシーケンシングステータスが、過少シーケンシングステータスである場合、方法１４００は、ブロック９４８に進んで、分布ベースのエラー訂正に基づいて分子標識を調節することなく、出力を生成することができる。たとえば、決定されたノイズ分子標識の数は、ゼロであってもよい。方法１４００は、ブロック９５２で終了する。 In decision block 924, if the sequencing status of the target in the sequencing data is under-sequencing status, method 1400 proceeds to block 948 without adjusting the molecular label based on distribution-based error correction. , Can produce output. For example, the number of determined noise molecule labels may be zero. Method 1400 ends at block 952.

シーケンシング
いくつかの実施形態では、確率バーコード付き標的の数を推定する工程は、標識標的、空間標識、分子標識、サンプル標識、細胞標識、またはその任意の産物（たとえば、標識アンプリコン、もしくは標識ｃＤＮＡ分子）の配列を決定する工程を含みうる。増幅された標的をシーケンシングに付すことができる。確率バーコード付き標的またはその任意の産物の配列を決定する工程は、サンプル標識の少なくとも一部、空間標識、細胞標識、分子標識、確率バーコード付き標的の少なくとも一部、その相補鎖、逆相補鎖、またはその任意の組合せの配列を決定するために、シーケンシング反応を実施する工程を含みうる。 Sequencing In some embodiments, the step of estimating the number of probabilistic bar coded targets is a labeled target, spatial label, molecular label, sample label, cell label, or any product thereof (eg, a labeled amplicon, or). It may include the step of determining the sequence of the labeled cDNA molecule). Amplified targets can be sequenced. The step of sequencing a probabilistic bar coded target or any product thereof is at least part of the sample label, spatial label, cell label, molecular label, at least part of the probabilistic bar coded target, its complementary strand, reverse complement. A step of performing a sequencing reaction may be included to determine the sequence of the chain, or any combination thereof.

確率バーコード付き標的（たとえば、増幅された核酸、標識核酸、標識核酸のｃＤＮＡコピーなど）の配列の決定は、さまざまなシーケンシング方法を用いて実施することができ、そうした方法として、限定するものではないが、ハイブリダイゼーションによるシーケンシング（ＳＢＨ）、ライゲーションによるシーケンシング（ＳＢＬ）、定量的インクリメンタル蛍光ヌクレオチド付加シーケンシング（ＱＩＦＮＡＳ）、段階的ライゲーションおよび切断、蛍光共鳴エネルギー移動（ＦＲＥＴ）、分子ビーコン、ＴａｑＭａｎリポータプローブ消化、パイロシーケンシング、蛍光ｉｎｓｉｔｕシーケンシング（ＦＩＳＳＥＱ）、ＦＩＳＳＥＱビーズ、ワブル（ｗｏｂｂｌｅ）シーケンシング、多重シーケンシング、重合コロニー（ＰＯＬＯＮＹ）シーケンシング；ナノグリッドローリングサークルシーケンシング（ＲＯＬＯＮＹ）、対立遺伝子特異的オリゴライゲーションアッセイ（たとえば、オリゴライゲーション（ＯＬＡ）、ライゲートした線状プローブおよびローリングサークル増幅（ＲＣＡ）読み出しを用いた単一テンプレート分子ＯＬＡ、ライゲートした錠型（ｐａｄｌｏｃｋ）プローブ、またはライゲートした環状錠型プローブおよびローリングサークル増幅（ＲＣＡ）を用いた単一テンプレート分子ＯＬＡなどが挙げられる。 Sequencing of targets with probabilistic barcodes (eg, amplified nucleic acids, labeled nucleic acids, cDNA copies of labeled nucleic acids, etc.) can be performed using a variety of sequencing methods and is limited as such methods. Sequencing by hybridization (SBH), sequencing by ligation (SBL), quantitative incremental fluorescent nucleotide addition sequencing (QIFNAS), stepwise ligation and cleavage, fluorescence resonance energy transfer (FRET), molecular beacons, but not. TaqMan Reporter Probe Digestion, Pyro Sequencing, Fluorescent In situ Sequencing (FISSEC), FISSEQ Beads, Wobble Sequencing, Multiple Sequencing, POLONY Sequencing; Nanogrid Rolling Circle Sequencing, A single template molecule OLA using an allogeneic-specific oligoligation assay (eg, oligoligation (OLA), ligated linear probe and rolling circle amplification (RCA) readout, ligated tablet probe, or ligated. Examples include a single template molecule OLA using a circular tablet probe and rolling circle amplification (RCA).

いくつかの実施形態では、確率バーコード標的またはその任意の産物の配列を決定する工程は、ペアエンドシーケンシング、ナノポアシーケンシング、ハイスループットシーケンシング、ショットガンシーケンシング、ダイターミネータシーケンシング、マルチプルプライマーＤＮＡシーケンシング、プライマーウォーキングを含み、サンガー（Ｓａｎｇｅｒ）ジデオキシシーケンシング、マクサム・ギルバート（ＭａｘａｍＧｉｌｂｅｒｔ）シーケンシング、パイロシーケンシング、真の単一分子シーケンシング、またはそれらの任意の組合せを含む。あるいは、確率バーコード付き標的またはその任意の産物の配列は、電子顕微鏡検査または化学－感受性電界効果トランジスタ（ｃｈｅｍＦＥＴ）アレイにより決定することができる。 In some embodiments, the steps of sequencing a stochastic barcode target or any product thereof are paired-end sequencing, nanopore sequencing, high throughput sequencing, shotgun sequencing, dieterminator sequencing, multiple primer DNA. Includes sequencing, primer walking, including Sanger dideoxy sequencing, Maxam Gilbert sequencing, pyrosequencing, true single molecule sequencing, or any combination thereof. Alternatively, the sequence of the probabilistic bar coded target or any product thereof can be determined by electron microscopy or a chemical-sensitive field effect transistor (chemFET) array.

Ｒｏｃｈｅ４５４、ＩｌｌｕｍｉｎａＳｏｌｅｘａ、ＡＢＩ－ＳＯＬｉＤ、ＩＯＮ
Ｔｏｒｒｅｎｔ、ＣｏｍｐｌｅｔｅＧｅｎｏｍｉｃｓ、ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅ、Ｈｅｌｉｃｏｓ、またはＰｏｌｏｎａｔｏｒプラットホームといったプラットホームを用いた環状アレイシーケンシングなどのハイスループットシーケンシング方法も使用することができる。いくつかの実施形態では、シーケンシングは、ＭｉＳｅｑシーケンシングを含みうる。いくつかの実施形態では、シーケンシングは、ＨｉＳｅｑシーケンシングを含みうる。 Roche 454, Illumina Solexa, ABI-SOLiD, ION
High-throughput sequencing methods such as circular array sequencing using platforms such as Torrent, Complete Genomics, Pacific Biosciences, Helicos, or Polonator platforms can also be used. In some embodiments, the sequencing may include MiSeq sequencing. In some embodiments, the sequencing may include HiSeq sequencing.

確率バーコード付き標的は、生物のゲノムの遺伝子の約０．０１％～生物のゲノムの遺伝子の約１００％を占める核酸を含みうる。たとえば、複数の多量体を含む標的相補領域を用いて、サンプル中の相補配列を含む遺伝子を捕捉することにより、生物のゲノムの遺伝子の約０．０１％～生物のゲノムの遺伝子の約１００％をシーケンシングすることができる。いくつかの実施形態では、確率バーコード付き標的は、生物のトランスクリプトームの転写物の約０．０１％～生物のトランスクリプトームの転写物の約１００％を占める核酸を含む。たとえば、ポリ（Ｔ）テールを含む標的相補的領域を用いて、サンプルからｍＲＮＡを捕捉することにより、生物のトランスクリプトームの転写物の約０．５０１％～生物のトランスクリプトームの転写物の約１００％をシーケンシングすることができる。 Targets with probability barcodes may contain nucleic acids that occupy about 0.01% to about 100% of the genes in the genome of the organism. For example, by capturing a gene containing a complementary sequence in a sample using a target complementary region containing multiple multimers, approximately 0.01% of the gene in the genome of the organism to about 100% of the gene in the genome of the organism. Can be sequenced. In some embodiments, the probabilistic bar coded target comprises nucleic acids that make up about 0.01% of the transcriptome transcript of the organism to about 100% of the transcriptome transcriptome of the organism. For example, by capturing mRNA from a sample using a target complementary region containing a poly (T) tail, approximately 0.501% of the transcriptome transcript of the organism to the transcriptome transcriptome of the organism. About 100% can be sequenced.

複数の確率バーコードの空間標識および分子標識の配列を決定する工程は、複数の確率バーコードの０．００００１％、０．０００１％、０．００１％、０．０１％、０．１％、１％、２％、３％、４％、５％、６％、７％、８％、９％、１０％、２０％、３０％、４０％、５０％、６０％、７０％、８０％、９０％、９９％、１００％、またはこれらの値のいずれか２つの間の数もしくは範囲をシーケンシングする工程を含みうる。複数の確率バーコードの標識、たとえば、サンプル標識、空間標識、および分子標識の配列を決定する工程は、複数の確率バーコードの１、１０、２０、３０、４０、５０、６０、７０、８０、９０、１００、１０³、１０⁴、１０⁵、１０⁶、１０⁷、１０⁸、１０⁹、１０¹⁰、１０¹¹、１０¹²、１０¹³、１０¹⁴、１０¹⁵、１０¹⁶、１０¹⁷、１０¹⁸、１０¹⁹、１０²⁰、またはこれらの値のいずれか２つの間の数もしくは範囲をシーケンシングする工程を含みうる。複数の確率バーコードの一部または全部をシーケンシングする工程は、約、少なくとも、または多くとも、１０、２０、３０、４０、５０、６０、７０、８０、９０、１００、２００、３００、４００、５００、６００、７００、８００、９００、１０００、２０００、３０００、４０００、５０００、６０００、７０００、８０００、９０００、１００００、またはこれらの値のいずれか２つの間の数もしくは範囲のヌクレオチドまたは塩基のリード長の配列を生成する工程を含みうる。 The steps of determining the sequence of spatial and molecular labels for multiple probability barcodes are 0.00001%, 0.0001%, 0.001%, 0.01%, 0.1%, of the multiple probability barcodes. 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80% , 90%, 99%, 100%, or a number or range between any two of these values may be sequenced. The step of determining the sequence of a plurality of probabilistic barcode labels, such as sample labels, spatial labels, and molecular labels, is 1, 10, 20, 30, 40, 50, 60, 70, 80 of the plurality of probabilistic barcodes. , 90, 100, 10 ³ , 10 ⁴ , 10 ⁵ , 10 ⁶ , 10 ⁷ , 10 ⁸ , 10 ⁹ , 10 ¹⁰ , 10 ¹¹ , 10 ¹² , 10 ¹³ , 10 ¹⁴ , 10 ¹⁵ , 10 ¹⁶ , 10 ¹⁷ , It may include the steps of sequencing a number or range between 10 ¹⁸ , 10 ¹⁹ , 10 ²⁰ or any two of these values. The steps of sequencing some or all of multiple probability barcodes are about, at least, or at most 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400. , 500, 600, 700, 800, 900, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, or a number or range of nucleotides or bases between any two of these values. It may include the step of producing a read length sequence.

シーケンシング工程は、確率バーコード付き標的の少なくともまたは少なくとも約１０、２０、３０、４０、５０、６０、７０、８０、９０、１００以上のヌクレオチドまたは塩基対をシーケンシングする工程を含みうる。たとえば、シーケンシング工程は、複数の確率バーコード付き標的に対するポリメラーゼ鎖反応（ＰＣＲ）増幅を実施することにより、５０、７５、もしくは１００以上のヌクレオチドのリード長を有するシーケンシングデータを生成する工程を含みうる。シーケンシング工程は、確率バーコード付き標的の少なくともまたは少なくとも約２００、３００、４００、５００、６００、７００、８００、９００、１，０００以上のヌクレオチドまたは塩基対をシーケンシングする工程を含みうる。シーケンシング工程は、確率バーコード付き標的の少なくともまたは少なくとも約１，５００、２０００、３０００、４０００、５０００、６０００、７０００、８０００、９０００、もしくは１００００以上のヌクレオチドまたは塩基対をシーケンシングする工程を含みうる。 The sequencing step may include sequencing at least or at least about 10, 20, 30, 40, 50, 60, 70, 80, 90, 100 or more nucleotides or base pairs of a probabilistic barcoded target. For example, the sequencing step involves performing polymerase chain reaction (PCR) amplification against multiple probabilistic barcoded targets to generate sequencing data with read lengths of 50, 75, or 100 or more nucleotides. Can include. The sequencing step may include sequencing at least or at least about 200, 300, 400, 500, 600, 700, 800, 900, 1,000 or more nucleotides or base pairs of a probabilistic barcoded target. Sequencing steps include sequencing at least or at least about 1,500, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, or 10000 or more nucleotides or base pairs of probabilistic barcoded targets. sell.

シーケンシング工程は、ラン当たり少なくとも約２００、３００、４００、５００、６００、７００、８００、９００、１，０００以上のシーケンシングリードを含みうる。いくつかの実施形態では、シーケンシング工程は、ラン当たり少なくともまたは少なくとも約１，５００、２０００、３０００、４０００、５０００、６０００、７０００、８０００、９０００、もしくは１００００以上のシーケンシングリードを含みうる。シーケンシング工程は、ラン当たり約１，６００，０００，０００以下のシーケンシングリードを含みうる。シーケンシング工程は、ラン当たり約２００，０００，０００以下のリードを含みうる。 The sequencing step may include at least about 200, 300, 400, 500, 600, 700, 800, 900, 1,000 or more sequencing leads per run. In some embodiments, the sequencing step may include at least or at least about 1,500, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, or 10000 or more sequencing leads per run. The sequencing step may include approximately 1,600,000,000 or less sequencing leads per run. The sequencing step may include up to about 200,000,000 leads per run.

サンプル
いくつかの実施形態では、１つ以上のサンプル中に複数の標識が含有されうる。１サンプルは、１つ以上の細胞、または１つ以上の細胞由来の核酸を含みうる。１サンプルは、単一細胞、または１細胞由来の核酸であってよい。１つ以上の細胞は、１つ以上の細胞型であってよい。１つ以上の細胞型の少なくとも１つは、脳細胞、心臓細胞、癌細胞、循環腫瘍細胞、器官細胞、上皮細胞、転移性細胞、良性細胞、一次細胞、循環細胞、またはそれらの任意の組合せである。 Samples In some embodiments, one or more samples may contain multiple labels. One sample may contain one or more cells, or nucleic acids from one or more cells. One sample may be a single cell or a nucleic acid derived from one cell. The one or more cells may be of one or more cell types. At least one of one or more cell types is brain cells, heart cells, cancer cells, circulating tumor cells, organ cells, epithelial cells, metastatic cells, benign cells, primary cells, circulating cells, or any combination thereof. Is.

本開示の方法に使用するためのサンプルは、１つ以上の細胞を含みうる。サンプルは、１つ以上の細胞を意味する。いくつかの実施形態では、複数の細胞は、１つ以上の細胞を含みうる。１つ以上の細胞型の少なくとも１つは、脳細胞、心臓細胞、癌細胞、循環腫瘍細胞、器官細胞、上皮細胞、転移性細胞、良性細胞、一次細胞、循環細胞、またはそれらの任意の組合せであってよい。いくつかの実施形態では、細胞は、癌組織、たとえば、乳癌、肺癌、結腸癌、前立腺癌、卵巣癌、膵癌、脳癌、黒色腫および非黒色腫皮膚癌などから切除された癌細胞である。いくつかの場合には、細胞は、癌に由来するが体液から採取される（たとえば循環腫瘍細胞）。癌の非限定的な例としては、腺腫、腺癌、扁平上皮細胞癌、基底細胞癌、小細胞癌、大細胞未分化癌、軟骨肉腫、および線維肉腫が挙げられる。サンプルは、組織、細胞単層、固定細胞、組織片、またはそれらの任意の組合せを含みうる。サンプルは、生体サンプル、臨床サンプル、環境サンプル、生体体液、組織、または被検者からの細胞を含みうる。サンプルは、ヒト、哺乳動物、イヌ、ラット、マウス、魚類、ハエ、蠕虫、植物、真菌、細菌、ウイルス、脊椎動物、または非脊椎動物から取得することができる。 Samples for use in the methods of the present disclosure may contain one or more cells. Sample means one or more cells. In some embodiments, the plurality of cells may comprise one or more cells. At least one of one or more cell types is brain cells, heart cells, cancer cells, circulating tumor cells, organ cells, epithelial cells, metastatic cells, benign cells, primary cells, circulating cells, or any combination thereof. May be. In some embodiments, the cell is a cancer cell resected from a cancerous tissue such as breast cancer, lung cancer, colon cancer, prostate cancer, ovarian cancer, pancreatic cancer, brain cancer, melanoma and non-melanoma skin cancer. .. In some cases, cells are derived from cancer but are taken from body fluids (eg, circulating tumor cells). Non-limiting examples of cancers include adenomas, adenomas, squamous cell carcinomas, basal cell carcinomas, small cell carcinomas, large undifferentiated carcinomas, chondrosarcomas, and fibrosarcoma. Samples can include tissue, cell monolayers, fixatives, tissue pieces, or any combination thereof. The sample may include a biological sample, a clinical sample, an environmental sample, a biological fluid, a tissue, or cells from a subject. Samples can be obtained from humans, mammals, dogs, rats, mice, fish, flies, helminths, plants, fungi, bacteria, viruses, vertebrates, or non-vertebrates.

いくつかの実施形態では、細胞は、ウイルスに感染していてウイルスオリゴヌクレオチドを含有する細胞である。いくつかの実施形態では、ウイルス感染は、一本鎖（＋鎖または「センス」）ＤＮＡウイルス（たとえば、パルボウイルス）、または二本鎖ＲＮＡウイルス（たとえば、レトロウイルス）などのウイルスにより引き起こされうる。いくつかの実施形態では、細胞は、細菌である。これらは、グラム陽性またはグラム陰性菌のいずれかを含みうる。いくつかの実施形態では、細胞は、真菌である。いくつかの実施形態では、細胞は、原生動物またはその他の寄生体である。 In some embodiments, the cell is a cell that is infected with a virus and contains a viral oligonucleotide. In some embodiments, the viral infection can be caused by a virus such as a single-stranded (+ strand or "sense") DNA virus (eg, parvovirus), or a double-stranded RNA virus (eg, retrovirus). .. In some embodiments, the cell is a bacterium. These may include either Gram-positive or Gram-negative bacteria. In some embodiments, the cell is a fungus. In some embodiments, the cell is a protozoa or other parasite.

本明細書で使用されるとき、「細胞」という用語は、１つ以上の細胞を意味しうる。いくつかの実施形態では、細胞は、正常細胞、たとえば、さまざまな発生段階のヒト細胞、またはさまざまな器官もしくは組織型に由来するヒト細胞である。いくつかの実施形態では、非ヒト細胞、たとえば、他のタイプの哺乳動物細胞（たとえば、マウス、ラット、ブタ、イヌ、ウシ、またはウマ）である。いくつかの実施形態では、細胞は、他のタイプの動物または植物細胞である。他の実施形態では、細胞は、任意の原核細胞または真核細胞でありうる。 As used herein, the term "cell" can mean one or more cells. In some embodiments, the cell is a normal cell, eg, a human cell at different developmental stages, or a human cell derived from different organs or histological types. In some embodiments, it is a non-human cell, eg, another type of mammalian cell (eg, mouse, rat, pig, dog, bovine, or horse). In some embodiments, the cell is another type of animal or plant cell. In other embodiments, the cell can be any prokaryotic cell or eukaryotic cell.

本明細書で使用されるとき、細胞は、細胞をビーズに関連付ける前にソートされる。たとえば、細胞は、蛍光活性化細胞ソーティングまたは磁気活性化細胞ソーティング、またはより一般的にはフローサイトメトリーによりソートすることができる。細胞はサイズ別に濾過することができる。いくつかの実施形態では、リテンテートは、ビーズに関連付けられる細胞を含有する。いくつかの実施形態では、フロースルーは、ビーズに関連付けられる細胞を含有する。 As used herein, cells are sorted before associating the cells with beads. For example, cells can be sorted by fluorescence activated cell sorting or magnetically activated cell sorting, or more generally by flow cytometry. Cells can be filtered by size. In some embodiments, the retainate contains cells associated with the beads. In some embodiments, the flow-through contains cells associated with the beads.

サンプルは、複数の細胞を意味しうる。サンプルは、細胞の単層を意味しうる。サンプルは、薄い切片（たとえば、組織薄片）を意味しうる。サンプルは、一次元のアレイに配置することができる細胞の固体または半固体コレクションを意味しうる。 A sample can mean multiple cells. The sample can mean a monolayer of cells. The sample can mean a thin section (eg, tissue flakes). A sample can mean a solid or semi-solid collection of cells that can be placed in a one-dimensional array.

データ解析および表示ソフトウェア
データ解析および標的の空間分解能の可視化
本開示は、確率バーコーディングおよび空間標識を使ってディジタルカウンティングを用いて標的の数および位置を推定する方法を提供する。本開示の方法から得られるデータはマップ上に可視化可能である。サンプルの標的の数および位置のマップは、本明細書に記載の方法を用いて生成された情報を用いて構築可能である。マップは、標的の物理的位置を決定するために使用可能である。マップは、複数の標的の位置を同定するために使用可能である。複数の標的は標的の同一種でありうるか、または複数の標的は複数の異なる標的でありうる。たとえば、脳のマップを構築して複数の標的のディジタルカウントおよび位置を示すことが可能である。 Data Analysis and Display Software Data Analysis and Visualization of Spatial Resolution of Targets The present disclosure provides a method for estimating the number and location of targets using digital counting using probabilistic barcoding and spatial labeling. The data obtained from the methods of the present disclosure can be visualized on a map. Maps of the number and location of targets in the sample can be constructed using the information generated using the methods described herein. Maps can be used to determine the physical position of the target. Maps can be used to identify the location of multiple targets. Multiple targets can be the same species of target, or multiple targets can be multiple different targets. For example, it is possible to build a map of the brain to show the digital counts and positions of multiple targets.

マップは、単一のサンプルのデータから生成可能である。マップは、複数のサンプルのデータを用いて構築可能であり、それにより組合せマップを生成可能である。マップは、何十、何百、および／または何千ものサンプルのデータで構築可能である。複数のサンプルから構成されるマップは、複数のサンプルに共通する領域に関連付けられる標的のディジタルカウントの分布を示すことが可能である。たとえば、レプリケートアッセイは同一のマップ上に表示可能である。少なくとも１、２、３、４、５、６、７、８、９、もしくは１０レプリケートまたはそれ以上を同一のマップ上に表示（たとえばオーバーレイ）しうる。多くとも１、２、３、４、５、６、７、８、９、もしくは１０レプリケートまたはそれ以上を同一のマップ上に表示（たとえばオーバーレイ）しうる。標的の空間分布および数は、さまざまな統計量により表すことが可能である。 Maps can be generated from a single sample of data. Maps can be constructed using data from multiple samples, thereby generating combinatorial maps. Maps can be constructed with data from dozens, hundreds, and / or thousands of samples. A map composed of multiple samples can show the distribution of digital counts of targets associated with regions common to multiple samples. For example, the replicate assay can be displayed on the same map. At least 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 replicates or more may be displayed (eg, overlay) on the same map. At most 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 replicates or more can be displayed (eg, overlay) on the same map. The spatial distribution and number of targets can be represented by various statistics.

複数のサンプルからのデータを合わせることにより、合わせたマップの位置的解像度を高めることができる。複数のサンプルの配向は、共通のランドマークにより登録することができ、そこで、サンプル全域に及ぶ個別の位置測定値は、少なくとも部分的に非連続的である。具体的な例は、ミクロトームを用いて、１つの軸上でサンプルを切断してから、別の軸に沿って第２のサンプルを切断するものである。合わせたデータベースは、標的のディジタルカウントを伴う三次元の空間位置を付与するであろう。前述のアプローチを多重化することにより、ディジタルカウント統計学の高解像度三次元マップが可能になるであろう。 By combining data from multiple samples, the positional resolution of the combined map can be increased. The orientation of multiple samples can be registered by common landmarks, where individual position measurements across the sample are at least partially discontinuous. A specific example is to use a microtome to cut a sample on one axis and then cut a second sample along another axis. The combined database will provide a three-dimensional spatial position with a digital count of the target. Multiplexing the above approach will enable high resolution 3D maps of digital count statistics.

機器システムのいくつかの実施形態では、システムは、単一細胞確率バーコーディングアッセイを行うことにより生成されたシーケンスデータセットのデータ解析を提供するためのコードを含むコンピュータ可読媒体を含むであろう。データ解析ソフトウェアにより提供しうるデータ解析機能の例としては、限定されるものではないが、（ｉ）アッセイの実施時に生成された確率バーコードライブラリーをシーケンスすることにより提供されるサンプル標識、細胞標識、空間標識、分子標識、および標的シーケンスデータのデコーディング／デマルチプレクシングのためのアルゴリズム、（ｉｉ）リード数／遺伝子／細胞およびユニーク転写物分子数／遺伝子／細胞を決定するためのアルゴリズム、（ｉｉｉ）たとえば、遺伝子発現データにより細胞をクラスター化するためのまたは転写物分子数／遺伝子／細胞などの決定の信頼区間を予測するためのシーケンスデータの統計解析、（ｉｖ）たとえば、主成分分析、階層的クラスタリング、ｋ平均値クラスタリング、自己組織化マップ、神経回路網などを用いて、希少細胞のサブ集団を同定するためのアルゴリズム、（ｖ）遺伝子配列データを既知の参照配列にアライメントするためのおよび突然変異、多型体マーカー、およびスプライス変異体を検出するための配列アライメント機能、ならびに（ｖｉ）増幅またはシーケンシングエラーを補償するための分子標識の自動クラスタリングが挙げられる。いくつかの実施形態では、データ解析の全部または一部を行うために市販のソフトウェアを使用しうる。たとえば、全細胞コレクションで各細胞に存在する１遺伝子以上のコピー数の表を編集するためにＳｅｖｅｎＢｒｉｄｇｅｓ（ｈｔｔｐｓ：／／ｗｗｗ．ｓｂｇｅｎｏｍｉｃｓ．ｃｏｍ／）ソフトウェアを使用しうる。いくつかの実施形態では、データ解析ソフトウェアは、有用なグラフ形式のシーケンシング結果、たとえば、細胞集団の各細胞に存在する１遺伝子以上のコピー数を示すヒートマップを出力するためのオプションを含みうる。いくつかの実施形態では、データ解析ソフトウェアは、たとえば、細胞集団の各細胞に存在する１遺伝子以上のコピー数と、あるタイプの細胞、あるタイプの希少細胞型、または特異的疾患もしくは病態を有する被験体に由来する細胞と、を相関付けることにより、シーケンシング結果から生物学的意味を抽出するためのアルゴリズムをさらに含みうる。ある実施形態では、データ解析ソフトウェアは、異なる生物学的サンプル全体にわたり細胞集団を比較するためのアルゴリズムをさらに含みうる。 In some embodiments of the instrument system, the system will include a computer-readable medium containing code to provide data analysis of the sequence data set generated by performing a single cell probability bar coding assay. Examples of data analysis functions that can be provided by data analysis software are, but are not limited to, (i) sample labels, cells provided by sequencing the probabilistic bar code library generated during the assay. Algorithms for decoding / demultiplexing of labeling, spatial labeling, molecular labeling, and target sequence data, (ii) algorithms for determining read count / gene / cell and unique transcript molecular number / gene / cell, (Iii) Statistical analysis of sequence data, eg, for clustering cells with gene expression data or for predicting confidence intervals in decisions such as transcript molecule number / gene / cell, (iv), eg, principal component analysis. , Hierarchical clustering, k-mean clustering, self-organizing maps, algorithms for identifying rare cell subpopulations, (v) for aligning gene sequence data to known reference sequences. And mutations, polymorphic markers, and sequence alignment capabilities for detecting splice variants, as well as (vi) automatic clustering of molecular labels to compensate for amplification or sequencing errors. In some embodiments, commercial software may be used to perform all or part of the data analysis. For example, Seven Bridges (https://www.sbgenomics.com/) software may be used to edit a table of copies of one or more genes present in each cell in a whole cell collection. In some embodiments, the data analysis software may include options for outputting useful graphed sequencing results, eg, heatmaps showing the number of copies of one or more genes present in each cell of a cell population. .. In some embodiments, the data analysis software has, for example, a copy number of one or more genes present in each cell of a cell population and a type of cell, a type of rare cell type, or a specific disease or condition. By correlating with cells from the subject, further algorithms for extracting biological significance from the sequencing results may be included. In certain embodiments, the data analysis software may further include an algorithm for comparing cell populations across different biological samples.

いくつかの実施形態では、データ解析機能のすべてを単一ソフトウェアパッケージ内にパッケージ化しうる。いくつかの実施形態では、データ解析能力の完全セットは、一式のソフトウェアパッケージを含みうる。いくつかの実施形態では、データ解析ソフトウェアは、アッセイ機器システムに依存せずにユーザーが利用可能なスタンドアロンパッケージでありうる。いくつかの実施形態では、ソフトウェアはウェブベースでありうるとともに、ユーザーによるデータの共有を可能しうる。 In some embodiments, all of the data analysis functions may be packaged in a single software package. In some embodiments, the complete set of data analysis capabilities may include a complete set of software packages. In some embodiments, the data analysis software can be a stand-alone package available to the user independent of the assay instrument system. In some embodiments, the software can be web-based as well as allow users to share data.

いくつかの実施形態では、データ解析機能性のすべてを単一のソフトウェアパッケージ内にパッケージすることができる。いくつかの実施形態では、データ解析能力の完全セットは、一式のソフトウェアパッケージを含みうる。いくつかの実施形態では、データ解析ソフトウェアは、アッセイ機器システムとは独立に、ユーザーが利用可能なスタンドアロンパッケージであってよい。いくつかの実施形態では、ソフトウェアは、ウェブベースでありうるとともに、ユーザーによるデータの共有が可能になりうる。 In some embodiments, all of the data analysis functionality can be packaged in a single software package. In some embodiments, the complete set of data analysis capabilities may include a complete set of software packages. In some embodiments, the data analysis software may be a user-available stand-alone package independent of the assay instrument system. In some embodiments, the software can be web-based and allow users to share data.

システムプロセッサーおよびネットワーク
一般的には、本開示の機器システム方法にての使用に適したコンピュータまたはプロセッサーは、図１５に示すように、固定媒体１５１２を有するサーバー１５０９に任意選択的に接続可能な媒体１５１１またはネットワークポート１５０５から命令を読取り可能な論理装置としてさらに理解しうる。システム１５００は、図１５に示すように、ＣＰＵ１５０１、ディスクドライブ１５０３、キーボード１５１５やマウス１５１６などのオプションの入力デバイス、およびオプションのモニター１５０７を含みうる。データ通信は、ローカル位置またはリモート位置のサーバーに対して指定の通信媒体を介して達成可能である。通信媒体は、データを送受信する任意の手段を含みうる。たとえば、通信媒体は、ネットワーク接続、無線接続、またはインターネット接続でありうる。かかる接続は、ＷｏｒｌｄＷｉｄｅＷｅｂによる通信を提供可能である。本開示に関するデータは、図１５に示すように、かかるネットワークまたは接続を介してあるパーティー１５２２による受信または閲覧のために伝送可能である。 System Processors and Networks Generally, computers or processors suitable for use in the equipment system methods of the present disclosure are media that can optionally be connected to server 1509 with fixed media 1512, as shown in FIG. It can be further understood as a logical device that can read instructions from 1511 or network port 1505. The system 1500 may include a CPU 1501, a disk drive 1503, optional input devices such as a keyboard 1515 and a mouse 1516, and an optional monitor 1507, as shown in FIG. Data communication is achievable via a designated communication medium for a server at a local or remote location. The communication medium may include any means of transmitting and receiving data. For example, the communication medium can be a network connection, a wireless connection, or an internet connection. Such a connection can provide communication via the World Wide Web. The data relating to the present disclosure can be transmitted for reception or viewing by a party 1522 over such a network or connection, as shown in FIG.

図１６が示すコンピュータシステム１６００の第１のアーキテクチャー例の例示的な実施形態は、本開示の実施形態例との関連で使用可能である。図１６が示すように、コンピュータシステム例は、処理命令用のプロセッサー１６０２を含みうる。プロセッサーの例としては、限定されるものではないが、ＩｎｔｅｌＸｅｏｎ（商標）プロセッサー、ＡＭＤＯｐｔｅｒｏｎ（商標）プロセッサー、Ｓａｍｓｕｎｇ３２ビットＲＩＳＣＡＲＭ１１７６ＪＺ（Ｆ）－Ｓｖ１．０（商標）プロセッサー、ＡＲＭＣｏｒｔｅｘ－Ａ８ＳａｍｓｕｎｇＳ５ＰＣ１００（商標）プロセッサー、ＡＲＭＣｏｒｔｅｘ－Ａ８ＡｐｐｌｅＡ４（商標）プロセッサー、ＭａｒｖｅｌｌＰＸＡ９３０（商標）プロセッサー、または機能的に等価なプロセッサーが挙げられる。実行のマルチスレッドは並列処理に使用可能である。いくつかの実施形態では、クラスター接続の単一コンピュータシステムであるか、または複数のコンピュータ、携帯電話、もしくは個人用携帯情報端末デバイスを含むネットワーク接続の分散システムであるかにかかわらず、複数のプロセッサーまたは複数のコアを備えたプロセッサーも使用可能である。 An exemplary embodiment of the first architectural example of the computer system 1600 shown in FIG. 16 can be used in connection with the embodiments of the present disclosure. As shown in FIG. 16, an example computer system may include a processor 1602 for processing instructions. Examples of processors include, but are not limited to, Intel Xeon ™ processor, AMD Opteron ™ processor, Samsung 32-bit RISC ARM 1176JZ (F) -S v1.0 ™ processor, ARM Cortex-. Examples include A8 Samsung S5PC100 ™ processor, ARM Cortex-A8 Apple A4 ™ processor, Marvel PXA 930 ™ processor, or functionally equivalent processor. Multithreading of execution can be used for parallel processing. In some embodiments, multiple processors, whether a single computer system with cluster connections or a distributed system with network connections that includes multiple computers, mobile phones, or personal digital assistant devices. Alternatively, a processor with multiple cores can be used.

図１６が示すように、高速キャッシュ１６０４は、プロセッサー１６０２が最近使用したまたは頻繁に使用する命令またはデータに対する高速メモリーを提供するために、プロセッサー１６０２に接続または導入することが可能である。プロセッサー１６０２は、プロセッサーバス１６０８によりノースブリッジ１６０６に接続可能である。ノースブリッジ１６０６は、メモリーバス１６１２によりランダムアクセスメモリー（ＲＡＭ）１６１０に接続され、プロセッサー１６０２によりＲＡＭ１６１０へのアクセスを管理する。ノースブリッジ１６０６はまた、チップセットバス１６１６によりサウスブリッジ１６１４に接続可能である。サウスブリッジ１６１４は、ひいては、周辺機器用バス１６１８に接続される。周辺機器用バスは、たとえば、ＰＣＩ、ＰＣＩ－Ｘ、ＰＣＩＥｘｐｒｅｓｓ、または他の周辺機器用バスでありうる。ノースブリッジおよびサウスブリッジはプロセッサーチップセットと呼ばれることが多く、プロセッサーとＲＡＭと周辺機器用バス１６１８上の周辺機器要素との間のデータ転送を管理する。いくつかの代替アーキテクチャーでは、ノースブリッジの機能、個別のノースブリッジチップを使用する代わりにプロセッサー中に組込み可能である。 As shown in FIG. 16, the fast cache 1604 can be connected to or installed in the processor 1602 to provide fast memory for recently used or frequently used instructions or data. The processor 1602 can be connected to the north bridge 1606 by the processor bus 1608. The northbridge 1606 is connected to the random access memory (RAM) 1610 by the memory bus 1612 and manages access to the RAM 1610 by the processor 1602. Northbridge 1606 can also be connected to Southbridge 1614 by chipset bus 1616. The south bridge 1614 is eventually connected to the peripheral bus 1618. The peripheral device bus can be, for example, a PCI, PCI-X, PCI Express, or other peripheral device bus. Northbridges and southbridges, often referred to as processor chipsets, manage data transfer between the processor and RAM and peripheral elements on the peripheral bus 1618. In some alternative architectures, the functionality of the northbridge is embedd in the processor instead of using a separate northbridge chip.

いくつかの実施形態では、システム１６００は、周辺機器用バス１６１８に結合されたアクセラレーターカード１６２２を含みうる。アクセラレーターは、ある特定の処理を加速するためにフィールドプログラマブルゲートアレイ（ＦＰＧＡ）または他のハードウェアを含みうる。たとえば、アクセラレーターは、アダプティブデータリストラクチャリングのために、または拡張セット処理で使用される代数式を評価するために、使用可能である。 In some embodiments, the system 1600 may include an accelerator card 1622 coupled to a peripheral bus 1618. Accelerators may include field programmable gate arrays (FPGAs) or other hardware to accelerate certain processes. For example, accelerators can be used for adaptive data restructuring or to evaluate algebraic expressions used in extended set processing.

ソフトウェアおよびデータは、外部記憶装置１６２４に記憶され、プロセッサーによる使用のためにＲＡＭ１６１０またはキャッシュ１６０４にロード可能である。システム１６００は、管理システムリソース用のオペレーティングシステムを含む。オペレーティングシステムの例は、限定されるものではないが、Ｌｉｎｕｘ（登録商標）、Ｗｉｎｄｏｗｓ（登録商標）、ＭＡＣＯＳ（商標）、ＢｌａｃｋＢｅｒｒｙＯＳ（商標）、ｉＯＳ（商標）、および他の機能的に等価なオペレーティングシステム、さらには本発明の実施形態例に従ってデータ記憶および最適化を管理するためのオペレーティングシステムの上で動作するアプリケーションソフトを含む。 Software and data are stored in external storage device 1624 and can be loaded into RAM 1610 or cache 1604 for use by the processor. System 1600 includes an operating system for management system resources. Examples of operating systems are, but are not limited to, Linux®, Windows®, MACOS®, BlackBerry OS®, iOS®, and other functionally equivalent. It includes an operating system as well as application software running on the operating system for managing data storage and optimization according to embodiments of the present invention.

この例では、システム１６００はまた、ネットワークインターフェースカード（ＮＩＣ）１６２０および１６２１を含み、ネットワーク接続記憶装置（ＮＡＳ）などの外部記憶装置および分散並列処理に使用可能な他のコンピュータシステムへのネットワークインターフェースを提供する周辺機器用バスに接続される。 In this example, system 1600 also includes network interface cards (NICs) 1620 and 1621, which provide network interfaces to external storage devices such as network-attached storage (NAS) and other computer systems that can be used for distributed parallel processing. Connected to the provided peripheral bus.

図１７は、本開示の方法での使用に好適な、複数のコンピュータシステム１７０２ａ、および１７０２ｂ、複数の携帯電話および個人用携帯情報端末１７０２ｃ、ならびにネットワーク接続記憶装置（ＮＡＳ）１７０４ａ、および１７０４ｂを含むネットワーク１７００の例示的な図を示す。実施形態例では、システム１７１２ａ、１７１２ｂ、および１７１２ｃは、データ記憶を管理し、ネットワーク接続記憶装置（ＮＡＳ）に記憶されたデータに対するデータアクセスを最適化することができる。データに数学モデルを使用することができ、分散並列処理コンピュータシステム１７１２ａ、および１７１２ｂ、ならびに携帯電話および個人用携帯情報端末システム１７１２ｃを用いて評価することができる。コンピュータシステム１７１２ａ、および１７１２ｂ、ならびに携帯電話および個人用携帯情報端末システム１７１２ｃはまた、ネットワーク接続記憶装置（ＮＡＳ）１７１４ａおよび１７１４ｂに記憶されたデータのアダプティブデータリストラクチャリングのために並列処理を提供可能である。図１７は、一例を示すに過ぎず、多種多様な他のコンピュータアーキテクチャーおよびシステムが、本発明の種々の実施形態に関連して使用することができる。たとえば、ブレードサーバーを用いて、並列処理を提供することができる。プロセッサーブレードは、並列処理を提供するためにバックプレーンを介して接続可能である。記憶装置はまた、バックプレーンに接続してもよいし、または個別ネットワークインターフェースを介してネットワーク接続記憶装置（ＮＡＳ）として存在してもよい。 FIG. 17 includes a plurality of computer systems 1702a and 1702b, a plurality of mobile phones and personal digital assistants 1702c, and network-attached storage (NAS) 1704a, and 1704b suitable for use in the methods of the present disclosure. An exemplary diagram of the network 1700 is shown. In embodiments, systems 1712a, 1712b, and 1712c can manage data storage and optimize data access to data stored in network-attached storage (NAS). Mathematical models can be used for the data and can be evaluated using distributed parallel processing computer systems 1712a and 1712b, as well as mobile phones and personal digital assistant systems 1712c. Computer systems 1712a and 1712b, as well as mobile phone and personal mobile information terminal systems 1712c, can also provide parallel processing for adaptive data restructuring of data stored in network-attached storage (NAS) 1714a and 1714b. Is. FIG. 17 is only an example, and a wide variety of other computer architectures and systems can be used in connection with various embodiments of the present invention. For example, a blade server can be used to provide parallel processing. Processor blades can be connected via the backplane to provide parallel processing. The storage device may also be connected to the backplane or may exist as a network attached storage device (NAS) via a separate network interface.

いくつかの実施形態例では、プロセッサーは、個別メモリー空間を保持可能であるとともに、ネットワークインターフェースを介してバックプレーンにまたは他のプロセッサーによる並列処理のために他のコネクターにデータを伝送可能である。他の実施形態では、プロセッサーの一部または全部は、共有仮想アドレスメモリー空間を使用可能である。 In some embodiments, the processor can hold a separate memory space and can transmit data over the network interface to the backplane or to other connectors for parallel processing by other processors. In other embodiments, some or all of the processors can use the shared virtual address memory space.

図１８に示すマルチプロセッサーコンピュータシステム１８００の例示的なブロック図は、実施形態例に従って共有仮想アドレスメモリー空間を使用する。システムは、共有メモリーサブシステム１８０４にアクセス可能な複数のプロセッサー１８０２ａ－ｆを含む。システムは、メモリーサブシステム１８０４中で複数のプログラマブルハードウェアメモリーアルゴリズムプロセッサー（ＭＡＰ）１８０６ａ－ｆを組込む。各ＭＡＰ１８０６ａ－ｆは、メモリー１８０８ａ－ｆと１つ以上のフィールドプログラマブルゲートアレイ（ＦＰＧＡ）１８１０ａ－ｆとを含みうる。ＭＡＰは、設定可能な機能ユニットを提供し、特定のアルゴリズムまたはアルゴリズムの一部は、それぞれのプロセッサーと緊密に連携して処理するためにＦＰＧＡ１８１０ａ－ｆに提供可能である。たとえば、ＭＡＰは、データモデルに関する代数式を評価するためにおよび実施形態例でアダプティブデータリストラクチャリングを行うために使用可能である。この例では、各ＭＡＰは、こうした目的のためにすべてのプロセッサーによりグローバルにアクセス可能である。一構成では、各ＭＡＰは、関連付けられたメモリー１８０８ａ－ｆにアクセスするためにダイレクトメモリアクセス（ＤＭＡ）を使用可能であり、それにより、それぞれのマイクロプロセッサー１８０２ａ－ｆに依存せずにかつ非同期的に課題を実行可能になる。この構成では、ＭＡＰは、アルゴリズムのパイプライン実行および並行実行のために他のＭＡＰに結果を直接供給可能である。 An exemplary block diagram of the multiprocessor computer system 1800 shown in FIG. 18 uses a shared virtual address memory space according to an embodiment. The system includes a plurality of processors 1802a-f that can access the shared memory subsystem 1804. The system incorporates a plurality of programmable hardware memory algorithm processors (MAPs) 1806a-f in a memory subsystem 1804. Each MAP1806a-f may include a memory 1808a-f and one or more field programmable gate array (FPGA) 1810a-f. The MAP provides configurable functional units, and certain algorithms or parts of the algorithms can be provided to the FPGA 1810a-f for processing in close coordination with their respective processors. For example, the MAP can be used to evaluate algebraic expressions for a data model and to perform adaptive data restructuring in embodiments. In this example, each MAP is globally accessible by all processors for this purpose. In one configuration, each MAP can use direct memory access (DMA) to access the associated memory 1808a-f, thereby being independent and asynchronous with each microprocessor 1802a-f. It becomes possible to carry out the task. In this configuration, the MAP can feed the results directly to other MAPs for pipeline and parallel execution of the algorithm.

以上のコンピュータアーキテクチャーおよびシステムは、単なる例にすぎず、一般的プロセッサー、共プロセッサー、ＦＰＧＡ、および他のプログラマブルロジックデバイス、システムオンチップ（ＳＯＣ）、特定用途向け集積回路（ＡＳＩＣ）、および他の処理素子および論理素子の任意の組合せを使用するシステムを含めて、多種多様な他のコンピュータ、携帯電話、および個人用携帯情報端末のアーキテクチャーおよびシステムを実施形態例との関連で使用可能である。いくつかの実施形態では、コンピュータシステムの全部または一部は、ソフトウェアまたはハードウェアで実現可能である。任意のさまざまなデータ記憶媒体は、ランダムアクセスメモリー、ハードドライブ、フラッシュメモリー、テープドライブ、ディスクアレイ、ネットワーク接続記憶装置（ＮＡＳ）、ならびに他のローカルまたは分散データ記憶デバイスおよびシステムを含めて、実施形態例との関連で、使用可能である。 These computer architectures and systems are just examples, including common processors, coprocessors, FPGAs, and other programmable logic devices, system-on-chip (SOC), application-specific integrated circuits (ASICs), and others. A wide variety of other computer, mobile phone, and personal mobile information terminal architectures and systems can be used in the context of embodiments, including systems that use any combination of processing and logic elements. .. In some embodiments, all or part of the computer system is feasible with software or hardware. Any variety of data storage media includes random access memory, hard drives, flash memory, tape drives, disk arrays, network-attached storage (NAS), and other local or distributed data storage devices and systems. It can be used in connection with the example.

実施形態例では、本開示のコンピュータサブシステムは、以上のまたは他のコンピュータアーキテクチャーおよびシステムのいずれかで実行されるソフトウェアモジュールを用いて実現可能である。他の実施形態では、システムの機能は、ファームウェア、プログラマブルロジックデバイス、たとえば、フィールドプログラマブルゲートアレイ（ＦＰＧＡ）、システムオンチップ（ＳＯＬ）、特定用途向け集積回路（ＡＳＩＣ）、または他の処理素子および論理素子で、部分的にまたは完全に実現可能である。たとえば、セットプロセッサーおよびオプティマイザーは、アクセラレーターカードなどのハードウェアアクセラレーターカードを用いてハードウェアアクセラレーションで実現可能である。 In embodiments, the computer subsystems of the present disclosure are feasible with software modules running in any of the above or other computer architectures and systems. In other embodiments, the functions of the system are firmware, programmable logic devices such as field programmable gate arrays (FPGAs), system-on-chips (SOLs), application-specific integrated circuits (ASICs), or other processing elements and logic. It is partially or fully feasible in the element. For example, set processors and optimizers can be achieved with hardware acceleration using hardware accelerator cards such as accelerator cards.

システムプロセッサーおよびネットワーク
一般的には、図に示すように、本開示の機器システムに含まれるコンピュータまたはプロセッサーは、固定媒体１２を有するサーバー０９に任意選択的に接続可能な媒体１１またはネットワークポート０５から命令を読み取ることができる論理装置としてさらに理解しうる。図に示すようなシステム００は、ＣＰＵ０１、ディスクドライブ０３、キーボード１５もしくはマウス１６などのオプションの入力デバイス、およびオプションのモニター０７を含みうる。データ通信は、ローカル位置またはリモート位置のサーバーに対して指定の通信媒体を介して達成することができる。通信媒体は、データを送受信する任意の手段を含みうる。たとえば、通信媒体は、ネットワーク接続、無線接続、またはインターネット接続でありうる。かかる接続は、ＷｏｒｌｄＷｉｄｅＷｅｂによる通信を提供可能である。図示される通り本開示に関するデータは、かかるネットワークまたは接続を介して、あるパーティー２２による受信または閲覧のために伝送することができる。 System Processors and Networks Generally, as shown in the figure, the computer or processor included in the equipment system of the present disclosure is from a medium 11 or network port 05 that can optionally be connected to a server 09 having a fixed medium 12. It can be further understood as a logical device that can read instructions. The system 00 as shown in the figure may include an optional input device such as a CPU 01, a disk drive 03, a keyboard 15 or a mouse 16, and an optional monitor 07. Data communication can be achieved via a designated communication medium for a server at a local or remote location. The communication medium may include any means of transmitting and receiving data. For example, the communication medium can be a network connection, a wireless connection, or an internet connection. Such a connection can provide communication via the World Wide Web. As illustrated, the data relating to this disclosure may be transmitted for reception or viewing by a party 22 over such a network or connection.

図は、本開示の実施形態例との関連で使用することができるコンピュータシステム００の第１のアーキテクチャー例の例示的な実施形態を示す。図に示すように、コンピュータシステム例は、処理命令用のプロセッサー０２を含みうる。プロセッサーの非限定的な例としては、ＩｎｔｅｌＸｅｏｎ（商標）プロセッサー、ＡＭＤＯｐｔｅｒｏｎ（商標）プロセッサー、Ｓａｍｓｕｎｇ３２ビットＲＩＳＣＡＲＭ１１７６ＪＺ（Ｆ）－Ｓｖ１．０（商標）プロセッサー、ＡＲＭＣｏｒｔｅｘ－Ａ８ＳａｍｓｕｎｇＳ５ＰＣ１００（商標）プロセッサー、ＡＲＭＣｏｒｔｅｘ－Ａ８ＡｐｐｌｅＡ４（商標）プロセッサー、ＭａｒｖｅｌｌＰＸＡ９３０（商標）プロセッサー、または機能的に同等のプロセッサーが挙げられる。実行のマルチスレッドは、並列処理に使用可能である。いくつかの実施形態では、クラスター接続の単一コンピュータシステムであるか、または複数のコンピュータ、携帯電話、もしくは個人用携帯情報端末デバイスを含むネットワーク接続の分散システムであるかにかかわらず、複数のプロセッサーまたは複数のコアを備えたプロセッサーも使用可能である。 The figure shows an exemplary embodiment of a first architectural example of computer system 00 that can be used in connection with the embodiments of the present disclosure. As shown in the figure, an example computer system may include a processor 02 for processing instructions. Non-limiting examples of processors include Intel Xeon ™ processor, AMD Opteron ™ processor, Samsung 32-bit RISC ARM 1176JZ (F) -S v1.0 ™ processor, ARM Cortex-A8 Samsung S5PC100 ( Included is a ™ processor, an ARM Cortex-A8 Apple A4 ™ processor, a Marvel PXA 930 ™ processor, or a functionally equivalent processor. Multithreading of execution can be used for parallel processing. In some embodiments, multiple processors, whether a single computer system with cluster connections or a distributed system with network connections that includes multiple computers, mobile phones, or personal digital assistant devices. Alternatively, a processor with multiple cores can be used.

図に示すように、高速キャッシュ０４は、プロセッサー０２が最近使用した、または頻繁に使用する命令またはデータに対する高速メモリーを提供するために、プロセッサー０２に接続または搭載することができる。プロセッサー０２は、プロセッサーバス０８によりノースブリッジ０６に接続可能である。ノースブリッジ０６は、メモリーバス１２によりランダムアクセスメモリー（ＲＡＭ）に接続されて、プロセッサー０２によるＲＡＭ１０へのアクセスを管理する。ノースブリッジ０６はまた、チップセットバス１６によりサウスブリッジ１４にも接続される。次いで、サウスブリッジ１４は、周辺機器用バス１８に接続される。周辺機器用バスは、たとえば、ＰＣＩ、ＰＣＩ－Ｘ、ＰＣＩＥｘｐｒｅｓｓ、または他の周辺機器用バスであってよい。ノースブリッジおよびサウスブリッジは、プロセッサーチップセットと呼ばれることが多く、プロセッサーと、ＲＡＭと、周辺機器用バス１８上の周辺機器要素との間のデータ転送を管理する。いくつかの代替アーキテクチャーでは、ノースブリッジの機能を、個別のノースブリッジチップを使用する代わりに、プロセッサー内に搭載することができる。 As shown in the figure, the fast cache 04 can be connected to or mounted on the processor 02 to provide fast memory for instructions or data recently used or frequently used by the processor 02. The processor 02 can be connected to the north bridge 06 by the processor bus 08. The north bridge 06 is connected to a random access memory (RAM) by the memory bus 12 and manages access to the RAM 10 by the processor 02. Northbridge 06 is also connected to Southbridge 14 by the chipset bus 16. The south bridge 14 is then connected to the peripheral bus 18. The peripheral device bus may be, for example, a PCI, PCI-X, PCI Express, or other peripheral device bus. Northbridges and southbridges, often referred to as processor chipsets, manage data transfer between the processor, RAM, and peripheral elements on the peripheral bus 18. In some alternative architectures, the functionality of the northbridge can be installed in the processor instead of using a separate northbridge chip.

いくつかの実施形態では、システム００は、周辺機器用バス１８に結合されたアクセラレーターカード２２を含みうる。アクセラレーターは、ある特定の処理を加速するために、フィールドプログラマブルゲートアレイ（ＦＰＧＡ）または他のハードウェアを含みうる。たとえば、アクセラレーターは、アダプティブデータリストラクチャリングのために、または拡張セット処理で使用される代数式を評価するために、使用することができる。 In some embodiments, the system 00 may include an accelerator card 22 coupled to a peripheral bus 18. Accelerators may include field programmable gate arrays (FPGAs) or other hardware to accelerate certain processes. For example, accelerators can be used for adaptive data restructuring or to evaluate algebraic expressions used in extended set processing.

ソフトウェアおよびデータは、外部記憶装置２４に記憶して、プロセッサーによる使用のためにＲＡＭ１０またはキャッシュ０４にロードすることができる。システム００は、管理システムリソース用のオペレーティングシステムを含み；オペレーティングシステムの例は、限定されるものではないが、Ｌｉｎｕｘ（登録商標）、Ｗｉｎｄｏｗｓ（登録商標）、ＭＡＣＯＳ（商標）、ＢｌａｃｋＢｅｒｒｙＯＳ（商標）、ｉＯＳ（商標）、および他の機能的に同等のオペレーティングシステム、さらには本発明の実施形態例に従ってデータ記憶および最適化を管理するためのオペレーティングシステムの上で動作するアプリケーションソフトを含む。 The software and data can be stored in external storage 24 and loaded into RAM 10 or cache 04 for use by the processor. System 00 includes an operating system for management system resources; examples of operating systems are, but are not limited to, Linux®, Windows®, MACOS®, BlackBerry OS®. , IOS ™, and other functionally equivalent operating systems, as well as application software running on the operating system for managing data storage and optimization according to embodiments of the invention.

この例では、システム００はまた、ネットワークインターフェースカード（ＮＩＣ）２０および２１を含み、これらは、ネットワーク接続記憶装置（ＮＡＳ）などの外部記憶装置および分散並列処理に使用可能な他のコンピュータシステムへのネットワークインターフェースを提供する周辺機器用バスに接続されている。 In this example, system 00 also includes network interface cards (NICs) 20 and 21, which are to external storage devices such as network-attached storage (NAS) and other computer systems that can be used for distributed parallel processing. It is connected to a peripheral bus that provides a network interface.

図は、複数のコンピュータシステム０２ａ、および０２ｂ、複数の携帯電話および個人用携帯情報端末０２ｃ、ならびにネットワーク接続記憶装置（ＮＡＳ）０４ａ、および０４ｂを含むネットワーク００の例示的な図を示す。実施形態例では、システム１２ａ、１２ｂ、および１２ｃは、データ記憶を管理すると共に、ネットワーク接続記憶装置（ＮＡＳ）１４ａおよび１４ｂに記憶されたデータに対するデータアクセスを最適化することができる。数学モデルをデータに使用し、コンピュータシステム１２ａ、および１２ｂ、ならびに携帯電話および個人用携帯情報端末システム１２ｃ全体を介した分散並列処理を用いて評価することができる。コンピュータシステム１２ａ、および１２ｂ、ならびに携帯電話および個人用携帯情報端末システム１２ｃはまた、ネットワーク接続記憶装置（ＮＡＳ）に記憶されたデータのアダプティブデータリストラクチャリングのために並列処理を提供することもできる。図は、一例を示すに過ぎず、多種多様な他のコンピュータアーキテクチャーおよびシステムを、本発明の種々の実施形態との関連で使用することができる。たとえば、並列処理を提供するために、ブレードサーバーを使用することができる。プロセッサーブレードは、並列処理を提供するためにバックプレーンを介して接続することができる。記憶装置はまた、バックプレーンに接続してもよいし、または個別ネットワークインターフェースを介するネットワーク接続記憶装置（ＮＡＳ）として存在してもよい。 The figure shows an exemplary diagram of network 00 including a plurality of computer systems 02a and 02b, a plurality of mobile phones and personal digital assistants 02c, and a network attached storage storage device (NAS) 04a, 04b. In embodiments, systems 12a, 12b, and 12c can manage data storage and optimize data access to data stored in network-attached storage devices (NAS) 14a and 14b. Mathematical models can be used for the data and evaluated using distributed parallel processing through computer systems 12a and 12b, as well as the entire mobile phone and personal digital assistant system 12c. Computer systems 12a and 12b, as well as mobile phones and personal mobile information terminal systems 12c, can also provide parallel processing for adaptive data restructuring of data stored in network-attached storage (NAS). .. The figure is only an example, and a wide variety of other computer architectures and systems can be used in the context of various embodiments of the invention. For example, a blade server can be used to provide parallelism. Processor blades can be connected via a backplane to provide parallelism. The storage device may also be connected to the backplane or may exist as a network-attached storage device (NAS) via an individual network interface.

いくつかの実施形態例では、プロセッサーは、個別メモリー空間を維持するとともに、他のプロセッサーによる並列処理のためにネットワークインターフェース、バックプレーンにまたは他のコネクターを介してデータを伝送することができる。他の実施形態では、プロセッサーの一部または全部は、共有仮想アドレスメモリー空間を使用することができる。 In some embodiments, the processor can maintain individual memory space and transmit data to a network interface, backplane, or through other connectors for parallel processing by other processors. In other embodiments, some or all of the processors may use the shared virtual address memory space.

図は、実施形態例に従って共有仮想アドレスメモリー空間を用いる、マルチプロセッサーコンピュータシステム００の例示的なブロック図を示す。システムは、共有メモリーサブシステム０４にアクセス可能な複数のプロセッサー０２ａ～ｆを含む。システムは、メモリーサブシステム０４中で複数のプログラマブルハードウェアメモリーアルゴリズムプロセッサー（ＭＡＰ）０６ａ～ｆを搭載する。各ＭＡＰ０６ａ～ｆは、メモリー０８ａ～ｆと、１つ以上のフィールドプログラマブルゲートアレイ（ＦＰＧＡ）１０ａ～ｆとを含みうる。ＭＡＰは、設定可能な機能ユニットを提供し、それぞれのプロセッサーと緊密に連携して処理するために、特定のアルゴリズムまたはアルゴリズムの一部をＦＰＧＡ１０ａ～ｆに提供することができる。たとえば、ＭＡＰを用いて、データモデルに関する代数式を評価するとともに、実施形態例でアダプティブデータリストラクチャリングを実施することができる。この例では、各ＭＡＰは、こうした目的のために、すべてのプロセッサーによりグローバルにアクセス可能である。一構成では、各ＭＡＰは、ダイレクトメモリアクセス（ＤＭＡ）を用いて、関連付けられたメモリー０８ａ～ｆにアクセスすることができ、これによって、それぞれのマイクロプロセッサー０２ａ～ｆから独立に、かつ非同期的に課題を実行することが可能になる。この構成では、ＭＡＰは、パイプライン処理およびアルゴリズムの並行実行のために他のＭＡＰに結果を直接供給することができる。 The figure shows an exemplary block diagram of a multiprocessor computer system 00 using a shared virtual address memory space according to an embodiment. The system includes a plurality of processors 02a-f that can access the shared memory subsystem 04. The system incorporates a plurality of programmable hardware memory algorithm processors (MAPs) 06a-f in the memory subsystem 04. Each MAP 06a-f may include memories 08a-f and one or more field programmable gate arrays (FPGAs) 10a-f. The MAP provides configurable functional units and can provide a particular algorithm or part of an algorithm to the FPGAs 10a-f for processing in close coordination with their respective processors. For example, MAPs can be used to evaluate algebraic expressions for data models and to perform adaptive data restructuring in examples of embodiments. In this example, each MAP is globally accessible by all processors for this purpose. In one configuration, each MAP can access the associated memory 08a-f using direct memory access (DMA), thereby independently and asynchronously from the respective microprocessors 02a-f. It will be possible to carry out the task. In this configuration, the MAP can feed the results directly to other MAPs for pipeline processing and parallel execution of algorithms.

以上のコンピュータアーキテクチャーおよびシステムは、単なる例にすぎず、一般的プロセッサー、共プロセッサー、ＦＰＧＡ、および他のプログラマブルロジックデバイス、システムオンチップ（ＳＯＣ）、特定用途向け集積回路（ＡＳＩＣ）、および他の処理素子および論理素子の任意の組合せを使用するシステムを含め、多種多様な他のコンピュータ、携帯電話、および個人用携帯情報端末のアーキテクチャーおよびシステムを実施形態例との関連で使用することができる。いくつかの実施形態では、コンピュータシステムの全部または一部は、ソフトウェアまたはハードウェアに実現可能である。ランダムアクセスメモリー、ハードドライブ、フラッシュメモリー、テープドライブ、ディスクアレイ、ネットワーク接続記憶装置（ＮＡＳ）、ならびに他のローカルまたは分散データ記憶デバイスおよびシステムを含め、任意のさまざまなデータ記憶媒体を、実施形態例に関連して使用することができる。 These computer architectures and systems are just examples, including common processors, coprocessors, FPGAs, and other programmable logic devices, system-on-chip (SOC), application-specific integrated circuits (ASICs), and others. A wide variety of other computers, mobile phones, and personal mobile information terminal architectures and systems can be used in the context of embodiments, including systems that use any combination of processing and logic elements. .. In some embodiments, all or part of the computer system is feasible in software or hardware. Embodiments of any variety of data storage media, including random access memory, hard drives, flash memory, tape drives, disk arrays, network attached storage (NAS), and other local or distributed data storage devices and systems. Can be used in connection with.

以上論述した実施形態のいくつかの態様を以下の実施例でさらに詳しく開示するが、これらの実施例は、本開示の範囲を何ら制限することを意図しない。 Although some aspects of the embodiments discussed above will be disclosed in more detail in the following examples, these examples are not intended to limit the scope of the present disclosure in any way.

実施例１
１塩基置換エラーの訂正
本実施例は、１塩基置換を含むＰＣＲまたはシーケンシングエラーの訂正を示す。１塩基置換を含むＰＣＲまたはシーケンシングエラーは、類似の分子標識と、３⁸のユニーク確率バーコードが存在した場合（４⁸のユニーク確率バーコードが存在した場合、１７）、≦７の発生数、すなわちシーケンシングリードとを有する標的のコピーを、複数の標的の同じ分子標識を有するものとして帰属させる工程により除去した。 Example 1
Correction of Single Base Substitution Error This example shows correction of PCR or sequencing error including single base substitution. PCR or sequencing errors involving one-base substitutions have a similar molecular label and the number of occurrences of ≤7 ⁱⁿ the presence of 38 unique probability barcodes ⁽ 17 in the presence of 48 unique probability barcodes). , That is, a copy of the target with the sequencing lead was removed by the step of assigning it as having the same molecular label on multiple targets.

確率バーコードを付ける工程は、その結合領域としてオリゴ（ｄＴ）を有する３⁸（６５６１）のユニーク確率バーコードの非枯渇プールを用いて、ＲＴ工程の前に、サンプル中にポリ（Ａ）を有するｍＲＮＡを標識する工程を含みうる。標識する工程はランダムであってよく、各標的分子は、１つの確率バーコードにハイブリダイズすることができる。いずれの標的についても、標的分子の数が、確率バーコードの数よりはるかに小さければ、各標的分子は、恐らく異なる確率バーコードにハイブリダイズするであろう。従って、少数の標的分子しか存在しない場合、少数の標的分子は、恐らく、ハイブリダイゼーション中に、類似の分子標識（ＭＬ）を有する確率バーコードにハイブリダイズするであろう。 The step of attaching the probability barcode uses a non-depleted pool of 38 ⁽ 6561) unique probability barcodes having oligo (dT) as its binding region and poly (A) in the sample prior to the RT step. It may include the step of labeling the mRNA having. The labeling process may be random and each target molecule can hybridize to one probability barcode. For any target, if the number of target molecules is much smaller than the number of probability barcodes, each target molecule will probably hybridize to a different probability barcode. Therefore, if only a small number of target molecules are present, the small number of target molecules will probably hybridize to a probability barcode with a similar molecular label (ML) during hybridization.

３⁸の非枯渇ユニーク確率バーコードからの類似分子標識を有する少なくとも１対の確率バーコードをサンプリングする確率を計算した。２つの分子標識は、それらが１塩基相違する場合、類似の配列を有しうる。このサンプリングイベントは、確率バーコードが、実際に非枯渇でありうるため、置換を含むサンプリングとみなすことができる。この確率は、複数の標的を含む所与のサンプルについて存在する可能性が最も低い類似の分子標識を有する確率バーコードを推定する上で役立ちうる。問題は、類似の分子標識を有する少なくとも２つの確率バーコードが特定の確率で選択されるために必要な確率バーコードの数として明確に述べることができる。この問題は、３⁸の識別可能な分子標識を仮定して、類似の配列を有する２つの確率バーコードの確率が０．５を超えるために必要とされる最小サンプルサイズとして明確に述べることができる。従って、この問題は、古典的な誕生日問題の一般化として考えることができる。古典的な誕生日問題は、３６５の異なる誕生日を仮定して、誕生日が同じ人が２人いる確率が０．５を超えうるために必要な最小サンプルサイズを決定することができる。 The probabilities of sampling at least one pair of probability barcodes with similar molecule labels from 38 non ^- depleted unique probability barcodes were calculated. Two molecular labels may have similar sequences if they differ by one base. This sampling event can be considered as a sampling involving substitutions, as the probability barcode can actually be non-depleted. This probability can be useful in estimating probability barcodes with similar molecular labels that are least likely to be present for a given sample containing multiple targets. The problem can be articulated as the number of probability barcodes required for at least two probability barcodes with similar molecular labels to be selected with a particular probability. This problem can be articulated as the minimum sample size required for the probabilities of two probability barcodes with similar sequences to exceed 0.5, assuming ³⁸ identifiable molecular labels. can. Therefore, this problem can be thought of as a generalization of the classical birthday problem. The classic birthday problem can assume 365 different birthdays and determine the minimum sample size required for the probability of having two people with the same birthday to exceed 0.5.

このサンプルサイズｒを得るために、３⁸のユニーク確率バーコードからサンプリングされたｒ個の確率バーコードを仮定し、その補集合事象の確率を用いて、少なくとも１対の類似の分子標識を有する確率を計算した。３⁸のユニーク確率バーコードからランダムに１つの確率バーコードだけが選択された場合、確率バーコードは１つしかないため、その分子標識が、他の確率バーコードの分子標識と類似していない確率、ｐ₁＝１である。第２の確率バーコードも３⁸のユニーク確率バーコードからランダムに選択された場合、その分子標識が、第１の確率バーコードの分子標識と類似していない確率、ｐ₂＝（３⁸－１６－１）／３⁸である。これは、確率バーコードの各位置に３つの考えられる塩基があると想定して、所与の分子標識について、各塩基位置が、２つの考えられる代替ヌクレオチドを有し、その結果、計２＊８個の１塩基変異体が得られたからであった。第３の確率バーコードが、ユニーク分子標識を有する３⁸のユニーク確率バーコードから、連続的にランダムに取り出された場合、その分子標識が、前の２つの分子標識と類似していない確率、ｐ₃＝（３⁸－１－１６－１－１６）／３⁸＝（３⁸－２＊１７）／３⁸である。確率バーコードは、ｒ番目の確率バーコードまで、３⁸のユニーク確率バーコードから連続的に取り出すことができる。この最後の確率バーコードが、前の確率バーコードと類似しない確率、ｐ_r（３⁸－（ｒ－１）＊１７）／３⁸である。ｒ個の確率バーコードはすべて独立に取り出されたため、いずれもが類似した配列を持たない確率バーコードを取り出す確率は、Ｐ（類似の配列を有していない全分子標識）＝ｐ₁＊ｐ₂＊ｐ₃＊・・・ｐ_rである。従って、ユニーク分子標識を有する３⁸の確率バーコードからのｒ個の確率バーコードの間で少なくとも１対の類似する確率バーコードを有する確率は、Ｐ（類似の配列を有する少なくとも１対の分子標識）＝１－Ｐ（類似の配列を有していない全分子標識）であった。次に、この等式から、Ｐ（類似の配列を有する少なくとも１対の分子標識）について望ましい値＝０．０１、０．０５、０．１、または所望の値を設定することにより、サンプルサイズｒを計算した。 To obtain this sample size r, we assume r probability barcodes sampled from ³⁸ unique probability barcodes and use the probabilities of their complement events to have at least one pair of similar molecular labels. The probability was calculated. If only one probability barcode is randomly selected from the ³⁸ unique probability barcodes, the molecular label is not similar to the molecular labels of other probability barcodes because there is only one probability barcode. Probability, p ₁ = 1. If the second probability barcode is also randomly selected from the ³⁸ unique probability barcodes, the probability that the molecular label is not similar to the molecular label of the first probability barcode, p ₂ = ⁽ 38-). 16-1) / ³⁸ . This assumes that there are three possible bases at each position in the probability barcode, and for a given molecular label, each base position has two possible alternative nucleotides, resulting in a total of 2 *. This was because eight 1-base variants were obtained. If the third probability barcode is continuously and randomly retrieved from the ³⁸ unique probability barcodes with the unique molecular label, the probability that the molecular label is not similar to the previous two molecular labels, p ₃ = (3 ⁸ -1-16-1-16) / 3 ⁸ = ( ³ ⁸ -2 * 17) / 38. The probability barcode can be continuously extracted from the 38 unique probability barcodes up to the ^rth probability barcode. This last probability barcode is a probability that is not similar to the previous probability barcode, pr (38- ( _r - ¹ ) * 17) / ³⁸ . Since all r probability barcodes were taken out independently, the probability of taking out probability barcodes that do not have similar sequences is P (all molecule labels that do not have similar sequences) = p ₁ * p. ₂ * p ₃ * ... _pr . Therefore, the probability of having at least one pair of similar probability barcodes among r probability barcodes from 38 probability barcodes with unique molecular labels is P ⁽ at least one pair of molecules with similar sequences). Label) = 1-P (whole molecule label without similar sequence). The sample size is then set from this equation by setting the desired value = 0.01, 0.05, 0.1, or the desired value for P (at least one pair of molecular labels with similar sequences). r was calculated.

表１は、３⁸または４⁸のユニーク分子標識を仮定して、ｒ個の分子標識の間で少なくとも１つの類似する対を有する確率を示す。３⁸のユニーク確率バーコードと、≦７（４⁸のユニーク確率バーコードがある場合には、１７）の確率バーコードが選択された場合、類似の分子標識を有する１対の確率バーコードを観察する確率は、０．０５未満であり、これは無視できる。従って、この小さい確率により正当化されるように、類似の分子標識は、類似の確率バーコードのリアル見込み選択よりも人工物である可能性が高かったため、訂正することができる。 Table ¹ shows the probabilities of having at least one similar pair between r molecular labels, assuming 38 or ⁴⁸ unique molecular labels. If a unique probability barcode of 38 and a probability barcode of ^≤7 ⁽ 17 if there is a unique probability barcode of 48) are selected, a pair of probability barcodes with similar molecular labels will be used. The probability of observing is less than 0.05, which is negligible. Thus, as justified by this small probability, similar molecular labels were more likely to be man-made than real prospective selections of similar probability barcodes and can be corrected.

しかし、７～２４超の確率バーコードが存在した場合、類似の分子標識を有する１対以上の確率バーコードを観察する確率は、高くなる（たとえば、０．５）であろう。従って、これらの確率バーコードが、真であり、人工物ではないという確率を確信して除外することはできない。対照的に、一般的直観では、もし６５６１のユニーク可能性の大きなプールから２４の確率バーコードだけが取り出されたら、いずれか１塩基のずれが、偶然ではなくシーケンシングエラーの結果でありうると、誤って結論付けられたであろう。 However, if there are more than 7-24 probability barcodes, the probability of observing one or more pairs of probability barcodes with similar molecular labels will be high (eg 0.5). Therefore, we cannot be confident that these probability barcodes are true and non-artificial. In contrast, the general intuition is that if only 24 probability barcodes were retrieved from the 6561 unique pool, any one base deviation could be the result of a sequencing error rather than a coincidence. Would have been erroneously concluded.

たとえば、１１５の確率バーコードがランダムにサンプリングされた場合、算出される確率は、１つであるため、類似の分子標識を有する少なくとも１対の確率バーコードが存在することは１００％確実である。サンプル中に１１５の標的があると想定して、ハイブリダイゼーションおよび逆転写プロセスの後に、類似の分子標識を有する２対の確率バーコードと、非類似の分子標識を有する１１１の確率バーコード（合計１１５の確率バーコード）が観察可能になる。しかし、シーケンシングデータ中に、類似の分子標識を有する３対の確率バーコードと、非類似の分子標識を有する１１０の確率バーコード（合計１１６の確率バーコード）が観察された場合には、類似の分子標識を有する２対の確率バーコードのみが真であり、３つ目の対は何らかのエラーにより生成された可能性。この１００％の確率は、類似の分子標識を有する少なくとも１対の確率バーコードを観察するイベントは、確率バーコード付けの工程中に１１５の確率バーコードがランダムにサンプリングされた場合に起こり得ることを示すものであるが；これは、類似の分子標識の観察されたすべての対が真であることを意味するわけではない。類似の分子標識を有する確率バーコードは、確率バーコード付け工程、リアルもしくは真の分子標識から、またはＰＣＲエラー、人工物、またはシーケンシングエラー、エラーもしくは偽の分子標識から生成されうる。従って、類似の分子標識が観察された場合、分子標識の特定の対が真であるか否かを決定するために、さらなる評価が必要であろう。さらに、総分子標識多様性を３⁸から４⁸に増加する場合、分子標識の類似の対を予測するために、各々の確率について、さらなる確率バーコードが必要となりうる。 For example, if 115 probability barcodes are randomly sampled, there is only one probability calculated, so it is 100% certain that there is at least one pair of probability barcodes with similar molecular labels. .. Assuming 115 targets in the sample, after hybridization and reverse transcription processes, two pairs of probability barcodes with similar molecular labels and 111 probability barcodes with dissimilar molecular labels (total). 115 probability barcodes) become observable. However, if three pairs of probability barcodes with similar molecular labels and 110 probability barcodes with dissimilar molecular labels (a total of 116 probability barcodes) are observed in the sequencing data, Only two pairs of probability barcodes with similar molecular labels are true, and the third pair may have been generated by some error. This 100% probability is that the event of observing at least one pair of probability barcodes with similar molecular labels can occur if 115 probability barcodes are randomly sampled during the probability barcode attachment process. However, this does not mean that all observed pairs of similar molecular labels are true. Probability barcodes with similar molecular labels can be generated from probabilistic bar coding steps, real or true molecular labels, or from PCR errors, man-made objects, or sequencing errors, errors or false molecular labels. Therefore, if similar molecular labels are observed, further evaluation may be needed to determine if a particular pair of molecular labels is true. ^In addition, if the ^total molecular label diversity is increased from 38 to 48, additional probability barcodes may be needed for each probability to predict similar pairs of molecular labels.

表２および表３は、ユニーク分子標識を有する≦７の確率バーコードが観察されたとき、そのような発生の確率は０．０５未満であったため、類似の分子標識が発生する可能性は非常に低かったことを示す。従って、そうした類似の分子標識は、ＰＣＲエラー、人工物、またはシーケンシングエラーによって起こった可能性があり、これらは、分子標識カウントを訂正または調節するために、分子標識カウントから除去すべきである。従って、表２および表３中の真の分子標識の総数は、５から１に、ならびに７から６にそれぞれ減少させることができる。しかし、表４では、２３のユニークバーコードが観察されており、これによって、類似の分子標識を有する少なくとも１対の確率バーコードを有する約５０％の見込みが予測される。従って、類似の分子標識を有する１６対の確率バーコードがリアルである可能性があり、類似の分子標識の各々の対は、それらがリアルであるか否かを確認するために、さらなる評価を要するであろう。 Tables 2 and 3 show that when a probability bar code of ≤7 with a unique molecular label was observed, the probability of such occurrence was less than 0.05, so it is highly probable that similar molecular labels will occur. Indicates that it was low. Therefore, such similar molecular labels may have been caused by PCR errors, man-made objects, or sequencing errors, which should be removed from the molecular label counts in order to correct or adjust the molecular label counts. .. Thus, the total number of true molecular labels in Tables 2 and 3 can be reduced from 5 to 1 and from 7 to 6, respectively. However, in Table 4, 23 unique barcodes have been observed, which predicts an approximately 50% chance of having at least one pair of probability barcodes with similar molecular labels. Therefore, 16 pairs of probability barcodes with similar molecular labels may be real, and each pair of similar molecular labels will undergo further evaluation to see if they are real or not. It will take.

全体として、これらのデータから、観察された類似の分子標識を有する確率バーコードの数は、ＰＣＲエラー、人工物、またはシーケンシングエラーによって、恐らく、類似の分子標識を有するこれらの確率バーコードが発生したため、除去されたことが明らかである。 Overall, from these data, the number of probability barcodes with similar molecular labels observed is likely due to PCR errors, man-made objects, or sequencing errors. Since it occurred, it is clear that it was removed.

実施例２
シーケンシングデータ中の標的のクオリティステータスの決定
この実施例は、シーケンシングデータ中の標的のクオリティステータスが、完全シーケンシングクオリティステータス、不完全シーケンシングクオリティステータス、または飽和シーケンシングクオリティステータスであることを決定する工程を明らかにする。標的のクオリティステータスは、真またはリアル分子標識のすべてが観察されたか否かに依存した。 Example 2
Determining the Quality Status of Targets in Sequencing Data This example shows that the quality status of a target in sequencing data is a complete sequencing quality status, an incomplete sequencing quality status, or a saturated sequencing quality status. Clarify the decision process. The quality status of the target depended on whether all of the true or real molecular labels were observed.

実施例１に示すように、ライブラリー中に存在するユニーク分子標識を有する確率バーコードの完全なカウンティングは、シーケンシング深度に大きく依存しうる。シーケンシングが深いほど、すべての真の分子標識が観察される可能性が高くなった。浅いシーケンシングは、安価ではあるが、多数の分子標識を捉えそこなう可能性があり、また、多分に遺伝子検出感度も損ないうる。完全シーケンシングとは、標的分子を標識するために用いた確率バーコードの真の分子標識がすべて観察されたことを意味し、不完全シーケンシングは、真の分子標識の一部しか観察されなかったことを意味しうる。さらに、４８５６８超の標的分子が出発サンプル中に存在した（これは、識別可能な確率バーコードの６５６１～２＊標準偏差に基づくポアソン訂正または調節後の分子数の下限である）ことも可能である。次に、飽和シーケンシングは、全分子標識の多様性に対する制限のために、標的分子の数が、決定しにくい場合に起こりうる。しかし、確率バーコード付けのための入力として少量のＲＮＡを使用した場合、飽和シーケンシングが発生する可能性は低くなる。 As shown in Example 1, the complete counting of probability barcodes with unique molecular labels present in the library can be highly dependent on the sequencing depth. The deeper the sequencing, the more likely it is that all true molecular labels will be observed. Shallow sequencing, while inexpensive, can miss a large number of molecular labels and possibly impair gene detection sensitivity. Complete sequencing means that all true molecular labels of the probabilistic bar code used to label the target molecule have been observed, and incomplete sequencing means that only part of the true molecular label has been observed. It can mean that. In addition, more than 48568 target molecules could be present in the starting sample, which is the lower limit of Poisson-corrected or regulated number of molecules based on the recognizable probability bar code 6561-2 * standard deviation. be. Second, saturated sequencing can occur when the number of target molecules is difficult to determine due to limitations on the diversity of all molecular labels. However, when a small amount of RNA is used as an input for probabilistic barcodes, saturation sequencing is less likely to occur.

完全または不完全シーケンシングを数学的に定義するために、各々をエラーが一切ない理論上のモデルと比較した。完全な実験条件下で、出発サンプル中の標的分子の各コピーは、ｊＰＣＲサイクルおよび各サイクルでＣ効率を仮定して、（１＋Ｃ）^j個のコピーを生成することができる。出発サンプル中の各バーコード付き分子について、イルミナシーケンシングは、オリジナルのバーコード付き分子から増幅された（１＋Ｃ）^j個のクローナルコピーからのポアソンサンプリングとみなすことができる。理論上、同じ標的遺伝子の場合、ｋ個の確率バーコード付き標的分子のシーケンシングは、すべての確率バーコード付き分子が、ＰＣＲ後均等に表現可能となりうることから、（１＋Ｃ）^j個のコピーからの反復ポアソンサンプリングとみなすことができる。ポアソンモデルの重要な想定は、平均値が分散と等しく、かつ、シーケンシングリードは、等散布に従うはずであることであった。散布は、分散／平均として定義することができる。 To mathematically define complete or incomplete sequencing, each was compared to a theoretical model with no errors. Under full experimental conditions, each copy of the target molecule in the starting sample can generate (1 + C) ^j copies, assuming C efficiency in the jPCR cycle and each cycle. For each barcoded molecule in the starting sample, illumination sequencing can be considered as Poisson sampling from (1 + C) ^j clonal copies amplified from the original barcoded molecule. Theoretically, for the same target gene, sequencing of k target molecules with probability barcodes can be evenly expressed after PCR for all molecules with probability barcodes, so (1 + C) ^j copies. Can be considered as an iterative Poisson sampling from. An important assumption of the Poisson model was that the mean should be equal to the variance and the sequencing leads should follow equal dispersal. Spraying can be defined as variance / average.

実際に、完全シーケンシングは、通常、はるかに低いリード頻度でクラスター化したエラーを伴うことが多い。真の分子標識とは異なり、エラーは、すべてのＰＣＲサイクルに参加しない可能性が高いため、ポアソンと比較してはるかに大きいリード頻度に変化をもたらすコピーが少なくなる。図１９Ａ～１９Ｂは、完全および不完全シーケンシング遺伝子の例を示す。図１９Ａでは、最大シーケンシングリードは、最小シーケンシングリードの３５０倍を超えた。従って、完全シーケンシングは、ポアソンに比べて、大きい散布指数（＞１）を示す傾向がある。 In fact, full sequencing usually involves clustered errors with a much lower read frequency. Unlike true molecular labeling, errors are unlikely to participate in every PCR cycle, resulting in much greater read frequency changes compared to Poisson. 19A-19B show examples of complete and incomplete sequencing genes. In FIG. 19A, the maximum sequencing lead exceeded 350 times the minimum sequencing lead. Therefore, complete sequencing tends to show a larger dispersal index (> 1) than Poisson.

対照的に、不完全シーケンシングの場合、ライブラリー中の真の分子標識を有する確率バーコードの一部だけがシーケンシングされたため、シーケンシングリードの変化は、ポアソンと比較して小さくなる。図１９Ｂでは、最大シーケンシングリードは、最小シーケンシングリードの約３倍にすぎなかった。従って、不完全シーケンシングは、ポアソンよりも小さい散布指数（＜１）を示す傾向がある。 In contrast, in the case of incomplete sequencing, the change in sequencing reads is small compared to Poisson because only part of the probability barcode with the true molecular label in the library was sequenced. In FIG. 19B, the maximum sequencing lead was only about three times the minimum sequencing lead. Therefore, incomplete sequencing tends to show a smaller spray index (<1) than Poisson.

散布指数を計算する以外に、最も豊富な分子標識のシーケンシングリードを、シーケンシングが完全であるか否かを決定するのに用いることができる。たとえば、豊富な分子指標のリードが２５であり、散布指数が５であった場合、シーケンシングステータスは完全として分類することができ；そうでなければ、不完全として分類することができる。シーケンシングエラーが出現し始めるまで、シーケンシングは不完全である可能性があるため、２５リードの閾値を使用することができる。いずれかの分子標識が２５回超認められたら、シーケンシングエラーが生成される可能性がある。 Besides calculating the dispersal index, the most abundant molecularly labeled sequencing leads can be used to determine if the sequencing is complete. For example, if the read for abundant molecular indicators is 25 and the dispersal index is 5, the sequencing status can be classified as complete; otherwise, it can be classified as incomplete. Since sequencing may be incomplete until sequencing errors begin to appear, a 25 read threshold can be used. Sequencing errors can occur if any molecular label is found more than 25 times.

高度に豊富な遺伝子のシーケンシングデータが、確率バーコード中で飽和した、たとえば、ユニーク分子標識を有する３⁸確率バーコードの場合、６５５７を超える状況下で、同じウェル内で他の低発現遺伝子のシーケンシング情報を、その遺伝子の散布指数および最大シーケンシングリードの計算に使用することができる。たとえば、同じウェル内で２番目に豊富な遺伝子が、確率バーコード中で飽和しておらず、かつ、不完全シーケンシングとして分類される場合、第１の遺伝子の飽和をリアルと考えることができ、分子の数を計算することはできない。そして、２番目に豊富な遺伝子が、完全シーケンシングとして分類された場合、第１の遺伝子の飽和は人工的である可能性があり、すべての確率バーコードの出現は、エラーによるものでありうる。次いで、真の分子標識の数を確認するために、ポアソンモデルベースの閾値化アルゴリズムを用いることができる。 Highly abundant gene sequencing data is saturated in the probability barcode, for example, in the case of a 38 probability barcode with a unique molecular label, other underexpressed genes ⁱⁿ the same well under conditions above 6557. Sequencing information can be used to calculate the dispersal index and maximum sequencing read for that gene. For example, if the second most abundant gene in the same well is not saturated in the probability barcode and is classified as incomplete sequencing, then the saturation of the first gene can be considered real. , The number of molecules cannot be calculated. And if the second abundant gene is classified as complete sequencing, the saturation of the first gene may be artificial and the appearance of all probability barcodes may be due to an error. .. A Poisson model-based thresholding algorithm can then be used to determine the number of true molecular labels.

全体として、これらのデータは、シーケンシングステータスが、完全シーケンシング、不完全シーケンシング、または飽和シーケンシングであることを決定する工程を明らかにする。 Overall, these data reveal the process of determining whether the sequencing status is complete sequencing, incomplete sequencing, or saturated sequencing.

実施例３
完全にシーケンシングされた遺伝子の１塩基置換によるＰＣＲまたはシーケンシングエラーの訂正
この実施例は、完全にシーケンシングされた遺伝子、すなわちシーケンシングデータ中の完全シーケンシングのクオリティステータスを有する遺伝子について１塩基置換によるＰＣＲまたはシーケンシングエラーを訂正する工程を示す。この実施例はまた、シーケンシンデータ中の標的に関連付けられた真の分子標識および偽の分子標識を決定するために、標的、たとえば、遺伝子の分子標識を閾値化する工程も示す。 Example 3
Correction of PCR or Sequencing Errors by Single Base Substitution of Fully Sequencing Genes This example is a single base for a fully sequenced gene, i.e., a gene with a complete sequencing quality status in the sequencing data. The steps of correcting PCR or sequencing errors due to substitution are shown. This example also shows the step of thresholding the molecular label of a target, eg, a gene, in order to determine the true and false molecular labels associated with the target in sequence data.

ヌクレオチド当たりのシーケンシングエラー率は、０．１～１％に変動しうるとともに、通常、低頻度リードとして認めることができる。シーケンシングが深く進行するにつれて、多くのシーケンシングエラーが生成される可能性がある。たとえば、真のヌクレオチドシーケンシングエラーが０．５％であり、かつ、分子標識が１００回シーケンシングされた場合、この分子標識に関連するシーケンシングエラーの予測数は、分子標識が８ヌクレオチド長であれば、１００＊（１－（１－０．５％）⁸）から計算して、約４でありうる。分子標識が３００回シーケンシングされた場合には、シーケンシングエラーの予測数は、約１２でありうる。これらのシーケンシングエラーは、カウントを増大する可能性がある人工的分子標識配列を生成しうる。これらの分子標識は、さらなる分析の前に、除去することができる。 Sequencing error rates per nucleotide can vary from 0.1 to 1% and can usually be seen as infrequent reads. As the sequencing progresses deeper, more sequencing errors can occur. For example, if the true nucleotide sequencing error is 0.5% and the molecular label is sequenced 100 times, the predicted number of sequencing errors associated with this molecular label is that the molecular label is 8 nucleotides long. If there is, it can be about 4 calculated from 100 * (1- (1-0.5%) ⁸ ). If the molecular label is sequenced 300 times, the predicted number of sequencing errors can be about 12. These sequencing errors can generate artificial molecularly labeled sequences that can increase the count. These molecular labels can be removed prior to further analysis.

すべてのシーケンシングエラーのうち、１塩基エラーは、２塩基以上隔てたものよりもはるかに頻繁に起こりうる。１塩基シーケンシングエラーを有する確率は、サンプルサイズ８を含む二項分布および１塩基シーケンシングエラー率と等しい成功確率から導くことができる。１つの目標は、１塩基シーケンシングエラーを訂正することであった。１塩基シーケンシングエラーは、最も豊富かつ近接した（たとえば、ハミング距離に関して）分子標識、すなわち親分子標識の子供として考えることができる。シーケンシングエラーは、親分子標識の真の子供（すなわち、親分子標識から１塩基隔てた子供分子標識）を見出すことにより検出した。 Of all sequencing errors, single-base errors can occur much more often than those separated by two or more bases. The probability of having a single nucleotide sequencing error can be derived from the binomial distribution including the sample size 8 and the probability of success equal to the single nucleotide sequencing error rate. One goal was to correct single base sequencing errors. Single-base sequencing errors can be thought of as children of the most abundant and close (eg, Hamming distance) molecular labels, ie parent molecule labels. Sequencing errors were detected by finding the true child of the parent molecule label (ie, the child molecule label one base away from the parent molecule label).

親および子供分子標識の選択
親分子標識は、＞２５シーケンシングリードを有することが要求され、子供分子標識は、３以下のシーケンシングリードを有することが要求されうる。これらの要件は、下記の推論に基づくものであった。ヌクレオチドごとのシーケンシングエラーの確率が０．５％と仮定する。分子標識が、２５回シーケンシングされて、合計して２００のヌクレオチドが生成された場合、２００＊０．００５＝１であるから、１ヌクレオチドがエラーであることが予想された。従って、２５のシーケンシングリードを有する各分子標識について、少なくとも１つの子供分子標識を有することが予想された。親分子標識は、２５のシーケンシングリードを有するべきであると想定されうる。４のシーケンシングリードを有する子供分子標識は、シーケンシングエラーである可能性は低かった。これは、１つの分子標識中に同じエラーを４回導入する確率が、８＊０．００５⁴＝１０^-9であるためであった。もし、合計１０６シーケンシングエラーリードが存在した場合には、４回反復されたシーケンシングエラーの予想数は、５＊１０⁹＊１０⁶＝０．００５となり、これは無視することができた。従って、子供分子標識は、リード≦３を有するべきである。 Selection of Parent and Child Molecular Labels Parent molecule labels may be required to have> 25 sequencing leads and child molecule labels may be required to have 3 or less sequencing leads. These requirements were based on the following reasoning. It is assumed that the probability of sequencing error for each nucleotide is 0.5%. If the molecular label was sequenced 25 times to produce a total of 200 nucleotides, then 200 * 0.005 = 1, so 1 nucleotide was expected to be an error. Therefore, for each molecular label with 25 sequencing leads, it was expected to have at least one child molecular label. It can be assumed that the parent molecule label should have 25 sequencing leads. A child molecule label with a sequencing lead of 4 was less likely to be a sequencing error. This was because the probability of introducing the same error four times in one molecular label was 8 * 0.005 ⁴ = 10 ^-9 . If a total of 106 sequencing error reads were present, the expected number of sequence errors repeated four times was 5 * 10 ⁹ * 10 ⁶ = 0.005, which could be ignored. Therefore, the child molecule label should have a lead ≤ 3.

１塩基隔てた親分子標識とその関連子供分子標識を仮定し、いかにして親の真のシーケンシングエラーである子供分子標識を決定するか？
親分子標識と、シーケンシングリード（Ｒ_child1、Ｒ_child2、・・・、Ｒ_childm）を有する、親分子標識とは１塩基相違する子供分子標識のセットとを仮定し、多重二項検定を用いて、真の子供分子標識を同定することができる。帰無仮説の下で、真の子供分子標識の存在量は、Ｒ_par＊ｐ以下になるはずであり（数学的に、Ｈ₀：ｐ＜ｅ／２）；そうでなければ、存在量は、Ｒ_par＊ｐより大きい（ＨＡ：ｐ＜ｅ／２）という別の仮説を支持する結論が下され、分子標識は、真の子供分子標識であったという仮説は拒絶されうる。次に、親分子標識とは１塩基相違する子供分子標識が１回観察された確率は、ｐ＝ｅ／２となる。次に、数学的に、総存在量（Ｒ_child＋Ｒ_par）から、この子供分子標識を少なくともＲ_child回観察する確率ｐ_childは、以下の通りとなる：

Assuming a parent molecule label separated by one base and its associated child molecule label, how to determine the child molecule label that is the true sequencing error of the parent?
Using a multiple binomial test, assuming a set of child molecule labels that have a parent molecule label and a sequencing read (R _child1 , R _child2 , ..., R _childm ) that are one base different from the parent molecule label. The true child molecule label can be identified. Under the null hypothesis, the abundance of true child molecule labels should be less than or equal to R _par * p (mathematical, H ₀ : p <e / 2); otherwise, the abundance is , R _par * p is greater than (HA: p <e / 2), and the hypothesis that the molecular label was a true child molecular label can be rejected. Next, the probability that a child molecule label different from the parent molecule label by one base is observed once is p = e / 2. Next, mathematically, from the total abundance (R _child + R _par ), the probability of observing this _child molecule label at least R _child times is as follows:

子供分子標識が、実際に、その親分子標識のシーケンシングエラーであった場合、確率ｐ_childは、５％の臨界値より大きいはずである。複数の仮説を同時に検定するため、帰無仮説を拒絶するのに使用する臨界値は、５％レベルに制御される偽発見率（ＦＤＲ）によって決定することができ、ｐ_childが、５％レベルのＦＤＲより大きければ、仮説を容認することができる。５％に制御されたＦＤＲを用いて、未調節のｐ値を、たとえばｐ₁≦ｐ₂≦ｐ_mのように、小さい順にソーティングすることができる。次に、その対応する順位ｊを含む検定を見出すことができる。ｐ_child≦ｊ／ｍ＊５％であれば、この子供分子標識は、親分子標識の１塩基シーケンシングエラーであったという帰無仮説を容認することができる。 If the child molecule label was actually a sequencing error of its parent molecule label, the probability p _child should be greater than the critical value of 5%. Since multiple hypotheses are tested simultaneously, the critical value used to reject the null hypothesis can be determined by false discovery rate (FDR), which is controlled at the 5% level, and p _child is at the 5% level. If it is larger than the FDR of, the hypothesis can be accepted. With a 5% controlled FDR, unadjusted p-values can be sorted in ascending order, for example p ₁ ≤ p ₂ ≤ p _m . Next, a test including the corresponding rank j can be found. If p _child ≤ j / m * 5%, the null hypothesis that this child molecule label was a one-base sequencing error of the parent molecule label can be tolerated.

全体として、これらのデータは、完全にシーケンシングされた遺伝子について１塩基シーケンシングエラーを訂正する工程を論証する：工程（１）、そのシーケンシングリードが２５より大きければ、最も豊富なシーケンシングリードを有する分子標識を第１の親分子標識として選択する。工程（２）シーケンシングリード≦３を有する分子標識を選択し、第１の親分子標識と１塩基相違するこれらの分子標識を同定し、それらを子供分子標識と呼び；子供分子標識または１塩基子供分子標識が見出されなければ、工程（５）に進む。工程（３）、子供分子標識および親分子標識のすべてに対して多重二項検定を実施し、帰無仮説が容認される子供分子標識を除去して、それらのシーケンシングリードをその親分子標識に帰属させる。帰無仮説のいずれも容認されなかった場合、これは、すべての子供分子標識が親分子標識の１塩基シーケンシングエラーではなかったことを意味し、リード訂正を実施する必要はない。工程（４）、分子標識配列ならびにシーケンシングリードを更新する。工程（５）、親分子標識として次に大きいシーケンシングリードを有する分子標識を選択し、適格な親分子標識または適格な子供分子標識が残らなくなるまで、前述の工程を反復する。 Overall, these data demonstrate a step of correcting single-base sequencing errors for fully sequenced genes: step (1), the richest sequencing read if the sequencing read is greater than 25. The molecular label having the above is selected as the first parent molecule label. Step (2) Select molecular labels with sequencing leads ≤ 3, identify these molecular labels that differ by 1 base from the first parent molecule label, and call them child molecule labels; child molecule labels or 1 base. If no child molecule label is found, the process proceeds to step (5). Step (3), multiple binomial tests are performed on all of the child molecule labels and parent molecule labels, the child molecule labels for which the null hypothesis is acceptable are removed, and their sequencing leads are labeled with their parent molecule labels. Attribut to. If none of the null hypotheses were accepted, this means that not all child molecule labels were single-base sequencing errors of the parent molecule label, and no read corrections need to be performed. Step (4), molecularly labeled sequences and sequencing reads are updated. Step (5): Select a molecular label with the next largest sequencing lead as the parent molecule label and repeat the above steps until no eligible parent molecule label or qualified child molecule label remains.

表５は、前述の分析を用いて、１塩基シーケンシングエラーを除去した後、更新されたＴＦＲＣシーケンシングデータを示す。分子標識の固有の数は、２３（表４に示す）から１１に減少した。 Table 5 shows updated TFRC sequencing data after removing single base sequencing errors using the analysis described above. The unique number of molecular labels decreased from 23 (shown in Table 4) to 11.

閾値化のためのポアソンモデルの使用
シーケンシングエラーは、完全シーケンシングの下で出現する可能性が高くなりうる。１塩基シーケンシングエラーなどのいくつかのタイプのエラーは、訂正可能であるが、人工的分子標識のランダム組込みといった他のエラーは、配列類似性に基づいて訂正することができないであろう。その代わり、これらのタイプのエラーは、モデル化によって同定することができる。前述したように、完全シーケンシングは、ポアソンに対して過剰散布される傾向がある。従って、過剰散布を特徴とする２つの特有のポアソンモデルを作製した：１つは、真の分子標識（すなわち、確率バーコード付け工程中に、標的分子を標識するのに用いられる分子標識配列）のためのシーケンシングリードをモデル化するために使用することができ、２つ目のモデルは、エラー分子標識（すなわち、確率バーコード付け工程中に使用されないが、エラーのためにシーケンシング後に出現した分子標識配列）のために使用することができる。シーケンシングエラー率は、約０．１～１％であり、ＰＣＲサイクルエラー率は、約０．００１％でありうる。ＰＣＲエラーは、ＰＣＲの後のサイクル中に、より多く起こって、低シーケンシングリードを有するエラー分子標識を生じうるが、すべての観察された分子標識配列の大部分に寄与しうる。従って、ＰＣＲおよびシーケンシングによって生じたエラーは、多くの場合、真の分子標識よりも低いシーケンシングリードを有しうる。そのため、真の分子標識のシーケンシングリードのポアソン平均は、エラー分子標識のポアソン平均より大きくなる。 Using Poisson Models for Thresholding Sequencing errors can be more likely to occur under full sequencing. Some types of errors, such as single nucleotide sequencing errors, can be corrected, while other errors, such as random integration of artificial molecular labels, will not be correctable based on sequence similarity. Instead, these types of errors can be identified by modeling. As mentioned earlier, complete sequencing tends to be oversprayed for Poisson. Therefore, two unique Poisson models characterized by overspraying were created: one is true molecular labeling (ie, the molecular labeling sequence used to label the target molecule during the probabilistic bar coding process). A second model can be used to model a sequencing read for an error molecule label (ie, not used during the probability bar coding process, but appears after sequencing due to an error). Can be used for molecularly labeled sequences). The sequencing error rate can be about 0.1-1% and the PCR cycle error rate can be about 0.001%. PCR errors can occur more often during the post-PCR cycle, resulting in error molecular labeling with low sequencing reads, but can contribute to the majority of all observed molecularly labeled sequences. Therefore, errors caused by PCR and sequencing can often have lower sequencing reads than true molecular labeling. Therefore, the Poisson average of the sequencing read of the true molecular label is larger than the Poisson average of the error molecular label.

合計ｋ個の識別可能な分子標識があり、それらのうちｔ個が、ＢＣ₁、ＢＣ₂、・・・、ＢＣ_tのような真の分子標識であり、残りが、ＢＣ_t+1、ＢＣ_t+2、・・・、ＢＣ_kのようなエラー分子標識であったと想定する。こうした真の分子標識およびエラー分子標識にマッピングされたシーケンシングリードは、Ｒ₁、Ｒ₂、・・・、Ｒ_tおよびＲ_t+1、Ｒ_t+2、・・・、Ｒ_kでありうる。さらに、真の分子標識およびエラー分子標識を用いたポアソン平均が、μ_tおよびμ_n（μ_t＞μ_n）であると想定すると、プロセス全体の確率は、以下のようになる：

（式中、Ｐ（Ｘ_i＝Ｒ_i｜μ_t）は、平均μ_tを有するポアソン過程の下で、存在量Ｒ_iを有するｉ番目の分子標識を観察する確率を示す）。 There are a total of k identifiable molecular labels, t of which are true molecular labels such as BC ₁ , BC ₂ , ..., BC _t , and the rest are BC _{t + 1} , BC. Suppose it was an error molecule label such as _{t + 2} , ..., BC _k . Sequencing leads mapped to these true and error molecular labels can be R ₁ , R ₂ , ..., R _t and R _{t + 1} , R _{t + 2} , ..., R _k . .. Furthermore, assuming that the Poisson mean with true and error molecular labels is μ _t and μ _n (μ _t > μ _n ), the overall process probabilities are as follows:

(In the equation, P (X _i = R _i | μ _t ) indicates the probability of observing the i-th molecular label with abundance R _i under a Poisson process with mean μ _t ).

真の分子標識の数を決定するｔために、次のようにモデルの数を考慮した；すべての分子標識が真であると想定したモデル（従って、ｌ＝ｋ）から出発して；最も少ない分子標識が、エラーであり、他の分子標識はすべて真である（従って、ｌ＝ｋ－１）であると想定した２番目のモデル；最も豊富な分子標識だけが真であり、他はすべてエラー分子標識である（従って、ｌ＝１）と想定した最後のモデルまで。最後に、最良のモデルは、考慮されるすべてのモデルの間で最も高い尤度を有するか、または最も小さい赤池情報量基準（ＡＩＣ）と同等のものであり、ＡＩＣは、所与のデータについて考えられるモデルの各々の相対量を測定することによって、モデル選択に使用することができる。数学的に、ＡＩＣは、ＡＩＣ＝－ｌｏｇＬ＋２ｐとして定義され、式中、ｐは、モデルで推定されるパラメータの数である。従って、Ｌ_kおよびＬ₁については、ｐ＝１であり、他の場合、ｐ＝２である。表６に示す例から、比較した８つの可能なモデルのうち、最も大きい３つのシーケンシングリードを有する３つの分子標識だけが真の分子標識であると考えられることがわかる。また、図２０は、選択されたモデル（最も大きい３つ）から導かれた閾値が、明らかに真の分子標識を、エラーの可能性が大きいものから区別したことを示す。 To determine the number of true molecular labels, we considered the number of models as follows; starting from the model assuming that all molecular labels are true (hence, l = k); the least. The second model assuming that the molecular label is an error and all other molecular labels are true (hence l = k-1); only the most abundant molecular labels are true and everything else is true. Up to the last model assumed to be an error molecule label (hence l = 1). Finally, the best model is equivalent to the Akaike Information Criterion (AIC), which has the highest or lowest likelihood among all the models considered, and the AIC is for a given data. It can be used for model selection by measuring the relative quantity of each of the possible models. Mathematically, AIC is defined as AIC = -logL + 2p, where p is the number of parameters estimated by the model. Therefore, for L _k and L ₁ , p = 1, otherwise p = 2. From the examples shown in Table 6, it can be seen that of the eight possible models compared, only the three molecular labels with the three largest sequencing leads are considered to be true molecular labels. Also, FIG. 20 shows that the thresholds derived from the selected model (the three largest) clearly distinguished true molecular labels from those with a high probability of error.

データは、１塩基シーケンシングエラーを除去し、ポアソンモデルを用いて閾値化することによって訂正された、完全にシーケンシングされた遺伝子のシーケンシングリードを示す。 The data show sequencing reads of fully sequenced genes corrected by removing single-base sequencing errors and thresholding using a Poisson model.

実施例４
不完全にシーケンシングされた遺伝子の調節
この実施例は、ノイジー遺伝子を除去するとともに、ゼロ切断ポアソンモデルを用いて、ライブラリー中に存在することが予想される分子標識の総数を推定することにより不完全にシーケンシングされた遺伝子を調節する工程を示す。 Example 4
Regulation of Incompletely Sequencing Genes This example removes the noisy gene and uses a zero-cleavage Poisson model to estimate the total number of molecular labels expected to be present in the library. The process of regulating an incompletely sequenced gene is shown.

ノイジー遺伝子の除去
分子標識およびそのシーケンシングリードの統計学を考慮する以外に、遺伝子レベルの解析も有益となりうる。ある遺伝子について、検出された分子標識が非常に少なく、しかも各分子標識が、完全にシーケンシングされた遺伝子に比べて著しく低いリードを有する場合、その遺伝子をノイジーとみなすことができる。この想定は、同じライブラリー内の確率バーコード付き分子が、概ね同じ頻度で増幅およびシーケンシングされるはずであるという論証に基づくものであった。こうした期待は、各分子のシーケンシングの相違に起因するＰＣＲおよびシーケンシングバイアスによって影響されうるが、それらはＰＣＲ中におけるサンプルの汚染や望ましくない分子の再結合などの事象によって発生する「ノイズ」に応じて小さいと想定されていた。遺伝子は、その増幅速度（分子標識当たりの平均リード）が、同じライブラリー中で完全にシーケンシングされた遺伝子に由来するエラーの増幅率と類似であった場合、ノイジーでありうる。 Removal of Noisy Genes Besides considering the statistics of molecular labeling and its sequencing reads, gene-level analysis can also be useful. A gene can be considered noisy if there are very few molecular labels detected for a gene and each molecular label has a significantly lower read than a fully sequenced gene. This assumption was based on the argument that molecules with probability barcodes in the same library should be amplified and sequenced at approximately the same frequency. These expectations can be influenced by PCR and sequencing bias due to differences in sequencing of each molecule, which are caused by events such as sample contamination and unwanted molecular recombination during PCR. It was supposed to be small accordingly. A gene can be noisy if its amplification rate (mean read per molecular label) is similar to the amplification rate of errors derived from fully sequenced genes in the same library.

具体的には、完全にシーケンシングされた遺伝子ｇ１が、全部でｔ₁個の真の分子標識とｅ₁個のエラー分子標識から構成され、それにより、Ｒ_g1,1、Ｒ_g1,2、・・・、Ｒ_g1,t1が、真の分子標識にマッピングされたシーケンシングリードであり、Ｒ^* _g1,1、Ｒ^* _g1,2、・・・、Ｒ^* _g1,e1が、エラー分子標識にマッピングされたシーケンシングリードであると想定する。次に、ｇ₁のエラー分子標識の増幅速度（ＥＡＭＰ）は、

であった。同様にして、他の完全にシーケンシングされた遺伝子すべてのｇ₂、ｇ₃、・・・、ｇ_xについて、ＥＡＭＰを計算することができる。観察された計５未満の分子標識を有する潜在的ノイジー遺伝子ｇ’₁、ならびに各分子標識にマッピングされたＲ_g’1,1、Ｒ_g’1,2、・・・、Ｒ_g’1,kシーケンシングリードについて、カットオフを適用することができ、その増幅速度を

として決定する。ａｍｐ_g’1＜中央（ａｍｐ_g’1、ａｍｐ_g’2、・・・、ａｍｐ_g’x）であれば、遺伝子ｇ’₁をノイジー遺伝子であると考えた。そうでなければ、これは、不完全遺伝子とみなすことができる。同様に、他のノイジー遺伝子も検定し、除去した。５の分子標識をカットオフとして選択した理由は、低い増幅速度を有する遺伝子を２つの個別のケース：人工物（５未満の分子標識が観察されたもの）と不完全シーケンシング（低ＰＣＲ／シーケンシングのプライマー失敗により≧５の分子標識が観察されたもの）に処理することが望ましいと思われるためである。 Specifically, the fully sequenced gene g1 is composed of a total of t ₁ true molecular labels and e ₁ error molecular labels, thereby R _g1,1 and R _g1,2 , ..., R _{g1, t1} are the sequencing leads mapped to the true molecular label, and R ^* _g1,1 , R ^* _g1,2 , ..., R ^* _{g1, e1} are the error molecular labels. Suppose it is a sequencing read mapped to. Next, the amplification rate (EAMP) of the error molecule label of g ₁ is

Met. Similarly, EAMP can be calculated for g ₂ , g ₃ , ..., G _x for all other fully sequenced genes. Potential noisy genes g'1 with a total of less than 5 molecular labels observed, as well as R g'1, ₁ , R _g'1 , 2, ..., R _g'1 _, mapped to each molecular label. A cutoff can be applied to the _k -sequencing read to determine its amplification speed.

To be determined as. If amp _g'1 <center (amp _g'1 , amp _g'2 , ..., amp _g'x ), the gene g'1 was considered to be _a noisy gene. Otherwise, this can be considered an incomplete gene. Similarly, other noisy genes were tested and removed. The reason for choosing the molecular label of 5 as the cutoff is that genes with low amplification rates were selected in two separate cases: artificial (where less than 5 molecular labels were observed) and incomplete sequencing (low PCR / sequencing). This is because it seems desirable to treat the molecule labeled with ≧ 5 due to the failure of the primer of the sing.

ゼロ切断ポアソンモデルを用いた推定
シーケンシングが不完全であったとき、エラーはデータ中に依然として存在しうるが、全体として不十分なシーケンシングリードのために同定することが困難となりうる。シーケンシングが浅く、ライブラリー中に存在する分子標識のすべてが観察されていない場合、重要な分析のためにいくつかの想定が必要となりうる。すべての観察された分子標識が真であること、ならびに観察されていない真の分子標識が、ゼロで切断されている、すなわち、ゼロ時間で観察された切断分子標識であると想定することができる。所与の遺伝子について確率バーコード付き転写物のすべてがシーケンシングにサンプリングされているわけではないが、検出された分子標識のリードの頻度を用い、ゼロ切断ポアソンモデルを適用することにより、全ライブラリー中に存在する分子標識の完全な多様性を推定することができる。 When estimated sequencing using the zero-cut Poisson model was incomplete, errors could still be present in the data, but could be difficult to identify due to inadequate sequencing reads overall. If the sequencing is shallow and not all of the molecular labels present in the library are observed, some assumptions may be needed for important analysis. It can be assumed that all observed molecular labels are true, and that true unobserved molecular labels are cleaved at zero, i.e., cleaved molecular labels observed at zero time. .. Not all transcripts with probability barcodes are sequenced for a given gene, but all live by applying a zero-cleaving Poisson model using the frequency of detected molecularly labeled reads. The complete variety of molecular labels present in the rally can be estimated.

リード（Ｒ₁、Ｒ₂、・・・、Ｒ_k）を有するｋ個の識別可能な分子標識が観察され、（Ｓ－ｋ）個の分子標識が観察されず、リードはゼロであった。１つの目標は、Ｓ、すなわち、ライブラリー中に存在することが予想される分子標識の総数を推定することであった。ポアソン平均μによりゼロで切断されたポアソン変量として、シーケンシングリード１、２、３、もしくはそれ以上とを認める頻度を想定し、すべてのシーケンシングリードの合計がｎであった場合、尤度は次のように表すことができる：
Ｌ（Ｓ，μ）∝Ｓ！／（Ｓ－ｋ）！μⁿｅｘｐ（－Ｓμ）（式３） K identifiable molecular labels with leads (R ₁ , R ₂ , ..., R _k ) were observed, no (SK) molecular labels were observed, and the leads were zero. One goal was to estimate S, the total number of molecular labels expected to be present in the library. Assuming a frequency of recognizing sequencing reads 1, 2, 3, or more as Poisson variates cleaved at zero by the Poisson mean μ, if the sum of all sequencing reads is n, the likelihood is It can be expressed as:
L (S, μ) ∝S! / (S-k)! μ ⁿ exp (-Sμ) (Equation 3)

伝統的な推理方式をμ、Ｓおよびそれらの標準誤差の推定のために適用することができる。μの最大尤度（ＭＬＥ）は、ｎ／Ｓであり、ＳのＭＬＥへの近似値は、ｋ／（１－ｅ－^n/S）またはｋ／（１－（１－１／Ｓ）ⁿ）となりうる。図２１は、分子標識の数およびそれらの対応するシーケンシングリードに基づく、当てはめゼロ切断ポアソンモデルを示す。図２１に示すように、３３のユニーク分子標識が、部分的にシーケンシングされたライブラリー中の計３９のリード全体にわたって観察された。シーケンシングリード１、２、３、および４を有する分子標識の頻度に基づき、ポアソンモデルを適用して、全ライブラリー中の計１１３の分子標識が、完了まで進行したシーケンシングを有することを推定した。推定方式は、μ、Ｓおよびそれらの標準エラーの推定のために適用した。μのＭＬＥは、ｎ／Ｓであり、ＳのＭＬＥへの近似値は、ｋ／（１－ｅ－^n/S）またはｋ／（１－（１－１／Ｓ）ⁿ）となりうる。 Traditional reasoning methods can be applied to estimate μ, S and their standard errors. The maximum likelihood (MLE) of μ is n / S, and the approximation of S to MLE is k / (1-e- ^{n / S} ) or k / (1- (1-1 / S) ⁿ . ) Can be. FIG. 21 shows a fitted zero-cut Poisson model based on the number of molecular labels and their corresponding sequencing leads. As shown in FIG. 21, 33 unique molecular labels were observed across a total of 39 leads in the partially sequenced library. Based on the frequency of molecular labels with sequencing reads 1, 2, 3, and 4, a Poisson model was applied to estimate that a total of 113 molecular labels in the entire library had advanced sequencing to completion. did. The estimation method was applied for estimation of μ, S and their standard error. The MLE of μ is n / S, and the approximate value of S to MLE can be k / (1-e- ^{n / S} ) or k / (1- (1-1 / S) ⁿ ).

全体として、これらのデータは、ノイジー遺伝子を除去するとともに、ライブラリー中に存在することが予想される分子標識の総数を推定するために、ゼロ切断ポアソンモデルを用いることによって訂正された不完全シーケンシング遺伝子のシーケンシングリードを明らかにする。 Overall, these data are incomplete sequences corrected by removing the noisy gene and using a zero-cleaving Poisson model to estimate the total number of molecular labels expected to be present in the library. Clarify the sequencing reads of the single gene.

実施例５
完全シーケンシング遺伝子および不完全シーケンシング遺伝子
この実施例は、完全シーケンシング遺伝子および不完全シーケンシング遺伝子のシーケンシングリードを調節した後に生成されたアウトプットの一例を示す。 Example 5
Complete Sequencing Genes and Incomplete Sequencing Genes This example shows an example of the output produced after regulating the sequencing reads of the complete sequencing genes and incomplete sequencing genes.

表７は、完全シーケンシング遺伝子および不完全シーケンシング遺伝子のシーケンシングリードを調節した後に生成されたアウトプットの一例を提供する。列の見出しの説明は次の通りであった：「遺伝子ＩＤ」は、検出された遺伝子の名称を示す。「シーケンシングステータス」は、３つの考えられる結果：完全、不完全および飽和を示し、これによって、解析方法が決定される。分類は、散布指数、および最も豊富な分子標識（ＭＬ）にマッピングされたシーケンシングリードに応じて実施した。「未補正ＭＬ」は、その遺伝子について観察されたユニーク分子標識のカウントを示す（非検出遺伝子の場合は「０」）。「未補正リード」は、未補正ＭＬにマッピングされたシーケンシングリードの合計を示す（非検出遺伝子の場合は「０」）。訂正ＭＬは、アルゴリズムを適用した後に真の分子標識とみなされたユニーク分子標識のカウントを示す（完全シーケンシング遺伝子の場合のみ、不完全遺伝子の場合は「ＮＡ」、ノイジーおよび非検出遺伝子の場合は「０」）。「訂正リード」は、訂正ＭＬにマッピングされたシーケンシングリードの合計を示す（完全シーケンシング遺伝子の場合のみ、不完全遺伝子の場合は「ＮＡ」、ノイジーおよび非検出遺伝子の場合は「０」）。「補外ＭＬ」は、ゼロ切断ポアソンモデルによるユニーク分子標識の推定数を示す（不完全シーケンシング遺伝子の場合のみ、完全遺伝子の場合は「ＮＡ」、ノイジーおよび非検出遺伝子の場合は「０」）。「推定Ｍｏｌ」は、訂正ＭＬ（完全シーケンシング遺伝子の場合）または補外ＭＬ（不完全シーケンシング遺伝子の場合）に基づいて推定された分子の数を示し、ノイジー遺伝子および非検出遺伝子の場合は「０」である。「推定ＭｏｌＬＢ」は、分子の推定数の下限を示す。「推定ＭｏｌＵＢ」は、分子の推定数の上限を示す。 Table 7 provides an example of the output produced after regulating the sequencing reads of complete and incomplete sequencing genes. The description of the column headings was as follows: "Gene ID" indicates the name of the detected gene. The "sequencing status" indicates three possible outcomes: complete, incomplete and saturated, which determine the method of analysis. Classification was performed according to the spray index and the sequencing leads mapped to the most abundant molecular labels (ML). "Uncorrected ML" indicates the count of unique molecular labels observed for that gene ("0" for undetected genes). “Uncorrected read” indicates the total number of sequencing reads mapped to the uncorrected ML (“0” for undetected genes). The corrected ML shows the count of unique molecular labels that were considered true molecular labels after applying the algorithm (only for fully sequenced genes, "NA" for incomplete genes, noisy and undetected genes). Is "0"). "Corrected read" indicates the total number of sequencing reads mapped to the corrected ML ("NA" for complete sequencing genes only, "NA" for incomplete genes, "0" for noisy and undetected genes). .. "Extrapolated ML" indicates the estimated number of unique molecular labels by the zero-cleaving Poisson model ("NA" for incompletely sequenced genes only, "NA" for complete genes, "0" for noisy and undetected genes. ). "Estimated Mol" indicates the number of molecules estimated based on the corrected ML (for complete sequencing genes) or supplemental ML (for incomplete sequencing genes), and for noisy and undetected genes. It is "0". "Estimated Mol LB" indicates the lower limit of the estimated number of molecules. "Estimated Mol UB" indicates the upper limit of the estimated number of molecules.

表７において、出発分子の推定数である、推定Ｍｏｌ（ｎ）は、次のように計算した：
ｎ＝－ｍｌｏｇ（１－ｋ／ｍ）、式（４）
（式中、ｍは、分子標識（３⁸）の全多様性であり、ｋは、観察されたユニーク分子標識の総数であった）。ｎの分散であるｖａｒ（ｎ）は、テイラー展開を用いて導かれた：ｖａｒ（ｎ）＝（ｍ／（ｍ－ｋ））²ｖａｒ（ｋ）（式中、ｖａｒ（ｋ）は、ｍ＊（１－（１－１／ｍ）ⁿ）（１－１／ｍ）ⁿ＋ｍ（ｍ－１）（（１－２／ｍ）ⁿ－（１－１／ｍ）²ⁿ）として表すことができる）。出発分子の推定数の下限および上限（推定ＭｏｌＬＢおよび推定ＭｏｌＵＢ）は、

を用いて計算した。 In Table 7, the estimated Mol (n), which is the estimated number of starting molecules, was calculated as follows:
n = -mlog (1-k / m), formula (4)
(In the formula, m was the total diversity of molecular labels ( ³⁸ ) and k was the total number of unique molecular labels observed). The variance of n, var (n), was derived using the Taylor expansion: var (n) = (m / (m—k)) ² var (k) (in the equation, var (k) is m. * Can be expressed as (1- (1-1 / m) ⁿ ) (1-1 / m) ⁿ + m (m-1) ((1-2 / m) ^n- (1-1 / m) ²ⁿ ) can). The lower and upper limits (estimated Mol LB and estimated Mol UB) of the estimated number of starting molecules are

Was calculated using.

全体として、これらのデータは、完全シーケンシング遺伝子および不完全シーケンシング遺伝子を調節する工程を明らかにする。 Overall, these data reveal steps to regulate complete and incomplete sequencing genes.

実施例６
完全シーケンシング遺伝子および不完全シーケンシング遺伝子の訂正の性能
この実施例は、完全シーケンシング遺伝子のシーケンシングリードの訂正の性能を示す。この性能は、除去された未補正分子標識カウントおよび除去されたシーケンシングリードのエラーおよびノイズに基づいた。 Example 6
Performance of Correction of Completely Sequencing Genes and Incompletely Sequencing Genes This example demonstrates the performance of correcting sequential reads of fully sequenced genes. This performance was based on the error and noise of the removed uncorrected molecular label count and the removed sequencing read.

いくつかの完全シーケンシング遺伝子を選択して、完全シーケンシング遺伝子のシーケンシングリードの訂正の性能を検定した。表８は、シーケンシングリードを訂正または調節する前、およびその後に、これらの遺伝子についていくつかの測定値を比較する。未補正ＭＬ、未補正リード、訂正ＭＬ、訂正リードは、出力表から直接導入した。未補正ａｍｐ（未補正データを用いた増幅速度）およびフィルタリングａｍｐ（訂正後の真の分子標識データを用いた増幅速度）を、（未補正リード／未補正ＭＬ）および（訂正リード／訂正ＭＬ）を用いて計算した。観察された分子標識の総数の訂正後に、真の分子標識の数に対する、保持されたＭＬのパーセンテージは、１００＊訂正ＭＬ／未補正ＭＬであり、保持された％リードも同様に、１００＊訂正リード／未補正リードとして定義された。表８は、より多い分子標識および総リードを呈示するＧＡＰＤＨおよびＡＣＴＢを含む、さまざまな存在量レベルの遺伝子例を示す。訂正を適用した後の真の分子標識の数は、未補正データに認められる総分子標識の７％未満を占めたが、これは、分子標識の９３％超が、エラー分子標識であると考えられ、廃棄されたことを意味する。未補正分子標識の９３％は、ノイズとして除去されたが、真の分子標識は、リードの少なくとも７２％に寄与し、これは、これらの廃棄されたエラー分子標識が、はるかに低いリードから成ることも意味する。さらに、アルゴリズムを適用した後の増幅速度は、１３７～４１３の範囲であり、これは、未補正データを用いて得られたもの（６．１～２９．４）よりはるかに高かった。訂正増幅速度は、はるかに実際的な測定値であり、これは、少なくとも７５％のＰＣＲ効率と相関した。 Several fully-sequencing genes were selected to test the ability of the complete-sequencing gene to correct the sequencing reads. Table 8 compares several measurements for these genes before and after correcting or regulating sequencing reads. Uncorrected ML, uncorrected read, corrected ML, and corrected lead were introduced directly from the output table. Uncorrected amp (amplification rate using uncorrected data) and filtering amp (amplification rate using corrected true molecular labeling data), (uncorrected read / uncorrected ML) and (corrected read / corrected ML) Was calculated using. After correction of the total number of observed molecular labels, the percentage of retained ML to the number of true molecular labels is 100 * corrected ML / uncorrected ML, and the retained% read is also 100 * corrected. Defined as lead / uncorrected lead. Table 8 shows examples of genes at various abundance levels, including GAPDH and ACTB that exhibit more molecular labels and total reads. The number of true molecular labels after applying the corrections accounted for less than 7% of the total molecular labels found in the uncorrected data, which is considered to be more than 93% of the molecular labels as error molecular labels. It means that it was discarded. 93% of the uncorrected molecular labels were removed as noise, but true molecular labels contributed to at least 72% of the leads, which means that these discarded error molecular labels consist of much lower leads. It also means that. Moreover, the amplification rate after applying the algorithm was in the range of 137-413, which was much higher than that obtained using the uncorrected data (6.1-29.4). The corrected amplification rate was a much more practical measurement, which correlated with a PCR efficiency of at least 75%.

全体として、これらのデータは、完全シーケンシング遺伝子のシーケンシングリードの訂正が、シーケンシングリードの大部分を使用する能力を依然として維持しながら、未補正分子標識カウンティングデータ中のエラーおよびノイズを有意に低減したことを示す。 Overall, these data significantly reduce errors and noise in uncorrected molecularly labeled counting data, while correcting the sequencing reads of the complete sequencing gene, while still maintaining the ability to use most of the sequencing reads. Indicates that it has decreased.

実施例７
確率バーコード付き標的のカウンティングデータを要約および視覚化するためのツール
この実施例は、前の実施例に示される確率バーコード付き標的のカウンティングデータを要約および視覚化するためのツールを示す。 Example 7
Tools for Summarizing and Visualizing Probabilistic Barcoded Target Counting Data This example presents a tool for summarizing and visualizing probabilistic barcoded target counting data as shown in the previous embodiment.

検定データのために、Ｐｒｅｃｉｓｅ（商標）ａｓｓａｙ（ＣｅｌｌｕｌａｒＲｅｓｅａｒｃｈ，Ｉｎｃ．（ＰａｌｏＡｌｔｏ，ＣＡ））による処理のために、単一細胞の２つのプレートを作製した。この実験では、２つの異なる細胞型を４：１比で使用し、各ウェルに配置した細胞のアイデンティティーは、実験を実施する研究員から不明にした。この試験の目標は、確率バーコードカウントからの遺伝子発現プロフィールを用いて、各ウェルの細胞型を同定することであった。 Two plates of single cells were made for treatment with Precision ™ assay (Cellular Research, Inc. (Palo Alto, CA)) for assay data. Two different cell types were used in this experiment in a 4: 1 ratio, and the identity of the cells placed in each well was obscured by the researchers conducting the experiment. The goal of this study was to identify the cell type of each well using the gene expression profile from the probability barcode count.

ウェルにおける全体的シーケンシングデータクオリティを評価するために、ウェル毎のシーケンシングリードの合計を算出した。そして、訂正方法の性能を評価するために、訂正方法の適用前および適用後のいくつかの統計学的測定値を集計し、比較した。さらに、グラフ図は、データの視覚的表示を提供し、異常またはパターンを容易に検出することが可能である。 In order to evaluate the overall sequencing data quality in the wells, the sum of the sequencing reads for each well was calculated. Then, in order to evaluate the performance of the correction method, some statistical measurements before and after the application of the correction method were aggregated and compared. In addition, graphs provide a visual representation of the data and can easily detect anomalies or patterns.

図９および１０は、シーケンシングリード＜５０００（イタリック体）を含むプレート１のウェル当たりのシーケンシングリードの合計を示す。リード＜５０００などのはるかに低いリードを有するウェルは、単一細胞がウェルに割り当てられなかったことを示しうるため、さらなる解析では、これらのウェルを除外すべきである。 9 and 10 show the total number of sequencing leads per well of plate 1 containing sequencing leads <5000 (italics). Wells with much lower leads, such as leads <5000, may indicate that no single cell was assigned to the wells and should be excluded for further analysis.

表１０および１１は、訂正方法の前および後のいくつかの測定値を比較する。これらの表から、「未補正リード」（ウェル当たりのシーケンシングリードの合計）と「未補正ＭＬ」（ウェル当たりの分子標識カウントの総数）に大きな変動が認められた。この大きな変動は、それらの標準偏差（ＳＤ）が平均より大きいことに起因しうるが、これも、低リードウェルの存在を示すものである。この方法を用いた後、ウェル当たり約４７％の遺伝子を、存在する全遺伝子の中で完全シーケンシング遺伝子として分類した。遺伝子の大部分が、不完全シーケンシング遺伝子として分類された（たとえば、０％）場合、本方法は、データ中のノイズを除去しない可能性がある。各ウェルについて、完全遺伝子の訂正後に約１５％の分子標識が保持されたが、これらの分子標識は、平均９５％のシーケンシングリードにマッピングされた。保持された％リードの値が高いほど、ノイズを除去しながら、訂正方法はシグナル（真の分子標識から賦与されたリード）を効果的に捕捉することができる。また、真の分子標識として保持される各分子標識の増幅速度は、１６３．３２であり、訂正方法を適用する前の２２．７６よりはるかに高かった。 Tables 10 and 11 compare several measurements before and after the correction method. From these tables, large fluctuations were observed in "uncorrected leads" (total of sequencing reads per well) and "uncorrected ML" (total number of molecular labeling counts per well). This large variation may be due to their standard deviation (SD) being greater than average, which is also an indication of the presence of low lead wells. After using this method, about 47% of the genes per well were classified as complete sequencing genes among all the genes present. If the majority of the genes are classified as incomplete sequencing genes (eg 0%), this method may not remove noise in the data. For each well, approximately 15% molecular labeling was retained after complete gene correction, but these molecular labels were mapped to an average of 95% sequencing leads. The higher the value of the% read retained, the more effectively the correction method can capture the signal (lead given from the true molecular label) while removing noise. Also, the amplification rate of each molecular label retained as a true molecular label was 163.32, much higher than 22.76 before applying the correction method.

図２２は、ウェル当たりの総シーケンシングリードの棒グラフを示す。図２２は、９６ウェル全体の相対入力の直接の視覚化を達成する。この図から、ウェルＣ０２およびＦ１１が、他に比べて高いリードを有することがわかり、これは、これらのウェルについての多細胞を示しうる。ウェルＡ１２、Ｂ０１、Ｂ０７～Ｂ１２、Ｃ０３、Ｃ０４、Ｃ０７、Ｃ１１、Ｄ０７、Ｄ０８、Ｄ１１、Ｅ０５、Ｅ０８、Ｆ０４～Ｆ１０、Ｆ１２、Ｇ０３、Ｇ０７、Ｈ０３、Ｈ０４、Ｈ０７～Ｈ０９、Ｈ１０～Ｈ１１は、他のウェルに比べてはるかに低いリードを有するが、これは、これらのウェルに細胞が配置されなかったことを示しうる。 FIG. 22 shows a bar graph of total sequencing leads per well. FIG. 22 achieves direct visualization of relative inputs across 96 wells. From this figure, it can be seen that wells C02 and F11 have higher leads than others, which may indicate multicellularity for these wells. Wells A12, B01, B07 to B12, C03, C04, C07, C11, D07, D08, D11, E05, E08, F04 to F10, F12, G03, G07, H03, H04, H07 to H09, H10 to H11 It has a much lower lead compared to the other wells, which may indicate that no cells were placed in these wells.

図２３は、％完全シーケンシング遺伝子、真の分子標識として保持された％分子標識（ＭＬ）、および各ウェルについて保持されたＭＬにマッピングされた％保持リードの棒グラフを示す。図２３は、ノイズ（各ウェルの下段）を除去するために、訂正方法を適用することができる、完全として分類された遺伝子のウェル当たりのパーセンテージ；分子標識を用いたウェル当たりのノイズのレベル（訂正方法の適用の前に観察された分子標識に対して、訂正方法の適用後に真の分子標識とみなされた、分子標識のパーセンテージ、各ウェルの上段）；ならびにシーケンシングリードを用いたウェル当たりのノイズのレベル（全未補正リードに比して、真の分子標識にマッピングされたリードのパーセンテージ、各ウェルの中段）を示す。図示するように、完全シーケンシング遺伝子の％はウェルに応じて変動するが、ウェルＡ１２、Ｂ０１、Ｂ０７～Ｂ１２、Ｃ０３、Ｃ０４、Ｃ０７、Ｄ０７、Ｄ０８、Ｄ１１、Ｅ０５、Ｅ０８、Ｆ０４～Ｆ１０、Ｆ１２、Ｇ０３、Ｇ０７、Ｈ０３、Ｈ０６、Ｈ０７、Ｈ１０～Ｈ１１はでははるかに低く、これは、はるかに低いリードを有するウェルと一致した。上段により示される％保持ＭＬは、すべてのウェルで概して２０％未満であったが、中段により示される％保持リードは、すべてのウェルで９０％を超えた。このタイプのプロットは、ノイズを除去する上で、また一方では各ウェルのシグナルを最大化する上でも、訂正方法がどれくらい有効であるかについての概念を提供しうる。 FIG. 23 shows a bar graph of the% complete sequencing gene, the% molecular label (ML) retained as a true molecular label, and the% retained lead mapped to the ML retained for each well. FIG. 23 shows the percentage of genes classified as complete per well for which correction methods can be applied to remove noise (bottom of each well); the level of noise per well with molecular labeling (the bottom of each well). The percentage of molecular labels that were considered true molecular labels after the application of the correction method, as opposed to the molecular labels that were observed prior to the application of the correction method); and per well with sequencing leads. No noise level (percentage of leads mapped to true molecular label compared to all uncorrected leads, middle row of each well). As shown, the percentage of complete sequencing genes varies with wells, but wells A12, B01, B07-B12, C03, C04, C07, D07, D08, D11, E05, E08, F04-F10, F12. , G03, G07, H03, H06, H07, H10-H11 were much lower, consistent with wells with much lower leads. The% retention ML indicated by the upper row was generally less than 20% in all wells, while the% retention lead indicated by the middle row exceeded 90% in all wells. This type of plot can provide a concept of how effective the correction method is in removing noise and, on the other hand, maximizing the signal in each well.

図２４は、各ウェルについて遺伝子により変動する％保持リードの箱ひげ図を示す。遺伝子レベルでの箱ひげ図は、ウェル中の各遺伝子について訂正方法がどれくらい良く作用したかなどの詳細な情報を明らかにし、これは、ウェルレベルでの棒グラフでは表すことができない。図２４に示すウェル当たりのすべての完全シーケンシング遺伝子についての％保持リードの箱ひげ図から、遺伝子間の変動は、たとえば、０．６を超えるひげを有するウェルＤ１１、Ｆ４、Ｆ８、Ｈ３およびＨ８の場合など、重要となりうることが判明した。しかし、これらの５つのウェルは、はるかに低い総シーケンシングリード、３３５７、５４５７、２８７４、３４１４および４０４３に対応した。 FIG. 24 shows a boxplot of% retained leads that varies genetically for each well. The boxplot at the gene level reveals detailed information such as how well the correction method worked for each gene in the well, which cannot be represented by a bar graph at the well level. From the boxplot of% retained reads for all fully sequenced genes per well shown in FIG. 24, the variation between genes is, for example, wells D11, F4, F8, H3 and H8 with whiskers greater than 0.6. It turns out that it can be important, such as in the case of. However, these five wells corresponded to much lower total sequencing leads, 3357, 5457, 2874, 3414 and 4043.

遺伝子発現データの解析にクラスター化を使用することができる。多次元性を低減し、恐らく相関する変数を、直交変換によって少数の変数にすることによる次元削減のために主成分分析（ＰＣＡ）を使用することができる。データ中のクラスターを検索するのに、ＰＣＡからの主要な主成分を用いることができる。 Clustering can be used to analyze gene expression data. Principal component analysis (PCA) can be used for dimensionality reduction by reducing multidimensionality and possibly correlating variables to a small number of variables by orthogonal transformation. The main principal components from the PCA can be used to retrieve the clusters in the data.

図２５Ａ～２５Ｂは、２つのプレートからの未補正ＭＬ対アルゴリズム適用後の訂正ＭＩを用いたＰＣＡプロットを示す。図２５Ａは、総シーケンシングリード＞５０００を有するウェル当たりの遺伝子毎の未補正ＭＬを用いたＰＣＡプロットを示す。このＰＣＡプロットは、第１に、総シーケンシングリード＜５０００を有するウェルを除去する（その結果、３つの制御遺伝子を除いて、１３９のウェルと、１０７の遺伝子が残った）工程；第２に、１３９ウェル全体でゼロ未補正ＭＬを有する遺伝子を除去する（８５の遺伝子が残った）工程；第３に、未補正ＭＬプラスワンの対数を採用して、データセットにゼロを組み込む工程、次に、センタリングおよびスケーリングの後、ログデータにＰＣＡを適用する工程によって生成された。ＰＣＡプロットは、明らかに２つのクラスターを示すが、両クラスターからの距離がほぼ等しいＤ０２、Ｄ０５、およびＦ０６などのウェルについては、細胞型を決定するのは困難であった。クラスター化の結果は、ノイズが付加されたために損なわれる可能性があり、少数のノイズ変数であっても明瞭なクラスター構造を損ないうる。従って、特徴／変数選択の前処理工程またはフィルタリングもしくは脱ノイズ工程から利益を受けることができる。完全シーケンシングデータに訂正方法を適用することにより、図２５Ｂに示すように、明瞭なクラスター構造が達成された。図２５ＢのＰＣＡプロットは、未補正ＭＬ（アルゴリズムの適用前に検出された遺伝子すべての分子標識のカウント）ではなく、訂正ＭＬ（訂正方法を適用後の完全シーケンシング遺伝子の真の分子標識のカウント）を用いた以外は、図２５Ａに示したように明瞭なクラスター構造が得られ、計１３９のウェルで７５の遺伝子を使用した。２つの識別可能なクラスターが観察され、これらは、ｙ軸によって首尾よく隔てられていた（ＰＣ２）。図２５Ａと比較して、図２５Ｂのクラスターは、サイズがコンパクトであり、各ウェルの細胞が明瞭にクラスターに割り当てられていた。加えて、図２５Ｂのｙ軸の右側の小さなクラスターは、３１のウェルから成り、総ウェルの約２２％であり、予想された２０％にかなり近い。 25A-25B show PCA plots with uncorrected ML vs. algorithm applied corrected MI from two plates. FIG. 25A shows a PCA plot with uncorrected ML per gene per well with total sequencing reads> 5000. This PCA plot first removes wells with total sequencing leads <5000 (resulting in 139 wells and 107 genes remaining, excluding 3 regulatory genes); second. Steps of removing genes with zero uncorrected ML across 139 wells (85 genes remained); third, adopting the uncorrected ML plus one log and incorporating zeros into the dataset, Generated by the process of applying PCA to log data after centering and scaling. The PCA plot clearly shows two clusters, but for wells such as D02, D05, and F06, which are approximately equal distances from both clusters, cell typing was difficult. The result of clustering can be compromised due to the addition of noise, and even a small number of noise variables can compromise a well-defined cluster structure. Therefore, it can benefit from feature / variable selection pretreatment steps or filtering or denoising steps. By applying the correction method to the complete sequencing data, a clear cluster structure was achieved, as shown in FIG. 25B. The PCA plot in FIG. 25B is not the uncorrected ML (count of molecular labels of all genes detected prior to application of the algorithm), but the corrected ML (count of true molecular labels of fully sequenced genes after applying the correction method). ) Was used, a clear cluster structure was obtained as shown in FIG. 25A, and 75 genes were used in a total of 139 wells. Two distinguishable clusters were observed, which were successfully separated by the y-axis (PC2). Compared to FIG. 25A, the cluster of FIG. 25B was compact in size and cells in each well were clearly assigned to the cluster. In addition, the small cluster on the right side of the y-axis in FIG. 25B consists of 31 wells, which is about 22% of the total wells, much closer to the expected 20%.

全体として、これらのデータは、確率バーコード付き標的のデータカウンティングを要約および視覚化する上で有用ないくつかのツールを明らかにする。 Overall, these data reveal some useful tools for summarizing and visualizing data counting of targets with probability barcodes.

実施例８
高度発現遺伝子－ＡＣＴＢのプレートにおける各ＭＬのＭＬカバー率
この実施例は、シーケンシングまたはＰＣＲの最中に生じたＭＬエラーの識別可能な分布が、一般に、ＭＬからの識別可能な分布を有することを実証する。 Example 8
ML coverage of each ML in a plate of highly expressed gene-ACTB In this example, the discriminating distribution of ML errors that occurred during sequencing or PCR generally has a discriminating distribution from ML. To demonstrate.

絶対遺伝子発現カウンティングおよびＰＣＲバイアス訂正に加えて、ＭＬは、ライブラリー作製方法およびシーケンシングデータの統計学的クオリティに関するより良い理解をもたらしうる。同じ遺伝子ＭＬを示すリードの数（ＭＬカバー率と呼ばれる）に関して、ライブラリー作製中に生成されたシーケンシングエラー塩基コールまたはＰＣＲエラーを検出することが可能である。たとえば、単位のリードのみにより表される所与のＳＬからの遺伝子ＭＬと比較して、複数のリードにより表される所与のＳＬからの遺伝子ＭＬは、恐らく、正確な測定値である。同じライブラリー中の高ＭＬカバー率の存在下で低ＭＬカバー率バーコードは、往々にして、ライブラリー作製の際のシーケンシングランまたはＰＣＲ工程中に生成された人工物もしくはエラーである。シーケンシングまたはＰＣＲの最中に生じたＭＬエラーは、一般に、真のＭＬからの識別可能な分布を有する。図２７は、高度発現遺伝子－ＡＴＣＢのマイクロプレートにおける各分子標識の分子標識カバー率を示す例示的なプロットを示し、ここで、識別可能な分布は、エラー分子標識とリアル分子標識の間に観察された。図２８は、高度発現遺伝子－ＡＴＣＢのマイクロプレートにおける各分子標識の分子標識カバー率への２つのネガティブ二項分布の当てはめを示す例示的なプロットである。２つのネガティブ二項分布の当てはめは、低い分子標識深度を有する分子標識エラーと、より高い分子標識深度を有する真の分子標識が、統計学的に識別可能な分布であることを実証する。ｘ軸は、分子深度である。 In addition to absolute gene expression counting and PCR bias correction, ML can provide a better understanding of library fabrication methods and statistical quality of sequencing data. With respect to the number of reads showing the same gene ML (called ML coverage), it is possible to detect sequencing error base calls or PCR errors generated during library production. For example, the gene ML from a given SL represented by multiple reads is probably an accurate measurement as compared to the gene ML from a given SL represented by a unit read alone. Low ML coverage barcodes in the presence of high ML coverage in the same library are often man-made objects or errors generated during sequencing runs or PCR steps during library fabrication. ML errors that occur during sequencing or PCR generally have a discernible distribution from true ML. FIG. 27 shows an exemplary plot showing the molecular label coverage of each molecular label in the highly expressed gene-ATCB microplate, where the discernible distribution is observed between the error molecular label and the real molecular label. Was done. FIG. 28 is an exemplary plot showing the fit of the two negative binomial distributions to the molecular label coverage of each molecular label in a highly expressed gene-ATCB microplate. The fit of the two negative binomial distributions demonstrates that the molecular labeling error with a low molecular labeling depth and the true molecular labeling with a higher molecular labeling depth are statistically distinguishable distributions. The x-axis is the molecular depth.

全体として、これらのデータは、シーケンシングまたはＰＣＲの最中に生じたＭＬエラーが、一般に、真のＭＬからの識別可能な分布を有することを実証する。 Overall, these data demonstrate that ML errors that occur during sequencing or PCR generally have a discernible distribution from true ML.

実施例９
ＰＣＲまたはシーケンシングエラーによる分子標識の訂正
この実施例は、ＰＣＲおよびシーケンシング置換エラーによる分子標識を訂正する方法を明らかにするものであり、これは、均一カバー率の想定なしに、かつ、完全シーケンシングステータスのために高いシーケンシングカバー率を必要とすることなく、全トランスクリプトームアッセイに適用することができる。 Example 9
Correction of Molecular Labeling Due to PCR or Sequencing Errors This example reveals a method for correcting molecular labeling due to PCR and sequencing substitution errors, which is complete and without assumption of uniform coverage. It can be applied to all transcriptome assays without the need for high sequencing coverage due to sequencing status.

各リードの第１のマッピング座標およびユニーク分子標識（ＵＭＩ）に対して重複排除を実施し、同じ開始座標、ＵＭＬ、および鎖を仮定して、リードは、同一であると想定した。重複排除の後、クラスター当たりの最も高いカウントを有するＵＭＬが保持された（表１３）。 Deduplication was performed on the first mapping coordinates and unique molecular label (UMI) of each lead, assuming the same starting coordinates, UML, and strands, assuming the leads were identical. After deduplication, the UML with the highest count per cluster was retained (Table 13).

分子標識（ＭＬ）は、遺伝子毎に訂正した。各遺伝子について、方向近接性を用いてＭＬのクラスターを同定した。ＭＬが、１のハミング距離内にあり、かつ、親ＭＬカウント≧２＊（子供ＭＩカウント）－１であった場合、方向近接性法は、ＭＬをクラスター化した。同じクラスター内のＭＬはすべて、同じ親ＭＬに由来すると考え、子供ＭＬカウントは、親ＭＬへ折りたたまれた。図２９は、分子標識訂正を示し、ここで、１のペアワイズハミング距離が大きな比率を占めた。分子標識訂正後、１のハミング距離相違する分子標識がクラスター化され、同じ親分子標識へ折りたたまれた。図３０は、リード数カバー率に対する訂正されたＭＬの数の曲線を示す。すべてのリードが保持されたため、この方法は、１塩基ＰＣＲまたはシーケンシングエラーを除去するために使用することもできる。 The molecular label (ML) was corrected for each gene. For each gene, ML clusters were identified using directional proximity. If the ML was within a Hamming distance of 1 and the parent ML count ≥ 2 * (child MI count) -1, the directional proximity method clustered the ML. All MLs in the same cluster were considered to be derived from the same parent ML, and the child ML count was collapsed to the parent ML. FIG. 29 shows the molecular labeling correction, where the pairwise Hamming distance of 1 occupies a large proportion. After correction of the molecular label, molecular labels with different Hamming distances of 1 were clustered and folded into the same parent molecule label. FIG. 30 shows a curve of the corrected number of MLs with respect to the lead number coverage. Since all reads were retained, this method can also be used to eliminate single-base PCR or sequencing errors.

全体として、これらのデータは、すべてのリードが保持されたことから、全トランスクリプトームアッセイのデータを訂正または調節するために適用することができる補正方法を実証する。 Overall, these data demonstrate a correction method that can be applied to correct or adjust the data for the entire transcriptome assay since all reads were retained.

実施例１０
高入力サンプルのための分子標識カウンティング
この実施例は、入力分子が増加するとき、使用されるユニーク分子標識を説明する。 Example 10
Molecular Labeling Counting for High Input Samples This example illustrates the unique molecular labeling used when the number of input molecules increases.

ｍＲＮＡの確率およびユニーク標識を可能にするために、小さなサンプル入力（たとえば、単一細胞）に使用する場合、ＢＤＰｒｅｃｉｓｅ（商標）ＴａｒｇｅｔｅｄＡｓｓａｙが最も好適であると考えられる。転写物の数が、高ＲＮＡ／細胞入力実験におけるバーコードプールに比して増加すると、同じ遺伝子を標識するために最小されるＭＬのパーセンテージが増加し、ポアソン分布を用いて理論上計算された（図２６）。こうした状況下で、統計学的訂正なしに、ＭＬを用いて遺伝子発現を定量する工程は、ポアソン訂正も２つのネガティブ二項分布に基づく訂正もなしで、初めに存在する分子の数を過小評価するであろう。 The BD Precise ™ Targeted Assay is considered most suitable when used for small sample inputs (eg, single cells) to allow for mRNA probabilities and unique labeling. As the number of transcripts increased relative to the barcode pool in high RNA / cell input experiments, the percentage of ML that was minimized to label the same gene increased, which was theoretically calculated using the Poisson distribution (Figure). 26). Under these circumstances, the step of quantifying gene expression using ML without statistical correction underestimates the number of initially present molecules without Poisson correction or correction based on two negative binomial distributions. Will do.

遺伝子当たりのｍＲＮＡの数が６５６１バーコードのコレクション全体を超える極めて高い入力サンプルでは、ポアソン訂正または２つのネガティブ二項分布に基づく訂正はもはや不可能である。たとえば、６５０００または１０００００入力分子のいずれにかかわらず、いずれの場合も最大６５６１の飽和バーコードが予想される。従って、高サンプル入力を有すると思われる遺伝子およびサンプルを改変することができ、それによって、ＭＬカウントは恐らく過少評価されるであろう。 Poisson corrections or corrections based on two negative binomial distributions are no longer possible in extremely high input samples where the number of mRNAs per gene exceeds the entire collection of 6651 barcodes. For example, a maximum of 6651 saturated barcodes are expected in either case, regardless of whether it is a 65,000 or 100,000 input molecule. Thus, genes and samples that appear to have high sample inputs can be modified, whereby the ML count will probably be underestimated.

全体として、これらのデータは、ＭＬを用いて遺伝子発現を定量する場合、未補正データを調節する必要性を実証する。 Overall, these data demonstrate the need to adjust uncorrected data when quantifying gene expression using ML.

実施例１１
再帰的置換エラー訂正（ＲＳＥＣ）
この実施例では、再帰的置換エラー訂正を明らかにする。 Example 11
Recursive Replacement Error Correction (RSEC)
This embodiment reveals recursive replacement error correction.

ＭＬエラーを除去するために、ＢＤＰｒｅｃｉｓｅ（商標）ＴａｒｇｅｔｅｄＡｓｓａｙ分析パイプラインに、２つの共同的方法を使用することができる。手短には、シーケンシング塩基コール置換エラーに由来するＭＬエラーを同定し、再帰的置換エラー訂正（ＲＳＥＣ）を用いて真のＭＬバーコードに調節する。続いて、ライブラリー作製工程由来のＭＬエラーまたはシーケンシング塩基欠失エラーを、分布ベースのエラー訂正（ＤＢＥＣ）を用いて調節する。 Two collaborative methods can be used in the BD Precision ™ Targeted Assay analysis pipeline to eliminate ML errors. Briefly, ML errors resulting from sequencing base call substitution errors are identified and adjusted to true ML barcodes using recursive substitution error correction (RSEC). Subsequently, ML errors or sequencing base deletion errors from the library fabrication process are adjusted using distribution-based error correction (DBEC).

ＲＳＥＣアルゴリズムは、ＰＣＲまたはシーケンシング置換に由来するＭＬエラーを調節することができる。これらの稀なエラーイベントは、ＭＬカバー率を調べる際に認められている。たとえば、エラーＭＬのＭＬカバー率は、適切なシーケンシングサンプル中のＭＬよりも有意に低くなりうる（図２７）が；初期ＭｏｌｅｃｕｌａｒＩｎｄｅｘｉｎｇ（商標）（逆転写）工程中に、２つの非常に類似したＭＬを用いた場合、これらは、概して、類似するＭＬカバー率を有し、除去する必要がない。シーケンシング深度が増大するにつれて、より多くのＭＬエラーが出現するため、ＲＳＥＣは、高度シーケンシングバーコード付きライブラリーのＭＬカウントを調節するために重要となりうる。 The RSEC algorithm can regulate ML errors resulting from PCR or sequencing substitutions. These rare error events are acknowledged when examining ML coverage. For example, the ML coverage of error ML can be significantly lower than that of ML in a suitable sequencing sample (FIG. 27); When used, they generally have similar ML coverage and do not need to be removed. RSEC can be important for adjusting the ML count of libraries with advanced sequencing barcodes, as more ML errors will appear as the sequencing depth increases.

簡潔に述べると、ＲＳＥＣは、エラー訂正において２つの因子：１）ＭＬ配列の類似性；および２）それらのＭＬカバー率を考慮する。各標的遺伝子について、それらのＭＬ配列の両方がある、互いに対して１塩基（ハミング距離＝１）内にあれば、ＭＬは接続される。ＭＬｘとｙとの間の各接続について、
カバー率（ｙ）＞２＊カバー率（ｘ）＋１式（５）
（式中、ｙは、「親ＭＬ」を示し、ｘは、「子供ＭＬ」を示す）。 Briefly, RSEC considers two factors in error correction: 1) ML sequence similarity; and 2) their ML coverage. For each target gene, the MLs are connected if both of their ML sequences are within one base (Hamming distance = 1) with respect to each other. For each connection between MLx and y
Coverage rate (y)> 2 * Coverage rate (x) + 1 equation (5)
(In the formula, y indicates "parent ML" and x indicates "child ML").

この代入に基づき、子供ＭＬは、その親へ折りたたまれうる。この過程は、当該遺伝子について同定可能な親／子供ＭＬがもはや存在しなくなるまで、再帰的である。 Based on this substitution, the child ML can be folded into its parent. This process is recursive until there are no more identifiable parent / child MLs for the gene.

図３１は、上に概説した再帰的置換エラー訂正の一例の概略図を示す。ＲＳＥＣ訂正前の未補正データ中のＭＬは、９つのユニークＭＬ：ＧＴＣＡＡＡＴＴ、ＧＴＣＡＡＡＡＴ、ＧＴＣＡＡＡＡＡ、ＴＴＣＡＡＡＡＡ、ＴＴＣＡＧＡＡＡ、ＣＴＣＡＡＡＡＡ、ＴＴＣＡＡＡＣＴ、ＴＴＣＡＡＡＡＴ、およびＴＴＣＡＡＡＣＡを含む。ＲＳＥＣを適用することにより、

は、

へ折りたたまれうる。なぜなら、２つのＭＬは、１ヌクレオチド（下線部）相違し、ＭＬＧＴＣＡＡＡＴＴは、ＧＴＣＡＡＡＡＴより低いＭＬカウントを有するからである。次に、ＭＬ

は、ＧＴＣＡＡＡＡＴより高いＭＬカウントを有するＭＬ

（ＭＬ配列中の相違を下線で示す）へ折りたたまれうる。同様に、ＭＬＴＴＣＡＧＡＡＡおよびＣＴＣＡＡＡＡＡは、ＭＬＴＴＣＡＡＡＡＡへ折りたたまれうる。ＭＬＴＴＣＡＡＡＣＴは、ＭＬＴＴＣＡＡＡＡＴへ折りたたまれ、これが、今度は、ＭＬＴＴＣＡＡＡＡＡに折りたたまれうる。ＭＬＴＴＣＡＡＡＣＡは、他のすべてのＭＬと２ヌクレオチド以上相違するため、他の８つのＭＬのいずれにも折りたたまれない。ＲＳＥＣ訂正前に、未補正ＭＬカウントは９であった。ＲＳＥＣ訂正後、ＭＬカウントは２つ：ＭＬＴＴＣＡＡＡＡＡおよびＴＴＣＡＡＡＡＡであった。 FIG. 31 shows a schematic diagram of an example of recursive replacement error correction outlined above. The ML in the uncorrected data before RSEC correction includes nine unique MLs: GTCAAAATT, GTCAAAAT, GTCAAAAA, TTCAAAAA, TTCAGAAA, CTCAAAAA, TTCAAAACT, TTCAAAAT, and TTCAAACA. By applying RSEC

teeth,

Can be folded into. This is because the two MLs differ by one nucleotide (underlined) and the ML GTCAAATT has a lower ML count than the GTCAAAATT. Next, ML

Has a higher ML count than GTCAAAAT

Can be collapsed (underlined differences in the ML sequence). Similarly, ML TTCAGAAA and CTCAAAAA can be folded into ML TTCAAAAA. The ML TTCAAAACT can be folded into the ML TTCAAAAT, which in turn can be folded into the ML TTCAAAAA. ML TTCAAACA differs from all other MLs by more than 2 nucleotides and therefore does not fold into any of the other 8 MLs. Prior to the RSEC correction, the uncorrected ML count was 9. After RSEC correction, the ML counts were two: ML TTCAAAAA and TTCAAAAA.

全体として、これらのデータは、未補正ＭＬカウントを訂正するためにＲＳＥＣを使用する工程を実証する。 Overall, these data demonstrate the process of using RSEC to correct uncorrected ML counts.

実施例１２
ＭＬカバー率計算
この実施例は、ＭＬカバー率計算を説明する。 Example 12
ML Coverage Calculation This example illustrates ML coverage calculation.

ＲＳＥＣの後、ウェル当たりの遺伝子ＭＬカウントを評価して、さらなる訂正についてそれらの適合性を判定する。低ＭＬカバー率（＜ＭＬ当たり４リード）を有する遺伝子は、次の訂正工程を迂回し、最終ＭＬデータ表に報告されて、バイオインフォマティクスパイプラインに「低深度」であると記録される。考えられる６５６１のバーコードのうち少なくとも６５５７が観察されるといった、極めて高い入力を有る遺伝子の場合、バーコード多様性のために分子の数を決定するのは困難となり、遺伝子は、「飽和」として表示される。２つの決定地点のいずれも満たさない遺伝子ＭＬについては、次のＤＢＥＣアルゴリズムに進み、出力ログファイル内で「合格」と表示される。さらに、ウェル当たり平均６５０ＭＬより高いＭＬを有する遺伝子は、これらのＭＬの＞５％は、ポアソン分布に基づいて再利用されるため、「高入力」であると記録される（図２７）。 After RSEC, the gene ML counts per well are evaluated to determine their suitability for further corrections. Genes with low ML coverage (<4 reads per ML) bypass the next correction step and are reported in the final ML data table and recorded as "low depth" in the bioinformatics pipeline. For genes with extremely high inputs, such as at least 6557 of the 6651 possible barcodes observed, the barcode diversity makes it difficult to determine the number of molecules and the gene is considered "saturated". Is displayed. For gene MLs that do not meet either of the two decision points, proceed to the next DBEC algorithm and display "pass" in the output log file. In addition, genes with ML above an average of 650 ML per well are recorded as "high input" because> 5% of these MLs are reused based on the Poisson distribution (FIG. 27).

全体として、この実施例は、ＭＬカバー率計算を説明する。 Overall, this example illustrates ML coverage calculations.

実施例１３
分布ベースのエラー訂正（ＤＢＥＣ）
この実施例は、分布ベースのエラー訂正を説明する。 Example 13
Distribution-based error correction (DBEC)
This embodiment illustrates distribution-based error correction.

ＲＳＥＣとは異なり、ＤＢＥＣアルゴリズムは、ＭＬが、そのＭＬ配列にかかわらず、エラーまたは真のシグナルであるかを識別するための方法である。ＲＳＥＣは、エラーを訂正するために、ＭＬ配列およびＭＬカバー率情報の両方に依存するが、ＤＢＥＣは、非置換エラー訂正について訂正するために、主としてＭＬカバー率だけに依存する。前述したように、エラーバーコードは、一般に、真のバーコードＭＬカバー率とは異なる低いＭＬカバー率を有し；このＭＬカバー率の差は、異なる分布として、ＭＬカバー率のヒストグラムプロットで認めることができる（図２７）。この差を仮定して、ＤＢＥＣは、ＭＬエラー（より低いＭＬカバー率を有する）と、より高いＭＬカバー率を有する真のシグナルのものとを統計学的に識別するために、２つのネガティブ二項分布を当てはめる。 Unlike RSEC, the DBEC algorithm is a method for identifying whether an ML is an error or a true signal, regardless of its ML sequence. RSEC relies on both ML sequences and ML coverage information to correct errors, while DBEC relies primarily solely on ML coverage to correct for non-replacement error correction. As mentioned above, error barcodes generally have a low ML coverage that is different from the true barcode ML coverage; this difference in ML coverage is seen as a different distribution in the ML coverage histogram plot. Can be done (Fig. 27). Given this difference, the DBEC has two negative binomials to statistically distinguish between ML errors (having lower ML coverage) and those of the true signal with higher ML coverage. Apply the term distribution.

最適分布当てはめのための再使用ＭＬの除去
所与の遺伝子について、検出されたＭＬが増加するにつれて、再使用されるＭＬ（すなわち、同じ遺伝子由来する２つ以上のｍＲＮＡを標識するために同じＭＬが使用される）のパーセンテージは、増加することから、推定することができる。ポアソン分布（γ_non-unique）を用いて、ウェルｉの再使用ＭＬの数（ｎ_non-unique,i）をＭＬ再使用率方程式（方程式（６））から推定する。推定再使用ＭＬが、ウェルｉにおける所与の遺伝子の総ＭＬの５％より大きければ、ウェルｉにおけるこの遺伝子は、「高入力」と表示される。これらの「高入力」データの場合、より優れた二項分布を取得するために、最大ＭＬカバー率ＭＬは、分布当てはめから除外される（しかし、後のカウント工程のために保存される）。
Ｐ（Ｘ＞１│λ_non-unique），λ_non-unique＝Ｎｕｍｂｅｒｏｆ
ＭＬ／６５６１式（６）

Removal of Reused ML for Optimal Distribution Fitting For a given gene, as the detected ML increases, the ML to be reused (ie, the same ML to label two or more mRNAs from the same gene) Is used), which can be estimated from the increase. Using the Poisson distribution (γ _non-unique ), the number of reused MLs in well i (n _{non-unique, i} ) is estimated from the ML reuse rate equation (equation (6)). If the putative reuse ML is greater than 5% of the total ML of a given gene in well i, then this gene in well i is labeled as "high input". For these "high input" data, the maximum ML coverage ML is excluded from the distribution fit (but preserved for later counting steps) in order to obtain a better binomial distribution.
P (X> 1│λ _non-unique ), λ _non-unique = Wavelength of
ML / 6561 formula (6)

低発現遺伝子のための擬似点の追加
ＭＬの固有の数が１０未満である場合、往々にして、データの希薄さのために分布を当てはめるのが難しくなる。この問題を改善するために、ＤＢＥＣは、分布当てはめを補助するために用いられる１％シグナルカウントの擬似点を追加するが、それでもなおデータに影響を与えない。 Addition of pseudopoints for underexpressed genes When the unique number of MLs is less than 10, it is often difficult to fit the distribution due to the sparseness of the data. To remedy this problem, DBEC adds a 1% signal count pseudopoint used to aid distribution fitting, but still does not affect the data.

パラメータの推定
２つのネガティブ二項分布を当てはめて、シグナルＭＬからエラーを区別するために、パラメータ推定のための２組の出発数値を概算する。エラー分布は、平均および１の散布を有するネガティブ二項分布であると想定される。 Parameter Estimates Two sets of starting values for parameter estimation are estimated to fit the two negative binomial distributions and distinguish errors from the signal ML. The error distribution is assumed to be a negative binomial distribution with mean and 1 scatter.

エラー／シグナル確率推定
シグナルおよびエラー分布をそれぞれＮｅｇａｔｉｖｅＢｉｎｏｍｉａｌ（μ_signal，ｓｉｚｅ_signal）およびＮｅｇａｔｉｖｅＢｉｎｏｍｉａｌ（μ_error，ｓｉｚｅ_error）として想定する。シグナルＭＬの数を小さい順に決定するために、所与のＭＬからのリードの数が、シグナルおよびエラー分布に由来する確率を、方程式（８）が満たされるまで計算し、ここで、先行するＭＬはすべて、エラーＭＬとみなされる。
Ｐ（Ｘ＝ｒ│μ＝μ_error，ｓｉｚｅ＝ｓｉｚｅ_error）＜Ｐ（Ｘ＝ｒ│μ＝μ_signal，ｓｉｚｅ＝ｓｉｚｅ_signal）式（８） Error / Signal Probability Estimation Signals and error distributions are assumed as Negative Binomial (μ _signal , size _signal ) and Negative Binomial (μ _error , size _error ), respectively. To determine the number of signal MLs in ascending order, the probability that the number of reads from a given ML will derive from the signal and error distribution is calculated until equation (8) is satisfied, where the preceding ML Are all considered error MLs.
P (X = r│μ = μ _error , size = size _error ) <P (X = r│μ = μ _signal , size = size _signal ) Equation (8)

全体として、この実施例は、分布ベースのエラー訂正を実施するための計算を示す。 Overall, this example shows a calculation for performing distribution-based error correction.

実施例１４
二次導関数に基づくＳＬエラーの調節
この実施例は、二次導関数に基づくＳＬエラーの調節を示す。 Example 14
Adjustment of SL Error Based on Second Derivative Function This example shows adjustment of SL error based on the second derivative.

図３２、パネル（ａ）～（ｅ）は、分子標識深度変化の二次導関数に基づくＰＣＲおよびシーケンシングエラーの訂正の例示的な結果を示す。図３２、パネル（ａ）は、ＳＬエラーおよびシグナルＭＬが、十分に分離されうることを示す。図３２、パネル（ｂ）および（ｄ）は、それぞれ、図３２、パネル（ｃ）および（ｅ）に示すＭＬカウントからの分子標識カウントの累積和を示す。図３２、パネル（ｂ）および（ｄ）中の縦線は、二次導関数の最大値の位置を示す。図３２、パネル（ｂ）および（ｄ）中の点線は、二次導関数の最大値の位置が、ＭＬカウント対ＭＬリード深度のプロットにおいてＭＬを分離し得ることを示す。 32 (a)-(e) show exemplary results of PCR and sequencing error correction based on the second derivative of molecular labeling depth changes. FIG. 32, panel (a) shows that SL error and signal ML can be well separated. 32, panels (b) and (d) show the cumulative sum of molecularly labeled counts from the ML counts shown in FIGS. 32, (c) and (e), respectively. In FIG. 32, the vertical lines in the panels (b) and (d) indicate the position of the maximum value of the quadratic derivative. Dotted lines in FIGS. 32b and (d) indicate that the position of the maximum value of the quadratic derivative can separate the ML in the ML count vs. ML read depth plot.

全体として、これらのデータは、ＭＬシグナルからＳＬエラーを分離するために、分子標識の二次導関数の最大値を用いることができることを明らかにする。 Overall, these data reveal that the maximum value of the quadratic derivative of the molecular label can be used to isolate the SL error from the ML signal.

実施例１５
ＤＢＥＣに基づくＰＣＲおよびシーケンシングエラーの訂正
この実施例は、２つのネガティブ二項分布に基づくＰＣＲおよびシーケンシングエラーの訂正を示す。 Example 15
Correction of PCR and Sequencing Errors Based on DBEC This example shows correction of PCR and sequencing errors based on two negative binomial distributions.

図３３、パネル（ａ）～（ｃ）は、ＣＤ６９について２つのネガティブ二項分布に基づくＰＣＲおよびシーケンシングエラーの訂正の例示的な結果を示す。図３３、パネル（ａ）は、図３３、パネル（ｂ）のＭＬ深度のヒストグラムに示すＭＬカウントデータでのＣＤ６９について２つのネガティブ二項分布（ノイズネガティブ二項分布のＤ_nと、シグナル二項分布のＤ_s）の当てはめを示す。図３３、パネル（ｂ）の点線は、図３３、パネル（ａ）に示す２つのネガティブ二項分布により決定されたＭＬシグナルおよびＳＬエラーの分離を示す。図３３、パネル（ｃ）の縦線は、リードの累積和プロットに基づいて決定される二次導関数の局所的最大値を示す。図３３と同様に、図３４、パネル（ａ）～（ｃ）は、ＣＤ３Ｅについての２つのネガティブ二項分布に基づくＰＣＲおよびシーケンシングエラーの訂正の例示的な結果を示す。 33, panels (a)-(c), show exemplary results of PCR and sequencing error correction based on two negative binomial distributions for CD69. FIG. 33, panel (a) shows two negative binomial distributions (D _n of noise negative binomial distribution and signal binomial) for CD69 in the ML count data shown in the histogram of ML depth in FIG. 33, panel (b). The fit of D _s ) of the distribution is shown. The dotted line in FIG. 33, panel (b) shows the separation of ML signals and SL errors as determined by the two negative binomial distributions shown in FIG. 33, panel (a). In FIG. 33, the vertical line in panel (c) shows the local maximum of the quadratic derivative determined based on the cumulative sum plot of leads. Similar to FIG. 33, FIGS. 34, (a)-(c) show exemplary results of PCR and sequencing error correction based on two negative binomial distributions for CD3E.

全体として、これらのデータは、ＤＢＥＣを用いて、ＰＣＲおよびシーケンシングエラーを訂正することができることを明らかにする。 Overall, these data reveal that DBEC can be used to correct PCR and sequencing errors.

実施例１６
ＭＬ再使用
この実施例は、高度発現遺伝子のためのＭＬ再使用、ならびに分布当てはめ前に高度発現遺伝子の入力データを調節する必要性を明らかにする。 Example 16
ML Reuse This example demonstrates the need to recycle ML for highly expressed genes, as well as to regulate the input data for highly expressed genes prior to distribution fitting.

図３５、パネル（ａ）～（ｃ）は、高度発現遺伝子ＡＣＴＢについての２つのネガティブ二項分布に基づくＰＣＲおよびシーケンシングエラーの訂正の例示的な結果を示す。高度発現遺伝子は、過剰シーケンシングステータス（たとえば、１００以上のＭＬカバー率を有する）を有しうる。いくつかの実施形態では、高度発現遺伝子は、他の基準を用いて決定してもよい。図３５、パネル（ａ）において、縦線右側の分子標識は、高い深度に基づいて恐らく再使用されたＭＬに対応する。図３５、パネル（ｂ）は、分子標識を３つのカテゴリー（ＭＬエラー以外に）：ＳＬエラー、シグナルＭＬ、および恐らく再使用されたＭＬに区分することができることを概略的に示す。図３５、パネル（ｃ）は、恐らく再使用されたＭＬを調節せずに、当てはめられた２つのネガティブ二項分布は、理想的ではなかったことを実証する。 35, panels (a)-(c) show exemplary results of PCR and sequencing error correction based on two negative binomial distributions for the highly expressed gene ACTB. Highly expressed genes can have over-sequencing status (eg, have an ML coverage of 100 or greater). In some embodiments, the highly expressed gene may be determined using other criteria. In FIG. 35, panel (a), the molecular label to the right of the vertical line corresponds to the ML that was probably reused based on the high depth. FIG. 35, panel (b) schematically shows that molecular labeling can be divided into three categories (other than ML errors): SL errors, signal MLs, and possibly reused MLs. FIG. 35, panel (c) demonstrates that the fitted two negative binomial distributions were not ideal, probably without adjusting for the reused ML.

図３６は、高度発現遺伝子についてＧリッチ分子標識の再使用の例示的な結果を示す。図３６は、高度発現遺伝子ＧＡＰＤＨ、ＡＣＴＢ、およびＨＳＰ９０ＡＢ１について上位２０の高い深度ＭＬを示す。これらの高い深度ＭＬは、多数のＧおよびＴを有し、これらは、再使用される可能性が高く、バーコード付けは確率論的ではなかった。ＭＬ二重項は、確率標識を想定する理論計算値より早く起こった。ＡＣＴＢについては、ウェル当たり３５０ＭＬが存在した場合、理論上、２．７％の二重項があるはずであったが、実際の二重項は、４パーセント前後であった。 FIG. 36 shows exemplary results of reuse of G-rich molecular labels for highly expressed genes. FIG. 36 shows the top 20 high depth MLs for the highly expressed genes GAPDH, ACTB, and HSP90AB1. These high depth MLs had a large number of Gs and Ts, which were likely to be reused and the bar coding was not stochastic. The ML doublet occurred earlier than the theoretically calculated value assuming the probability indicator. For ACTB, if there were 350 ML per well, there should have been a 2.7% doublet in theory, but the actual doublet was around 4%.

図３７、パネル（ａ）～（ｂ）は、２つのネガティブ二項分布を当てはめる前の、高度発現遺伝子についての入力データの調節の例示的な結果を示す。図３７、パネル（ａ）は、高度発現遺伝子について調節された、図３５、パネル（ａ）における入力データを示す。図３５、パネル（ｃ）における非理想的な分布当てはめとは対照的に、図３７、パネル（ｂ）は、当てはめられた２つのネガティブ二項分布を示す。 37, panels (a)-(b) show exemplary results of regulation of input data for highly expressed genes prior to fitting the two negative binomial distributions. 37, panel (a) shows the input data in FIG. 35, panel (a), regulated for highly expressed genes. In contrast to the non-ideal distribution fit in FIG. 35, panel (c), FIG. 37, panel (b) shows the two negative binomial distributions fitted.

全体として、これらのデータは、２つのネガティブ二項分布の当てはめの前に、高度発現遺伝子についてのシーケンシンデータから、再使用されたＭＬを除去する必要がありうることを示す。 Overall, these data indicate that reused ML may need to be removed from sequenced data for highly expressed genes prior to fitting the two negative binomial distributions.

実施例１７
２つのネガティブ二項分布を用いたＭＬカウントの訂正
この実施例は、２つのネガティブ二項分布を用いて訂正された１０の標的のＭＬカウントを示す。 Example 17
Correction of ML Counts Using Two Negative Binomial Distributions This example shows the ML counts of 10 targets corrected using two negative binomial distributions.

図３８、パネル（ａ）～（ｊ）は、２つのネガティブ二項分布を用いて訂正されたデータセットの非限定的な例示的検証を示す。図３８に示すように、１０の標的のＭＬカウントが訂正された。図３８の各パネルの縦線は、２つのネガティブ二項分布を用いて決定された、標的のＭＬシグナルおよびＳＬエラーの分離を示す。 38, panels (a)-(j), show a non-limiting exemplary validation of a data set corrected using two negative binomial distributions. As shown in FIG. 38, the ML counts for 10 targets have been corrected. The vertical lines in each panel of FIG. 38 show the separation of the target ML signal and SL error as determined using the two negative binomial distributions.

全体として、これらのデータは、２つのネガティブ二項分布を用いたＭＬカウントの訂正を検証するものである。 Overall, these data validate ML count corrections using two negative binomial distributions.

実施例１８
混合されたＪｕｒｋａｔおよび乳癌（ＢｒＣａ）単一細胞の９６ウェルからのＢＤＰｒｅｃｉｓｅ（商標）ＴａｒｇｅｔｅｄＡｓｓａｙのｔ－確率的近傍埋込み視覚化
この実施例は、混合されたＪｕｒｋａｔおよび乳癌（ＢｒＣａ）単一細胞についての再帰的置換エラー訂正および分布ベースのエラー訂正に基づいてＰＣＲおよびシーケンシングエラーを訂正する方法を示す。 Example 18
T-Probabilistic Proximal Implantable Visualization of BD Precise ™ Targeted Assay from 96 Wells of Mixed Jurkat and Breast Cancer (BrCa) Single Cells This example is a mixed Jurkat and Breast Cancer (BrCa) Single Cell. Demonstrates how to correct PCR and sequencing errors based on recursive replacement error correction and distribution-based error correction for.

図３９、パネル（ａ）～（ｄ）は、混合されたＪｕｒｋａｔおよび乳癌（ＢｒＣａ）単一細胞の９６ウェルからのＢＤＰｒｅｃｉｓｅ（商標）ＴａｒｇｅｔｅｄＡｓｓａｙの例示的なｔ－確率的近傍埋込み（ｔ－ＳＮＥ）視覚化を示す（８６の被検遺伝子）。図３９、パネル（ａ）は、ＭＬ調節前および後の同じパラメータを有するＤＢＳｃａｎを用いて、細胞クラスターを同定したことを示す。図３９、パネル（ｂ）～（ｄ）は、色および点サイズの両方により評価される個々のマーカー発現を示す。図３９、パネル（ｂ）は、ＰＳＭＢ４、すなわち、両細胞型中に、およびＭＬ調節後に存在するハウスキーピング遺伝子を示し、ＰＳＭＢ４シグナルの欠如は、「低シグナル」クラスター中でさらに強調される。図３９、パネル（ｃ）は、ＣＤ３Ｅ、すなわち、Ｊｕｒｋａｔ細胞クラスターを強調するリンパ球マーカーを示す。図３９、パネル（ｄ）は、ＣＤＨ１、すなわち、ＢｒＣａクラスターを強調する上皮細胞マーカーを示す。 39, panels (a)-(d) are exemplary t-probabilistic neighborhood implants of BD Precise ™ Targeted Assay from 96 wells of mixed Jurkat and breast cancer (BrCa) single cells. SNE) Shows visualization (86 test genes). FIG. 39, panel (a) shows that cell clusters were identified using DBScan with the same parameters before and after ML regulation. 39, panels (b)-(d) show individual marker expression assessed by both color and point size. FIG. 39, panel (b) shows PSMB4, a housekeeping gene present in both cell types and after ML regulation, and the lack of PSMB4 signal is further accentuated in the "low signal" cluster. FIG. 39, panel (c) shows CD3E, a lymphocyte marker that emphasizes Jurkat cell clusters. FIG. 39, panel (d), shows CDH1, an epithelial cell marker that emphasizes BrCa clusters.

全体として、これらのデータは、ＭＬ調節によってＭＬノイズが除去され、これにより、細胞クラスター間の遺伝子発現の明瞭な区別が可能になったことを実証するものである。 Overall, these data demonstrate that ML regulation removed ML noise, which allowed a clear distinction of gene expression between cell clusters.

実施例１９
細胞クラスター間の差異発現分析
この実施例は、低シグナル細胞および乳癌（ＢｒＣａ）細胞についての再帰的置換エラー訂正および分布ベースのエラー訂正に基づいてＰＣＲおよびシーケンシングエラーを訂正する方法を示す。 Example 19
Analysis of Difference Expression Between Cell Clusters This example shows how to correct PCR and sequencing errors based on recursive replacement error correction and distribution-based error correction for low signal cells and breast cancer (BrCa) cells.

図４０、パネル（ａ）～（ｂ）は、各々のクラスターでＤＢＳｃａｎにより計算され、かつ遺伝子マーカーレベルによって決定された、両方の選択クラスターにおいて＞０ＭＬを有する遺伝子について細胞クラスター間の差異発現分析を示す非限定的な例示的プロットである。図４０、パネル（ａ）は、残りの細胞と比較した「低シグナル」クラスター遺伝子発現を示す。図４０、パネル（ａ）の上部は、未補正ＭＬ比較を示し、これによって、他の細胞において高い平均発現を有する遺伝子ほど、ＭＬノイズが概して高いことがわかる。図４０、パネル（ａ）の下部は、ＲＳＥＣおよびＤＢＥＣを用いたＭＬ調節後に、「低シグナル」クラスター中に検出されたＭＬノイズが低減し、クラスター間の遺伝子発現の明瞭な識別を可能にすることを示す。図４０、パネル（ｂ）は、残りの細胞と比較した「ＢｒＣａ」クラスター遺伝子発を示す。図４０、パネル（ｂ）の上部は、非ＢｒＣａ細胞中の未補正ＭＬも、ＫＲＴ１、ＭＵＣ１などのＢｒＣａマーカーの有意なＭＬカウントを有したことを示す。図４０、パネル（ｂ）の下部は、ＢｒＣａマーカーの調節されたＭＬが、ＢｒＣａクラスター中で、残りの細胞よりも極めて豊富であったことを示す。 40, panels (a)-(b) show the difference expression analysis between cell clusters for genes with> 0 ML in both selected clusters, calculated by DBScan in each cluster and determined by the genetic marker level. It is a non-limiting exemplary plot shown. FIG. 40, panel (a), shows "low signal" cluster gene expression compared to the remaining cells. In FIG. 40, the upper part of the panel (a) shows an uncorrected ML comparison, which shows that genes with higher average expression in other cells generally have higher ML noise. FIG. 40, lower part of panel (a) reduces ML noise detected in "low signal" clusters after ML regulation with RSEC and DBEC, allowing clear identification of gene expression between clusters. Show that. FIG. 40, panel (b), shows the development of the "BrCa" cluster gene compared to the remaining cells. FIG. 40, upper part of panel (b) shows that uncorrected MLs in non-BrCa cells also had significant ML counts for BrCa markers such as KRT1, MUC1. In FIG. 40, the lower part of the panel (b) shows that the regulated ML of the BrCa marker was much more abundant in the BrCa cluster than the remaining cells.

全体として、これらのデータは、低シグナル細胞および乳癌細胞などの細胞の場合、再帰的置換エラー訂正および分布ベースのエラー訂正に基づいてＰＣＲおよびシーケンシングエラーを訂正することができることを示す。 Overall, these data show that for cells such as low signal cells and breast cancer cells, PCR and sequencing errors can be corrected based on recursive replacement error correction and distribution-based error correction.

実施例２０
混合ＪｕｒｋａｔおよびＴ４７Ｄ細胞の分子標識の調節
この実施例は、混合ＪｕｒｋａｔおよびＴ４７Ｄ細胞の分子標識を調節する方法を示す。 Example 20
Modulation of Molecular Labeling of Mixed Jurkat and T47D Cells This example shows a method of regulating molecular labeling of mixed Jurkat and T47D cells.

図４１、パネル（ａ）～（ｄ）は、８６の被検遺伝子を含む混合Ｊｕｒｋａｔおよび乳癌（Ｔ４７Ｄ）単一細胞の９６ウェルからのＢＤＰｒｅｃｉｓｅ（商標）ＴａｒｇｅｔｅｄＡｓｓａｙのｔ－確率的近傍埋込み視覚化を示す非限定的な例示的プロットである。図４１、パネル（ａ）は、ＭＬ調節前および後に同じパラメータを有するＤＢＳｃａｎを用いて、細胞クラスターを同定したことを示す。図４１、パネル（ｂ）～（ｄ）は、色および点サイズの両方によって評価される個々のマーカー発現を示す。図４１、パネル（ｂ）は、ＰＳＭＢ４、すなわち、両細胞型中に、およびＭＬ調節後に存在するハウスキーピング遺伝子の評価を示す。ＰＳＭＢ４シグナルの欠如は、テンプレートなし対照（ＮＴＣ）クラスターにおいてさらに強調される。図４１、パネル（ｃ）は、ＣＤ３Ｅ、すなわち、Ｊｕｒｋａｔ細胞クラスターを強調するリンパ球マーカーの評価を示す。図４１、パネル（ｄ）は、ＣＤＨ１、すなわち、Ｔ４７Ｄクラスターを強調する上皮細胞マーカーの評価を示す。 41, panels (a)-(d) show t-stochastic near-implantation visuals of BD Precise ™ Targeted Assay from 96 wells of mixed Jurkat and breast cancer (T47D) single cells containing 86 test genes. It is a non-limiting exemplary plot showing the transformation. FIG. 41, panel (a) shows that cell clusters were identified using DBScan with the same parameters before and after ML regulation. 41, panels (b)-(d), show individual marker expression assessed by both color and point size. FIG. 41, panel (b) shows the evaluation of PSMB4, a housekeeping gene present in both cell types and after ML regulation. The lack of PSMB4 signals is further accentuated in untemplated control (NTC) clusters. FIG. 41, panel (c), shows the evaluation of CD3E, a lymphocyte marker that emphasizes Jurkat cell clusters. FIG. 41, panel (d), shows the evaluation of CDH1, an epithelial cell marker that emphasizes T47D clusters.

図４２、パネル（ａ）～（ｂ）は、エラー訂正工程前（図４２、パネル（ａ）に示す未補正ＭＬ）ならびにＲＳＥＣおよびＤＢＥＣ訂正後（図４２、パネル（ｂ）に示す調節ＭＬ）に、図４１で同定されたさまざまな細胞クラスター間の分子標識カウントによる差異遺伝子発現を表示する非限定的な例示的ヒートマップである。発現の低かった遺伝子は青色で、発現が高かった遺伝子はオレンジ色である。これらの細胞型の間で遺伝子発現が類似する遺伝子は、互いにクラスター化する。エラー訂正がない場合、ＮＴＣは、ＣＤ３ＥおよびＫＲＴ１８（それぞれ、ＪｕｒｋａｔおよびＴ４７Ｄマーカーである）などの高度発現遺伝子に由来するノイズを有した。さらに、エラー訂正は、ＪｕｒｋａｔとＴ４７Ｄとの間で識別可能な遺伝子発現パターンを明らかにした。 42, panels (a) to (b) are before the error correction step (uncorrected ML shown in FIG. 42, panel (a)) and after RSEC and DBEC correction (adjustment ML shown in FIG. 42, panel (b)). In addition, it is a non-limiting exemplary heat map showing the difference gene expression by molecular labeling count between the various cell clusters identified in FIG. 41. Genes with low expression are blue and genes with high expression are orange. Genes with similar gene expression among these cell types cluster together. In the absence of error correction, NTC had noise from highly expressed genes such as CD3E and KRT18 (Jurkat and T47D markers, respectively). In addition, error correction revealed a gene expression pattern that could be distinguished between Jurkat and T47D.

全体として、これらのデータは、ＭＬ調節が、ＭＩノイズを除去することができ、これによって、細胞クラスター間の遺伝子発現の明瞭な区別を可能になることを実証するものである。 Overall, these data demonstrate that ML regulation can eliminate MI noise, which allows for a clear distinction of gene expression between cell clusters.

以上に記載の実施形態の少なくともいくつかでは、実施形態で使用される１つ以上のエレメントは、他の実施形態で互換的に使用可能である。ただし、かかる交換が技術的に実現可能である場合に限る。特許請求された主題の範囲から逸脱することなく、以上に記載の方法および構造に種々の他の省略、追加、および変更を行いうることは、当業者であれば分かるであろう。かかる変更および変化はすべて、添付の特許請求の範囲に規定される主題の範囲内に含まれることが意図される。 In at least some of the embodiments described above, the one or more elements used in the embodiment can be used interchangeably with other embodiments. However, only if such replacement is technically feasible. Those skilled in the art will appreciate that various other omissions, additions, and modifications to the methods and structures described above may be made without departing from the scope of the claimed subject matter. All such changes and changes are intended to be within the scope of the subject matter set forth in the appended claims.

本明細書に記載の実質的に任意の複数形および／または単数形の用語の使用に関連して、文脈上および／または適用上適切であれば、当業者は複数形から単数形へおよび／または単数形から複数形への変換が可能である。明確にするために種々の単数形／複数形の入替えを本明細書に明示的に記述しうる。本明細書および添付の特許請求の範囲で用いられる場合、特に文脈上明確に規定されていない限り、単数形の「ａ」、「ａｎ」、および「ｔｈｅ」には、複数の参照語が包含される。本明細書での「ｏｒ（または）」の意味はいずれも、特に明記されていない限り、「ａｎｄ／ｏｒ（および／または）」を包含することが意図される。 As appropriate in the context and / or application in connection with the use of substantially any plural and / or singular term described herein, those skilled in the art may from plural to singular and /. Alternatively, conversion from the singular to the plural is possible. Various singular / plural interchanges may be explicitly described herein for clarity. As used herein and in the appended claims, the singular "a," "an," and "the" include multiple references, unless expressly specified in the context. Will be done. Any meaning of "or (or)" herein is intended to include "and / or (and / or)" unless otherwise stated.

一般的には、本明細書特に添付の特許請求の範囲（たとえば添付の特許請求の範囲の本文）で用いられる用語は「オープン」用語であることが一般に意図されることは当業者であれば理解されよう（たとえば、「ｉｎｃｌｕｄｉｎｇ（～を含む）」という用語は「～を含むがこれらに限定されるものではない」と解釈すべきであり、「ｈａｖｉｎｇ（～を有する）」という用語は「少なくとも～を有する」と解釈すべきであり、「ｉｎｃｌｕｄｅｓ（～を含む）」という用語は「～を含むがこれらに限定されるものではない」と解釈すべきであるなど）。さらに、導入クレームレシテーションの特定数が意図される場合、かかる意図は請求項で明示的にリサイトされ、かかるレシテーションの不在下ではかかる意図は存在しないことは当業者であれば理解されよう。たとえば、理解の一助として、以下の添付の特許請求の範囲は、クレームレシテーションを導入するために導入語句「ａｔｌｅａｓｔｏｎｅ（少なくとも１つ）」および「ｏｎｅｏｒｍｏｒｅ（１つ以上）」の使用を含みうる。しかしながら、かかる語句が用いられたとしても、不定冠詞「ａ」または「ａｎ」によるクレームレシテーションの導入が、かかる導入クレームレシテーションを含む任意の特定の請求項を、一方のかかるレシテーションを含む実施形態のみに限定することを意味するものと解釈すべきでない。たとえ同一の請求項が導入語句「ｏｎｅｏｒｍｏｒｅ（１つ以上）」または「ａｔｌｅａｓｔｏｎｅ（少なくとも１つ）」と不定冠詞たとえば「ａ」または「ａｎ」とを含む場合でさえも、そのように解釈すべきでない（たとえば、「ａ」および／または「ａｎ」は「ａｔｌｅａｓｔｏｎｅ（少なくとも１つ）」または「ｏｎｅｏｒｍｏｒｅ（１つ以上）」を意味するものと解釈すべきである）。定冠詞を用いてクレームレシテーションを導入する場合にも、同じことが当てはまる。そのほかに、たとえ特定数の導入クレームレシテーションが明示的にリサイトされたとしても、かかるレシテーションは少なくともリサイトされた数を意味すると解釈すべきであることは当業者であれば分かるであろう（たとえば、「２つのレシテーション」という他の修飾語を含まないベアのレシテーションは、少なくとも２つのレシテーションまたは２つ以上レシテーションを意味する）。さらに、「Ａ、Ｂ、およびＣの少なくとも１つ」に類似した条件が用いられる場合、一般的には、かかる構成は当業者がその条件を理解する意味であることが意図される（たとえば、「Ａ、Ｂ、およびＣの少なくとも１つを有する系」は、限定されるものではないが、Ａ単独、Ｂ単独、Ｃ単独、ＡとＢの両方、ＡとＣの両方、ＢとＣの両方、および／またはＡとＢとＣの全部などを有する系を含であろう）。「Ａ、Ｂ、またはＣの少なくとも１つなど」に類似した条件が用いられる場合、一般的には、かかる構成は当業者がその条件を理解する意味であることが意図される（たとえば、「Ａ、Ｂ、またはＣの少なくとも１つを有する系」は、限定されるものではないが、Ａ単独、Ｂ単独、Ｃ単独、ＡとＢの両方、ＡとＣの両方、ＢとＣの両方、および／またはＡとＢとＣの全部などを有する系を含であろう）。さらに、２つ以上の代替用語を表す実質上任意の選言的な語および／または語句は、明細書、請求項、または図面にかかわらず、用語の１つ、用語のいずれか、または用語の両方を含む可能性が企図されると理解すべきであることは当業者であれば理解されよう。たとえば、「ＡまたはＢ」という語句は「Ａ」または「Ｂ」または「ＡおよびＢ」の可能性を含むものと理解されよう。 In general, it is generally intended by those skilled in the art that the terms used herein, in particular in the appended claims (eg, the text of the appended claims), are "open" terms. As will be understood (for example, the term "inclusion" should be interpreted as "including but not limited to" and the term "having" should be interpreted as "having". It should be interpreted as "having at least", and the term "includes" should be interpreted as "including but not limited to"). Further, it will be appreciated by those skilled in the art that if a particular number of introductory claim recitations are intended, such intent will be explicitly resited in the claim and that such intent does not exist in the absence of such recitation. For example, to aid understanding, the following claims include the use of the introductory terms "at least one" and "one or more" to introduce claim recitation. Can be included. However, even if such a phrase is used, the introduction of claim recitation by the indefinite article "a" or "an" includes any particular claim comprising such introductory claim recitation, one such recitation. It should not be construed to mean limiting to embodiments only. Even if the same claim contains the introductory phrase "one or more" or "at least one" and an indefinite article such as "a" or "an". (For example, "a" and / or "an" should be interpreted to mean "at least one" or "one or more") .. The same is true when introducing claim recitation using definite articles. Other than that, one of ordinary skill in the art will know that even if a certain number of introductory claim recitations are explicitly resited, such recitation should be construed to mean at least the number of resites (). For example, bear recitation without the other modifier "two recitations" means at least two recitations or two or more recitations). Further, when conditions similar to "at least one of A, B, and C" are used, such configurations are generally intended to mean that one of ordinary skill in the art will understand the conditions (eg, for example. A "system having at least one of A, B, and C" is, but is not limited to, A alone, B alone, C alone, both A and B, both A and C, B and C. Will include systems with both and / or all of A, B and C, etc.). When conditions similar to "at least one of A, B, or C, etc." are used, such configurations are generally intended to mean that one of ordinary skill in the art understands the conditions (eg, "" A system having at least one of A, B, or C "is, but is not limited to, A alone, B alone, C alone, both A and B, both A and C, both B and C. , And / or a system having all of A, B, C, etc.). In addition, virtually any selective term and / or phrase representing two or more alternative terms, whether in the description, claims, or drawings, is one of the terms, any of the terms, or the term. It will be understood by those skilled in the art that it should be understood that the possibility of including both is intended. For example, the phrase "A or B" may be understood to include the possibility of "A" or "B" or "A and B".

そのほかに、本開示の特徴または態様がマーカッシュグループにより記述される場合、それにより、本開示は、マーカッシュグループの任意の個別のメンバーまたはメンバーのサブグループにより記述されることは当業者であれば分かるであろう。 In addition, if the features or aspects of the disclosure are described by a Markush group, it will be appreciated by those skilled in the art that the disclosure will be described by any individual member or subgroup of members of the Markush group. Will.

当業者であれば理解されるであろうが、あらゆる目的で、たとえば、明細書の提供に関して、本明細書に開示された範囲はすべて、あらゆる可能なサブ範囲およびそのサブ範囲の組合せをも包含する。いずれの列挙された範囲も、十分に記述されたものとしてかつその範囲が少なくとも２等分、３等分、４等分、５等分、１０等分などされうるものとして容易に認識可能である。たとえば、限定されるものではないが、本明細書で考察した各範囲は、下３分の１、中３分の１、上３分の１に容易に分解可能である。同様に、当業者であれば理解されるであろうが、「～まで」、「少なくとも～」、「～超」、「～未満」などの表現はすべて、リサイトされた数を含み、以上で考察したように後続的にサブ範囲に分解可能な範囲を意味する。最終的に、当業者であれば理解されるであろうが、範囲は各個別のメンバーを含む。したがって、たとえば、１～３個の物品を有するグループは、１、２、または３個の物品を有するグループを意味する。同様に、１～５個の物品を有するグループは、１、２、３、４、または５個の物品を有するグループを意味し、他も同様である。 As will be appreciated by those skilled in the art, all scopes disclosed herein include all possible subranges and combinations of subranges thereof for any purpose, eg, with respect to the provision of the specification. do. Any of the listed ranges can be easily recognized as being well described and having at least two equal parts, three equal parts, four equal parts, five equal parts, ten equal parts and the like. .. For example, without limitation, each range considered herein can be easily decomposed into a lower third, a middle third, and an upper third. Similarly, as those skilled in the art will understand, expressions such as "to", "at least", "super", "less than" all include the number of sites that have been resited. As discussed, it means a range that can be subsequently decomposed into sub-ranges. Ultimately, as will be appreciated by those skilled in the art, the scope includes each individual member. Thus, for example, a group with 1 to 3 articles means a group with 1, 2 or 3 articles. Similarly, a group with 1 to 5 articles means a group with 1, 2, 3, 4, or 5 articles, and so on.

種々の態様および実施形態を本明細書に開示してきたが、他の態様および実施形態は当業者には自明であろう。本明細書に開示される種々の態様および実施形態は、例示を目的としたものであり、限定を意図したものではなく、真の範囲および趣旨は、以下の特許請求の範囲により示される。
なお、本発明としては、以下の態様も好ましい。
〔１〕標的の数を決定する方法であって、
（ａ）複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程であって、前記複数の確率バーコードの各々が分子標識を含む工程と；
（ｂ）前記確率バーコード付き標的のシーケンシングデータを取得する工程と；
（ｃ）前記複数の標的の１つ以上について：
（ｉ）前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する分子標識の数をカウントする工程と；
（ｉｉ）方向近接性を用いて、前記標的の分子標識のクラスターを同定する工程と；
（ｉｉｉ）（ｉｉ）で同定された前記標的の分子標識の前記クラスターを用いて、（ｂ）で得られた前記シーケンシングデータを折りたたむ工程と；
（ｉｖ）前記標的の数を推定する工程であって、推定された前記標的の数が、（ｉｉ）の前記シーケンシングデータの折りたたみ後に、（ｉ）でカウントされた前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する分子標識の数と相関する工程と、
を含む、方法。
〔２〕前記複数の標的が、細胞の全トランスクリプトームの標的を含む、〔１〕に記載の方法。
〔３〕クラスター内の前記標的の分子標識が、互いの所定の方向近接性閾値内にある、〔１〕～〔２〕のいずれか一項に記載の方法。
〔４〕前記方向近接性閾値が、１のハミング距離である、〔３〕に記載の方法。
〔５〕前記クラスター内の前記標的の前記分子標識が、１つ以上の親分子標識と、前記１つ以上の親分子標識の子供分子標識とを含み、前記親分子標識の発生数が、所定の方向近接性発生数閾値以上である、〔１〕～〔４〕のいずれか一項に記載の方法。
〔６〕前記所定の方向近接性発生数閾値が、２×（子供分子標識の発生数）－１である、〔５〕に記載の方法。
〔７〕（ｉｉ）で同定された前記標的の分子標識の前記クラスターを用いて、（ｂ）で得られた前記シーケンシングデータを折りたたむ工程が、
前記子供分子標識の発生数を前記親分子標識に帰属させる工程
を含む、〔１〕～〔６〕のいずれか一項に記載の方法。
〔８〕前記標的のシーケンシング深度を決定する工程をさらに含む、〔１〕～〔７〕のいずれか一項に記載の方法。
〔９〕前記標的の前記シーケンシング深度が所定のシーケンシング深度閾値を超える場合、前記標的の数を推定する工程が、（ｉ）でカウントされた前記シーケンシングデータを調節する工程を含む、〔８〕に記載の方法。
〔１０〕前記所定のシーケンシング深度閾値が、１５～２０である、〔９〕に記載の方法。
〔１１〕（ｉ）でカウントされた前記シーケンシングデータを調節する工程が、
前記標的の分子標識を閾値化して、（ｂ）で得られた前記シーケンシングデータ中の前記標的に関連付けられた真の分子標識および偽の分子標識を決定する工程
を含む、〔９〕～〔１０〕のいずれか一項に記載の方法。
〔１２〕前記標的の前記分子標識を閾値化する工程が、前記標的の前記分子標識について統計解析を実施する工程を含む、〔１１〕に記載の方法。
〔１３〕前記統計解析を実施する工程が、
前記標的の前記分子標識の分布およびそれらの発生数を２つのネガティブ二項分布に当てはめる工程と；
前記２つのネガティブ二項分布を用いて真の分子標識の数ｎを決定する工程と；
（ｂ）で得られた前記シーケンシングデータから前記偽の分子標識を除去する工程と、
を含み、
前記偽の分子標識が、ｎ番目に豊富な分子標識の発生数よりも低い発生数を有する分子標識を含み、前記真の分子標識が、ｎ番目に豊富な分子標識の発生数以上の発生数を有する分子標識を含む、〔１２〕に記載の方法。
〔１４〕前記ネガティブ二項分布が、前記真の分子標識に対応する第１のネガティブ二項分布と、前記偽の分子標識に対応する第２のネガティブ二項分布を含む、〔１３〕に記載の方法。
〔１５〕標的の数を決定する方法であって、
（ａ）複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程であって、前記複数の確率バーコードの各々が分子標識を含む工程と；
（ｂ）前記確率バーコード付き標的のシーケンシングデータを取得する工程と；
（ｃ）前記複数の標的の１つ以上について：
（ｉ）前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する分子標識の数をカウントする工程と；
（ｉｉ）前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有するノイズ分子標識の数を決定する工程と；
（ｉｉｉ）前記標的の数を推定する工程と、
を含み、
推定された前記標的の数が、（ｉｉ）で決定された前記ノイズ分子標識の数に応じて調節された、（ｉ）でカウントされた前記シーケンシングデータ中の前記標的に関連付けられた前記識別可能な配列を有する分子標識の数と相関する、方法。
〔１６〕前記シーケンシングデータ中の前記標的のシーケンシングステータスを決定する工程をさらに含む、〔１５〕に記載の方法。
〔１７〕前記シーケンシングデータ中の前記標的の前記シーケンシングステータスが、飽和シーケンシング、過少シーケンシング、または過剰シーケンシングである、〔１６〕に記載の方法。
〔１８〕前記飽和シーケンシングステータスが、所定の飽和閾値よりも大きい、識別可能な配列を含む分子標識の数を有する前記標的によって決定される、〔１７〕に記載の方法。
〔１９〕前記確率バーコードが、識別可能な配列を有する約６５６１の分子標識を含む場合、前記所定の飽和閾値が、約６５５７である、〔１８〕に記載の方法。
〔２０〕前記確率バーコードが、識別可能な配列を有する約６５５３６の分子標識を含む場合、前記所定の飽和閾値が、約６５５３２である、〔１８〕～〔１９〕のいずれか一項に記載の方法。
〔２１〕前記シーケンシングデータ中の前記標的の前記シーケンシグステータスが、前記飽和シーケンシングステータスである場合、（ｉｉ）で決定された前記ノイズ分子標識の数が、ゼロである、〔１７〕～〔２０〕のいずれか一項に記載の方法。
〔２２〕前記過少シーケンシングステータスが、所定の過少シーケンシング閾値より小さい深度を有する前記標的によって決定され、前記対象の前記深度が、前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する前記分子標識の、平均、最小、または最大深度を含む、〔１７〕～〔２１〕のいずれか一項に記載の方法。
〔２３〕前記過少シーケンシング閾値が約４である、〔２２〕に記載の方法。
〔２４〕前記過少シーケンシング閾値は、識別可能な配列を有する前記分子標識の数とは無関係である、〔２３〕に記載の方法。
〔２５〕前記シーケンシングデータ中の前記標的の前記シーケンシグステータスが、前記過少シーケンシングステータスである場合、（ｉｉ）で決定された前記ノイズ分子標識の数が、ゼロである、〔１７〕～〔２４〕のいずれか一項に記載の方法。
〔２６〕前記過剰シーケンシングステータスが、所定の過剰シーケンシング閾値より大きい深度を有する前記標的によって決定され、前記対象の前記深度が、前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する前記分子標識の、平均、最小、または最大深度を含む、〔１７〕～〔２５〕のいずれか一項に記載の方法。
〔２７〕前記確率バーコードが、識別可能な配列を有する約６５６１の分子標識を含む場合、前記過剰シーケンシング閾値が、約２５０である、〔２６〕に記載の方法。
〔２８〕前記シーケンシングデータ中の前記標的の前記シーケンシングテータスが、前記過剰シーケンシングステータスである場合、
前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する前記分子標識の数を、前記所定の過剰シーケンシング閾値にサブサンプリングする工程
をさらに含む、〔２６〕～〔２７〕のいずれか一項に記載の方法。
〔２９〕前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する前記ノイズ分子標識の数を決定する工程が、
ネガティブ二項分布当てはめ条件が満たされる場合、
（ｉｖ）シグナルネガティブ二項分布を、（ｉ）でカウントされた前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する前記分子標識の数に当てはめる工程であって、前記シグナルネガティブ二項分布が、シグナル分子標識である、（ｉ）でカウントされた前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する分子標識の数に対応するステップと；
（ｖ）ノイズネガティブ二項分布を、（ｉ）でカウントされた前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する前記分子標識の数に当てはめる工程であって、前記ノイズネガティブ二項分布が、ノイズ分子標識である、（ｉ）でカウントされた前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する分子標識の数に対応する工程と；
（ｖｉ）（ｖ）で当てはめた前記シグナルネガティブ二項分布および（ｖｉ）で当てはめた前記ノイズネガティブ二項分布を用いて、前記ノイズ分子標識の数を決定する工程と、を含む、
〔１７〕～〔２８〕のいずれか一項に記載の方法。
〔３０〕前記ネガティブ二項分布当てはめ条件が、前記シーケンシングデータ中の前記標的の前記シーケンシングステータスが、前記過少シーケンシングステータスまたは前記過剰シーケンシングステータスではないことを含む、〔２９〕に記載の方法。
〔３１〕（ｖ）で当てはめた前記シグナルネガティブ二項分布および（ｖｉ）で当てはめた前記ノイズネガティブ二項分布を用いて、前記ノイズ分子標識の数を決定する工程が、
前記シーケンシングデータ中の前記標的に関連付けられた前記識別可能な配列の各々について、
前記識別可能な配列のシグナル確率が、前記シグナルネガティブ二項分布であることを決定する工程と；
前記識別可能な配列のノイズ確率が、前記ノイズネガティブ二項分布であることを決定する工程と；
前記シグナル確率が前記ノイズ確率より小さければ、前記識別可能な配列がノイズ分子標識であることを決定する工程と、
を含む、〔２９〕～〔３０〕のいずれか一項に記載の方法。
〔３２〕前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する前記ノイズ分子標識の数を決定する工程が、
前記シーケンシングデータ中の前記標的の前記シーケンシングステータスが、前記過少シーケンシングステータスまたは前記過剰シーケンシングステータスではなく、かつ、（ｉ）でカウントされた前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する前記分子標識の数が、擬似点閾値より少ない場合、（ｉｉ）で前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する前記ノイズ分子標識の数を決定する前に、前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する前記分子標識の数に擬似点を加える工程を含む、
〔１７〕～〔３１〕のいずれか一項に記載の方法。
〔３３〕前記擬似点閾値が１０である、〔３２〕に記載の方法。
〔３４〕前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する前記ノイズ分子標識の数を決定する工程が、
前記シーケンシングデータ中の前記標的の前記シーケンシングステータスが、前記過少シーケンシングステータスまたは前記過剰シーケンシングステータスではなく、かつ、（ｉ）でカウントされた前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する前記分子標識の数が、擬似点閾値以上である場合、（ｉｉ）で前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する前記ノイズ分子標識の数を決定する際に、非ユニーク分子標識を除去する工程を含む、
〔１７〕～〔３３〕のいずれか一項に記載の方法。
〔３５〕前記非ユニーク分子標識を除去する工程が、前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する前記分子標識の数が、所定の再使用分子標識閾値より大きい場合、（ｉｉ）で前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する前記ノイズ分子標識の数を決定する際に、前記非ユニーク分子標識を除去する工程を含む、〔３４〕に記載の方法。
〔３６〕前記確率バーコードが、識別可能な配列を有する約６５６１の分子標識を含む場合、前記再使用分子標識閾値が、約６５０である、〔３５〕に記載の方法。
〔３７〕前記非ユニーク分子標識を除去する工程が、
前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する前記分子標識の数について非ユニーク分子標識の理論上の数を決定する工程と；
前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有するｎ番目に豊富な前記分子標識よりも大きい発生数を有する分子標識を除去する工程と、
を含み、
ｎが、非ユニーク分子標識の理論数である、〔３４〕～〔３６〕のいずれか一項に記載の方法。
〔３８〕ハードウェアプロセッサーと、
前記ハードウェアプロセッサーによって実行される場合、前記プロセッサーに〔１〕～〔３７〕のいずれか一項に記載の方法を実行させる命令を記憶した非一過性メモリーと、
を含む、ターゲットの数を決定するためのコンピュータシステム。
〔３９〕〔１〕～〔３７〕のいずれか一項に記載の方法を実行するためのコードを含むソフトウェアプログラムを含む、コンピュータ読取り媒体。
〔４０〕標的の数を決定する方法であって、
（ａ）複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程であって、前記複数の確率バーコードの各々が分子標識を含む工程と；
（ｂ）前記確率バーコード付き標的のシーケンシングデータを取得する工程と；
（ｃ）前記複数の標的の１つ以上について：
（ｉ）前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する分子標識の数をカウントする工程と；
（ｉｉ）（ｂ）で得られた前記シーケンシングデータ中の前記標的のクオリティステータスを決定する工程と；
（ｉｉｉ）（ｂ）で得られた前記シーケンシングデータ中の１つ以上のシーケンシングデータエラーを決定する工程であって、前記シーケンシングデータ中の前記１つ以上のシーケンシングデータエラーを決定する工程が、以下：前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する前記分子標識の数、前記シーケンシングデータ中の前記標的の前記クオリティステータス、および前記複数の確率バーコード中の識別可能な配列を有する前記分子標識の数のうち１つ以上を決定することを含む工程と；
（ｉｖ）前記標的の数を推定する工程であって、推定された前記標的の数が、（ｉｉｉ）で決定された前記１つ以上のシーケンシングデータエラーに応じて調節された、（ｉ）でカウントされた前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する前記分子標識の数と相関する工程と、
を含む、方法。
〔４１〕前記１つ以上のシーケンシングデータエラーを決定する前に、（ｂ）で得られた前記シーケンシングデータを折りたたむ工程
をさらに含む、〔４０〕に記載の方法。
〔４２〕（ｂ）で得られた前記シーケンシングデータを折りたたむ工程が、
類似した分子標識を有し、かつ、所定の折りたたみ発生数閾値よりも少ない発生数を有する標的のコピーを、前記複数の標的について同じ分子標識を有するものとして帰属させる工程を含み、前記標的の２つのコピーは、前記標的の前記２つのコピーの分子標識の配列が少なくとも１塩基相違する場合、類似の分子標識を有する、
〔４１〕に記載の方法。
〔４３〕前記確率バーコードが、識別可能な配列を有する約６５６１の分子標識を含む場合、前記所定の折りたたみ発生数閾値が７である、〔４２〕に記載の方法。
〔４４〕前記確率バーコードが、識別可能な配列を有する約６５５３６の分子標識を含む場合、前記所定の折りたたみ発生数閾値が１７である、〔４２〕に記載の方法。
〔４５〕前記標的の２つのコピーが、前記標的の前記２つのコピーの分子標識の配列が少なくとも１塩基相違する場合、類似の分子標識を有する、〔４２〕～〔４４〕のいずれか一項に記載の方法。
〔４６〕前記分子標識が、５～２０個のヌクレオチドを含む、〔４０〕～〔４５〕のいずれか一項に記載の方法。
〔４７〕異なる確率バーコードの前記分子標識が、互いに異なっている、〔４０〕～〔４６〕のいずれか一項に記載の方法。
〔４８〕前記複数の確率バーコードが、識別可能な配列を有する約６５６１の分子標識を含む、〔４０〕～〔４７〕のいずれか一項に記載の方法。
〔４９〕前記複数の確率バーコードが、識別可能な配列を有する約６５５３６の分子標識を含む、〔４０〕～〔４７〕のいずれか一項に記載の方法。
〔５０〕前記シーケンシングデータが、５０ヌクレオチド以上のリード長を有する前記複数の標的の配列を含む、〔４０〕～〔４９〕のいずれか一項に記載の方法。
〔５１〕前記シーケンシングデータが、７５ヌクレオチド以上のリード長を有する前記複数の標的の配列を含む、〔４０〕～〔４９〕のいずれか一項に記載の方法。
〔５２〕前記シーケンシングデータが、１００ヌクレオチド以上のリード長を有する前記複数の標的の配列を含む、〔４０〕～〔４９〕のいずれか一項に記載の方法。
〔５３〕（ｂ）で得られた前記シーケンシングデータが、前記複数の確率バーコード付き標的に対してポリメラーゼ連鎖反応（ＰＣＲ）増幅を実施することによって生成することができる、〔４０〕～〔５２〕のいずれか一項に記載の方法。
〔５４〕前記１つ以上のシーケンシングデータエラーが、ＰＣＲ導入エラー、シーケンシング導入エラー、バーコード混入に起因するエラー、ライブラリー作製エラー、またはそれらの任意の組合せである、〔４０〕～〔５３〕のいずれか一項に記載の方法。
〔５５〕前記ＰＣＲ導入エラーが、ＰＣＲ増幅エラー、ＰＣＲ増幅バイアス、不十分なＰＣＲ増幅、またはそれらの任意の組合せの結果である、〔５４〕に記載の方法。
〔５６〕前記シーケンシング導入エラーが、不正確なベースコーリング、不十分なシーケンシング、またはそれらの任意の組合せの結果である、〔５４〕～〔５５〕のいずれか一項に記載の方法。
〔５７〕工程（ｉ）、（ｉｉ）、（ｉｉｉ）、および（ｉｖ）が、前記複数の標的の各々について実施される、〔４０〕～〔５６〕のいずれか一項に記載の方法。
〔５８〕前記シーケンシングデータ中の前記標的の前記クオリティステータスが、完全シーケンシング、不完全シーケンシング、または飽和シーケンシングである、〔４０〕～〔５７〕のいずれか一項に記載の方法。
〔５９〕前記シーケンシングデータ中の標的のクオリティステータスが、前記複数の確率バーコード中に識別可能な配列を有する前記分子標識の数と、カウントされた前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する前記分子標識の数とによって決定される、〔５８〕に記載の方法。
〔６０〕前記完全シーケンシングクオリティステータスが、所定の完全シーケンシング散布閾値以上の前記ポアソン分布と比較した散布指数によって決定され、前記所定の完全シーケンシング散布閾値が、０．９である、〔５８〕～〔５９〕のいずれか一項に記載の方法。
〔６１〕前記所定の完全シーケンシング散布閾値が、１である、〔６０〕に記載の方法。
〔６２〕前記所定の完全シーケンシング散布閾値が、４である、〔６０〕に記載の方法。
〔６３〕前記完全シーケンシングクオリティステータスが、（ｂ）で得られた前記シーケンシングデータ中の所定の完全シーケンシング発生数閾値以上の発生数を有する分子標識によってさらに決定され、前記所定の完全シーケンシング発生数閾値が、１０である、〔６０〕～〔６２〕のいずれか一項に記載の方法。
〔６４〕前記所定の完全シーケンシング発生数閾値が、１８である、〔６３〕に記載の方法。
〔６５〕前記飽和シーケンシングクオリティステータスが、所定の飽和閾値よりも大きい、識別可能な配列を含む分子標識の数を有する前記標的によって決定される、〔５８〕～〔６４〕のいずれか一項に記載の方法。
〔６６〕前記飽和シーケンシングクオリティステータスが、前記所定の飽和閾値よりも大きい、識別可能な配列を含む分子標識の数を有する前記複数の標的のうちの１つの他の標的によって、さらに決定される、〔６５〕に記載の方法。
〔６７〕前記確率バーコードが、識別可能な配列を有する約６５６１の分子標識を含む場合、前記所定の飽和閾値が、６５５７である、〔６５〕に記載の方法。
〔６８〕前記確率バーコードが、識別可能な配列を有する約６５５３６の分子標識を含む場合、前記所定の飽和閾値が、６５５３２である、〔６５〕に記載の方法。
〔６９〕前記シーケンシングデータ中の前記標的の前記クオリティステータスは、（ｂ）で得られた前記シーケンシングデータ中の前記標的の前記クオリティステータスが、完全シーケンシングではなく、かつ、飽和シーケンシングではない場合に、不完全シーケンシングとして分類される、〔４０〕～〔６８〕のいずれか一項に記載の方法。
〔７０〕（ｉ）でカウントされた前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する前記分子標識の数が、（ｉｖ）において、
前記標的が前記完全シーケンシングクオリティステータスを有している場合、
１つ以上の親分子標識についてすべての子供分子標識を決定する工程と；
少なくとも１つの子供分子標識および前記親分子標識について第１の統計解析を実施する工程と；
前記第１の統計解析の帰無仮説が容認される場合、前記子供分子標識の前記発生数を前記親分子標識に帰属させる工程と、
によって調節される、〔５０〕～〔６９〕のいずれか一項に記載の方法。
〔７１〕前記１つ以上の親分子標識が、所定の完全シーケンシング親閾値以上の発生数を有する分子標識を含み、前記所定の完全シーケンシング親閾値が、前記所定の完全シーケンシング発生数閾値と等しい、〔７０〕に記載の方法。
〔７２〕前記子供分子標識が、前記親分子標識と１塩基相違し、かつ、所定の完全シーケンシング子供閾値以下の発生数を有する分子標識を含み、前記所定の完全シーケンシング子供閾値が、３である、〔７０〕～〔７１〕のいずれか一項に記載の方法。
〔７３〕前記所定の完全シーケンシング子供閾値が、５である、〔７２〕に記載の方法。
〔７４〕前記帰無仮説が真である確率が偽発見率を下回る場合、前記第１の統計解析の前記帰無仮説が容認され、前記偽発見率が、５％である、〔７０〕～〔７３〕のいずれか一項に記載の方法。
〔７５〕前記偽発見率が１０％である、〔７４〕に記載の方法。
〔７６〕前記第１の統計解析が、多重二項検定である、〔７０〕～〔７５〕のいずれか一項に記載の方法。
〔７７〕（ｉ）でカウントされた前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する前記分子標識の数は、（ｉｖ）において、
前記標的が前記完全シーケンシングクオリティステータスを有する場合、
前記標的の分子標識を閾値化して、（ｂ）で得られた前記シーケンシングデータ中の前記標的に関連付けられた真の分子標識および偽の分子標識を決定する工程
によって調節される、〔５０〕～〔７６〕のいずれか一項に記載の方法。
〔７８〕前記標的の前記分子標識を閾値化する工程が、前記標的の前記分子標識について第２の統計解析を実施する工程を含む、〔７７〕に記載の方法。
〔７９〕前記第２の統計解析を実施する工程が、
前記標的の前記分子標識の分布およびそれらの発生数を２つのポアソン分布に当てはめる工程と；
前記２つのポアソン分布を用いて真の分子標識の数ｎを決定する工程と；
（ｂ）で得られた前記シーケンシングデータから前記偽の分子標識を除去する工程と、
を含み、
前記偽の分子標識が、ｎ番目に豊富な分子標識の前記発生数よりも低い発生数を有する分子標識を含み、前記真の分子標識が、ｎ番目に豊富な分子標識の前記発生数以上の発生数を有する分子標識を含む、〔７８〕に記載の方法。
〔８０〕前記２つのポアソン分布が、前記真の分子標識に対応する第１のポアソン分布と、前記偽の分子標識に対応する第２のポアソン分布を含む、〔７９〕に記載の方法。
〔８１〕（ｉ）でカウントされた前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する前記分子標識の数が、（ｉｖ）において、
（ｂ）で得られた前記シーケンシングデータ中の前記標的の前記クオリティステータスが、前記不完全シーケンシングクオリティステータスである場合、
前記標的が、（ｂ）で得られたシーケンシングデータにおいてノイジーであるか否かを決定する工程と；
（ｂ）で得られた前記シーケンシングデータから前記ノイジー標的を除去する工程と、
によって調節される、〔５８〕～〔８０〕のいずれか一項に記載の方法。
〔８２〕前記ノイジー標的の前記分子標識の前記発生数が、不完全シーケンシングクノイジー標的閾値以下であれば、前記標的はノイジーであり、前記不完全シーケンシングノイジー遺伝子閾値が、５である、〔８１〕に記載の方法。
〔８３〕前記不完全シーケンシングノイジー標的閾値が、完全シーケンシングのクオリティステータスを有する前記複数の標的の前記分子標識の前記中央発生数と等しい、〔８２〕に記載の方法。
〔８４〕前記不完全シーケンシングノイジー標的閾値が、完全シーケンシングのクオリティステータスを有する前記複数の標的の前記分子標識の前記平均発生数と等しい、〔８２〕に記載の方法。
〔８５〕（ｉ）でカウントされた前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する前記分子標識の数が、（ｉｖ）において、
（ｂ）で得られた前記シーケンシングデータ中の前記標的の前記クオリティステータスが前記不完全シーケンシングクオリティステータスである場合、
前記標的の前記分子標識を閾値化して、（ｂ）で得られた前記シーケンシングデータ中の真の分子標識および偽の分子標識を決定する工程
によって調節される、〔５０〕～〔８４〕のいずれか一項に記載の方法。
〔８６〕前記標的の前記分子標識を閾値化する工程が、前記分子標識について第３の統計解析を実施する工程を含む、〔８５〕に記載の方法。
〔８７〕前記分子標識について前記第３の統計解析を実施する工程が、
ゼロ切断ポアソンモデルを用いて、真の分子標識の数ｎを決定する工程と；
（ｂ）で得られた前記シーケンシングデータから前記偽の分子標識を除去する工程と、
を含み、
前記偽の分子標識が、ｎ番目に豊富な分子標識の発生数よりも低い発生数を有する分子標識を含み、前記真の分子標識が、ｎ番目に豊富な分子標識の前記発生数以上の発生数を有する分子標識を含む、〔８６〕に記載の方法。
〔８８〕（ｉ）でカウントされた前記シーケンシングデータが、（ｉｉｉ）で決定された前記１つ以上のシーケンシングデータエラーに応じて調節された後、（ｂ）で得られた前記シーケンシングデータ中の前記分子標識の少なくとも５０％が保持される、〔４０〕～〔８７〕のいずれか一項に記載の方法。
〔８９〕（ｉ）でカウントされた前記シーケンシングデータが、（ｉｉｉ）で決定された前記１つ以上のシーケンシングデータエラーに応じて調節された後、（ｂ）ｂ）で得られた前記シーケンシングデータ中の前記分子標識の少なくとも８０％が保持される、〔４０〕～〔８７〕のいずれか一項に記載の方法。
〔９０〕前記複数の標的に確率バーコードを付ける工程が、前記複数の確率バーコードを前記複数の標的とハイブリダイズさせて、前記確率バーコード付き標的を生成する工程を含む、〔４０〕～〔８７〕のいずれか一項に記載の方法。
〔９１〕前記複数の標的に確率バーコードを付ける工程が、前記確率バーコード付き標的のインデックス付きライブラリーを作製する工程を含む、〔８９〕に記載の方法。
〔９２〕前記確率バーコード付き標的のインデックス付きライブラリーを作製する工程が、前記複数の確率バーコードを含む固体担体を用いて実施される、〔８９〕～〔９１〕のいずれか一項に記載の方法。
〔９３〕前記固体担体が、前記複数の確率バーコードと結合した複数の合成粒子を含む、〔９２〕に記載の方法。
〔９４〕前記複数の確率バーコードの各々が、サンプル標識、ユニバーサル標識および細胞標識の１つ以上を含み、前記サンプル標識が、前記固体担体上の前記複数の確率バーコードに対するものと同じであり、ユニバーサル標識が、前記固体担体上の前記複数の確率バーコードに対するものと同じであり、細胞標識が、前記固体担体上の前記複数の確率バーコードに対するものと同じである、〔９２〕～〔９３〕のいずれか一項に記載の方法。
〔９５〕前記サンプル標識が、５～２０ヌクレオチドを含む、〔９４〕に記載の方法。
〔９６〕前記ユニバーサル標識が、５～２０ヌクレオチドを含む、〔９４〕～〔９５〕のいずれか一項に記載の方法。
〔９７〕前記細胞標識が、５～２０ヌクレオチドを含む、〔９４〕～〔９６〕のいずれか一項に記載の方法。
〔９８〕前記固体担体が、２次元または３次元の前記複数の確率バーコードを含む、〔９２〕～〔９５〕のいずれか一項に記載の方法。
〔９９〕前記合成粒子がビーズである、〔９３〕～〔９８〕のいずれか一項に記載の方法。
〔１００〕前記ビーズが、シリカゲルビーズ、調節多孔性ガラスビーズ、磁気ビーズ、ダイナビーズ、セファデックス／セファロースビーズ、セルロースビーズ、ポリスチレンビーズ、またはそれらの任意の組合せである、〔９９〕に記載の方法。
〔１０１〕前記固体担体が、ポリマー、マトリックス、ヒドロゲル、ニードルアレイデバイス、抗体、またはそれらの任意の組合せを含む、〔４０〕～〔１００〕に記載の方法。
〔１０２〕前記複数の標的がサンプル中に含まれる、〔４０〕～〔１０１〕のいずれか一項に記載の方法。
〔１０３〕前記サンプルが、１つ以上の細胞を含む、〔１０２〕に記載の方法。
〔１０４〕前記サンプルが単一細胞である、〔１０２〕に記載の方法。
〔１０５〕前記１つ以上の細胞を溶解する工程をさらに含む、〔１０２〕に記載の方法。
〔１０６〕前記１つ以上の細胞を溶解する工程が、前記サンプルを加熱する工程、前記サンプルを洗剤と接触させる工程、前記サンプルのｐＨを変える工程、またはそれらの任意の組合せを含む、〔１０５〕に記載の方法。
〔１０７〕前記１つ以上の細胞が、１つ以上の細胞型を含む、〔１０２〕に記載の方法。
〔１０８〕前記１つ以上の細胞型の少なくとも１つが、脳細胞、心細胞、癌細胞、循環腫瘍細胞、臓器細胞、上皮細胞、転移細胞、良性細胞、一次細胞、循環細胞、またはそれらの任意の組合せである、〔１０７〕に記載の方法。
〔１０９〕前記複数の標的が、リボ核酸（ＲＮＡ）、メッセンジャーＲＮＡ（ｍＲＮＡ）、ｍｉｃｒｏＲＮＡ、低分子干渉ＲＮＡ（ｓｉＲＮＡ）、ＲＮＡ分解産物、ポリ（Ａ）テールを各々含むＲＮＡ、またはそれらの任意の組合せを含む、〔４０〕～〔１０８〕のいずれか一項に記載の方法。
〔１１０〕前記方法が多重化される、〔４０〕～〔１０９〕のいずれか一項に記載の方法。
〔１１１〕標的の数を決定する方法であって、
（ａ）複数の確率バーコードを用いて、複数の標的に確率バーコードを付けて、複数の確率バーコード付き標的を生成する工程であって、前記複数の確率バーコードの各々が分子標識を含む工程と；
（ｂ）前記確率バーコード付き標的のシーケンシングデータを取得する工程と；
（ｃ）前記複数の標的の１つ以上について：
（ｉ）前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する分子標識の数をカウントする工程と；
（ｉｉ）方向近接性を用いて、前記標的の分子標識のクラスターを同定する工程と；
（ｉｉｉ）（ｉｉ）で同定された前記標的の分子標識の前記クラスターを用いて、（ｂ）で得られた前記シーケンシングデータを折りたたむ工程と；
（ｉｖ）前記標的の数を推定する工程であって、推定された前記標的の数が、（ｉｉ）の前記シーケンシングデータの折りたたみ後に、（ｉ）でカウントされた前記シーケンシングデータ中の前記標的に関連付けられた識別可能な配列を有する分子標識の数と相関する工程と、
を含む、方法。
〔１１２〕前記複数の標的が、細胞の全トランスクリプトームの標的を含む、〔１１１〕に記載の方法。
〔１１３〕クラスター内の前記標的の分子標識が、互いの所定の方向近接性閾値内にある、〔１１１〕～〔１１２〕のいずれか一項に記載の方法。
〔１１４〕前記方向近接性閾値が、１のハミング距離である、〔１１３〕に記載の方法。
〔１１５〕前記クラスター内の前記標的の前記分子標識が、１つ以上の親分子標識と、前記１つ以上の親分子標識の子供分子標識とを含み、前記親分子標識の発生数が、所定の方向近接性発生数閾値以上である、〔１１２〕～〔１１４〕のいずれか一項に記載の方法。
〔１１６〕前記所定の方向近接性発生数閾値が、２×（子供分子標識の発生数）－１である、〔１１５〕に記載の方法。
〔１１７〕（ｉｉ）で同定された前記標的の分子標識の前記クラスターを用いて、（ｂ）で得られたシーケンシングデータを折りたたむ工程が、
前記子供分子標識の前記発生数を前記親分子標識に帰属させる工程
を含む、〔１１１〕～〔１１６〕のいずれか一項に記載の方法。
〔１１８〕前記標的のシーケンシング深度を決定する工程をさらに含む、〔１１１〕～〔１１７〕のいずれか一項に記載の方法。
〔１１９〕前記標的の前記シーケンシング深度が所定のシーケンシング深度閾値を超える場合、前記標的の数を推定する工程が、（ｉ）でカウントされた前記シーケンシングデータを調節する工程を含む、〔１１８〕に記載の方法。
〔１２０〕前記所定のシーケンシング深度閾値が、１５～２０である、〔１１９〕に記載の方法。
〔１２１〕（ｉ）でカウントされた前記シーケンシングデータを調節する工程が、
前記標的の分子標識を閾値化して、（ｂ）で得られた前記シーケンシングデータ中の前記標的に関連付けられた真の分子標識および偽の分子標識を決定する工程
を含む、〔１１９〕～〔１２０〕のいずれか一項に記載の方法。
〔１２２〕前記標的の前記分子標識を閾値化する工程が、前記標的の前記分子標識について統計解析を実施する工程を含む、〔１２１〕に記載の方法。
〔１２３〕前記統計解析を実施する工程が、
前記標的の前記分子標識の分布およびそれらの発生数を２つのポアソン分布に当てはめる工程と；
前記２つのポアソン分布を用いて真の分子標識の数ｎを決定する工程と；
（ｂ）で得られた前記シーケンシングデータから前記偽の分子標識を除去する工程と、
を含み、
前記偽の分子標識が、ｎ番目に豊富な分子標識の前記発生数よりも低い発生数を有する分子標識を含み、前記真の分子標識が、ｎ番目に豊富な分子標識の前記発生数以上の発生数を有する分子標識を含む、〔１２２〕に記載の方法。
〔１２４〕前記２つのポアソン分布が、前記真の分子標識に対応する第１のポアソン分布と、前記偽の分子標識に対応する第２のポアソン分布を含む、〔１２３〕に記載の方法。
〔１２５〕ハードウェアプロセッサーと、
前記ハードウェアプロセッサーによって実行される場合、前記プロセッサーに〔４０〕～〔１２４〕のいずれか一項に記載の方法を実行させる命令を記憶した非一過性メモリーと、を含む、ターゲットの数を決定するためのコンピュータシステム。
〔１２６〕〔４０〕～〔１２４〕のいずれか一項に記載の方法を実行するためのコードを含むソフトウェアプログラムを含む、コンピュータ読取り媒体。 Various embodiments and embodiments have been disclosed herein, but other embodiments and embodiments will be obvious to those of skill in the art. The various embodiments and embodiments disclosed herein are for purposes of illustration only and are not intended to be limiting, and the true scope and intent are set forth in the following claims.
The following aspects of the present invention are also preferable.
[1] A method for determining the number of targets.
(A) A step of attaching a probability barcode to a plurality of targets using a plurality of probability barcodes to generate a target with a plurality of probability barcodes, wherein each of the plurality of probability barcodes has a molecular label. With the process including;
(B) The step of acquiring the sequencing data of the target with the probability barcode;
(C) For one or more of the plurality of targets:
(I) A step of counting the number of molecular labels having an identifiable sequence associated with the target in the sequencing data;
(Ii) With the step of identifying the cluster of the molecular label of the target using directional proximity;
(Iii) With the step of collapsing the sequencing data obtained in (b) using the cluster of the molecular label of the target identified in (ii);
(Iv) In the step of estimating the number of the targets, the estimated number of the targets is the said in the sequencing data counted in (i) after folding the sequencing data of (ii). A step that correlates with the number of molecular labels with identifiable sequences associated with the target, and
Including, how.
[2] The method according to [1], wherein the plurality of targets include targets for all transcriptomes of cells.
[3] The method according to any one of [1] to [2], wherein the molecular labels of the targets in the cluster are within a predetermined directional proximity threshold value of each other.
[4] The method according to [3], wherein the directional proximity threshold is 1 Hamming distance.
[5] The molecular label of the target in the cluster includes one or more parent molecule labels and a child molecule label of the one or more parent molecule labels, and the number of occurrences of the parent molecule labels is predetermined. The method according to any one of [1] to [4], which is equal to or greater than the directional proximity occurrence number threshold.
[6] The method according to [5], wherein the predetermined number of occurrences of directional proximity threshold value is 2 × ( number of occurrences of child molecule labels) -1 .
[7] The step of collapsing the sequencing data obtained in (b) using the cluster of the molecular label of the target identified in (ii) is
The method according to any one of [1] to [6], which comprises a step of assigning the number of generations of the child molecule label to the parent molecule label.
[8] The method according to any one of [1] to [7], further comprising a step of determining the sequencing depth of the target.
[9] When the sequencing depth of the target exceeds a predetermined sequencing depth threshold, the step of estimating the number of the targets includes the step of adjusting the sequencing data counted in (i). 8].
[10] The method according to [9], wherein the predetermined sequencing depth threshold is 15 to 20.
[11] The step of adjusting the sequencing data counted in (i) is
[9]-[. 10] The method according to any one of the items.
[12] The method according to [11], wherein the step of thresholding the molecular label of the target includes a step of performing statistical analysis on the molecular label of the target.
[13] The step of carrying out the statistical analysis is
A step of applying the distribution of the molecular labels of the target and the number of their occurrences to the two negative binomial distributions;
The step of determining the number n of true molecular labels using the two negative binomial distributions;
The step of removing the false molecular label from the sequencing data obtained in (b), and
Including
The false molecular label comprises a molecular label having a lower number of occurrences than the nth richest molecular label, and the true molecular label has more than the number of occurrences of the nth richest molecular label. The method according to [12], which comprises a molecular label having.
[14] The negative binomial distribution comprises a first negative binomial distribution corresponding to the true molecular label and a second negative binomial distribution corresponding to the false molecular label, according to [13]. the method of.
[15] A method for determining the number of targets.
(A) A step of attaching a probability barcode to a plurality of targets using a plurality of probability barcodes to generate a target with a plurality of probability barcodes, wherein each of the plurality of probability barcodes has a molecular label. With the process including;
(B) The step of acquiring the sequencing data of the target with the probability barcode;
(C) For one or more of the plurality of targets:
(I) A step of counting the number of molecular labels having an identifiable sequence associated with the target in the sequencing data;
(Ii) The step of determining the number of noise molecule labels having an identifiable sequence associated with the target in the sequencing data;
(Iii) The step of estimating the number of the targets and
Including
The identification associated with the target in the sequencing data counted in (i), where the estimated number of the targets was adjusted according to the number of the noise molecule labels determined in (ii). A method that correlates with the number of molecular labels having possible sequences.
[16] The method of [15], further comprising determining the sequencing status of the target in the sequencing data.
[17] The method of [16], wherein the sequencing status of the target in the sequencing data is saturated sequencing, undersequencing, or oversequencing.
[18] The method of [17], wherein the saturation sequencing status is determined by said target having a number of molecular labels containing identifiable sequences greater than a predetermined saturation threshold.
[19] The method of [18], wherein the predetermined saturation threshold is about 6557 where the probability barcode comprises about 6651 molecular labels having an identifiable sequence.
[20] The item according to any one of [18] to [19], wherein when the probability barcode contains about 65536 molecular labels having an identifiable sequence, the predetermined saturation threshold is about 65532. the method of.
[21] When the sequential status of the target in the sequencing data is the saturated sequencing status, the number of the noise molecule labels determined in (ii) is zero, [17] to [17]. The method according to any one of [20].
[22] The undersequencing status is determined by the target having a depth less than a predetermined undersequencing threshold, and the depth of the subject is an identifiable sequence associated with the target in the sequencing data. The method according to any one of [17] to [21], which comprises the average, minimum, or maximum depth of the molecular label having the above.
[23] The method according to [22], wherein the undersequencing threshold is about 4.
[24] The method of [23], wherein the undersequencing threshold is independent of the number of the molecular labels having an identifiable sequence.
[25] When the sequential status of the target in the sequencing data is the under-sequencing status, the number of the noise molecule labels determined in (ii) is zero, [17] to The method according to any one of [24].
[26] The excess sequencing status is determined by the target having a depth greater than a predetermined excess sequencing threshold, and the depth of the subject is an identifiable sequence associated with the target in the sequencing data. The method according to any one of [17] to [25], which comprises the average, minimum, or maximum depth of the molecular label having the above.
[27] The method of [26], wherein the excess sequencing threshold is about 250 when the probability barcode contains about 6651 molecular labels having an identifiable sequence.
[28] When the sequencing data of the target in the sequencing data is the excessive sequencing status.
Any of [26]-[27], further comprising subsampling the number of the molecular labels having an identifiable sequence associated with the target in the sequencing data to the predetermined excess sequencing threshold. The method described in item 1.
[29] The step of determining the number of the noise molecule labels having an identifiable sequence associated with the target in the sequencing data is:
If the negative binomial distribution fitting condition is met,
(Iv) A step of fitting the signal negative binomial distribution to the number of molecular labels having an identifiable sequence associated with the target in the sequencing data counted in (i), wherein the signal negative. With the step corresponding to the number of molecular labels having an identifiable sequence associated with the target in the sequencing data counted in (i), where the binomial distribution is a signal molecular label;
(V) A step of fitting the noise-negative binomial distribution to the number of molecular labels having an identifiable sequence associated with the target in the sequencing data counted in (i), wherein the noise-negative binomial distribution is applied. With the step corresponding to the number of molecular labels having an identifiable sequence associated with the target in the sequencing data counted in (i), where the binomial distribution is a noise molecular label;
(Vi) The step of determining the number of noise molecule labels using the signal negative binomial distribution fitted in (v) and the noise negative binomial distribution fitted in (vi).
The method according to any one of [17] to [28].
[30] The negative binomial distribution fitting condition comprises the fact that the sequencing status of the target in the sequencing data is not the under-sequencing status or the over-sequencing status. Method.
[31] The step of determining the number of the noise molecule labels using the signal negative binomial distribution fitted in (v) and the noise negative binomial distribution fitted in (vi) is
For each of the identifiable sequences associated with the target in the sequencing data
With the step of determining that the signal probability of the identifiable sequence is the signal negative binomial distribution;
With the step of determining that the noise probability of the identifiable array is the noise negative binomial distribution;
If the signal probability is less than the noise probability, then the step of determining that the identifiable sequence is a noise molecule label.
The method according to any one of [29] to [30], which comprises.
[32] The step of determining the number of noise molecule labels having an identifiable sequence associated with the target in the sequencing data is:
The sequencing status of the target in the sequencing data is not the under-sequencing status or the excess-sequencing status and is associated with the target in the sequencing data counted in (i). If the number of the molecular labels with identifiable sequences is less than the pseudo-point threshold, (ii) determines the number of noise molecular labels with identifiable sequences associated with the target in the sequencing data. A step of adding a pseudo-point to the number of the molecular labels having an identifiable sequence associated with the target in the sequencing data.
The method according to any one of [17] to [31].
[33] The method according to [32], wherein the pseudo-point threshold is 10.
[34] The step of determining the number of the noise molecule labels having an identifiable sequence associated with the target in the sequencing data is:
The sequencing status of the target in the sequencing data is not the under-sequencing status or the over-sequencing status and is associated with the target in the sequencing data counted in (i). If the number of the molecular labels having an identifiable sequence is greater than or equal to the pseudo-point threshold, in (ii) the number of the noise molecular labels having the identifiable sequence associated with the target in the sequencing data. Including the step of removing the non-unique molecular label in the determination,
The method according to any one of [17] to [33].
[35] When the step of removing the non-unique molecular label has a number of the molecular labels having an identifiable sequence associated with the target in the sequencing data greater than a predetermined reused molecular label threshold. In (34), the step of removing the non-unique molecular label is included in determining the number of the noise molecular labels having an identifiable sequence associated with the target in the sequencing data. The method described.
[36] The method of [35], wherein the reusable molecular labeling threshold is about 650 when the probability barcode comprises about 6651 molecular labels having an identifiable sequence.
[37] The step of removing the non-unique molecular label is
With the step of determining the theoretical number of non-unique molecular labels for the number of the molecular labels having an identifiable sequence associated with the target in the sequencing data;
A step of removing a molecular label having a higher incidence than the nth abundant molecular label having an identifiable sequence associated with the target in the sequencing data.
Including
The method according to any one of [34] to [36], wherein n is a theoretical number of non-unique molecular labels.
[38] With a hardware processor
When executed by the hardware processor, a non-transient memory storing an instruction to cause the processor to execute the method according to any one of [1] to [37].
A computer system for determining the number of targets, including.
[39] A computer reading medium comprising a software program containing a code for executing the method according to any one of [1] to [37].
[40] A method for determining the number of targets.
(A) A step of attaching a probability barcode to a plurality of targets using a plurality of probability barcodes to generate a target with a plurality of probability barcodes, wherein each of the plurality of probability barcodes has a molecular label. With the process including;
(B) The step of acquiring the sequencing data of the target with the probability barcode;
(C) For one or more of the plurality of targets:
(I) A step of counting the number of molecular labels having an identifiable sequence associated with the target in the sequencing data;
(Ii) With the step of determining the quality status of the target in the sequencing data obtained in (b);
(Iii) A step of determining one or more sequencing data errors in the sequencing data obtained in (b), wherein the one or more sequencing data errors in the sequencing data are determined. The steps are as follows: the number of the molecular labels having an identifiable sequence associated with the target in the sequencing data, the quality status of the target in the sequencing data, and the plurality of probability barcodes. A step comprising determining one or more of the number of said molecular labels having an identifiable sequence of;
(Iv) In the step of estimating the number of the targets, the estimated number of the targets is adjusted according to the one or more sequencing data errors determined in (iii), (i). A step that correlates with the number of the molecular labels having an identifiable sequence associated with the target in the sequencing data counted in.
Including, how.
[41] The method according to [40], further comprising a step of collapsing the sequencing data obtained in (b) before determining the one or more sequencing data errors.
[42] The step of folding the sequencing data obtained in (b) is
2 . One copy has a similar molecular label if the sequences of the molecular labels on the two copies of the target differ by at least one base.
The method according to [41].
[43] The method according to [42], wherein the predetermined folding occurrence number threshold is 7 when the probability barcode contains about 6651 molecular labels having an identifiable sequence.
[44] The method according to [42], wherein the predetermined folding occurrence number threshold is 17 when the probability barcode contains about 65536 molecular labels having an identifiable sequence.
[45] Any one of [42] to [44], wherein the two copies of the target have similar molecular labels when the sequences of the molecular labels of the two copies of the target differ by at least one base. The method described in.
[46] The method according to any one of [40] to [45], wherein the molecular label contains 5 to 20 nucleotides.
[47] The method according to any one of [40] to [46], wherein the molecular labels having different probability barcodes are different from each other.
[48] The method according to any one of [40] to [47], wherein the plurality of probability barcodes contain about 6651 molecular labels having an identifiable sequence.
[49] The method according to any one of [40] to [47], wherein the plurality of probability barcodes include a molecular label of about 65536 having an identifiable sequence.
[50] The method according to any one of [40] to [49], wherein the sequencing data comprises a sequence of the plurality of targets having a read length of 50 nucleotides or more.
[51] The method according to any one of [40] to [49], wherein the sequencing data comprises a sequence of the plurality of targets having a read length of 75 nucleotides or more.
[52] The method according to any one of [40] to [49], wherein the sequencing data comprises a sequence of the plurality of targets having a read length of 100 nucleotides or more.
[53] The sequencing data obtained in (b) can be generated by performing polymerase chain reaction (PCR) amplification on the plurality of probabilistic barcoded targets, [40]-[. 52] The method according to any one of paragraphs.
[54] The one or more sequencing data errors are PCR introduction errors, sequencing introduction errors, errors due to bar code contamination, library production errors, or any combination thereof. [40]-[ 53] The method according to any one of paragraphs.
[55] The method of [54], wherein the PCR introduction error is the result of a PCR amplification error, a PCR amplification bias, inadequate PCR amplification, or any combination thereof.
[56] The method of any one of [54]-[55], wherein the sequencing introduction error is the result of inaccurate base calling, inadequate sequencing, or any combination thereof.
[57] The method according to any one of [40] to [56], wherein steps (i), (ii), (iii), and (iv) are carried out for each of the plurality of targets.
[58] The method according to any one of [40] to [57], wherein the quality status of the target in the sequencing data is complete sequencing, incomplete sequencing, or saturated sequencing.
[59] The quality status of the target in the sequencing data is associated with the number of the molecular labels having identifiable sequences in the plurality of probability barcodes and the target in the counted sequencing data. 58. The method according to [58], which is determined by the number of said molecular labels having an identifiable sequence.
[60] The complete sequencing quality status is determined by a dispersal index compared to the Poisson distribution above a predetermined complete sequencing dispersal threshold, and the predetermined complete sequencing dispersal threshold is 0.9 [58]. ] To [59]. The method according to any one of the following items.
[61] The method according to [60], wherein the predetermined complete sequencing spray threshold is 1.
[62] The method according to [60], wherein the predetermined complete sequencing application threshold is 4.
[63] The complete sequencing quality status is further determined by a molecular label having an occurrence number equal to or greater than a predetermined complete sequencing occurrence number threshold in the sequencing data obtained in (b), and the predetermined complete sequence is further determined. The method according to any one of [60] to [62], wherein the threshold for the number of occurrences of singing is 10.
[64] The method according to [63], wherein the predetermined complete sequencing occurrence number threshold value is 18.
[65] Any one of [58]-[64], wherein the saturation sequencing quality status is determined by the target having a number of molecular labels containing an identifiable sequence greater than a predetermined saturation threshold. The method described in.
[66] The saturation sequencing quality status is further determined by one other target of the plurality of targets having a number of molecular labels containing an identifiable sequence greater than the predetermined saturation threshold. , [65].
[67] The method according to [65], wherein the predetermined saturation threshold is 6557 when the probability barcode comprises about 6651 molecular labels having an identifiable sequence.
[68] The method according to [65], wherein the predetermined saturation threshold is 65532 when the probability barcode contains about 65536 molecular labels having an identifiable sequence.
[69] The quality status of the target in the sequencing data is such that the quality status of the target in the sequencing data obtained in (b) is not complete sequencing and saturated sequencing. The method according to any one of [40] to [68], which is classified as incomplete sequencing if not present.
[70] In (iv), the number of the molecular labels having the identifiable sequence associated with the target in the sequencing data counted in (i).
If the target has the complete sequencing quality status
With the step of determining all child molecule labels for one or more parent molecule labels;
With the step of performing a first statistical analysis on at least one child molecule label and said parent molecule label;
If the null hypothesis of the first statistical analysis is acceptable, the step of assigning the number of occurrences of the child molecule label to the parent molecule label and
The method according to any one of [50] to [69], which is adjusted by.
[71] The one or more parent molecule labels include a molecular label having an occurrence number equal to or higher than a predetermined complete sequencing parent threshold value, and the predetermined complete sequencing parent threshold value is the predetermined complete sequencing occurrence number threshold value. The method according to [70], which is equal to.
[72] The child molecule label comprises a molecule label that is one base different from the parent molecule label and has a number of occurrences that is less than or equal to the predetermined complete sequencing child threshold, and the predetermined complete sequencing child threshold is 3. The method according to any one of [70] to [71].
[73] The method according to [72], wherein the predetermined complete sequencing child threshold is 5.
[74] When the probability that the null hypothesis is true is less than the false discovery rate, the null hypothesis of the first statistical analysis is accepted and the false discovery rate is 5%, [70] to The method according to any one of [73].
[75] The method according to [74], wherein the false discovery rate is 10%.
[76] The method according to any one of [70] to [75], wherein the first statistical analysis is a multiplex binomial test.
[77] In (iv), the number of the molecular labels having the identifiable sequence associated with the target in the sequencing data counted in (i) is.
If the target has the complete sequencing quality status
Adjusted by the step of thresholding the molecular label of the target to determine the true and false molecular label associated with the target in the sequencing data obtained in (b). [50] The method according to any one of [76].
[78] The method according to [77], wherein the step of thresholding the molecular label of the target includes a step of performing a second statistical analysis on the molecular label of the target.
[79] The step of carrying out the second statistical analysis is
A step of applying the distribution of the molecular labels of the target and the number of their occurrences to the two Poisson distributions;
The step of determining the number n of true molecular labels using the two Poisson distributions;
The step of removing the false molecular label from the sequencing data obtained in (b), and
Including
The false molecular label comprises a molecular label having a lower number of occurrences than the nth richest molecular label, and the true molecular label is greater than or equal to the number of occurrences of the nth richest molecular label. The method according to [78], comprising a molecular label having a number of occurrences .
[80] The method according to [79], wherein the two Poisson distributions include a first Poisson distribution corresponding to the true molecular label and a second Poisson distribution corresponding to the false molecular label.
[81] In (iv), the number of the molecular labels having the identifiable sequence associated with the target in the sequencing data counted in (i).
When the quality status of the target in the sequencing data obtained in (b) is the incomplete sequencing quality status.
The step of determining whether or not the target is noisy in the sequencing data obtained in (b);
The step of removing the noisy target from the sequencing data obtained in (b), and
The method according to any one of [58] to [80], which is adjusted by.
[82] If the number of occurrences of the molecular label of the noisy target is equal to or less than the incomplete sequencing noisy target threshold value, the target is noisy and the incomplete sequencing noisy gene threshold value is 5. The method according to [81].
[83] The method of [82], wherein the incomplete sequencing noisy target threshold is equal to the central number of occurrences of the molecular label of the plurality of targets having a quality status of complete sequencing.
[84] The method according to [82], wherein the incomplete sequencing noisy target threshold is equal to the average number of occurrences of the molecular label of the plurality of targets having a quality status of complete sequencing.
[85] In (iv), the number of the molecular labels having the identifiable sequence associated with the target in the sequencing data counted in (i).
When the quality status of the target in the sequencing data obtained in (b) is the incomplete sequencing quality status.
[50]-[84], wherein the molecular label of the target is thresholded to determine a true molecular label and a false molecular label in the sequencing data obtained in (b). The method described in any one of the items.
[86] The method according to [85], wherein the step of thresholding the molecular label of the target comprises a step of performing a third statistical analysis on the molecular label.
[87] The step of carrying out the third statistical analysis of the molecular label is
With the step of determining the number n of true molecular labels using a zero-cut Poisson model;
The step of removing the false molecular label from the sequencing data obtained in (b), and
Including
The false molecular label comprises a molecular label having a lower number of occurrences than the nth richest molecular label, and the true molecular label has more than the number of occurrences of the nth richest molecular label. The method according to [86], comprising a molecular label having a number .
[88] The sequencing obtained in (b) after the sequencing data counted in (i) is adjusted according to the one or more sequencing data errors determined in (iii). The method according to any one of [40] to [87], wherein at least 50% of the molecular label in the data is retained.
[89] The sequencing data counted in (i) is adjusted according to the one or more sequencing data errors determined in (iii), and then obtained in (b) b). The method according to any one of [40] to [87], wherein at least 80% of the molecular label in the sequencing data is retained.
[90] The step of attaching a probability barcode to the plurality of targets includes a step of hybridizing the plurality of probability barcodes with the plurality of targets to generate the target with the probability barcode [40]. The method according to any one of [87].
[91] The method according to [89], wherein the step of attaching a probability barcode to the plurality of targets includes a step of creating an indexed library of the target with the probability barcode.
[92] In any one of [89] to [91], the step of producing the indexed library of the target with the probability barcode is carried out using the solid carrier containing the plurality of probability barcodes. The method described.
[93] The method according to [92], wherein the solid carrier comprises a plurality of synthetic particles bound to the plurality of probability barcodes.
[94] Each of the plurality of probability barcodes comprises one or more of a sample label, a universal label and a cell label, wherein the sample label is the same as for the plurality of probability barcodes on the solid carrier. , Universal labeling is the same for the plurality of probability barcodes on the solid carrier, and cell labeling is the same for the plurality of probability barcodes on the solid carrier, [92]-[. 93] The method according to any one of paragraphs.
[95] The method according to [94], wherein the sample label comprises 5 to 20 nucleotides.
[96] The method according to any one of [94] to [95], wherein the universal label comprises 5 to 20 nucleotides.
[97] The method according to any one of [94] to [96], wherein the cell label comprises 5 to 20 nucleotides.
[98] The method according to any one of [92] to [95], wherein the solid carrier comprises the two-dimensional or three-dimensional probability barcodes.
[99] The method according to any one of [93] to [98], wherein the synthetic particles are beads.
[100] The method according to [99], wherein the beads are silica gel beads, regulated porous glass beads, magnetic beads, dyna beads, Sephadex / Sephadex beads, cellulose beads, polystyrene beads, or any combination thereof. ..
[101] The method according to [40] to [100], wherein the solid carrier comprises a polymer, a matrix, a hydrogel, a needle array device, an antibody, or any combination thereof.
[102] The method according to any one of [40] to [101], wherein the plurality of targets are contained in the sample.
[103] The method of [102], wherein the sample comprises one or more cells.
[104] The method according to [102], wherein the sample is a single cell.
[105] The method according to [102], further comprising the step of lysing the one or more cells.
[106] The step of lysing the one or more cells comprises heating the sample, contacting the sample with a detergent, changing the pH of the sample, or any combination thereof [105]. ] The method described in.
[107] The method according to [102], wherein the one or more cells contain one or more cell types.
[108] At least one of the one or more cell types is a brain cell, a heart cell, a cancer cell, a circulating tumor cell, an organ cell, an epithelial cell, a metastatic cell, a benign cell, a primary cell, a circulating cell, or any of them. The method according to [107], which is a combination of the above.
[109] The plurality of targets are ribonucleic acid (RNA), messenger RNA (mRNA), microRNA, small interfering RNA (siRNA), RNA degradation products, RNA containing poly (A) tail, or any of them. The method according to any one of [40] to [108], which comprises a combination.
[110] The method according to any one of [40] to [109], wherein the method is multiplexed.
[111] A method of determining the number of targets.
(A) A step of attaching a probability barcode to a plurality of targets using a plurality of probability barcodes to generate a target with a plurality of probability barcodes, wherein each of the plurality of probability barcodes has a molecular label. With the process including;
(B) The step of acquiring the sequencing data of the target with the probability barcode;
(C) For one or more of the plurality of targets:
(I) A step of counting the number of molecular labels having an identifiable sequence associated with the target in the sequencing data;
(Ii) With the step of identifying the cluster of the molecular label of the target using directional proximity;
(Iii) With the step of collapsing the sequencing data obtained in (b) using the cluster of the molecular label of the target identified in (ii);
(Iv) In the step of estimating the number of the targets, the estimated number of the targets is the said in the sequencing data counted in (i) after folding the sequencing data of (ii). A step that correlates with the number of molecular labels with identifiable sequences associated with the target, and
Including, how.
[112] The method of [111], wherein the plurality of targets include targets for the entire transcriptome of the cell.
[113] The method according to any one of [111] to [112], wherein the molecular labels of the targets in the cluster are within a predetermined directional proximity threshold of each other.
[114] The method according to [113], wherein the directional proximity threshold is a Hamming distance of 1.
[115] The molecular label of the target in the cluster includes one or more parent molecule labels and a child molecule label of the one or more parent molecule labels, and the number of occurrences of the parent molecule labels is predetermined. The method according to any one of [112] to [114], which is equal to or greater than the directional proximity occurrence number threshold.
[116] The method according to [115], wherein the predetermined number of occurrences of directional proximity threshold value is 2 × ( number of occurrences of child molecule labels) -1 .
[117] The step of collapsing the sequencing data obtained in (b) using the cluster of the molecular label of the target identified in (ii) is
The method according to any one of [111] to [116], which comprises a step of assigning the number of occurrences of the child molecule label to the parent molecule label.
[118] The method according to any one of [111] to [117], further comprising a step of determining the sequencing depth of the target.
[119] When the sequencing depth of the target exceeds a predetermined sequencing depth threshold, the step of estimating the number of the targets includes the step of adjusting the sequencing data counted in (i). 118].
[120] The method according to [119], wherein the predetermined sequencing depth threshold is 15 to 20.
[121] The step of adjusting the sequencing data counted in (i) is
[119] to [119], comprising the step of thresholding the molecular label of the target to determine the true molecular label and the false molecular label associated with the target in the sequencing data obtained in (b). 120] The method according to any one of paragraphs.
[122] The method according to [121], wherein the step of thresholding the molecular label of the target includes a step of performing statistical analysis on the molecular label of the target.
[123] The step of carrying out the statistical analysis is
A step of applying the distribution of the molecular labels of the target and the number of their occurrences to the two Poisson distributions;
The step of determining the number n of true molecular labels using the two Poisson distributions;
The step of removing the false molecular label from the sequencing data obtained in (b), and
Including
The false molecular label comprises a molecular label having a lower number of occurrences than the nth richest molecular label, and the true molecular label is greater than or equal to the number of occurrences of the nth richest molecular label. The method according to [122], comprising a molecular label having a number of occurrences .
[124] The method according to [123], wherein the two Poisson distributions include a first Poisson distribution corresponding to the true molecular label and a second Poisson distribution corresponding to the false molecular label.
[125] With a hardware processor
When executed by the hardware processor, the number of targets, including a non-transient memory that stores an instruction to cause the processor to perform the method according to any one of [40] to [124]. Computer system for deciding.
[126] A computer reading medium comprising a software program containing the code for performing the method according to any one of [40] to [124].

Claims

A method of determining the number of nucleic acid targets in a sample ,
(A) A step of attaching a probability barcode to a plurality of nucleic acid targets using a plurality of probability barcodes to generate a nucleic acid target with a plurality of probability barcodes, wherein each of the plurality of probability barcodes is molecularly labeled. including;
(B) A step of acquiring sequencing data of the nucleic acid target with a probability barcode; and (c) the following steps (i) to (iv) for one or more of the plurality of nucleic acid targets:
(I) Counting the number of molecular labels having an identifiable sequence associated with the nucleic acid target in the sequencing data;
(Ii) A step of identifying a cluster of molecular labels for the nucleic acid target using directional proximity ,
Here, the identifying step comprises retrospectively determining for all molecular labels having an identifiable sequence whether the child molecular label belongs to a cluster containing one or more parent molecular labels, said cluster. The molecular label of the nucleic acid target within contains one or more parent molecule labels and a child molecule label of the one or more parent molecule labels, and the number of occurrences of the parent molecule labels is in a predetermined direction. Proximity occurrence number threshold or higher ;
(Iii) Using the cluster of molecular labels for the nucleic acid target identified in (ii), the step of collapsing the sequencing data obtained in (b); and (iv) estimating the number of the nucleic acid target. Step, the number of said nucleic acid targets estimated herein is identifiable associated with said nucleic acid target in the sequenced data counted in ( i ) after folding the sequenced data in (iii). Correlates with the number of molecular labels with different sequences,
Including, how.

The method of claim 1, wherein the molecular labels of the nucleic acid targets in the cluster are within a predetermined directional proximity threshold of each other.

The method of claim 2, wherein the directional proximity threshold is a Hamming distance of 1.

The method according to claim 1 , wherein the predetermined directional proximity occurrence number threshold value is 2 × ( number of occurrences of child molecule labeling) -1 .

The step of collapsing the sequencing data obtained in (b) using the cluster of the molecular label of the nucleic acid target identified in (ii) is
The method according to any one of claims 1 to 4 , which comprises a step of assigning the generation number of the child molecule label to the parent molecule label.

The method according to any one of claims 1 to 5 , further comprising a step of determining the sequencing depth of the nucleic acid target.

Claimed, wherein when the sequencing depth of the nucleic acid target exceeds a predetermined sequencing depth threshold, the step of estimating the number of the nucleic acid targets comprises adjusting the sequencing data counted in (i). The method according to 6 .

The step of adjusting the sequencing data counted in (i) is
7. A step of thresholding the molecular label of the nucleic acid target to determine a true molecular label and a false molecular label associated with the nucleic acid target in the sequencing data obtained in (b). The method described in.

The method of claim 8 , wherein the step of thresholding the molecular label of the nucleic acid target comprises performing a statistical analysis of the molecular label of the nucleic acid target.

The process of performing the statistical analysis is
A step of applying the distribution of the molecular labels of the nucleic acid target and the number of their occurrences to the two negative binomial distributions;
The step of determining the number n of true molecular labels using the two negative binomial distributions; and the step of removing the sham molecular label from the sequencing data obtained in (b), where the sham The molecular label comprises a molecule label having a lower number of occurrences than the number of occurrences of the nth abundant molecular label, and the true molecular label has a number of occurrences equal to or higher than the number of occurrences of the nth abundant molecular label. 9. The method of claim 9 , comprising a sign.

The method of claim 10 , wherein the negative binomial distribution comprises a first negative binomial distribution corresponding to the true molecular label and a second negative binomial distribution corresponding to the false molecular label.

A method of determining the number of nucleic acid targets,
(A) A step of attaching a probability barcode to a plurality of nucleic acid targets using a plurality of probability barcodes to generate a nucleic acid target with a plurality of probability barcodes, wherein each of the plurality of probability barcodes is molecularly labeled. including;
(B) A step of acquiring sequencing data of the nucleic acid target with a probability barcode;
(C) A step of determining the sequencing status of the nucleic acid target in the sequencing data;
( D ) For one or more of the plurality of nucleic acid targets, the following (i) to (iii):
(I) Counting the number of molecular labels having an identifiable sequence associated with the nucleic acid target in the sequencing data;
(Ii) The step of determining the number of noise molecule labels having an identifiable sequence associated with the nucleic acid target in the sequencing data , where the step is.
If the negative binomial distribution fitting condition is met based on the sequencing status
(1) Applying the signal negative binomial distribution to the number of molecular labels having an identifiable sequence associated with the nucleic acid target in the sequencing data counted in (i), where the signal negative The binomial distribution corresponds to the number of signal molecular labels having an identifiable sequence associated with the nucleic acid target in the sequencing data counted in (i);
(2) Applying the noise-negative binomial distribution to the number of molecular labels having an identifiable sequence associated with the nucleic acid target in the sequencing data counted in (i), where the noise-negative The binomial distribution corresponds to the number of molecular labels with identifiable sequences associated with the nucleic acid target in the sequencing data counted in (i), which are noise molecular labels;
(3) To determine the number of the noise molecule labels using the signal negative binomial distribution fitted in (1) and the noise negative binomial distribution fitted in (2).
And (iii ) the step of estimating the number of the nucleic acid targets, wherein the number of the nucleic acid targets estimated herein was adjusted according to the number of the noise molecule labels determined in (iii). Corresponds to the number of molecular labels having the identifiable sequence associated with the nucleic acid target in the sequencing data counted in i).
Including, how.

12. The method of claim 12 , wherein the sequencing status of the nucleic acid target in the sequencing data is saturated sequencing, undersequencing, or oversequencing.

13. The method of claim 13 , wherein the saturation sequencing status is determined by the nucleic acid target having a number of molecular labels comprising an identifiable sequence that is greater than a predetermined saturation threshold.

13 . The method described.

The undersequencing status is determined by the nucleic acid target having a depth less than a predetermined undersequencing threshold, and the depth of the nucleic acid target is an identifiable sequence associated with the nucleic acid target in the sequencing data. The method according to any one of claims 13 to 15 , comprising the average, minimum, or maximum depth of the molecular label having.

16. The method of claim 16 , wherein the undersequencing threshold is independent of the number of said molecular labels having an identifiable sequence.

13 to 17 , the number of the noise molecule labels determined in (ii) is zero when the sequencing status of the nucleic acid target in the sequencing data is the undersequencing status. The method described in any one of the items.

The excess sequencing status is determined by the nucleic acid target having a depth greater than a predetermined excess sequencing threshold, and the depth of the nucleic acid target is an identifiable sequence associated with the nucleic acid target in the sequencing data. The method according to any one of claims 13 to 18 , comprising the average, minimum, or maximum depth of the molecular label having.

When the sequencing data of the nucleic acid target in the sequencing data is the excess sequencing status.
19. The method of claim 19 , further comprising subsampling the number of molecular labels having an identifiable sequence associated with the nucleic acid target in the sequencing data near the predetermined excess sequencing threshold. ..

12. The method of claim 12 , wherein the negative binomial distribution fitting condition comprises that the sequencing status of the nucleic acid target in the sequencing data is not the under-sequencing status or the over-sequencing status.

The step of determining the number of the noise molecule labels using the signal negative binomial distribution fitted in (v) and the noise negative binomial distribution fitted in (vi) is:
For each of the identifiable sequences associated with the nucleic acid target in the sequencing data
With the step of determining that the signal probability of the identifiable sequence is within the signal negative binomial distribution;
With the step of determining that the noise probability of the identifiable array is within the noise negative binomial distribution;
If the signal probability is less than the noise probability, then the step of determining that the identifiable sequence is a noise molecule label.
The method according to any one of claims 12 to 21 , which comprises.

The step of determining the number of the noise molecule labels having an identifiable sequence associated with the nucleic acid target in the sequencing data is:
The sequencing status of the nucleic acid target in the sequencing data is not the under-sequencing status or the excess-sequencing status and is associated with the nucleic acid target in the sequencing data counted in (i). If the number of the molecular labels having the identified identifiable sequence is less than the pseudo-point threshold, then in (ii) the noise molecular labels having the identifiable sequence associated with the nucleic acid target in the sequencing data. Prior to determining the number, comprising adding a pseudo-point to the number of the molecular labels having an identifiable sequence associated with the nucleic acid target in the sequencing data.
The method according to any one of claims 13 to 22 .

The step of determining the number of the noise molecule labels having an identifiable sequence associated with the nucleic acid target in the sequencing data is:
The sequencing status of the nucleic acid target in the sequencing data is not the under-sequencing status or the excess sequencing status and is associated with the nucleic acid target in the sequencing data counted in (i). When the number of the molecular labels having the identified identifiable sequences is greater than or equal to the pseudo-point threshold, the noise molecular labels having the identifiable sequence associated with the nucleic acid target in the sequencing data in (ii). Including the step of removing non-unique molecular labels in determining the number of
The method according to any one of claims 13 to 23 .

If the step of removing the non-unique molecular label has a number of the molecular labels having an identifiable sequence associated with the nucleic acid target in the sequencing data greater than a predetermined reused molecular labeling threshold (ii). 24. The step of removing the non-unique molecular label in determining the number of the noise molecular labels having an identifiable sequence associated with the nucleic acid target in the sequencing data. the method of.

The step of removing the non-unique molecular label is
The step of determining the theoretical number of non-unique molecular labels for the number of the molecular labels having an identifiable sequence associated with the nucleic acid target in the sequencing data; and to the nucleic acid target in the sequencing data. The step of removing a molecular label having a larger number of occurrences than the nth most abundant molecular label among the molecular labels having an associated identifiable sequence, where n is the theoretical number of non-unique molecular labels. 24. The method of claim 24 or 25 , including.

With a hardware processor
When executed by the hardware processor, a non-transient memory storing an instruction to cause the processor to execute the method according to any one of claims 1 to 26 .
A computer system for determining the number of nucleic acid targets, including.

A computer reading medium comprising a software program comprising the code for performing the method according to any one of claims 1 to 26 .