WO2016057947A1 - Random nucleotide mutation for nucleotide template counting and assembly - Google Patents

Random nucleotide mutation for nucleotide template counting and assembly Download PDF

Info

Publication number
WO2016057947A1
WO2016057947A1 PCT/US2015/054981 US2015054981W WO2016057947A1 WO 2016057947 A1 WO2016057947 A1 WO 2016057947A1 US 2015054981 W US2015054981 W US 2015054981W WO 2016057947 A1 WO2016057947 A1 WO 2016057947A1
Authority
WO
WIPO (PCT)
Prior art keywords
nams
group
template
read
mnams
Prior art date
Application number
PCT/US2015/054981
Other languages
French (fr)
Inventor
Michael Wigler
Dan Levy
Original Assignee
Cold Spring Harbor Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cold Spring Harbor Laboratory filed Critical Cold Spring Harbor Laboratory
Priority to EP15849374.2A priority Critical patent/EP3204521B1/en
Priority to AU2015330685A priority patent/AU2015330685B2/en
Priority to US15/515,913 priority patent/US11008606B2/en
Priority to EP21176777.7A priority patent/EP3957742A1/en
Priority to CA2964169A priority patent/CA2964169C/en
Publication of WO2016057947A1 publication Critical patent/WO2016057947A1/en
Priority to IL251509A priority patent/IL251509B/en
Priority to US17/320,634 priority patent/US20210340604A1/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6813Hybridisation assays
    • C12Q1/6827Hybridisation assays for detection of mutation or polymorphism
    • CCHEMISTRY; METALLURGY
    • C07ORGANIC CHEMISTRY
    • C07HSUGARS; DERIVATIVES THEREOF; NUCLEOSIDES; NUCLEOTIDES; NUCLEIC ACIDS
    • C07H21/00Compounds containing two or more mononucleotide units having separate phosphate or polyphosphate groups linked by saccharide radicals of nucleoside groups, e.g. nucleic acids
    • C07H21/04Compounds containing two or more mononucleotide units having separate phosphate or polyphosphate groups linked by saccharide radicals of nucleoside groups, e.g. nucleic acids with deoxyribosyl as saccharide radical
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/102Mutagenizing nucleic acids
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/102Mutagenizing nucleic acids
    • C12N15/1031Mutagenizing nucleic acids mutagenesis by gene assembly, e.g. assembly by oligonucleotide extension PCR
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/1034Isolating an individual clone by screening libraries
    • C12N15/1058Directional evolution of libraries, e.g. evolution of libraries is achieved by mutagenesis and screening or selection of mixed population of organisms
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6844Nucleic acid amplification reactions
    • C12Q1/6851Quantitative amplification
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2521/00Reaction characterised by the enzymatic activity
    • C12Q2521/50Other enzymatic activities
    • C12Q2521/539Deaminase
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2525/00Reactions involving modified oligonucleotides, nucleic acids, or nucleotides
    • C12Q2525/10Modifications characterised by
    • C12Q2525/161Modifications characterised by incorporating target specific and non-target specific sites
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2531/00Reactions of nucleic acids characterised by
    • C12Q2531/10Reactions of nucleic acids characterised by the purpose being amplify/increase the copy number of target nucleic acid
    • C12Q2531/113PCR
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2535/00Reactions characterised by the assay type for determining the identity of a nucleotide base or a sequence of oligonucleotides
    • C12Q2535/122Massive parallel sequencing

Definitions

  • the present invention provides a method is provided for determining the number of nucleic acid molecules (NAMs) in a group of NAMs, comprising
  • step (iii) counting the number of different sequences obtained in step (ii) to determine the number of unique mNAMs in the group of mNAMS
  • the present invention also provides a method for determining the number of nucleic acid molecules (NAMs) in a group of NAMs, comprising
  • mNAMs mutagenized NAMs
  • step (iii) counting the number of different sequences obtained in step (iii) to determine the number of unique mNAMs in the group of mNAMs,
  • the present invention also provides a method for determining the number of different sequences in a group of nucleic acid molecules (NAMs) that have been mutagenized and then amplified comprising
  • step (ii) counting the number of different sequences obtained in step (ii) ,
  • the present invention also provides a method for sequencing a nucleic acid molecule (NAM) that comprises two or more segments having substantially the same sequence, and that has a length of more than one sequencing read, comprising
  • step (i) subjecting each copy of the NAM in step (i) to a mutagenesis that mutates only select nucleic acid bases in the NAMs at a rate of 10% to 90% to produce mutated copies of the NAM (mcNAM) ;
  • the present invention also provides a method for determining genomic copy number information from genomic material, comprising,
  • the present invention also provides a method for profiling RNA transcripts, comprising i) obtaining a group of RNA transcripts;
  • RNA transcript profile determining the proportionate number of a plurality of RNA transcripts having the same sequence to a second different plurality of RNA transcripts that have the same sequence, thereby determining RNA transcript profile.
  • the present invention also provides a method for determining allelic imbalance, comprising
  • the present invention also provides a method for determining genome assembly, comprising
  • step (iii) aligning the sequences obtained in step (ii) according to matching mutation patterns in overlaps of the sequences
  • the present invention also provides a method for determining haplotype assembly, comprising
  • step (iii) comparing the sequences obtained in step (ii) , thereby determining haplotype assembly.
  • the present invention also provides a kit for determining the number of NAMs in a group of NAMs comprising:
  • the mutagen is a bisulfite or a salt thereof, or a deamination agent.
  • the present invention also provides a composition of matter comprising a plurality of mutagenized nucleic acid molecules (mNAMs), wherein selected mutable nucleic acid positions in the plurality of mutagenized NAMs (mNAMs) are mutated at a rate of 10% to 90%.
  • mNAMs mutagenized nucleic acid molecules
  • the present invention also provides a composition of matter derived from sequencing primers has the sequence:
  • ACACTCTTTCCCTACACACGACGCTCTTCCGATC*T (Seq ID p5.mC), wherein the cytosines (C) are methylated, and wherein *T is a phosphorothioated thymine.
  • the present invention also provides a composition of matter derived from sequencing primers has the sequence: 5Phos-
  • the present invention also provides a composition of matter comprising two or more copies of a nucleic acid molecule (NAM) comprising two or more segments having substantially the same sequence, and that has a length of more than one sequencing read, wherein each copy of the NAM has a unique primer at its 5' end and another unique primer at its 3' end, and is subjected to a mutagenesis that mutates each mutable position in the NAMs at a rate of 10% to 90% to produce mutated copies of the NAM (mcNAM) , wherein the unique primers of each mcNAM lack a nucleotide that is mutable by the mutagenesis.
  • NAM nucleic acid molecule
  • the present invention also provides sequence information produced by a system including one or more processing units which counts the number of different sequences obtained by a sequencer that processed the group of amplified mNAMs in the method of the present invention, or the group of mcNAMs of in the method of the present invention.
  • the present invention also provides a system including one or more processing units which counts the number of different sequences obtained by a sequencer that processed the group of amplified mNAMs of in the method of the present invention, or the group of mcNAMs of in the method of the present invention.
  • Cytidine deamination is a mutational method that converts cytidine (underlined) to a uridine (bold) . Upon amplification, uridine becomes a thymidine (bold) (panel C) . The sequenced nucleotide strings are aligned, aggregated by their mutational patterns (panel D) , and the number of the distinct patterns counted.
  • each template has a unique pair of markers, or "end tags", denoted by the colored circle and square (panel A) . Markers of the same color occur on the same template strands and are said to be "in phase”. Gray marks on the templates show positions that may mutate.
  • panel B a random mutation process
  • each read is mapped to the reference template (panel C, top strand) .
  • the phase of the original templates was recovered (panel D) .
  • the ability to recover the template count is a function of the window size, template length, flip rate, number of templates, and depth of coverage. Simulation results of template count estimation are shown under a variety of conditions. Each panel has three plots, for window sizes of 10, 20, and 30 bits. The x axis shows the true count from 2 to 1,024 (log2 scale) and the y axis shows the average estimated count divided by the true count, or the proportion of templates recovered. Panel A simulates recovery when the template is one window long for a range of flip rates for infinite coverage. Panel B shows the results from one window template under finite coverage (lx to 5x reads per template) for a fixed flip rate of 0.35. Panel C repeats the results of panel B for long templates comprised of 16 read lengths.
  • each read is vertex. Some reads contain their end tag (colored circles) and some do not (white circles) . Two reads with an edge were connected if they agree on their overlap. The weight of an edge reflects the strength of that overlap.
  • Panel A depicts the template information assuming exhaustive coverage, drawing all distinct reads and the edges between them.
  • sample reads were finitely sampled from the templates at a depth of coverage of 4x and 8x per template, respectively. From this information the greedy algorithm was applied (panels Ci and C2) , to select the best edges, shown in red.
  • Panel A depicts the rate at which each base is observed over all the data for those positions with a coverage of at least 30 reads.
  • Panel B depicts the cumulative distribution of the conversion rate per read.
  • Panel C depicts the correlation in flips for all cytosines in the targeted region.
  • Panel D depicts panel C as a histogram. For the best amplified position, partial conversion was determined with a 60% flip rate randomly distributed throughout each read .
  • Figure 7 Conversion rate as a function of incubation time and temperature.
  • the datasets A3, A6, A9 and A45 are the 3, 6, 9 and 45 minute conversions described herein.
  • Figure 8 Subset of clustered reads showing mutational patterns and two heterozygous positions in the sample. The panel on the left shows all the positions in the fragment while the plot on the right shows only cytosine (bit) positions. The white lines separate reads derived from the same initial template. Each cluster contains between 30 and 50 reads. Black indicates a position where the read matches the reference genome. The frequent gray squares are cytosines that have converted to thymine. The white and light grey streaks are linked heterozygous alleles which split according to mutation pattern. Sparse background errors are typically from the sequencer while the bands of error are typically the result of PCR.
  • Figure 10 Comparison of template count distributions for fragments from the autosome and X chromosome. Since the sample is male, 2 to 1 ratio was observed in the mean template counts. The histogram is the empirical distribution and the curve shows a negative binomial fit.
  • FIG 11 Heterozygous allele counts by template demonstrate perfect fit to the binomial distribution.
  • the plot on the left shows as a histogram the counts for one allele.
  • the curve shows the theoretical expectation of the count distribution assuming a binomial distribution for the allele at each locus assuming the given locus coverage.
  • the plot on the right shows the Q-Q plot over all 6000 heterozygous positions observed.
  • Figure 13 Properties of the consensus sequences derived by clustering reads with the same mutation pattern. For each consensus sequence base, the proportion of reads reporting the homozygous base was determined. These error rates are order of magnitude below sequencer error.
  • the present invention provides a method for determining the number of nucleic acid molecules (NAMs) in a group of NAMs, comprising
  • step (iii) counting the number of different sequences obtained in step (ii) to determine the number of unique mNAMs in the group of mNAMS
  • the present invention provides a method for determining the number of nucleic acid molecules (NAMs) in a group of NAMs, comprising i) obtaining an amplified and mutagenized group of NAMs that was produced by
  • step (iii) counting the number of different sequences obtained in step (ii) to determine the number of unique mNAMs in the group of mNAMS
  • obtaining sequences comprises obtaining composite sequences produced by assembling sequence reads of the mNAMs by a) aligning the sequence reads according to matching mutation patterns in overlaps of the sequence reads, thereby obtaining composite sequences, and
  • obtaining sequences in comprises obtaining composite sequences produced by assembling sequence reads of the mNAMs by
  • the present invention also provides a method for determining the number of nucleic acid molecules (NAMs) in a group of NAMs, comprising
  • mNAMs mutagenized NAMs
  • step (iii) counting the number of different sequences obtained in step (iii) to determine the number of unique mNAMs in the group of mNAMs,
  • the present invention also provides a method for determining the number of nucleic acid molecules (NAMs) in a group of NAMs, comprising
  • mNAMs mutagenized NAMs
  • step (iii) counting the number of different sequences obtained in step (iii) to determine the number of unique mNAMs in the group of mNAMs,
  • the sequencing comprises assembling sequence reads of the mNAMs into composite sequences by
  • the sequencing comprises assembling sequence reads of the mNAMs into composite sequences by
  • step (iii) wherein counting the number of different composite sequences obtained in step (iii) .
  • a sub-group of NAMs in the group of NAMs is determined to have substantially the same nucleotide sequence.
  • the sub-group of NAMs is determined to have nucleotide sequences that are at least 95, 96, 97, 98, 99, 99.1, 99.2, 99.3, 99.4, 99.5, 99.6, 99.7, 99.9, or 99.9% identical.
  • the nucleotide sequences of a sub-group of NAMs comprise a stretch of consecutive nucleotides having a sequence which includes at least two mutable positions and is i) identical to the sequence of a stretch of consecutive nucleotides within another NAM within the sub-group of NAMs, or ii) determined to have at least 95, 96, 97, 98, 99, 99.1, 99.2, 99.3, 99.4, 99.5, 99.6, 99.7, 99.9, or 99.9% identical to the sequence of a stretch of consecutive nucleotides within another NAM within the sub-group of NAMs .
  • the counting comprises counting the number of different sequences that are determined to have substantially the same sequence except for their mutable positions, thereby determining the number of NAMs in the group of NAMs that had substantially the same sequence. In one or more embodiments, the counting comprises counting the number of different sequences which lack substantially the same sequence in any stretch including at least two mutable positions, thereby determining the number of NAMs without substantially the same sequence in the group of NAMs.
  • the present invention also provides a method for determining the number of different sequences in a group of nucleic acid molecules (NAMs) that have been mutagenized and then amplified comprising
  • step (ii) counting the number of different sequences obtained in step (ii) ,
  • the present invention also provides a method for sequencing a nucleic acid molecule (NAM) that comprises two or more segments having substantially the same sequence, and that has a length of more than one sequencing read, comprising
  • a copy of the NAM is a partial copy of the NAM.
  • a copy of the NAM has at least 50 bp of identical or complementary sequence to the NAM.
  • a copy of the NAM is a complete copy of the NAM.
  • the present invention also provides a method for sequencing a nucleic acid molecule (NAM) that comprises two or more segments having substantially the same sequence, and that has a length of more than one sequencing read, comprising
  • step (i) subjecting each copy of the NAM in step (i) to a mutagenesis that mutates only select nucleic acid bases in the NAMs at a rate of 10% to 90% to produce mutated copies of the NAM (mcNAM) ;
  • the present invention also provides a method for sequencing a nucleic acid molecule (NAM) that comprises two or more segments having substantially the same sequence, and that has a length of more than one sequencing read, comprising
  • step (i) subjecting each copy of the NAM in step (i) to a chemical mutagenesis that mutates only select nucleic acid bases in the NAMs at a rate of 10% to 90% to produce mutated copies of the NAM (mcNAM) ;
  • each of the two or more copies of the NAM has a unique primer at its 5' end and another unique primer at its 3' end.
  • the unique primers of each mcNAM lack a nucleotide that is mutable by the mutagenesis.
  • the present invention also provides a method for determining genomic copy number information from genomic material, comprising,
  • the present invention also provides a method for profiling RNA transcripts, comprising
  • RNA transcript profile determining the proportionate number of a plurality of RNA transcripts having the same sequence to a second different plurality of RNA transcripts that have the same sequence, thereby determining RNA transcript profile.
  • the present invention also provides a method for determining allelic imbalance, comprising
  • the present invention also provides a method for determining genome assembly, comprising
  • step (iii) aligning the sequences obtained in step (ii) according to matching mutation patterns in overlaps of the sequences
  • the present invention also provides a method for determining haplotype assembly, comprising i) obtaining a group of alleles, wherein the alleles in the group of alleles are located in the same chromosome;
  • the rate of mutagenizing each mutable position of the NAMs in the group of NAMs is 25% to 75%.
  • the rate of mutagenizing each mutable position of the NAMs in the group of NAMs is 40% to 60%.
  • the rate of mutagenizing each mutable position of the NAMs in the group of NAMs is 50%.
  • the proportion of all bases mutated in each mNAM is about 3% to 30%.
  • the mutagenesis is by cytosine deamination .
  • the mutagenesis is performed after binding template molecules to a bead or surface.
  • biotinylated primers are attached to templates .
  • templates linked to biotinylated moieties are attached to streptavidin beads .
  • the mutagenesis further comprises beads and/or varietal tags .
  • the cytosine deamination is induced by a bisulfite or a salt thereof. In some embodiments, the cytosine deamination is induced by enzymology .
  • the cytosine deamination is induced by an activation-induced deaminase.
  • the mutagenesis comprises contacting the group of NAMs with a depurination agent, transposase agent, or an alkylating agent.
  • each mutable position of the NAMs comprises a cytosine (C) .
  • the cytosine (C) is mutated to a uracil (U) or a thymine (T) .
  • each NAM in the group of NAMs has a unique primer at its 5' end and another unique primer at its 3' end.
  • the primer comprises one or more methylated cytosines.
  • the primer comprises one or more phosphorothioated nucleotide bases .
  • the primer further comprises a 5'- phosphorylated, deoxyuridine-containing anchor-primer.
  • the primer has the sequence: ACACTCTTTCCCTACACACGACGCTCTTCCGATCT (Seq ID p5) .
  • the primer has the sequence: ACACTCTTTCCCTACACACGACGCTCTTCCGATC*T (Seq ID p5.mC), wherein the cytosines (C) are methylated, and wherein *T is a phosphorothioated thymine .
  • the cytosines (C) are methylated.
  • the primer has the sequence: GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG (Seq ID p7)
  • the primer further comprises a 5'- phosphorylated, deoxyuridine-containing anchor-primer.
  • the primer has the sequence: having the sequence: GATCGGAAGAGCGGTTCAGCAGGAATGCCGA*G (Seq ID p7.mC), wherein the cytosines (C) are methylated, wherein *G is a phosphorothioated guanine, and wherein 5Phos is a 5 ' -phosphorylated, deoxyuridine- containing anchor-primer.
  • the cytosines (C) are methylated.
  • the assembling further comprises aligning the sequences according to unique primers at the 5' and 3' ends .
  • the sequence of each unique primer comprises a segment that is substantially the same sequence, and an amplification primer that is complementary to the shared sequence when amplifying the mNAMs or copy thereof.
  • amplification is performed using a sequence- specific "wobble" primer.
  • each unique primer comprises a unique tag sequence.
  • the method further comprises the step of tagging each NAM or copy thereof.
  • the tag lacks a nucleotide that is mutable by the mutagenesis.
  • the NAM is within a mixture of DNA or RNA extracted from a cell. In some embodiments, the DNA or RNA extracted from the cell has been fragmented .
  • the DNA or RNA extracted from the cell has been fragmented by mechanical shearing or one or more restriction enzymes.
  • the one or more restriction enzyme is Pstl .
  • fragmentation occurs before amplifying. In one or more embodiments, fragmentation occurs after amplifying.
  • fragmentation occurs after mutagenesis.
  • the method of the claimed invention rther comprises subjecting the fragmented DNA or RNA to end-repair.
  • the method of the claimed invention further comprises subjecting the fragmented DNA or RNA to adenylation .
  • the method of the claimed invention further comprises subjecting the fragmented DNA or RNA to ligation with methyl-cytosine adaptors, wherein the methyl-cytosine adaptors are bisulfite resistant sequencing adaptors.
  • the NAM is a DNA molecule.
  • the DNA molecule is a fragment of genomic DNA.
  • the DNA molecule is a cDNA molecule.
  • the NAM is an RNA molecule.
  • the RNA molecule is an mRNA molecule. In one or more embodiments, the RNA molecule is a viral RNA molecule .
  • the NAM is an RNA molecule derived from one or more cell lines.
  • the method of the claimed invention further comprises reverse transcription of the NAM.
  • the reverse transcription is with poly-T and methyl-cytosines , wherein the methyl-cytosines are resistant to bisulfite mutation.
  • chemical mutagenesis occurs prior to reverse transcription.
  • one or more NAMs in the group of NAMs has a length of one sequencing read length.
  • one or more NAMs in the group of NAMs has a length of two or more sequencing read lengths .
  • the sequencing read length is 2, 3, 4, 5, 6, 7, 8, 9, 10, or 10-3000 sequencing read lengths.
  • the number of NAMs in the group of NAMs is about 2, 3, 4, 5, 6, 7, 8, 9, 10, or 10-10000.
  • the number of NAMs in the group of NAMs is greater than 10000, then diluting the group of NAMs.
  • the amplifying is by short-range or long-range polymerase chain reaction (PCR) .
  • the mutagen of the mutagenesis is diluted .
  • the group of NAMs is incubated with a mutagen at an incubation temperature of about 70 to 78 degrees Celsius .
  • the incubation temperature is about 73 degrees Celsius.
  • the group of NAMs is incubated with a mutagen at an incubation time of about 3 to 45 minutes.
  • the group of NAMs is incubated with a mutagen at an incubation time of about 5 to 20 minutes.
  • the incubation time is about 3, 6, or 9 minutes .
  • the incubation time is about 10 minutes.
  • the present invention also provides a kit for determining the number of NAMs in a group of NAMs comprising:
  • the mutagen is a bisulfite or a salt thereof, or a deamination agent.
  • the bisulfite or salt thereof is NaHS03.
  • the mutagen induces cytosine deamination .
  • the cytosine deamination is by enzymology .
  • the mutagen is diluted.
  • the kit of the present invention further comprises a plurality of unique primers including:
  • substantially unique primers comprise substantially unique tags .
  • the kit of the present invention further comprises a DNA polymerase having 3' -5' proofreading activity.
  • the plurality of substantially unique primers comprises 10 n primers, wherein n is an integer from 2 to 9.
  • the substantially unique tags are at least 6 nucleotides long.
  • the substantially unique tags are at least 15 nucleotides long.
  • the substantially unique primers comprise sets of substantially unique primers having shared sample tags .
  • the sample tags are at least 2 or 4 nucleotides long.
  • the sequence of the substantially unique tag is not altered by the mutagen.
  • the kit of the present invention further comprises a primer wherein the cytosines (C) are methylated.
  • the methylated primer having the sequence: ACACTCTTTCCCTACACACGACGCTCTTCCGATC*T (Seq ID p5.mC), wherein the cytosines (C) are methylated, and wherein *T is a phosphorothioated thymine.
  • the methylated primer having the sequence: 5Phos-GATCGGAAGAGCGGTTCAGCAGGAATGCCGA*G (Seq ID p7.mC), wherein the cytosines (C) are methylated, wherein *G is a phosphorothioated guanine, and wherein 5Phos is a 5' -phosphorylated, deoxyuridine-containing anchor-primer .
  • the present invention also provides a composition of matter comprising a plurality of mutagenized nucleic acid molecules (mNAMs), wherein selected mutable nucleic acid positions in the plurality of mutagenized NAMs (mNAMs) are mutated at a rate of 10% to 90%.
  • mNAMs mutagenized nucleic acid molecules
  • each mutable position is mutated at a rate of 25% to 75%.
  • each mutable position is mutated at a rate of 40% to 60%.
  • each mutable position is mutated at a rate of 50%.
  • each mutable nucleic acid base is mutated at a rate of 25% to 75%.
  • each mutable nucleic acid base is mutated at a rate of 40% to 60%.
  • each mutable nucleic acid base is mutated at a rate of 50%.
  • the proportion of all nucleic acids mutated in each mNAM is about 3% to 30%.
  • the m itable nucleic acid position is a cytosine position of the mNAMs and the mutagenesis is deamination of the cytosine.
  • the mutable nucleic acid base is a cytosine base of the mNAMs and the mutagenesis is deamination of the cytosine .
  • the deamination of the cytosine is induced by a bisulfite or a salt thereof.
  • the cytosine deamination of the cytosine is induced by enzymology.
  • the cytosine deamination of the cytosine is induced by an activation-induced deaminase.
  • the mutable nucleic acid position is mutagenized by contacting the group of NAMs with a depurination agent, transposase agent, or an alkylating agent.
  • each mutable position of the NAMs comprises a cytosine (C) .
  • the cytosine (C) is mutated to a uracil (U) or a thymine (T) .
  • each NAM in the plurality of NAMs has a unique primer at its 5' end and another unique primer at its 3' end.
  • the primer comprises one or more methylated cytosines.
  • the primer comprises one or more phosphorothioated nucleotide bases .
  • the primer further comprises a 5'- phosphorylated, deoxyuridine-containing anchor-primer.
  • the primer has the sequence: ACACTCTTTCCCTACACACGACGCTCTTCCGATCT (Seq ID p5) .
  • the plurality of mNAMS bearing a primer wherein the sequence of the primer is:
  • cytosines (C) are methylated.
  • all the cytosines (C) are methylated.
  • the primer has the sequence: GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG (Seq ID p7) .
  • the primer further comprises a 5'- phosphorylated, deoxyuridine-containing anchor-primer.
  • GATCGGAAGAGCGGTTCAGCAGGAATGCCGA*G (Seq ID p7.mC), wherein *G is a phosphorothioated guanine, and wherein 5Phos is a 5' -phosphorylated, deoxyuridine-containing anchor-primer .
  • the cytosines (C) are methylated.
  • the present invention also provides a composition of matter derived from sequencing primers has the sequence:
  • ACACTCTTTCCCTACACACGACGCTCTTCCGATC*T (Seq ID p5.mC), wherein the cytosines (C) are methylated, and wherein *T is a phosphorothioated thymine .
  • the present invention also provides a composition of matter derived from sequencing primers has the sequence: 5Phos-
  • the present invention also provides a composition of matter comprising two or more copies of a nucleic acid molecule (NAM) comprising two or more segments having substantially the same sequence, and that has a length of more than one sequencing read, wherein each copy of the NAM has a unique primer at its 5' end and another unique primer at its 3' end, and is subjected to a mutagenesis that mutates each mutable position in the NAMs at a rate of 10% to 90% to produce mutated copies of the NAM (mcNAM) , wherein the unique primers of each mcNAM lack a nucleotide that is mutable by the mutagenesis.
  • NAM nucleic acid molecule
  • the present invention also provides sequence information produced by a system including one or more processing units which counts the number of different sequences obtained by a sequencer that processed the group of amplified mNAMs in the method of the claimed invention, or the group of mcNAMs of in the method of the claimed invention.
  • the present invention also provides a system including one or more processing units which counts the number of different sequences obtained by a sequencer that processed the group of amplified mNAMs of in the method of the claimed invention, or the group of mcNAMs of in the method of the claimed invention.
  • the present invention also provides a method for sequencing a nucleic acid molecule (NAM) that comprises two or more segments having substantially the same sequence, and that has a length of more than one sequencing read, comprising
  • step (i) subjecting each copy of the NAM in step (i) to a mutagenesis that mutates only select nucleic acid positions in the NAMs at a rate of 10% to 90% to produce mutated copies of the NAM (mcNAM) ;
  • the present invention also provides a method for distinguishing between benign and malignantly transformed cells by detecting one or more single nucleotide polymorphisms (SNPs) in a sample from a subject and a reference sample from a control subject comprising a method of the claimed invention.
  • SNPs single nucleotide polymorphisms
  • the present invention also provides a method for distinguishing between benign and malignantly transformed cells by detecting one or more single nucleotide polymorphisms (SNPs) in a first and second sample from a subject comprising a method of the claimed invention.
  • SNPs single nucleotide polymorphisms
  • the present invention also provides a method for determining the presence of tumor cells in a sample by comparing a sample from a subject and a reference sample from a control subject comprising a method of the claimed invention.
  • the present invention also provides a method for determining the presence of tumor cells in a sample by comparing a first and second sample from a subject comprising a method of the claimed invention.
  • the present invention also provides a method for quantifying tumor cells in a sample by comparing a sample from a subject and a reference sample from a control subject comprising a method of the claimed invention.
  • the present invention also provides a method for quantifying tumor cells in a sample by comparing a sample from a first and second sample from a subject comprising a method of the claimed invention.
  • the present invention also provides a method for detecting one or more rare mutations by comparing a sample from a subject and a reference sample from a control subject comprising a method of the claimed invention.
  • the present invention also provi le s a method for detecting one or more rare mutations by comparing a sample from a first and second sample from a subject comprising a method of the claimed invention.
  • the sample is a blood sample, plasma sample, serum sample, tissue sample, or cell sample.
  • the tissue sample is from a tumor mass, surgically removed tumor mass, or margins of a surgically removed tumor mass.
  • the present invention also provides a method for detecting one or more rare mutations in a cell-free or substantially cell-free sample comprising a method of the claimed invention.
  • the present invention also provides a method for determining whether a fetus has at least one or more rare mutations in a cell-free or substantially cell-free sample comprising a method of the claimed invention
  • the sample is a maternal sample.
  • the maternal sample is obtained from a member selected from: maternal blood, maternal plasma and maternal serum .
  • nucleic acid molecule and “sequence” are not used interchangeably herein.
  • sequence refers to the sequence information of a “nucleic acid molecule”.
  • mutable position refers to the position of a nucleic acid that is susceptible to a given type of chemical mutagenesis.
  • determining the number refers to determining the lower bound number.
  • X% with respect to mutation rate, refers to the probability percentage of mutagenesis per mutable position of the multiple mutable positions that are present in a plurality of nucleic acid molecules. Thus, 25% mutation rate means a 25% probability of mutagenesis.
  • nucleic acid shall mean any nucleic acid, including, without limitation, DNA, RNA and hybrids thereof.
  • the nucleic acid bases that form nucleic acid molecules can be the bases A, C, G, T and U, as well as derivatives thereof .
  • contig and continguous refers to a set of overlapping sequence or sequence reads.
  • amplifying refers to the process of synthesizing nucleic acid molecules that are complementary to one or both strands of a template nucleic acid.
  • Amplifying a nucleic acid molecule typically includes denaturing the template nucleic acid, annealing primers to the template nucleic acid at a temperature that is below the melting temperatures of the primers, and enzymatically elongating from the primers to generate an amplification product. The denaturing, annealing and elongating steps each can be performed once.
  • the denaturing, annealing and elongating steps are performed multiple times (e.g., polymerase chain reaction (PCR) ) such that the amount of amplification product is increasing, often times exponentially, although exponential amplification is not required by the present methods.
  • Amplification typically requires the presence of deoxyribonucleoside triphosphates, a DNA polymerase enzyme and an appropriate buffer and/or co-factors for optimal activity of the polymerase enzyme.
  • the term "amplified nucleic acid molecule” refers to the nucleic acid sequences, which are produced from the amplifying process as defined herein.
  • bisulfite mutagenesis refers to the mutagenesis of nucleic acid with a reagent used for the bisulfite conversion of cytosine to uracil.
  • bisulfite conversion reagents include but are not limited to treatment with a bisulfite, a disulfite or a hydrogensulfite compound.
  • mapping refers to identifying a location on a genome or cDNA library that has a sequence which is substantially identical to or substantially fully complementary.
  • the nucleic acid molecule may be, but is not limited to the following: a segment of genomic material, a cDNA, a mRNA, or a segment of a cDNA.
  • methylation refers to the covalent attachment of a methyl group at the C5-position of the nucleotide base cytosine.
  • methylation state or refers to the presence or absence of 5-methyl-cytosine ("5-Me") at one or a plurality of CpG dinucleotides within a DNA sequence.
  • a methylation site is a sequence of contiguous linked nucleotides that is recognized and methylated by a sequence specific methylase .
  • a methylase is an enzyme that methylates (i. e., covalently attaches a methyl group) one or more nucleotides at a methylation site.
  • the term "read” or “sequence read” refers to the nucleotide or base sequence information of a nucleic acid that has been generated by any sequencing method.
  • a read therefore corresponds to the sequence information obtained from one strand of a nucleic acid fragment.
  • a DNA fragment where sequence has been generated from one strand in a single reaction will result in a single read.
  • multiple reads for the same DNA strand can be generated where multiple copies of that DNA fragment exist in a sequencing project or where the strand has been sequenced multiple times.
  • a read therefore corresponds to the purine or pyrimidine base calls or sequence determinations of a particular sequencing reaction.
  • sequencing refers to nucleotide sequence information that is sufficient to identify or characterize the nucleic acid molecule, and could be the full length or only partial sequence information for the nucleic acid molecule.
  • reference genome refers to a genome of the same species as that being analyzed for which genome the sequence information is known.
  • region of the genome refers to a continuous genomic sequence comprising multiple discrete locations.
  • sample tag refers to a nucleic acid having a sequence no greater than 1000 nucleotides and no less than two that may be covalently attached to each member of a plurality of tagged nucleic acid molecules or tagged reagent molecules.
  • a “sample tag” may comprise part of a "tag.”
  • segments of genomic material refers to the nucleic acid molecules resulting from fragmentation of genomic DNA.
  • substantially the same sequences have at least about 80% sequence identity or complementarity, respectively, to a nucleotide sequence. Substantially the same sequences or may have at least about 95%, 96%, 97%, 98%, 99% or 100% sequence identity or complementarity, respectively.
  • substantially unique primers refers to a plurality of primers, wherein each primer comprises a tag, and wherein at least 50% of the tags of the plurality of primers are unique.
  • the tags are at least 60%, 70%, 80%, 90%, or 100% unique tags.
  • substantially unique tags refers to tags in a plurality of tags, wherein at least 50% of the tags of the plurality are unique to the plurality of tags.
  • substantially unique tags will be at least 60%, 70%, 80%, 90%, or 100% unique tags.
  • tag refers to a nucleic acid having a sequence no greater than 1000 nucleotides and no less than two that may be covalently attached to a nucleic acid molecule or reagent molecule.
  • a tag may comprise a part of an adaptor or a primer.
  • a "tagged nucleic acid molecule” refers to a nucleic acid molecule which is covalently attached to a "tag.”
  • the term "wobble base pairing" with regard to two complementary nucleic acid sequences refers to the base pairing of G to uracil U rather than C, when one or both of the nucleic acid strands contains the ribonucleobase U.
  • the term "substantially fully complementary” with regard to a sequence refers to the reverse complement of the sequence allowing for both Watson-Crick base pairing and wobble base pairing, whereby G pairs with either C or U, and A pairs with either U or T.
  • a sequence may be substantially complementary to the entire length of another sequence, or it may be substantially complementary to a specified portion or length of another sequence.
  • U may be present in RNA
  • T may be present in DNA. Therefore, a U within an RNA sequence may pair with A or G in either an RNA sequence or a DNA sequence, while an A within either of a RNA or DNA sequence may pair with a U in a RNA sequence or T in a DNA sequence .
  • a "wobble" primer is a set of primers where the sequence at mutable positions is equally likely to match the original base or the mutated base .
  • Example 11 a library of Python programs was developed, (Example 11), to simulate mutation, sequencing, counting and assembly of distinct templates under the assumptions of error-free sequencing, perfect mapping, and uniformity of mutation sites, mutation rate, sequence coverage, and DNA amplification.
  • Our ability to recover template count and assembly depends on the depth of read sampling, typically called “coverage”. Coverage usually means the average number of reads overlapping a position in the reference genome, however herein, coverage means the average read depth over a position per template.
  • the first class of applications focuses on the general problem of determining absolute template count. This is important for determining the copy number of genomic DNA, measuring mRNA expression levels, quantifying allele bias, and detecting somatic mutations.
  • the protocol requires mutagenesis prior to amplification. Amplification could be either short- or long-range PCR but must occur before fragmentation if needed for library preparation. The number of possible mutation patterns should exceed the template number to obtain the most accurate count. Hence, cases were only considered where the absolute template count is below the low thousands, and save the other cases for Discussion.
  • the number of possible patterns depends on the number of bits per read, and the probability of observing a given pattern depends also on the flip rate.
  • the optimal flip rate for generating distinct mutational patterns is 0.5, wherein every pattern is equally likely. However for a window of at least 20 bits, corresponding to a read length of 80 base pairs, a rate of 0.25 is still virtually perfect for template counts in the thousands. Similar efficacy is obtained at a flip rate of 0.15 for a 30-bit window. Templates numbering in the thousands are adequate for genome copy number determination or single-cell transcript profiling.
  • example code is provided to simulate performance under a variety of conditions. The recovery of template count was also demonstrated subject to varying depth of coverage for a fixed flip rate of 0.35 ( Figure 3, panel B) .
  • a path in this graph represents a possible partial assembly of an initial template pattern. Consequently, determining the minimum number of templates needed to explain all of the reads is achieved by finding the minimum number of paths such that every vertex in the graph is included in at least one path. This is known as the minimum vertex cover and in general is an nondeterministic polynomial time hard (NP-hard) problem.
  • NP-hard polynomial time hard
  • our graph is not only directed, but also acyclic.
  • the minimum number of covering paths is equivalent to the maximum number of elements in an antichain (Dilworth, 1950) .
  • the minimum number of templates needed to explain the reads is equal to the maximum number of reads that are pairwise incompatible.
  • the simulations of this section provide guidelines for (i) genome wide copy number determination, (ii) transcript profiling, and (iii) determining allelic ratios.
  • copy number the ratio of count was measured for a given locus to the median count over the remainder of the genome.
  • transcript profiling the proportionate counts of each gene transcript were measured.
  • allelic imbalance the ratio of counts was measured from templates distinguishable by at least one SNV. In the context of RNA, this also enables observation of biased allele expression resulting from chromosome inactivation, imprinting and the like.
  • the second class of applications is to correctly assemble reads by their mutation patterns in order to recover the proper end-to-end sequence of nearly identical templates, desired when determining haplotype phasing or enumerating transcript isoforms.
  • Long templates each tagged uniquely at both ends were considered to simulate the more general task of determining how many initial templates can be correctly assembled from end to end from the mutation pattern alone ( Figure 2) .
  • reads were connected with overlapping mutation signatures to assemble a path from one tag to the other. Whereas in the previous application, all compatible edges between reads were allowed, for this problem a subgraph was built with only the "best edges" between overlapping reads.
  • a pair of tags is "exactly matched" if there is a path in the subgraph that connects them and neither tag is connected to another tag. Such a path is called an "exact path.” If two tags originate from the same template, they are a “true match.” A "true path” is an exact path for which every read originates from the same initial template .
  • Determining performance for the general task provides a lower bound on performance for other applications, because if there is an exact path that is also true, then all sequence information for that template was correctly observed. This includes haplotype phasing in the case of genomic data and transcript structure in the case of RNA profiling. In fact, these two applications are less demanding than the general task because there will only be a few template varieties and each template variety provides additional sequence information for distinguishing them.
  • Figure 4 panel B explores the effect of coverage (2x to 14x per template) on recovery of exact matches as a function of template length (2 to 1, 024 read lengths), for 32 templates, a 30 bit read length, and a flip rate of 0.35.
  • Partial bisulfite mutagenesis was obtained in a single stranded phi x 174 genomic DNA using the MethylEaseTM Xceed Rapid DNA Bisulphite Modification Kit (Human Genetic Signatures) .
  • the full conversion protocol was modified by changing the incubation temperature to 73 degrees Celsius (from 80 degrees Celsius) and the incubation time to 10 minutes (instead of 45 minutes) . Four regions were amplified to measure the conversion rate.
  • MethylEaseTM Xceed Rapid DNA Bisulphite Modification Kit Human Genetic Signatures
  • the sequencing primer (Seq ID p5.mC), wherein the cytosines (C) are methylated, and wherein *T is a phosphorothioated thymine to protect the ends from degradation by exonuclease .
  • p7. mC The sequencing primer (Seq ID p5.mC), wherein the cytosines (C) are methylated, and wherein *T is a phosphorothioated thymine to protect the ends from degradation by exonuclease .
  • *T is a phosphorothioated thymine to protect the ends from degradation by exonuclease .
  • the sequencing primer (Seq ID p7.mC), wherein the cytosines (C) are methylated, wherein *G is a phosphorothioated guanine to protect the ends from degradation by exonuclease, and wherein 5Phos is a 5'- phosphorylated, deoxyuridine-containing anchor-primer.
  • Figure 6 panel A depicts the rate at which each base is observed over all the data for those positions with a coverage of at least 30 reads. Nearly all the C positions are at 40% C and 60% T. For each of the four regions amplified, conversion patterns were compared between reads. Not all regions are equally well covered in the data, with 40-4500 reads.
  • Figure 6 panel B depicts the cumulative distribution of the conversion rate per read.
  • Figure 6 panel C depicts correlation in flips for all cytosines in the targeted region. It was determined that there none.
  • Figure 6 panel D depicts the data of Figure 6 panel C as a histogram.
  • Counting varietal tags can be used to mitigate the effects of amplification bias. While the original message is completely recoverable, the tag is confined to one end of the molecule such that identity and count can only be distinguished within one read length of the ends. Only reads that include the tag are useful in determining count and varietal tags provide no solution for assembly and assortment.
  • Described herein are different approaches for counting and assembling templates using template mutagenesis.
  • the non-limiting examples herein demonstrate by simulation that template mutagenesis can solve both the problems of counting and assembly.
  • the order of operation is mutagenesis first, followed by short- or long-range PCR, then fragmentation, if needed, and preparation of sequence libraries.
  • Two classes of applications were explored. The first is counting specific DNA or RNA molecules, for assessing genome copy number or profiling a transcriptome ( Figure 1) .
  • the second is sequence assembly - for example establishing haplotypes or distinguishing transcript isoforms ( Figure 2) .
  • each mutable position (or "bit") converts (or "flips") independently from a wild-type state to an altered state with a fixed probability (or "flip rate”) .
  • Performance was simulated under a variety of reasonable parameters for read length and mutation rate, and over a range of template lengths and counts. The results are presented under an assumption of complete coverage to obtain a theoretical upper limit of performance and then consider the consequences of sampling to various levels of coverage. In the simulations, mutable positions are distributed uniformly throughout the template such that each read contains the same number of bits (or "bit length”) . Sequence or mapping error are not presently incorporated. Variations to these assumptions and procedures are addressed herein.
  • RNA templates can be counted with near perfect accuracy for mRNA species of intermediate to scarce expression ( ⁇ 1,000 copies per cell).
  • varietal tagging can achieve accurate counting of gene transcripts, even long and abundant ones, but it is limited to labeling the end of a molecule and so does not allow counting of isoforms or observing sequence variants, except near the ends of transcripts.
  • the two methods, varietal tagging and mutagenesis can be seamlessly integrated, achieving the benefits of both methods.
  • the ability to establish phase by this method depends on strong concordance between the haplotypes and the reference genome. For those regions where the reference genome is a poor match, due to repeat content, large-scale rearrangements or novel sequence, the mutation pattern assembly algorithm will fail to generate consistent end-to-end assemblages. Although this presents a problem for direct inference by reference-matched phasing, it provides an opportunity for de novo haplotype assembly.
  • the SUTTA algorithm (Narzisi and Mishra, 2011) assembles haplotypes from short-read data by scoring proposed local assemblies based on orthogonal data sources, such as coverage, mate pairs, or physical maps. Template mutagenesis can help. Each local reference genome that SUTTA considers can also be assigned a score based on the number of successful end-to-end mutation pattern assemblies over the region. The result would be a de novo assembly over the human genome for those difficult regions.
  • mapping was assumed. In practice, however, the ability to map reads might be somewhat degraded by template mutation.
  • a standard practice is to map to a reduced alphabet where all cytosines ("C"s) are converted to thymines ("T"s) in both the read and the reference, with two distinct references genomes for each DNA strand (Krueger et al., 2011; Otto et al . , 2012) .
  • C cytosines
  • T thymines
  • restricting to a smaller alphabet and doubling the reference genomes impacts the ability to unambiguously map reads, however, the effect is surprisingly mild (Krueger et al . , 2012) .
  • the mapping algorithm can be augmented with a probabilistic model of the flip rate to prioritize the most likely alignments.
  • sequence error may reduce the ability to recover mutation patterns in those cases where the error appears to flip a bit or reverse a flipped bit. Fortunately, sequence error is typically rare. Within a reasonable range for flip rate, window size, and template count, sparse mutational patterns are expected, well separated so that no two patterns are very much alike. Sequence error will produce a pattern "nearby" an established pattern, and less well covered, and this signature can be used to discount those reads.
  • the simulations demonstrate that most applications work best for a low initial template count, less than a few thousand. This is not a problem for many genomic applications and is close to ideal for single-cell RNA analysis. If analysis of greater numbers of template molecules is desired, for example during analysis of bulk mRNA, then after mutagenesis of the first strand cDNA, multiple separate amplifications reactions can be performed, each with low template count. The products of each reaction can be tagged with barcodes, pooled and sequenced.
  • the description of the method is very similar to that established for the phiX samples discussed above.
  • a study was performed on a PstI digest of a human genome.
  • Genomic DNA is digested with a restriction enzyme (PstI) . Fragments are end-repaired, adenylated, and then ligated to bisulfite resistant sequencing adapters. These adapters match the standard Illumina adapters, save that the cytosines are replaced by methyl-cytosines .
  • the sample is then treated with a standard kit for bisulfite treatment (MethylEasy Xceed Rapid DNA Bisulphite Modification Kit Mix; Human Genetic Signatures.) Instead of incubating the sample for the standard of 80°C for 45 minutes, 3, 6, and 9 minutes at 73°C were tested. One library using the standard 80°C and 45 minutes was also generated. The samples were sub-sampled, amplified and sequenced.
  • the resulting reads were mapped to the genome using an informatics pipeline designed for bisulfite sequence data.
  • the converted read-pairs are then mapped twice, once to a genome where every C is converted to a T and once to a genome where every G is converted to A.
  • the best mapping was assigned to the original read-pair and the mapped genome recorded.
  • Reads that map to the AGT-genome are called "original top” or "OT” and are templates derived from the top strand of the initial restriction fragment.
  • the reads that map to the ACT genome are called "original bottom” or "OB.” Focus was on a 135 thousand fragments with high quality alignments in the range of 150 to 400 base pairs.
  • Each restriction fragment/strand provides an opportunity to observe multiple reads derived from the same initial template.
  • error-free data one need only cluster reads that have precisely the same pattern.
  • a robust method was developed for joining reads derived from the same initial template.
  • Information was extracted from all convertible positions and then cluster reads using a multi- scale clustering algorithm that works on pair-wise hamming distances [ arXiv : 1506.03072 (clustering method devised is available at arvix.org/abs/1506.03072)].
  • An example of clustered reads at a single restriction fragment for the original top strand is shown in Figure 8.
  • the mutational protocol can be applied to cDNA as well. While this data is less well-studied, the preliminary results are very promising. Taking whole RNA derived from cell lines, the mRNA was reverse transcribed with poly-T and template switch oligo primers that are resistant to bisulfite mutation (methyl-cytosines substituted for cytosines) . The resulting first strand cDNA were mutagenized with the muSeq protocol for 6 minutes at 73C. The mutated strands were then sub-sampled, amplified, sonicated, repaired, and ligated to sequencing adapters, amplified and sequenced .
  • the resulting reads were then converted two ways (read 1 C ⁇ T, read 2 G ⁇ A; and read 1 G ⁇ A, read 2 C ⁇ T) mapped to two versions of the human genome using the STAR mapper (Dobin, et al STAR: ultrafast universal RNA-seq aligner, Bioinformatics . 2012) much as described above. The best of the four maps were selected to assign to the original read. Plots showing stacks of reads in the IGV viewer are shown in Figures 13 and 14.
  • RNA sequence can be directly mutagenized before reverse transcription.
  • a sample is obtained from a subject afflicted with cancer.
  • the sample is subjected to a chemical mutagenesis as described herein.
  • the mutagenized sample is sequenced, aligned, mapped, and counted as described herein.
  • the presence of tumor cells in the sample is determined. Also, quantification of tumor cells in the sample is determined. Also, one or more rare mutations in the sample is determined. Also, one or more single nucleotide polymorphisms in the sample is determined. Also, benign and malignantly transformed cells is distinguished.
  • Example 7B Detecting a small load of cancer DNA in the presence of an excess of normal DNA
  • a sample is obtained from a subject afflicted with cancer.
  • the sample is subjected to a chemical mutagenesis as described herein.
  • the sample is further subjected to beads and/or varietal tags.
  • the mutagenized sample is sequenced, aligned, mapped, and counted as described herein.
  • the presence of tumor cells in the sample is determined. Also, quantification of tumor cells in the sample is determined. Also, one or more rare mutations in the sample is determined. Also, one or more single nucleotide polymorphisms in the sample is determined. Also, benign and malignantly transformed cells is distinguished.
  • a sample is obtained from a pregnant female.
  • the sample is subjected to a chemical mutagenesis as described herein.
  • the mutagenized sample is sequenced, aligned, mapped, and counted as described herein .
  • One or more rare mutations in a fetus is determined.
  • one or more single nucleotide polymorphisms in a fetus is determined.
  • one or more chromosomal abnormalities in a fetus is determined.
  • Example 6 The error reduction described above in Example 6 is used in conjunction with the beads and/or varietal tags to obtain sequence counts for rare variants .
  • a group of RNA transcipts in one cell or a population of cells is obtained.
  • the group of RNA transcripts is subjected to a chemical mutagenesis and sequenced as described herein.
  • An assembly algorithm is applied to the sequences of the group of RNA transcripts.
  • the assembly algorithm may be SOAPdenovo-Trans, Velvet/Oases, Trans-ABySS, or Trinity transcriptome assemblers.
  • a transcriptome is obtained without mapping to a reference genome.
  • a group of NAMs is obtained from genomes from a large variety of organisms, wherein some of the organisms may be highly related.
  • the group of NAMs is subjected to a chemical mutagenesis and fragmentation as described herein.
  • the group of mutagenized fragments of NAMs is sequenced.
  • the sequencing may be metagenome sequencing, shotgun sequencing, or high-throughput sequencing.
  • an assembly algorithm is applied to the sequences of the group of mutagenized fragments of NAMs.
  • the assembly algorithm may be Phrap, Celera, or Velvet/Oases assemblers. Independent genomes are assembled.
  • seed hash ("This is not a random seed.")
  • templates_binary np . array ([ toBinary (x) for x in templates] )
  • mean_unique np .
  • mean (unique_counts [ ind] ) info [ read_length, flip_rate, template_count, coverage, mean_unique]
  • fixed_window_finite_coverage Performs simulations testing the recovery of template count over a fixed window for a range of flip_rates, read_lengths , and template_counts
  • seed hash ("This is not a random seed.")
  • templates_binary np . array ([ toBinary (x) for x in templates] )
  • seed hash ("This is not a random seed.")
  • template_length 2*read_length + span_length ## length of template
  • left_edge read_length ## left edge of span (first read position to not contain left mark)
  • templates getRandomPatterns (template_count,
  • Apos read_length - 1
  • word_space getWordSpace (match)
  • num_reads int (coverage * template_count * ( float (template_length) / float ( read_length) ) )
  • read_template np . random. randint (0, template_count, num reads)
  • read_start np . random. randint (0, max_start, num reads)
  • word space assigns ambiguous reads to their least template
  • read_index [word_space [x] for x in zip (read_template, read_start) ]
  • read_tracker defaultdict (set)
  • edge_in, edge_out, inscore greedyAssembly (read_start, read_index, match, one_count, score_table, read_length,
  • true_exact_match false_exact_match
  • many_windows_finite_coverage_counting Performs simulations testing the recovery of template count over a template of many read lengths for a fixed flip_rate, and a range of read_lengths, and template_counts,
  • The provides a speed advantage for long templates.
  • seed hash (This is not a seed.)
  • template_length read_length*template_factor
  • templates getRandomPatterns (template_count,
  • word_space getWordSpace (match)
  • read_template np . random . randint ( 0 ,
  • read_start np . random . randint ( 0 , max_start, num_reads )
  • read_index [word_space [x] for x in zip (read_template, read_start) ]
  • seed hash ("This is not a random seed.")
  • template_length read_length*template_factor
  • templates getRandomPatterns (template_count, template_length, flip_rate)
  • overlap_count np . array ( [np . sum (np . equal . outer (x, x) , 0) for x in overlap . T ]).
  • def templateToWindows (template, window_size) :
  • def templatesToWindows templates, window_size
  • : ' ' ' convert a set of template patterns into their sequence of window_size subwords ' ' '
  • Tl and T2 match for every position on the window P: (P+W)
  • mismatch_pos [ i ] np .
  • logical_xor templates [ i ] , templates
  • each position converts from template index to word index. if each word is unique at a position, then they are identical, otherwise, each word is assigned its lowest index template.
  • template_count match . shape [ 0 ] ## number of templates
  • read_length match . shape [ 3 ] - 1 ## length of read
  • max_start match . shape [ 2 ] - read_length + 1 ## maximum read start position
  • word_space [t, pos] np.argmax(x)
  • one_count[T, P, W] returns the number of flipped positions in the window P: (P+W) in template T.
  • one_count [ : , :toEnd, k+1] one_count [ : , : toEnd, k] + templates [ : , k : ]
  • This function generates a lookup table that keeps the values for: length of overlap, number of ones for a fixed flip_rate.
  • the primary DAG has vertices for each read
  • the complete DAG has the same vertices
  • overlap_match np . equal . outer (overlap [: , position+1], overlap[:, position+1])
  • source (position, read[x, position]) ##
  • Y is a simple node if
  • overlap_match np . equal . outer (overlap [: , position+1], overlap[:, position+1])
  • source (position, read[x, position]) ##
  • node_list sorted (out_edge . keys ()) [:: -1 ]
  • the primary DAG has vertices for each read
  • max_start read . shape [ 1 ] - 2 ## iterate backwards from the next to last position to the first for position in range (max_start, -1, -1) :
  • overlap_match np . equal . outer (overlap [: , position+1], overlap[:, position+1])
  • source (position, read[x, position]) ##
  • def overlapMatch (read_start, read_index, match, read_length, template_length) :
  • index_to_start np . array ( [bisect . bisect_left (read_start, x) for x in np . arange (template_length) ] + [num_nodes] )
  • index_to_start [ a_pos : (a_pos + 2)]
  • a_template read_index [ low : high]
  • overlap_length read_length - b_pos + a_pos
  • def pathScore edge_in, edge_out, read_start, read_index, left_edge, right_edge, read_tracker
  • start_marks [set () for _ in range (num_nodes ) ]
  • start_marks [target] .update start_marks [source] )
  • Some nodes may be referred by multiple reads from different templates -- if the mutation patterns are degenerate for more than a read length .
  • has_true_path hasTruePath (tindex, pstart, pend)
  • the score for an edge is the log likelihood of an accidental overlap
  • index_to_start np . array ( [bisect . bisect_left (read_start, x) for x in np . arange (template_length) ] + [num_nodes])
  • outedge_heaps [[] for _ in range (num_nodes) ]
  • index_to_start [ a_pos : (a_pos + 2)]
  • a_template read_index [ low : high]
  • overlap_length read_length - b_pos + a_pos
  • score score table [bits, flipped]

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Organic Chemistry (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Biotechnology (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • Biochemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • General Health & Medical Sciences (AREA)
  • Microbiology (AREA)
  • Biophysics (AREA)
  • Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Immunology (AREA)
  • Analytical Chemistry (AREA)
  • Plant Pathology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Ecology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

A method for determining the number of nucleic acid molecules (NAMs) in a group of NAMs, comprising i) obtaining an amplified and mutagenized group of NAMs that was produced by a. subjecting the group of NAMs to a chemical mutagenesis which mutates only select nucleic acid bases in the group of NAMs at a rate of 10% to 90% thus forming a group of mutagenized NAMs (mNAMs), and b. amplifying the group of mNAMs; ii) obtaining sequences of the mNAMs in the group of amplified mNAMs; and iii) counting the number of different sequences obtained in step (ii) to determine the number of unique mNAMs in the group of mNAMS, thereby determining the number of NAMs in the group of NAMs.

Description

RANDOM NUCLEOTIDE MUTATION FOR NUCLEOTIDE TEMPLATE
COUNTING AND ASSEMBLY
This application claims the priority of U.S. Provisional No. 62/062,571, filed October 10, 2014, the content of which is hereby incorporated by reference.
Throughout this application, various publications are referenced, including referenced in parenthesis. Full citations for publications referenced in parenthesis may be found listed at the end of the specification immediately preceding the claims. The disclosures of all referenced publications in their entireties are hereby incorporated by reference into this application in order to more fully describe the state of the art to which this invention pertains
Background of Invention
Despite improvements in DNA sequencing, many problems of interpretation arise when trying to count or assemble molecules (templates) that are largely identical. Some problems in genomic analysis have remained difficult despite the development of high throughput sequencing methods. Many of these problems arise from the inability to distinguish identical and nearly identical template sequences. Counting molecules of identical composition in an RNA sequencing assay or the copy number of identical stretches of DNA currently depend on quantitative methods that adjust imperfectly for the distortions of data caused by sample processing.
Moreover, when read lengths are short, determining the physical connection of distinguishable elements separated by long identical stretches is difficult to impossible and limits the ability to phase single nucleotide variants (SNVs) , identify transcript isoforms, and assemble through repetitive genomic regions .
There are several protocols in which a sequence of random nucleotides is appended to the template molecules before amplification and sequencing. This methodology has been applied under a variety of names to identify PCR duplicates (McCloskey et al., 2007; Miner et al . , 2004), improve counting of DNA (Casbon et al., 2011; Fu et al . , 2011) and RNA (Islam et al . , 2014; Jabara et al., 2011; Kivioja et al., 2012) templates, and reduce sequence error (Hiatt et al . , 2013; Kinde et al . , 2011; Schmitt et al . , 2012) . Each implementation has its own name for the random nucleotide sequences, which can be referred to as varietal tags (Schmitt et al . , 2012) . Each implementation also has its own shortcoming .
Summary of the Invention
The present invention provides a method is provided for determining the number of nucleic acid molecules (NAMs) in a group of NAMs, comprising
i) obtaining an amplified and mutagenized group of NAMs that was produced by
a) subjecting the group of NAMs to a chemical mutagenesis which mutates only select nucleic acid bases in the group of NAMs at a rate of 10% to 90% thus forming a group of mutagenized NAMs (mNAMs) , and b) amplifying the group of mNAMs;
ii) obtaining sequences of the mNAMs in the group of amplified mNAMs; and
iii) counting the number of different sequences obtained in step (ii) to determine the number of unique mNAMs in the group of mNAMS,
thereby determining the number of NAMs in the group of NAMs.
The present invention also provides a method for determining the number of nucleic acid molecules (NAMs) in a group of NAMs, comprising
i) subjecting the group of NAMs to a chemical mutagenesis which mutates only select nucleic acid bases in the group of NAMs at a rate of 10% to 90%, to produce a group of mutagenized NAMs (mNAMs) ;
ii) amplifying the group of mNAMs;
iii) sequencing the mNAMs in the group of amplified mNAMs to obtain sequences of the mNAMs; and
iv) counting the number of different sequences obtained in step (iii) to determine the number of unique mNAMs in the group of mNAMs,
thereby determining the number of NAMs in the group of NAMs.
The present invention also provides a method for determining the number of different sequences in a group of nucleic acid molecules (NAMs) that have been mutagenized and then amplified comprising
a) obtaining the group of NAMs that have been mutagenized and then amplified; b) obtaining sequences of the mutagenized NAMs (mNAMs) in the group of amplified mNAMs; and
c) counting the number of different sequences obtained in step (ii) ,
thereby determining the number of different sequences in the group of amplified mNAMs.
The present invention also provides a method for sequencing a nucleic acid molecule (NAM) that comprises two or more segments having substantially the same sequence, and that has a length of more than one sequencing read, comprising
i) obtaining two or more copies of the NAM;
ii) subjecting each copy of the NAM in step (i) to a mutagenesis that mutates only select nucleic acid bases in the NAMs at a rate of 10% to 90% to produce mutated copies of the NAM (mcNAM) ;
iii) amplifying each of the mcNAMs;
iv) obtaining composite sequences of the mcNAMs that are produced by assembling sequence reads of the amplified mcNAMs, such that when taken together, span as much as possible of the entire length of the NAM, by
a) aligning the sequence reads according to matching mutation patterns in overlaps of the sequence reads, and
b) mapping the composite sequences,
thereby sequencing the NAM.
The present invention also provides a method for determining genomic copy number information from genomic material, comprising,
i) obtaining segments of the genomic material; and
ii) determining the number of segments of the genomic material according to the method of the claimed invention, thereby determining genomic copy number information from genomic material .
The present invention also provides a method for profiling RNA transcripts, comprising i) obtaining a group of RNA transcripts;
ii) determining the number of RNA transcripts in the group of RNA transcripts according to the method of the claimed invention; and
iii) determining the proportionate number of a plurality of RNA transcripts having the same sequence to a second different plurality of RNA transcripts that have the same sequence, thereby determining RNA transcript profile.
The present invention also provides a method for determining allelic imbalance, comprising
i) obtaining copy number of a first allele;
ii) obtaining copy number of a second allele; and
iii) comparing the copy numbers obtained in steps (i) and (ii) , thereby determining allelic imbalance, wherein the copy number in steps (i) and (ii) is obtained by the method of the claimed invention .
The present invention also provides a method for determining genome assembly, comprising
i) obtaining segments of a genome, wherein the segments span the entire length of the genome;
ii) sequencing the segments of the genome according to the method of claim 11;
iii) aligning the sequences obtained in step (ii) according to matching mutation patterns in overlaps of the sequences; and
iv) mapping the sequences,
thereby assembling the genome.
The present invention also provides a method for determining haplotype assembly, comprising
i) obtaining a group of alleles, wherein the alleles in the group of alleles are located in the same chromosome;
ii) sequencing each allele in the group of alleles according to the method of claim 11, and
iii) comparing the sequences obtained in step (ii) , thereby determining haplotype assembly.
The present invention also provides a kit for determining the number of NAMs in a group of NAMs comprising:
a) a mutagen; and
b) instructions for performing mutagenesis resulting in suboptimal mutagenesis,
wherein the mutagen is a bisulfite or a salt thereof, or a deamination agent.
The present invention also provides a composition of matter comprising a plurality of mutagenized nucleic acid molecules (mNAMs), wherein selected mutable nucleic acid positions in the plurality of mutagenized NAMs (mNAMs) are mutated at a rate of 10% to 90%.
The present invention also provides a composition of matter derived from sequencing primers has the sequence:
ACACTCTTTCCCTACACACGACGCTCTTCCGATC*T (Seq ID p5.mC), wherein the cytosines (C) are methylated, and wherein *T is a phosphorothioated thymine.
The present invention also provides a composition of matter derived from sequencing primers has the sequence: 5Phos-
GATCGGAAGAGCGGTTCAGCAGGAATGCCGA*G (Seq ID p7.mC), wherein the cytosines (C) are methylated, wherein *G is a phosphorothioated guanine, and wherein 5Phos is a 5 ' -phosphorylated, deoxyuridine- containing anchor-primer.
The present invention also provides a composition of matter comprising two or more copies of a nucleic acid molecule (NAM) comprising two or more segments having substantially the same sequence, and that has a length of more than one sequencing read, wherein each copy of the NAM has a unique primer at its 5' end and another unique primer at its 3' end, and is subjected to a mutagenesis that mutates each mutable position in the NAMs at a rate of 10% to 90% to produce mutated copies of the NAM (mcNAM) , wherein the unique primers of each mcNAM lack a nucleotide that is mutable by the mutagenesis. The present invention also provides sequence information produced by a system including one or more processing units which counts the number of different sequences obtained by a sequencer that processed the group of amplified mNAMs in the method of the present invention, or the group of mcNAMs of in the method of the present invention.
The present invention also provides a system including one or more processing units which counts the number of different sequences obtained by a sequencer that processed the group of amplified mNAMs of in the method of the present invention, or the group of mcNAMs of in the method of the present invention.
Brief Description of the Drawings
Figure 1. Eniamerating Templates Over A Fixed Window
Illustration of the process of counting indistinguishable template molecules (panel A) . In the first step, the molecules were mutated by random process, creating a unique mutation signature to each molecule (panel B) . Cytidine deamination is also illustrated. Cytidine deamination is a mutational method that converts cytidine (underlined) to a uridine (bold) . Upon amplification, uridine becomes a thymidine (bold) (panel C) . The sequenced nucleotide strings are aligned, aggregated by their mutational patterns (panel D) , and the number of the distinct patterns counted.
Figure 2. Assembling Long Templates
Illustration of the process of connecting distinguishable markers separated by a long span of indistinguishable sequence. In this example, each template has a unique pair of markers, or "end tags", denoted by the colored circle and square (panel A) . Markers of the same color occur on the same template strands and are said to be "in phase". Gray marks on the templates show positions that may mutate. Each template was subjected to a random mutation process (panel B) that converts some of the positions (converted positions shown in black) . After amplification and sequencing, each read is mapped to the reference template (panel C, top strand) . Finally, by matching mutation patterns from overlapping reads, the phase of the original templates was recovered (panel D) .
Figure 3. Recovering Counts
The ability to recover the template count is a function of the window size, template length, flip rate, number of templates, and depth of coverage. Simulation results of template count estimation are shown under a variety of conditions. Each panel has three plots, for window sizes of 10, 20, and 30 bits. The x axis shows the true count from 2 to 1,024 (log2 scale) and the y axis shows the average estimated count divided by the true count, or the proportion of templates recovered. Panel A simulates recovery when the template is one window long for a range of flip rates for infinite coverage. Panel B shows the results from one window template under finite coverage (lx to 5x reads per template) for a fixed flip rate of 0.35. Panel C repeats the results of panel B for long templates comprised of 16 read lengths.
Figure 4. Recovering Phase
Proper phasing of end tags separated by a long span of identical sequence depends on window size, template length, flip rate, number of templates, and depth of coverage. In panels A and B, simulation results are shown for the recovery of phase for 32 templates across a span ranging from 2 to 1,024 read lengths for a 30-bit window. In panel A, infinite coverage was assumed and recovery was examined as a function of flip rate. Because coverage is exhaustive and the assembly graph is well characterized, there are no errors and all phased results are correct. In panels B and C, the case of finite coverage was considered, fixing the flip rate at 0.35 over a range of coverages from 2x to 14x per template. In panel B, the number of template molecules was fixed at 32 to explore the effect of template length on recovery of phase. In panel C, the template length was fixed to 32 read lengths to explore the effect of the number of templates. Because not every read is observed, a greedy algorithm was used that may return exact matches that are correct (black) or incorrect (white) . Figure 5. Greedy Path Assembly
To carry out the pattern assembly shown in Figure 2, a graph was defined in which each read is vertex. Some reads contain their end tag (colored circles) and some do not (white circles) . Two reads with an edge were connected if they agree on their overlap. The weight of an edge reflects the strength of that overlap. Panel A depicts the template information assuming exhaustive coverage, drawing all distinct reads and the edges between them. In panels Bi and B2, sample reads were finitely sampled from the templates at a depth of coverage of 4x and 8x per template, respectively. From this information the greedy algorithm was applied (panels Ci and C2) , to select the best edges, shown in red. When coverage is low (panel Ci) , some paths do not succeed in spanning the length of the template and of those that do, three determine the correct phasing, and two in error. Under higher coverage (panel C2) , all paths span template and all six are correctly phased . Figure 6. Partial Bisulfite Mutagenesis of a PhiX Template Genome
The experimental results from partial bisulfite mutagenesis of a phix template genome. Panel A depicts the rate at which each base is observed over all the data for those positions with a coverage of at least 30 reads. Panel B depicts the cumulative distribution of the conversion rate per read. Panel C depicts the correlation in flips for all cytosines in the targeted region. Panel D depicts panel C as a histogram. For the best amplified position, partial conversion was determined with a 60% flip rate randomly distributed throughout each read .
Figure 7. Conversion rate as a function of incubation time and temperature. The datasets A3, A6, A9 and A45 are the 3, 6, 9 and 45 minute conversions described herein. Figure 8 . Subset of clustered reads showing mutational patterns and two heterozygous positions in the sample. The panel on the left shows all the positions in the fragment while the plot on the right shows only cytosine (bit) positions. The white lines separate reads derived from the same initial template. Each cluster contains between 30 and 50 reads. Black indicates a position where the read matches the reference genome. The frequent gray squares are cytosines that have converted to thymine. The white and light grey streaks are linked heterozygous alleles which split according to mutation pattern. Sparse background errors are typically from the sequencer while the bands of error are typically the result of PCR.
Figure 9 . Bit positions and bit patterns are uncorrelated . For the plot on the left, pairs of positions (100,000) were sampled and the Fisher exact p-value computed to test the independence of bits by position. For the plot on the right, pairs of templates (100,000) were sampled and the Fisher exact p-value computed to test the independence of bits within a pattern. Both plots conform to the expected distribution.
Figure 10 . Comparison of template count distributions for fragments from the autosome and X chromosome. Since the sample is male, 2 to 1 ratio was observed in the mean template counts. The histogram is the empirical distribution and the curve shows a negative binomial fit.
Figure 11 . Heterozygous allele counts by template demonstrate perfect fit to the binomial distribution. The plot on the left shows as a histogram the counts for one allele. The curve shows the theoretical expectation of the count distribution assuming a binomial distribution for the allele at each locus assuming the given locus coverage. The plot on the right shows the Q-Q plot over all 6000 heterozygous positions observed.
Figure 12. Heterozygous allele counts by read fail to fit binomial distribution. As shown in Figure 11, except the fit is quite poor. Far better to count templates than reads.
Figure 13 . Properties of the consensus sequences derived by clustering reads with the same mutation pattern. For each consensus sequence base, the proportion of reads reporting the homozygous base was determined. These error rates are order of magnitude below sequencer error.
Detailed Description of the Invention
The present invention provides a method for determining the number of nucleic acid molecules (NAMs) in a group of NAMs, comprising
i) obtaining an amplified and mutagenized group of NAMs that was produced by
a) subjecting the group of NAMs to a chemical mutagenesis which mutates only select nucleic acid bases in the group of NAMs at a rate of 10% to 90% thus forming a group of mutagenized NAMs (mNAMs) , and b) amplifying the group of mNAMs;
ii) obtaining sequences of the mNAMs in the group of amplified mNAMs; and
iii) counting the number of different sequences obtained in step (ii) to determine the number of unique mNAMs in the group of mNAMS,
thereby determining the number of NAMs in the group of NAMs.
The present invention provides a method for determining the number of nucleic acid molecules (NAMs) in a group of NAMs, comprising i) obtaining an amplified and mutagenized group of NAMs that was produced by
a) subjecting the group of NAMs to a chemical mutagenesis which mutates only select nucleic acid positions in the group of NAMs at a rate of 10% to 90% thus forming a group of mutagenized NAMs (mNAMs) , and b) amplifying the group of mNAMs;
ii) obtaining sequences of the mNAMs in the group of amplified mNAMs; and
iii) counting the number of different sequences obtained in step (ii) to determine the number of unique mNAMs in the group of mNAMS,
thereby determining the number of NAMs in the group of NAMs.
In some embodiments, obtaining sequences comprises obtaining composite sequences produced by assembling sequence reads of the mNAMs by a) aligning the sequence reads according to matching mutation patterns in overlaps of the sequence reads, thereby obtaining composite sequences, and
b) mapping the composite sequences; and
wherein counting the number of jointly overlapping different composite sequences obtained.
In some embodiments, obtaining sequences in comprises obtaining composite sequences produced by assembling sequence reads of the mNAMs by
a) aligning the sequence reads according to matching mutation patterns in overlaps of the sequence reads, thereby obtaining composite sequences, and
b) mapping the composite sequences; and
wherein counting the number of different composite sequences obtained.
The present invention also provides a method for determining the number of nucleic acid molecules (NAMs) in a group of NAMs, comprising
i) subjecting the group of NAMs to a chemical mutagenesis which mutates only select nucleic acid bases in the group of NAMs at a rate of 10% to 90%, to produce a group of mutagenized NAMs (mNAMs) ;
ii) amplifying the group of mNAMs;
iii) sequencing the mNAMs in the group of amplified mNAMs to obtain sequences of the mNAMs; and
iv) counting the number of different sequences obtained in step (iii) to determine the number of unique mNAMs in the group of mNAMs,
thereby determining the number of NAMs in the group of NAMs.
The present invention also provides a method for determining the number of nucleic acid molecules (NAMs) in a group of NAMs, comprising
i) subjecting the group of NAMs to a chemical mutagenesis which mutates only select nucleic acid positions in the group of NAMs at a rate of 10% to 90%, to produce a group of mutagenized NAMs (mNAMs) ;
ii) amplifying the group of mNAMs;
iii) sequencing the mNAMs in the group of amplified mNAMs to obtain sequences of the mNAMs; and
iv) counting the number of different sequences obtained in step (iii) to determine the number of unique mNAMs in the group of mNAMs,
thereby determining the number of NAMs in the group of NAMs.
In some embodiments, the sequencing comprises assembling sequence reads of the mNAMs into composite sequences by
a) aligning the sequence reads according to matching mutation patterns in overlaps of the sequence reads, thereby obtaining composite sequences, and
b) mapping the composite sequences; and
wherein counting the number of jointly overlapping different composite sequences obtained.
In some embodiments, the sequencing comprises assembling sequence reads of the mNAMs into composite sequences by
a) aligning the sequence reads according to matching mutation patterns in overlaps of the sequence reads, thereby obtaining composite sequences, and
b) mapping the composite sequences; and
wherein counting the number of different composite sequences obtained in step (iii) .
In one or more embodiments, a sub-group of NAMs in the group of NAMs is determined to have substantially the same nucleotide sequence.
In one or more embodiments, the sub-group of NAMs is determined to have nucleotide sequences that are at least 95, 96, 97, 98, 99, 99.1, 99.2, 99.3, 99.4, 99.5, 99.6, 99.7, 99.9, or 99.9% identical.
In one or more embodiments, the nucleotide sequences of a sub-group of NAMs comprise a stretch of consecutive nucleotides having a sequence which includes at least two mutable positions and is i) identical to the sequence of a stretch of consecutive nucleotides within another NAM within the sub-group of NAMs, or ii) determined to have at least 95, 96, 97, 98, 99, 99.1, 99.2, 99.3, 99.4, 99.5, 99.6, 99.7, 99.9, or 99.9% identical to the sequence of a stretch of consecutive nucleotides within another NAM within the sub-group of NAMs .
In one or more embodiments, the counting comprises counting the number of different sequences that are determined to have substantially the same sequence except for their mutable positions, thereby determining the number of NAMs in the group of NAMs that had substantially the same sequence. In one or more embodiments, the counting comprises counting the number of different sequences which lack substantially the same sequence in any stretch including at least two mutable positions, thereby determining the number of NAMs without substantially the same sequence in the group of NAMs.
The present invention also provides a method for determining the number of different sequences in a group of nucleic acid molecules (NAMs) that have been mutagenized and then amplified comprising
a) obtaining the group of NAMs that have been mutagenized and then amplified;
b) obtaining sequences of the mutagenized NAMs (mNAMs) in the group of amplified mNAMs; and
c) counting the number of different sequences obtained in step (ii) ,
thereby determining the number of different sequences in the group of amplified mNAMs.
The present invention also provides a method for sequencing a nucleic acid molecule (NAM) that comprises two or more segments having substantially the same sequence, and that has a length of more than one sequencing read, comprising
i) obtaining two or more copies of the NAM; 11) subjecting each copy of the NAM in step (1) to a mutagenesis that mutates only select nucleic acid positions in the NAMs at a rate of 10% to 90% to produce mutated copies of the NAM (mcNAM) ;
iii) amplifying each of the mcNAMs;
iv) obtaining composite sequences of the mcNAMs that are produced by assembling sequence reads of the amplified mcNAMs, such that when taken together, span as much as possible of the entire length of the NAM, by
a) aligning the sequence reads according to matching mutation patterns in overlaps of the sequence reads, and
b) mapping the composite sequences,
thereby sequencing the NAM.
In some embodiments, a copy of the NAM is a partial copy of the NAM.
In some embodiments, a copy of the NAM has at least 50 bp of identical or complementary sequence to the NAM.
In some embodiments, a copy of the NAM is a complete copy of the NAM.
The present invention also provides a method for sequencing a nucleic acid molecule (NAM) that comprises two or more segments having substantially the same sequence, and that has a length of more than one sequencing read, comprising
i) obtaining two or more copies of the NAM;
ii) subjecting each copy of the NAM in step (i) to a mutagenesis that mutates only select nucleic acid bases in the NAMs at a rate of 10% to 90% to produce mutated copies of the NAM (mcNAM) ;
iii) amplifying each of the mcNAMs;
iv) obtaining composite sequences of the mcNAMs that are produced by assembling sequence reads of the amplified mcNAMs, such that when taken together, span as much as possible of the entire length of the NAM, by a) aligning the sequence reads according to matching mutation patterns in overlaps of the sequence reads, and
b) mapping the composite sequences,
thereby sequencing the NAM.
The present invention also provides a method for sequencing a nucleic acid molecule (NAM) that comprises two or more segments having substantially the same sequence, and that has a length of more than one sequencing read, comprising
i) obtaining two or more copies of the NAM;
ii) subjecting each copy of the NAM in step (i) to a chemical mutagenesis that mutates only select nucleic acid bases in the NAMs at a rate of 10% to 90% to produce mutated copies of the NAM (mcNAM) ;
iii) amplifying each of the mcNAMs;
iv) obtaining composite sequences of the mcNAMs that are produced by assembling sequence reads of the amplified mcNAMs, such that when taken together, span as much as possible of the entire length of the NAM, by
a) aligning the sequence reads according to matching mutation patterns in overlaps of the sequence reads, and
b) mapping the composite sequences,
thereby sequencing the NAM.
In some embodiments, each of the two or more copies of the NAM has a unique primer at its 5' end and another unique primer at its 3' end.
In some embodiments, the unique primers of each mcNAM lack a nucleotide that is mutable by the mutagenesis.
The present invention also provides a method for determining genomic copy number information from genomic material, comprising,
i) obtaining segments of the genomic material; and
ii) determining the number of segments of the genomic material according to the method of the present invention, thereby determining genomic copy number information from genomic material .
The present invention also provides a method for profiling RNA transcripts, comprising
i) obtaining a group of RNA transcripts;
ii) determining the number of RNA transcripts in the group of RNA transcripts according to the method of the claimed invention; and
iii) determining the proportionate number of a plurality of RNA transcripts having the same sequence to a second different plurality of RNA transcripts that have the same sequence, thereby determining RNA transcript profile.
The present invention also provides a method for determining allelic imbalance, comprising
i) obtaining copy number of a first allele;
ii) obtaining copy number of a second allele; and
iii) comparing the copy numbers obtained in steps (i) and (ii) , thereby determining allelic imbalance, wherein the copy number in steps (i) and (ii) is obtained by the method of the present invention .
The present invention also provides a method for determining genome assembly, comprising
i) obtaining segments of a genome, wherein the segments span the entire length of the genome;
ii) sequencing the segments of the genome according to the method of the claimed invention;
iii) aligning the sequences obtained in step (ii) according to matching mutation patterns in overlaps of the sequences; and
iv) mapping the sequences,
thereby assembling the genome.
The present invention also provides a method for determining haplotype assembly, comprising i) obtaining a group of alleles, wherein the alleles in the group of alleles are located in the same chromosome;
ii) sequencing each allele in the group of alleles according to the method of the claimed invention, and
iii) comparing the sequences obtained in step (ii) ,
thereby determining haplotype assembly.
In one or more embodiments, the rate of mutagenizing each mutable position of the NAMs in the group of NAMs is 25% to 75%.
In one or more embodiments, the rate of mutagenizing each mutable position of the NAMs in the group of NAMs is 40% to 60%.
In one or more embodiments, the rate of mutagenizing each mutable position of the NAMs in the group of NAMs is 50%.
In one or more embodiments, the proportion of all bases mutated in each mNAM is about 3% to 30%.
In one or more embodiments, the mutagenesis is by cytosine deamination .
In one or more embodiments, the mutagenesis is performed after binding template molecules to a bead or surface.
In one or more embodiments, biotinylated primers are attached to templates .
In one or more embodiments, templates linked to biotinylated moieties are attached to streptavidin beads .
In one or more embodiments, the mutagenesis further comprises beads and/or varietal tags .
In some embodiments, the cytosine deamination is induced by a bisulfite or a salt thereof. In some embodiments, the cytosine deamination is induced by enzymology .
In some embodiments, the cytosine deamination is induced by an activation-induced deaminase.
In one or more embodiments, the mutagenesis comprises contacting the group of NAMs with a depurination agent, transposase agent, or an alkylating agent.
In one or more embodiments, each mutable position of the NAMs comprises a cytosine (C) .
In one or more embodiments, the cytosine (C) is mutated to a uracil (U) or a thymine (T) .
In one or more embodiments, each NAM in the group of NAMs has a unique primer at its 5' end and another unique primer at its 3' end.
In one or more embodiments, the primer comprises one or more methylated cytosines.
In one or more embodiments, the primer comprises one or more phosphorothioated nucleotide bases .
In one or more embodiments, the primer further comprises a 5'- phosphorylated, deoxyuridine-containing anchor-primer.
In one or more embodiments, the primer has the sequence: ACACTCTTTCCCTACACACGACGCTCTTCCGATCT (Seq ID p5) .
In one or more embodiments, the primer has the sequence: ACACTCTTTCCCTACACACGACGCTCTTCCGATC*T (Seq ID p5.mC), wherein the cytosines (C) are methylated, and wherein *T is a phosphorothioated thymine .
In one or more embodiments, the cytosines (C) are methylated. In one or more embodiments, the primer has the sequence: GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG (Seq ID p7)
In one or more embodiments, the primer further comprises a 5'- phosphorylated, deoxyuridine-containing anchor-primer.
In one or more embodiments, the primer has the sequence: having the sequence: GATCGGAAGAGCGGTTCAGCAGGAATGCCGA*G (Seq ID p7.mC), wherein the cytosines (C) are methylated, wherein *G is a phosphorothioated guanine, and wherein 5Phos is a 5 ' -phosphorylated, deoxyuridine- containing anchor-primer.
In one or more embodiments, the cytosines (C) are methylated.
In one or more embodiments, the assembling further comprises aligning the sequences according to unique primers at the 5' and 3' ends .
In one or more embodiments, the sequence of each unique primer comprises a segment that is substantially the same sequence, and an amplification primer that is complementary to the shared sequence when amplifying the mNAMs or copy thereof.
In some embodiments, amplification is performed using a sequence- specific "wobble" primer.
In one or more embodiments, each unique primer comprises a unique tag sequence.
In one or more embodiments, the method further comprises the step of tagging each NAM or copy thereof.
In one or more embodiments, the tag lacks a nucleotide that is mutable by the mutagenesis.
In one or more embodiments, the NAM is within a mixture of DNA or RNA extracted from a cell. In some embodiments, the DNA or RNA extracted from the cell has been fragmented .
In some embodiments, the DNA or RNA extracted from the cell has been fragmented by mechanical shearing or one or more restriction enzymes.
In one or more embodiments, the one or more restriction enzyme is Pstl .
In one or more embodiments, fragmentation occurs before amplifying. In one or more embodiments, fragmentation occurs after amplifying.
In one or more embodiments, fragmentation occurs after mutagenesis. one or more embodiments, the method of the claimed invention rther comprises subjecting the fragmented DNA or RNA to end-repair.
In one or more embodiments, the method of the claimed invention further comprises subjecting the fragmented DNA or RNA to adenylation .
In one or more embodiments, the method of the claimed invention further comprises subjecting the fragmented DNA or RNA to ligation with methyl-cytosine adaptors, wherein the methyl-cytosine adaptors are bisulfite resistant sequencing adaptors.
In one or more embodiments, the NAM is a DNA molecule.
In some embodiments, the DNA molecule is a fragment of genomic DNA.
In one or more embodiments, the DNA molecule is a cDNA molecule.
In one or more embodiments, the NAM is an RNA molecule.
In one or more embodiments, the RNA molecule is an mRNA molecule. In one or more embodiments, the RNA molecule is a viral RNA molecule .
In one or more embodiments, the NAM is an RNA molecule derived from one or more cell lines.
In one or more embodiments, the method of the claimed invention further comprises reverse transcription of the NAM.
In one or more embodiments, the reverse transcription is with poly-T and methyl-cytosines , wherein the methyl-cytosines are resistant to bisulfite mutation.
In one or more embodiments, chemical mutagenesis occurs prior to reverse transcription.
In one or more embodiments, one or more NAMs in the group of NAMs has a length of one sequencing read length.
In one or more embodiments, one or more NAMs in the group of NAMs has a length of two or more sequencing read lengths .
In one or more embodiments, the sequencing read length is 2, 3, 4, 5, 6, 7, 8, 9, 10, or 10-3000 sequencing read lengths.
In one or more embodiments, the number of NAMs in the group of NAMs is about 2, 3, 4, 5, 6, 7, 8, 9, 10, or 10-10000.
In one or more embodiments, the number of NAMs in the group of NAMs is greater than 10000, then diluting the group of NAMs.
In one or more embodiments, diluting the group of NAMs to comprise 1000 or more NAMs in the group of NAMs.
In one or more embodiments, the amplifying is by short-range or long-range polymerase chain reaction (PCR) . In one or more embodiments, the mutagen of the mutagenesis is diluted .
In one or more embodiments, the group of NAMs is incubated with a mutagen at an incubation temperature of about 70 to 78 degrees Celsius .
In one or more embodiments, the incubation temperature is about 73 degrees Celsius.
In one or more embodiments, the group of NAMs is incubated with a mutagen at an incubation time of about 3 to 45 minutes.
In one or more embodiments, the group of NAMs is incubated with a mutagen at an incubation time of about 5 to 20 minutes.
In one or more embodiments, the incubation time is about 3, 6, or 9 minutes .
In one or more embodiments, the incubation time is about 10 minutes.
The present invention also provides a kit for determining the number of NAMs in a group of NAMs comprising:
a) a mutagen; and
b) instructions for performing mutagenesis resulting in suboptimal mutagenesis,
wherein the mutagen is a bisulfite or a salt thereof, or a deamination agent.
In some embodiments, the bisulfite or salt thereof is NaHS03.
In one or more embodiments, the mutagen induces cytosine deamination .
In one or more embodiments, the cytosine deamination is by enzymology .
In one or more embodiments, the mutagen is diluted. In one or more embodiments, the kit of the present invention further comprises a plurality of unique primers including:
a) a plurality of substantially unique primers suitable for ligation to the 5' of a NAM; and
b) a plurality of substantially unique primers suitable for ligation to the 3' of a NAM.
wherein the substantially unique primers comprise substantially unique tags .
In one or more embodiments, the kit of the present invention further comprises a DNA polymerase having 3' -5' proofreading activity.
In one or more embodiments, the plurality of substantially unique primers comprises 10n primers, wherein n is an integer from 2 to 9.
In one or more embodiments, the substantially unique tags are at least 6 nucleotides long.
In one or more embodiments, the substantially unique tags are at least 15 nucleotides long.
In one or more embodiments, the substantially unique primers comprise sets of substantially unique primers having shared sample tags .
In one or more embodiments, the sample tags are at least 2 or 4 nucleotides long.
In one or more embodiments, the sequence of the substantially unique tag is not altered by the mutagen.
In one or more embodiments, the kit of the present invention further comprises a primer wherein the cytosines (C) are methylated.
In one or more embodiments, the methylated primer having the sequence: ACACTCTTTCCCTACACACGACGCTCTTCCGATC*T (Seq ID p5.mC), wherein the cytosines (C) are methylated, and wherein *T is a phosphorothioated thymine. In one or more embodiments, the methylated primer having the sequence: 5Phos-GATCGGAAGAGCGGTTCAGCAGGAATGCCGA*G (Seq ID p7.mC), wherein the cytosines (C) are methylated, wherein *G is a phosphorothioated guanine, and wherein 5Phos is a 5' -phosphorylated, deoxyuridine-containing anchor-primer .
The present invention also provides a composition of matter comprising a plurality of mutagenized nucleic acid molecules (mNAMs), wherein selected mutable nucleic acid positions in the plurality of mutagenized NAMs (mNAMs) are mutated at a rate of 10% to 90%.
In one or more embodiments, each mutable position is mutated at a rate of 25% to 75%.
In one or more embodiments, each mutable position is mutated at a rate of 40% to 60%.
In one or more embodiments, each mutable position is mutated at a rate of 50%.
In one or more embodiments, each mutable nucleic acid base is mutated at a rate of 25% to 75%.
In one or more embodiments, each mutable nucleic acid base is mutated at a rate of 40% to 60%.
In one or more embodiments, each mutable nucleic acid base is mutated at a rate of 50%.
In one or more embodiments, the proportion of all nucleic acids mutated in each mNAM is about 3% to 30%.
In one or more embodiments, the m itable nucleic acid position is a cytosine position of the mNAMs and the mutagenesis is deamination of the cytosine.
In one or more embodiments, the mutable nucleic acid base is a cytosine base of the mNAMs and the mutagenesis is deamination of the cytosine . In one or more embodiments, the deamination of the cytosine is induced by a bisulfite or a salt thereof.
In one or more embodiments, the cytosine deamination of the cytosine is induced by enzymology.
In one or more embodiments, the cytosine deamination of the cytosine is induced by an activation-induced deaminase.
In one or more embodiments, the mutable nucleic acid position is mutagenized by contacting the group of NAMs with a depurination agent, transposase agent, or an alkylating agent.
In one or more embodiments, each mutable position of the NAMs comprises a cytosine (C) .
In one or more embodiments, the cytosine (C) is mutated to a uracil (U) or a thymine (T) .
In one or more embodiments, each NAM in the plurality of NAMs has a unique primer at its 5' end and another unique primer at its 3' end.
In one or more embodiments, the primer comprises one or more methylated cytosines.
In one or more embodiments, the primer comprises one or more phosphorothioated nucleotide bases .
In one or more embodiments, the primer further comprises a 5'- phosphorylated, deoxyuridine-containing anchor-primer.
In one or more embodiments, the primer has the sequence: ACACTCTTTCCCTACACACGACGCTCTTCCGATCT (Seq ID p5) .
In one or more embodiments, the plurality of mNAMS bearing a primer wherein the sequence of the primer is:
ACACTCTTTCCCTACACACGACGCTCTTCCGATC*T (Seq ID p5.mC), and wherein *T is a phosphorothioated thymine. In one or more embodiments, the cytosines (C) are methylated.
In one or more embodiments, all the cytosines (C) are methylated.
In one or more embodiments, the primer has the sequence: GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG (Seq ID p7) .
In one or more embodiments, the primer further comprises a 5'- phosphorylated, deoxyuridine-containing anchor-primer.
In one or more embodiments, the plurality of mNAMS bearing a primer wherein the sequence of the primer is: 5Phos-
GATCGGAAGAGCGGTTCAGCAGGAATGCCGA*G (Seq ID p7.mC), wherein *G is a phosphorothioated guanine, and wherein 5Phos is a 5' -phosphorylated, deoxyuridine-containing anchor-primer .
In one or more embodiments, the cytosines (C) are methylated. The present invention also provides a composition of matter derived from sequencing primers has the sequence:
ACACTCTTTCCCTACACACGACGCTCTTCCGATC*T (Seq ID p5.mC), wherein the cytosines (C) are methylated, and wherein *T is a phosphorothioated thymine .
The present invention also provides a composition of matter derived from sequencing primers has the sequence: 5Phos-
GATCGGAAGAGCGGTTCAGCAGGAATGCCGA*G (Seq ID p7.mC), wherein the cytosines (C) are methylated, wherein *G is a phosphorothioated guanine, and wherein 5Phos is a 5 ' -phosphorylated, deoxyuridine- containing anchor-primer.
The present invention also provides a composition of matter comprising two or more copies of a nucleic acid molecule (NAM) comprising two or more segments having substantially the same sequence, and that has a length of more than one sequencing read, wherein each copy of the NAM has a unique primer at its 5' end and another unique primer at its 3' end, and is subjected to a mutagenesis that mutates each mutable position in the NAMs at a rate of 10% to 90% to produce mutated copies of the NAM (mcNAM) , wherein the unique primers of each mcNAM lack a nucleotide that is mutable by the mutagenesis.
The present invention also provides sequence information produced by a system including one or more processing units which counts the number of different sequences obtained by a sequencer that processed the group of amplified mNAMs in the method of the claimed invention, or the group of mcNAMs of in the method of the claimed invention.
The present invention also provides a system including one or more processing units which counts the number of different sequences obtained by a sequencer that processed the group of amplified mNAMs of in the method of the claimed invention, or the group of mcNAMs of in the method of the claimed invention.
The present invention also provides a method for sequencing a nucleic acid molecule (NAM) that comprises two or more segments having substantially the same sequence, and that has a length of more than one sequencing read, comprising
i) obtaining two or more copies of the NAM;
ii) subjecting each copy of the NAM in step (i) to a mutagenesis that mutates only select nucleic acid positions in the NAMs at a rate of 10% to 90% to produce mutated copies of the NAM (mcNAM) ;
iii) amplifying each of the mcNAMs;
iv) obtaining composite sequences of the mcNAMs that are produced by assembling sequence reads of the amplified mcNAMs, such that when taken together, span as much as possible of the entire length of the NAM, by de novo assembly,
thereby sequencing the NAM.
The present invention also provides a method for distinguishing between benign and malignantly transformed cells by detecting one or more single nucleotide polymorphisms (SNPs) in a sample from a subject and a reference sample from a control subject comprising a method of the claimed invention.
The present invention also provides a method for distinguishing between benign and malignantly transformed cells by detecting one or more single nucleotide polymorphisms (SNPs) in a first and second sample from a subject comprising a method of the claimed invention.
The present invention also provides a method for determining the presence of tumor cells in a sample by comparing a sample from a subject and a reference sample from a control subject comprising a method of the claimed invention.
The present invention also provides a method for determining the presence of tumor cells in a sample by comparing a first and second sample from a subject comprising a method of the claimed invention.
The present invention also provides a method for quantifying tumor cells in a sample by comparing a sample from a subject and a reference sample from a control subject comprising a method of the claimed invention.
The present invention also provides a method for quantifying tumor cells in a sample by comparing a sample from a first and second sample from a subject comprising a method of the claimed invention.
The present invention also provides a method for detecting one or more rare mutations by comparing a sample from a subject and a reference sample from a control subject comprising a method of the claimed invention.
The present invention also provi le s a method for detecting one or more rare mutations by comparing a sample from a first and second sample from a subject comprising a method of the claimed invention.
In one or more embodiments, the sample is a blood sample, plasma sample, serum sample, tissue sample, or cell sample. In one or more embodiments, the tissue sample is from a tumor mass, surgically removed tumor mass, or margins of a surgically removed tumor mass.
The present invention also provides a method for detecting one or more rare mutations in a cell-free or substantially cell-free sample comprising a method of the claimed invention. The present invention also provides a method for determining whether a fetus has at least one or more rare mutations in a cell-free or substantially cell-free sample comprising a method of the claimed invention In one or more embodiments, the sample is a maternal sample.
In one or more embodiments, the maternal sample is obtained from a member selected from: maternal blood, maternal plasma and maternal serum .
Each embodiment disclosed herein is contemplated as being applicable to each of the other disclosed embodiments. Thus, all combinations of the various elements described herein are within the scope of the invention .
It is understood that where a parameter range is provided, all integers within that range, and tenths thereof, are also provided by the invention. For example, "70 to 78 degrees Celsius" is a disclosure of 70.0 degrees Celsius, 70.1 degrees Celsius, 70.2 degrees Celsius, 70.3 degrees Celsius, 70.4 degrees Celsius, 70.5 degrees Celsius etc. up to 78.0 degrees Celsius. Terms
Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by a person of ordinary skill in the art to which this invention belongs.
As used herein, and unless stated otherwise or required otherwise by context, each of the following terms shall have the definition set forth below. As used herein, "about" in the context of a numerical value or range means ±10% of the numerical value or range recited or claimed, unless the context requires a more limited range.
The terms "nucleic acid molecule" and "sequence" are not used interchangeably herein. A "sequence" refers to the sequence information of a "nucleic acid molecule".
As used herein "mutable position" refers to the position of a nucleic acid that is susceptible to a given type of chemical mutagenesis.
As used herein "determining the number" refers to determining the lower bound number. As used herein "X%" with respect to mutation rate, refers to the probability percentage of mutagenesis per mutable position of the multiple mutable positions that are present in a plurality of nucleic acid molecules. Thus, 25% mutation rate means a 25% probability of mutagenesis.
The terms "template", "nucleic acid", and "nucleic acid molecule", are used interchangeably herein, and each refers to a polymer of deoxyribonucleotides and/or ribonucleotides. "Nucleic acid" shall mean any nucleic acid, including, without limitation, DNA, RNA and hybrids thereof. The nucleic acid bases that form nucleic acid molecules can be the bases A, C, G, T and U, as well as derivatives thereof . As used herein "contig" and "continguous" refers to a set of overlapping sequence or sequence reads.
As used herein, the term "amplifying" refers to the process of synthesizing nucleic acid molecules that are complementary to one or both strands of a template nucleic acid. Amplifying a nucleic acid molecule typically includes denaturing the template nucleic acid, annealing primers to the template nucleic acid at a temperature that is below the melting temperatures of the primers, and enzymatically elongating from the primers to generate an amplification product. The denaturing, annealing and elongating steps each can be performed once. Generally, however, the denaturing, annealing and elongating steps are performed multiple times (e.g., polymerase chain reaction (PCR) ) such that the amount of amplification product is increasing, often times exponentially, although exponential amplification is not required by the present methods. Amplification typically requires the presence of deoxyribonucleoside triphosphates, a DNA polymerase enzyme and an appropriate buffer and/or co-factors for optimal activity of the polymerase enzyme. The term "amplified nucleic acid molecule" refers to the nucleic acid sequences, which are produced from the amplifying process as defined herein.
As used herein, the term "bisulfite mutagenesis" refers to the mutagenesis of nucleic acid with a reagent used for the bisulfite conversion of cytosine to uracil. Examples of bisulfite conversion reagents include but are not limited to treatment with a bisulfite, a disulfite or a hydrogensulfite compound.
As used herein, the term "mapping" refers to identifying a location on a genome or cDNA library that has a sequence which is substantially identical to or substantially fully complementary. The nucleic acid molecule may be, but is not limited to the following: a segment of genomic material, a cDNA, a mRNA, or a segment of a cDNA.
As used herein, the term "methylation" refers to the covalent attachment of a methyl group at the C5-position of the nucleotide base cytosine. The term "methylation state" or refers to the presence or absence of 5-methyl-cytosine ("5-Me") at one or a plurality of CpG dinucleotides within a DNA sequence. A methylation site is a sequence of contiguous linked nucleotides that is recognized and methylated by a sequence specific methylase . A methylase is an enzyme that methylates (i. e., covalently attaches a methyl group) one or more nucleotides at a methylation site.
As used herein, the term "read" or "sequence read" refers to the nucleotide or base sequence information of a nucleic acid that has been generated by any sequencing method. A read therefore corresponds to the sequence information obtained from one strand of a nucleic acid fragment. For example, a DNA fragment where sequence has been generated from one strand in a single reaction will result in a single read. However, multiple reads for the same DNA strand can be generated where multiple copies of that DNA fragment exist in a sequencing project or where the strand has been sequenced multiple times. A read therefore corresponds to the purine or pyrimidine base calls or sequence determinations of a particular sequencing reaction.
As used herein, the terms "sequencing", "obtaining a sequence" or "obtaining sequences" refer to nucleotide sequence information that is sufficient to identify or characterize the nucleic acid molecule, and could be the full length or only partial sequence information for the nucleic acid molecule.
As used herein, the term "reference genome" refers to a genome of the same species as that being analyzed for which genome the sequence information is known.
As used herein, the term "region of the genome" refers to a continuous genomic sequence comprising multiple discrete locations.
As used herein, the term "sample tag" refers to a nucleic acid having a sequence no greater than 1000 nucleotides and no less than two that may be covalently attached to each member of a plurality of tagged nucleic acid molecules or tagged reagent molecules. A "sample tag" may comprise part of a "tag." As used herein, the term "segments of genomic material" refers to the nucleic acid molecules resulting from fragmentation of genomic DNA.
As used herein, "substantially the same" sequences have at least about 80% sequence identity or complementarity, respectively, to a nucleotide sequence. Substantially the same sequences or may have at least about 95%, 96%, 97%, 98%, 99% or 100% sequence identity or complementarity, respectively.
As used herein, the term "substantially unique primers" refers to a plurality of primers, wherein each primer comprises a tag, and wherein at least 50% of the tags of the plurality of primers are unique. Preferably, the tags are at least 60%, 70%, 80%, 90%, or 100% unique tags.
As used herein, the term "substantially unique tags" refers to tags in a plurality of tags, wherein at least 50% of the tags of the plurality are unique to the plurality of tags. Preferably, substantially unique tags will be at least 60%, 70%, 80%, 90%, or 100% unique tags.
As used herein, the term "tag" refers to a nucleic acid having a sequence no greater than 1000 nucleotides and no less than two that may be covalently attached to a nucleic acid molecule or reagent molecule. A tag may comprise a part of an adaptor or a primer.
As used herein, a "tagged nucleic acid molecule" refers to a nucleic acid molecule which is covalently attached to a "tag." Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limit of that range, and any other stated or intervening value in that stated range, is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included in the smaller ranges, and are also encompassed within the invention, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the invention.
As used herein, the term "wobble base pairing" with regard to two complementary nucleic acid sequences refers to the base pairing of G to uracil U rather than C, when one or both of the nucleic acid strands contains the ribonucleobase U.
As used herein, the term "substantially fully complementary" with regard to a sequence refers to the reverse complement of the sequence allowing for both Watson-Crick base pairing and wobble base pairing, whereby G pairs with either C or U, and A pairs with either U or T. A sequence may be substantially complementary to the entire length of another sequence, or it may be substantially complementary to a specified portion or length of another sequence. One of skill in the art will recognize that the U may be present in RNA, and that T may be present in DNA. Therefore, a U within an RNA sequence may pair with A or G in either an RNA sequence or a DNA sequence, while an A within either of a RNA or DNA sequence may pair with a U in a RNA sequence or T in a DNA sequence . As used herein, a "wobble" primer is a set of primers where the sequence at mutable positions is equally likely to match the original base or the mutated base .
All publications and other references mentioned herein are incorporated by reference in their entirety, as if each individual publication or reference were specifically and individually indicated to be incorporated by reference. Publications and references cited herein are not admitted to be prior art.
This invention will be better understood by reference to the Experimental Details which follow, but those skilled in the art will readily appreciate that the specific experiments detailed are only illustrative of the invention as defined in the claims which follow thereafter . Experimental Details
Examples are provided below to facilitate a more complete understanding of the invention. The following examples illustrate the exemplary modes of making and practicing the invention. However, the scope of the invention is not limited to specific embodiments disclosed in these Examples, which are for purposes of illustration only .
Methods - Counting Templates (Single Read Length)
To enable a wide range of simulations, a library of Python programs was developed, (Example 11), to simulate mutation, sequencing, counting and assembly of distinct templates under the assumptions of error-free sequencing, perfect mapping, and uniformity of mutation sites, mutation rate, sequence coverage, and DNA amplification. Our ability to recover template count and assembly depends on the depth of read sampling, typically called "coverage". Coverage usually means the average number of reads overlapping a position in the reference genome, however herein, coverage means the average read depth over a position per template.
To measure template count over a single fixed read length for T templates, mutation was simulated to generate T random template patterns with R bits and a flip rate of p. To measure recovery under infinite coverage, the number of distinguishable mutation patterns observed was counted (Example code in Section Z) . To measure recovery under finite coverage, reads were generated by sampling a fixed number of patterns with replacement from the pool before counting distinguishable patterns (Example code in Section Z) . Amplification distortion could be modeled by altering the sampling procedure; however, results are restricted to the case of uniform sampling. Simulations explore mutation rates in the range of 0.05 to 0.35, template counts in the range of 2 to 1, 024, and read lengths with 10, 20, and 30 mutable bits. For each condition, one hundred simulations were performed and record the average recovery under infinite coverage (Figure 3, panel A) . For each of those hundred simulations, one hundred finite samplings were performed for each coverage level of lx to 5x per template and the mean count observed was recorded. Shown in Figure 3, panel B are the results for the mutation rate of 0.35.
The first class of applications focuses on the general problem of determining absolute template count. This is important for determining the copy number of genomic DNA, measuring mRNA expression levels, quantifying allele bias, and detecting somatic mutations. To obtain an accurate count, the protocol requires mutagenesis prior to amplification. Amplification could be either short- or long-range PCR but must occur before fragmentation if needed for library preparation. The number of possible mutation patterns should exceed the template number to obtain the most accurate count. Hence, cases were only considered where the absolute template count is below the low thousands, and save the other cases for Discussion.
Example 1. Counting Templates (Single Read Length)
The simplest formulation is the case where the templates span a single fixed read length and exhaustive coverage (Figure 3, panel A) In this case, the estimate of template count is merely the number of distinguishable mutation patterns observed.
The number of possible patterns depends on the number of bits per read, and the probability of observing a given pattern depends also on the flip rate. The optimal flip rate for generating distinct mutational patterns is 0.5, wherein every pattern is equally likely. However for a window of at least 20 bits, corresponding to a read length of 80 base pairs, a rate of 0.25 is still virtually perfect for template counts in the thousands. Similar efficacy is obtained at a flip rate of 0.15 for a 30-bit window. Templates numbering in the thousands are adequate for genome copy number determination or single-cell transcript profiling. In Section Z, example code is provided to simulate performance under a variety of conditions. The recovery of template count was also demonstrated subject to varying depth of coverage for a fixed flip rate of 0.35 (Figure 3, panel B) . For each simulation, T initial template molecules were mutated, creating a pool of patterns. At a coverage level of c, c χ T patterns were selected with replacement from the pool and record the number of distinct patterns. For high coverage, on the order of 4x per original template molecule, recovery of count is nearly perfect. At lower coverage the count is underestimated, the inevitable consequence of undersampling . These simulations assume uniform PCR amplification and provide an upper bound on performance.
Methods - Counting Templates (Multiple Read Lengths)
When the template length exceeds a single window, a few simple graph formalisms are used. Two reads are said to "conflict" if they map to overlapping positions and their overlaps fail to agree. A "compatible read graph" was defined to be one in which each distinguishable read is treated as a "vertex" and two vertices are "compatible" and connected by an "edge" if the reads could have originated from the same template. In the case of finite coverage, two reads are compatible if they do not conflict. In the case of infinite coverage, there is a stronger constraint: two reads at a distance d mutable bits apart are compatible if all d - 1 were observed to be distinct, non-conflicting, intermediate reads (Example code in Section Z) . All edges inherit a direction from the orientation of the reference template such that each vertex has in- edges extending from one end of the template and out-edges extending toward the other end.
A path in this graph represents a possible partial assembly of an initial template pattern. Consequently, determining the minimum number of templates needed to explain all of the reads is achieved by finding the minimum number of paths such that every vertex in the graph is included in at least one path. This is known as the minimum vertex cover and in general is an nondeterministic polynomial time hard (NP-hard) problem. However, under the assumption of perfect read mapping to a linear genome, our graph is not only directed, but also acyclic. By Dilworth' s theorem the minimum number of covering paths is equivalent to the maximum number of elements in an antichain (Dilworth, 1950) . In other words, the minimum number of templates needed to explain the reads is equal to the maximum number of reads that are pairwise incompatible. Using Konig' s theorem ( Konig, 1931), this problem was solved by finding a maximal matching in a new bipartite graph constructed by splitting each vertex in two (an "in" vertex that receives in-edges and an "out" vertex that receives out-edges) . Finding a maximal matching in a bipartite graph is then solvable in polynomial time by the Hopcroft-Karp algorithm (Hopcroft and Karp, 1973) .
As in the case of a fixed read length, mutation was simulated to generate a pool of template patterns. For a finite coverage level of c, read length R, and T initial templates of length L, c χ T χ (L / R) reads were generated by drawing uniformly with replacement from the set of templates and read start positions. These reads were used to build a compatible read graph, convert to a bipartite graph, and apply Hopcroft-Karp to find a maximal matching.
In the case of finite coverage, all reads that do not overlap are by definition compatible. This implies that the maximal antichain (comprised of reads that are all pairwise incompatible) must be comprised of reads all in which all overlap and therefore all start within a single interval of length R. This fact permits a significant computational simplification: instead of computing the maximal antichain for the entire graph, smaller subgraphs of reads can be restricted whose start positions are contained in the interval [nR, (n + 2) R] for n = 0, ..., (L / R) - 1; identify a maximal antichain over each interval; and compute the maximum size over all intervals (Example code in Section Z) . For a fixed mutation rate of 0.35, emplate counts were simulated in the range of 2 to 1, 024, reads of 10, 20, and 30 bits, templates that are 16 read lengths long, and a coverage of lx to 5x. For each condition, 100 simulations were performed and the mean number of templates recovered were reported.
Example 2. Counting Templates (Multiple Read Lengths)
More often, the length of the template will exceed a single read length. In this case, there was no fixed window over which to count distinct mutational patterns. So instead of looking for the number of unique (and hence incompatible) patterns over a fixed window, a maximal set of pairwise incompatible reads was determined. This problem can be precisely stated in the language of graph theory. More specifically, reads were treated as vertices, and connected by directed "compatible" edges whenever two reads can derive from the same template. The direction of the edge is determined by the orientation of the reference template. The result is a directed acyclic graph. By Dilworth's theorem (Dilworth, 1950), the size of the largest set of pairwise incompatible reads is the same as the minimum number of paths covering all vertices of this graph. By Konig' s theorem (Konig, 1931) this is a problem of maximal matching in a bipartite graph, which is computationally tractable and solvable in polynomial time with the Hopcroft-Karp algorithm (Hopcroft and Karp, 1973) . In Figure 3 panel C, the results are shown for a fixed mutation rate of 0.35, a template that spans 16 read lengths, for depths of coverage ranging from lx to 5x per template, and for template counts ranging from 2 to 1,024. Under similar conditions of flip rate and coverage, counting over long templates is comparable to performance in counting over a fixed window, as described above.
The simulations of this section provide guidelines for (i) genome wide copy number determination, (ii) transcript profiling, and (iii) determining allelic ratios. To determine copy number, the ratio of count was measured for a given locus to the median count over the remainder of the genome. For transcript profiling, the proportionate counts of each gene transcript were measured. To determine allelic imbalance, the ratio of counts was measured from templates distinguishable by at least one SNV. In the context of RNA, this also enables observation of biased allele expression resulting from chromosome inactivation, imprinting and the like.
Methods - Assembling Templates
In the counting problem, a conservative estimate was established for the initial number of templates by allowing all compatible edges between reads. For the assembly problem, however, high probability assemblies were to be established. Consequently, the compatible edge graph was restricted to a sub-graph composed of the best edges. When coverage is exhaustive, a very precise definition of best edges can be formulated. Two i?-bit reads were joined by an edge if they overlap and agree for R - 1 bits. Consequently, two tags of a template are exactly matched if and only if every such R - 1 bit string across its span has a unique pattern. When the condition is not met — for example due to too many templates and/or too few flipped bits —connected end tags that were not exactly matched were found (Example code in Section Z) .
When coverage is finitely sampled, there are many ways to select the best edges for a subgraph (Figure 5) . A simple scoring method was applied that assigns a weight to an edge that reflects how unlikely it is that the reads agree on their mutation patterns by chance. Given two reads that overlap and agree on a window of size M with K bits flipped, the edge was weighted by -log (pK(l-p)M~K) where p is the flip rate. All of the edges in order of decreasing weight were iterated through. An edge e from A to B is selected for inclusion in the subgraph if no edge from A or to B has already been selected that has a weight strictly greater than e. This procedure was carried out for each simulation and extract from the resultant subgraph all exact matches. Unlike the case for exhaustive coverage, exact matches may be incorrect. Because this is a simulation, the truth was known and the number of exact matches that are correct and incorrect was recorded. Whether the exact path is also true was also recorded (Example code in Section Z) .
Example 3. Assembling Templates
The second class of applications is to correctly assemble reads by their mutation patterns in order to recover the proper end-to-end sequence of nearly identical templates, desired when determining haplotype phasing or enumerating transcript isoforms. Long templates each tagged uniquely at both ends were considered to simulate the more general task of determining how many initial templates can be correctly assembled from end to end from the mutation pattern alone (Figure 2) . Following mutagenesis, amplification, fragmentation, and sequencing, reads were connected with overlapping mutation signatures to assemble a path from one tag to the other. Whereas in the previous application, all compatible edges between reads were allowed, for this problem a subgraph was built with only the "best edges" between overlapping reads. A pair of tags is "exactly matched" if there is a path in the subgraph that connects them and neither tag is connected to another tag. Such a path is called an "exact path." If two tags originate from the same template, they are a "true match." A "true path" is an exact path for which every read originates from the same initial template .
Determining performance for the general task provides a lower bound on performance for other applications, because if there is an exact path that is also true, then all sequence information for that template was correctly observed. This includes haplotype phasing in the case of genomic data and transcript structure in the case of RNA profiling. In fact, these two applications are less demanding than the general task because there will only be a few template varieties and each template variety provides additional sequence information for distinguishing them.
The case of exhaustive coverage was considered to establish the best performance that can expect for a given set of conditions. In this case, the best edges are naturally defined as the set of compatible edges that overlap for all but one bit. In Figure 4 panel A, the effect of flip rate on template assembly was explored. At a flip rate of 0.35 and 30 bits per read, performance on 32 template molecules is nearly perfect for even the longest span tested, 210 read lengths, in excess of 100 kilobases.
With exhaustive coverage, best edges are unambiguously defined and have the property that exactly matched pairs and exact paths are also true. However, for finite coverage, there is no natural definition of best edges and exactly matched pairs and exact paths may not be true. For finite coverage, weights to edges were assigned by likelihood and a greedy algorithm was used to select the best subgraph (detailed above and illustrated in Figure 5) .
Figure 4 panel B explores the effect of coverage (2x to 14x per template) on recovery of exact matches as a function of template length (2 to 1, 024 read lengths), for 32 templates, a 30 bit read length, and a flip rate of 0.35.
In Figure 4 panel C, the template length was fixed to 32 read lengths to show recovery as a function of the number of templates, from 2 to 1,024 under the same conditions. Because these are simulations and the ground truth was known, the proportion of exact matches are true and false can be determined, and these numbers shown. All exact matches are connected by an exact path, and for the conditions explored here, virtually all true exact matches are connected by a true path. At the maximum level of coverage, nearly all the true paths were determined, even for spans of 210 read lengths. For haplotype phasing, it is sufficient to have a single true path, and this is attainable with high probability at a lower coverage, lOx per template.
Methods - Partial Bisulfite Mutagenesis
Partial bisulfite mutagenesis was obtained in a single stranded phi x 174 genomic DNA using the MethylEase™ Xceed Rapid DNA Bisulphite Modification Kit (Human Genetic Signatures) . The full conversion protocol was modified by changing the incubation temperature to 73 degrees Celsius (from 80 degrees Celsius) and the incubation time to 10 minutes (instead of 45 minutes) . Four regions were amplified to measure the conversion rate.
Using the MethylEase™ Xceed Rapid DNA Bisulphite Modification Kit (Human Genetic Signatures) , the default incubation temperature (default 80 degrees Celsius) and time (default 45 minutes) were modified. A range of temperatures were tested (65, 73 and 80 degrees Celsius) and a range of times were tested (0, 3, 10, 30, 45 minutes) .
The remainder of the protocol remains the same. After bisulfite treatment, four specific regions of phi x were amplified with "wobble" primers which contain a mixture of oligonucleotides such that the mutable positions are guanine (G) or adenine (A) for the forward primers and cytosine (C) or thymine (T) on the reverse primers. The addition of A-tails and ligatation of Illumina sequencing adapters and sequence resulted in a sequence which was mapped back to the phi x genome using a simple dictionary mapping method using a fully converted genome and fully converting all reads before mapping.
The best results (60% conversion) were obtained at an incubation of 10 minutes at 73 degrees. A conversion rate of 20% was also obtained by incubating for 3 minutes at 73 degrees. Methods - Custom sequencing primers
p5. mC :
ACACTCTTTCCCTACACGACGCTCTTCCGATC*T
The sequencing primer (Seq ID p5.mC), wherein the cytosines (C) are methylated, and wherein *T is a phosphorothioated thymine to protect the ends from degradation by exonuclease . p7. mC :
/5Phos/GATCGGAAGAGCGGTTCAGCAGGAATGCCGA*G
The sequencing primer (Seq ID p7.mC), wherein the cytosines (C) are methylated, wherein *G is a phosphorothioated guanine to protect the ends from degradation by exonuclease, and wherein 5Phos is a 5'- phosphorylated, deoxyuridine-containing anchor-primer.
Example 4. Partial Bisulfite Mutagenesis
The experimental results from partial bisulfite mutagenesis of a phix template genome (Figure 6) . The full conversion protocol was modified by changing the incubation temperature to 73 degree Celsius (from 80 degrees Celsius) and the incubation time to 10 minutes (instead of 45 minutes) . Four regions were amplified to measure the conversion rate. Figure 6 panel A depicts the rate at which each base is observed over all the data for those positions with a coverage of at least 30 reads. Nearly all the C positions are at 40% C and 60% T. For each of the four regions amplified, conversion patterns were compared between reads. Not all regions are equally well covered in the data, with 40-4500 reads. Figure 6 panel B depicts the cumulative distribution of the conversion rate per read. Figure 6 panel C depicts correlation in flips for all cytosines in the targeted region. It was determined that there none. Figure 6 panel D depicts the data of Figure 6 panel C as a histogram.
For the best amplified position, it is clear that partial conversion was determined with a 60% flip rate randomly distributed throughout each read.
Discussion - Examples 1-4
Presently, inferring the long-range structure of the DNA templates is limited by short read length. Accurate template counts suffer from distortions occurring during polymerase chain reaction (PCR) amplification. The utility of introducing random mutations in identical or nearly identical templates was explored to create distinguishable patterns that are inherited during subsequent copying. Simulations of the applications of this process were performed under assumptions of error free sequencing and perfect mapping, employing cytosine deamination as a model for mutation. The simulations demonstrate that within readily achievable conditions of nucleotide conversion and sequence coverage, accurately counting the number of otherwise identical molecules as well as connecting variants separated by long spans of identical sequence can be achieved. Many potential applications include, transcript profiling, isoform assembly, haplotype phasing, RNA expression analysis, copy number determination and de novo genome assembly.
Counting varietal tags can be used to mitigate the effects of amplification bias. While the original message is completely recoverable, the tag is confined to one end of the molecule such that identity and count can only be distinguished within one read length of the ends. Only reads that include the tag are useful in determining count and varietal tags provide no solution for assembly and assortment.
There are also protocols designed to aid sequence assembly of regions that are resistant due to base composition or repeat structure, which rely on misincorporation of artificial nucleotides during amplification (Keith et al . , 2004a; Keith et al . , 2004b; Mitchelson, 2011) . Implementation of this method requires many steps, and is not amenable to high throughput process, and in addition to such technical hurdles, is not suitable for counting as it loses track of template count.
Described herein are different approaches for counting and assembling templates using template mutagenesis. The non-limiting examples herein demonstrate by simulation that template mutagenesis can solve both the problems of counting and assembly. For any application, the order of operation is mutagenesis first, followed by short- or long-range PCR, then fragmentation, if needed, and preparation of sequence libraries. Two classes of applications were explored. The first is counting specific DNA or RNA molecules, for assessing genome copy number or profiling a transcriptome (Figure 1) . The second is sequence assembly - for example establishing haplotypes or distinguishing transcript isoforms (Figure 2) .
The simulations model partial bisulfite mutagenesis of single stranded DNA (Shortle and Botstein, 1983) : each mutable position (or "bit") converts (or "flips") independently from a wild-type state to an altered state with a fixed probability (or "flip rate") . Performance was simulated under a variety of reasonable parameters for read length and mutation rate, and over a range of template lengths and counts. The results are presented under an assumption of complete coverage to obtain a theoretical upper limit of performance and then consider the consequences of sampling to various levels of coverage. In the simulations, mutable positions are distributed uniformly throughout the template such that each read contains the same number of bits (or "bit length") . Sequence or mapping error are not presently incorporated. Variations to these assumptions and procedures are addressed herein.
A feasibility study and guide is presented for template mutagenesis as an enhancement to sequence based analysis. Such methods introduce random mutations to create distinguishable patterns in previously identical or nearly identical templates. These patterns are inherited in copies of the template, and portions of patterns remain in fragmented copies. Provided these fragments overlap and there is sufficient diversity in mutational patterns over a read length, the structure of each original template can be inferred, thus overcoming the loss of connectivity resulting from short read lengths. The accuracy of template counting also improves, undistorted by biased amplification. The applications of this process were simulated under assumptions of error-free sequencing, perfect mapping, uniformly distributed reads, and uniform bit distribution. The simulations suggest that within readily achievable conditions accurately counting the number of otherwise identical molecules can be determined, as well as connecting variants separated by long spans of identical sequence. There are many potential applications, ranging from transcription profiling and isoform assembly to haplotype phasing and de novo genome assembly.
To illustrate one application of the results, characterizing single cell gene expression was considered. A mammalian cell has -350,000 mRNA transcripts with an average length of 1,500 nucleotides (Alberts, 1994) . Single strand cDNA would be randomly mutagenized using cytidine deamination before amplification, fragmenting, and sequencing. Assuming uniformity of sequence coverage and uniformity of PCR amplification, 9 million 100 base pair paired-end reads yields an average of 4x coverage per template. Given the typical distribution of cytidine in mammalian genomes, 30 is a reasonable estimate for the number of mutable positions in a read pair. From Figure 3 panel C, it is clear that RNA templates can be counted with near perfect accuracy for mRNA species of intermediate to scarce expression (< 1,000 copies per cell).
Furthermore, many mammalian genes are alternatively spliced, resulting in a diversity of isoforms observable as different patterns of exon inclusion in the mRNA. At a read depth of lOx per template, or 23 million reads, count can be determined not only transcripts per gene, but also expression at the level of individual isoforms: Even without the additional information gained by observing alternative splice junctions, a true path can be assembled from one end of the molecule to the other. This can be accomplished for all but the most abundant genes and very long transcripts (greater than 6, 000 nucleotides; Figure 4, panel C) . In contrast, varietal tagging can achieve accurate counting of gene transcripts, even long and abundant ones, but it is limited to labeling the end of a molecule and so does not allow counting of isoforms or observing sequence variants, except near the ends of transcripts. The two methods, varietal tagging and mutagenesis can be seamlessly integrated, achieving the benefits of both methods.
Another direct application of template mutation is discriminating haplotypes in an individual. Using existing short-read technology, many of the heterozygous positions can be readily identified; however, because the polymorphisms are typically more than one read length apart, the proper phasing of alleles cannot be determined. The following procedure for phase recovery is proposed: first perform partial deamination on high-molecular-weight DNA with a 0.35 flip rate, then dilute it to 30 initial templates per region for each of the two strands. Then amplifying with randomly primed PCR, and fragmenting as needed for preparing sequencing libraries. Under the assumption of uniformly distributed coverage (Figure 4 panel B) , it is observed that coverage of lOx per initial template would be sufficient to recover phase information for variants separated by hundreds of kilobases.
Further, the ability to establish phase by this method depends on strong concordance between the haplotypes and the reference genome. For those regions where the reference genome is a poor match, due to repeat content, large-scale rearrangements or novel sequence, the mutation pattern assembly algorithm will fail to generate consistent end-to-end assemblages. Although this presents a problem for direct inference by reference-matched phasing, it provides an opportunity for de novo haplotype assembly. The SUTTA algorithm (Narzisi and Mishra, 2011) assembles haplotypes from short-read data by scoring proposed local assemblies based on orthogonal data sources, such as coverage, mate pairs, or physical maps. Template mutagenesis can help. Each local reference genome that SUTTA considers can also be assigned a score based on the number of successful end-to-end mutation pattern assemblies over the region. The result would be a de novo assembly over the human genome for those difficult regions.
In the simulations, focus is on a specific form of mutation: the conversion of cytidine to uridine by deamination. This conversion can be achieved either chemically through bisulfite treatment (Shortle and Botstein, 1983) or enzymatically through activation- induced deaminase (Bransteitter et al, 2003) . The advantages of deamination are that conversion patterns are predictable. Moreover, because bisulfite treatment is widely used in DNA methylation assays, the computational tools for mapping deaminated sequence reads are readily available. Still, other methods of mutagenesis, such as depurination, transposition, alkylating agents, or inducing replication error in the first template copy might be useful in some contexts . In the simulations perfect mapping was assumed. In practice, however, the ability to map reads might be somewhat degraded by template mutation. For a deamination protocol, a standard practice is to map to a reduced alphabet where all cytosines ("C"s) are converted to thymines ("T"s) in both the read and the reference, with two distinct references genomes for each DNA strand (Krueger et al., 2011; Otto et al . , 2012) . Clearly, restricting to a smaller alphabet and doubling the reference genomes impacts the ability to unambiguously map reads, however, the effect is surprisingly mild (Krueger et al . , 2012) . If increased mapping efficiency is needed, the mapping algorithm can be augmented with a probabilistic model of the flip rate to prioritize the most likely alignments.
In the simulations no sequence error was assumed, but methods will be necessary for handling these. Aside from its effects on mapping, sequence error may reduce the ability to recover mutation patterns in those cases where the error appears to flip a bit or reverse a flipped bit. Fortunately, sequence error is typically rare. Within a reasonable range for flip rate, window size, and template count, sparse mutational patterns are expected, well separated so that no two patterns are very much alike. Sequence error will produce a pattern "nearby" an established pattern, and less well covered, and this signature can be used to discount those reads.
The simulations demonstrate that most applications work best for a low initial template count, less than a few thousand. This is not a problem for many genomic applications and is close to ideal for single-cell RNA analysis. If analysis of greater numbers of template molecules is desired, for example during analysis of bulk mRNA, then after mutagenesis of the first strand cDNA, multiple separate amplifications reactions can be performed, each with low template count. The products of each reaction can be tagged with barcodes, pooled and sequenced.
Example 5. Assembling Templates
A mutational protocol, muSeq, was established for partial bisulfite conversion. The description of the method is very similar to that established for the phiX samples discussed above. To establish the operating characteristics of the protocol, a study was performed on a PstI digest of a human genome.
The experiment is described here.. Genomic DNA is digested with a restriction enzyme (PstI) . Fragments are end-repaired, adenylated, and then ligated to bisulfite resistant sequencing adapters. These adapters match the standard Illumina adapters, save that the cytosines are replaced by methyl-cytosines . The sample is then treated with a standard kit for bisulfite treatment (MethylEasy Xceed Rapid DNA Bisulphite Modification Kit Mix; Human Genetic Signatures.) Instead of incubating the sample for the standard of 80°C for 45 minutes, 3, 6, and 9 minutes at 73°C were tested. One library using the standard 80°C and 45 minutes was also generated. The samples were sub-sampled, amplified and sequenced.
The resulting reads were mapped to the genome using an informatics pipeline designed for bisulfite sequence data. First the read-pairs are fully converted. For read 1, every C is converted to a T and for read 2, every G to an A. The converted read-pairs are then mapped twice, once to a genome where every C is converted to a T and once to a genome where every G is converted to A. The best mapping was assigned to the original read-pair and the mapped genome recorded. Reads that map to the AGT-genome are called "original top" or "OT" and are templates derived from the top strand of the initial restriction fragment. Similarly, the reads that map to the ACT genome are called "original bottom" or "OB." Focus was on a 135 thousand fragments with high quality alignments in the range of 150 to 400 base pairs.
The first observation, demonstrated in Figure 7, is that the same mutational rate can be reliably established by setting the incubation time. Also the mutational rates fall within the desired range .
Each restriction fragment/strand provides an opportunity to observe multiple reads derived from the same initial template. In error-free data, one need only cluster reads that have precisely the same pattern. However, because of errors in sequencing and PCR amplification, a robust method was developed for joining reads derived from the same initial template. Information was extracted from all convertible positions and then cluster reads using a multi- scale clustering algorithm that works on pair-wise hamming distances [ arXiv : 1506.03072 (clustering method devised is available at arvix.org/abs/1506.03072)]. An example of clustered reads at a single restriction fragment for the original top strand is shown in Figure 8.
To more carefully study the properties of the mutational sequencing, we sequenced the 6 minute conversion sample at great depth, on two lanes of a nextSeq. When cytosine (or bit) positions were examined, the mutations were found to be uncorrelated . Selecting 100,000 random pairs of bits, we compute the Fisher exact test and compare to the theoretical expectation that the two sites are independent. The resulting Q-Q plot is linear suggesting that the observed distribution does not diverge from the null expectation and that deamination events are random and independent (Figure 9) .
The conditions of digestion, amplification, and sequencing imply that the mean that coverage is not uniform for each locus. However, on the whole, the counts of template from the autosome and the X chromosome differ at a ratio of two to one, as expected in our male sample (Figure 10) . To study copy number in the context of identical sequence and coverage bias, we examine positions in the sample that are heterozygous. Heterozygous loci are nearly identical for sequence and so suffer the same distortions. When the observed distribution was computed using template count, alleles were found to conform to the expected binomial distribution (Figure 11) . The fit is far less good when the reads were used instead of the template counts (Figure 12) . This suggests that template count is superior for comparing copy number against a known difference. For example, counting the relative proportion of a rare SNP or determining copy number in the presence of a reference.
Finally, collapsing reads with the same pattern were found to reduce the error in each template. By examining reads from homozygous positions, 99.92% of consensus read positions were found to have greater than 80% of base pairs showing the homozygous base. This reduces typically sequencing error rates by 100-fold.
Conversion of cDNA
The mutational protocol can be applied to cDNA as well. While this data is less well-studied, the preliminary results are very promising. Taking whole RNA derived from cell lines, the mRNA was reverse transcribed with poly-T and template switch oligo primers that are resistant to bisulfite mutation (methyl-cytosines substituted for cytosines) . The resulting first strand cDNA were mutagenized with the muSeq protocol for 6 minutes at 73C. The mutated strands were then sub-sampled, amplified, sonicated, repaired, and ligated to sequencing adapters, amplified and sequenced .
The resulting reads were then converted two ways (read 1 C → T, read 2 G → A; and read 1 G → A, read 2 C → T) mapped to two versions of the human genome using the STAR mapper (Dobin, et al STAR: ultrafast universal RNA-seq aligner, Bioinformatics . 2012) much as described above. The best of the four maps were selected to assign to the original read. Plots showing stacks of reads in the IGV viewer are shown in Figures 13 and 14.
Preliminary results were obtained suggesting that long transcript structures can be recovered without mapping the reads at all but relying instead on methods of de novo assembly.
Example 6. Error Reduction
Using the high-coverage data with the 6 minute incubation and 5 rounds of post-bisulfite linear amplification, the properties of the consensus sequences derived by clustering reads with the same mutation pattern were examined. Only genomic positions that are not bits and are confidently homozygous in our sample were considered. For each consensus sequence base, the proportion of reads reporting the homozygous base (homozygous base fraction) was determined. Greater than 99.99% of consensus bases have a majority of reads in agreement with the homozygous base. Under more stringent requirements, it was found that 99.92% of consensus bases match the homozygous base for greater than 80% of reads in the template (Figure 13) . These values bound the error rate as some of the disagreement is likely due to true somatic mutation. Nevertheless, these error rates are orders of magnitude below sequencer error, which is commonly observed at about one base error per position per read .
Methodological improvements
1. Direct mutagenesis of RNA. The RNA sequence can be directly mutagenized before reverse transcription.
2. Mutagenesis on beads or surfaces. By binding template molecules to a bead or surface, for instance by attaching biotinylated primers to the templates and binding with streptavadin beads, yield of post mutation templates may be improved. Also multiple independent amplifications from the bound targets to circumvent PCR error may be performed.
3. Binding to linear products of mutagenized templates to beads.
Similar to 2. Example 7. Detecting one or more variants .
Example 7A - Detecting tumor cells
A sample is obtained from a subject afflicted with cancer. The sample is subjected to a chemical mutagenesis as described herein. The mutagenized sample is sequenced, aligned, mapped, and counted as described herein.
The presence of tumor cells in the sample is determined. Also, quantification of tumor cells in the sample is determined. Also, one or more rare mutations in the sample is determined. Also, one or more single nucleotide polymorphisms in the sample is determined. Also, benign and malignantly transformed cells is distinguished.
Example 7B - Detecting a small load of cancer DNA in the presence of an excess of normal DNA
A sample is obtained from a subject afflicted with cancer. The sample is subjected to a chemical mutagenesis as described herein. The sample is further subjected to beads and/or varietal tags. The mutagenized sample is sequenced, aligned, mapped, and counted as described herein.
The presence of tumor cells in the sample is determined. Also, quantification of tumor cells in the sample is determined. Also, one or more rare mutations in the sample is determined. Also, one or more single nucleotide polymorphisms in the sample is determined. Also, benign and malignantly transformed cells is distinguished.
Example 7C - Detecting Prenatal Abnormalities
A sample is obtained from a pregnant female. The sample is subjected to a chemical mutagenesis as described herein. The mutagenized sample is sequenced, aligned, mapped, and counted as described herein . One or more rare mutations in a fetus is determined. Also, one or more single nucleotide polymorphisms in a fetus is determined. Also, one or more chromosomal abnormalities in a fetus is determined.
Example 8. Sensitive detection of mutations .
The error reduction described above in Example 6 is used in conjunction with the beads and/or varietal tags to obtain sequence counts for rare variants .
Example 9. De novo transcriptome assembly
A group of RNA transcipts in one cell or a population of cells is obtained. The group of RNA transcripts is subjected to a chemical mutagenesis and sequenced as described herein.
An assembly algorithm is applied to the sequences of the group of RNA transcripts. In some embodiments, the assembly algorithm may be SOAPdenovo-Trans, Velvet/Oases, Trans-ABySS, or Trinity transcriptome assemblers. A transcriptome is obtained without mapping to a reference genome.
Example 10. Metagenomics
A group of NAMs is obtained from genomes from a large variety of organisms, wherein some of the organisms may be highly related. The group of NAMs is subjected to a chemical mutagenesis and fragmentation as described herein.
The group of mutagenized fragments of NAMs is sequenced. In some embodiments, the sequencing may be metagenome sequencing, shotgun sequencing, or high-throughput sequencing.
An assembly algorithm is applied to the sequences of the group of mutagenized fragments of NAMs. In some embodiments, the assembly algorithm may be Phrap, Celera, or Velvet/Oases assemblers. Independent genomes are assembled. Example 11. Python programs developed for a wide range of simulations fixed_window_finite_coverage
Performs simulations testing the recovery of template count
over a fixed window for a range of flip_rates, read_lengths , and template_counts
assuming finite coverage
import sys
import numpy as np
from support_functions import getRandomPatterns , toBinary, tabprint
## fixes a random seed
seed = hash ("This is not a random seed.")
np . random .seed (seed)
## input parameters
numsim = 100 ## template simulations
covsim = 100 ## sampling reads
read_lengths = [10,20, 30]
flip_rates = np . linspace (0, 0.35, 8) [1:]
template_counts = 2**np . arange (1 , 11 )
coverages = np . arange ( 1, 17 )
headings = ['R', 'ρ', 'T', 'coverage', ' mean_distinct ' ]
print "\t" . j oin (headings)
## for each read length, flip rate and number of templates
for read_length in read_lengths :
for flip_rate in flip_rates:
for template_count in template_counts :
## perform numsim sims
unique_counts = [list() for _ in range (len (coverages) ) ] ## for storing results
for sim in range (numsim) :
## generate template patterns and convert to binary representation as int
templates =
getRandomPatterns (template_count, read_length, flip_rate)
templates_binary = np . array ([ toBinary (x) for x in templates] )
## for each level of coverage
for ind, coverage in enumerate (coverages) :
num_reads = coverage * template_count
## csim times
for csim in range (numsim) :
## sample uniformly from the templates read_template = np . random. randint (0, template_count, num_reads)
## consider just the observed templates templates_observed =
np. unique (read_template) ## and ask how many are unique, saving the result
unique_count =
len (np . unique ( templates_binary [templates_observed] ) )
unique_counts [ ind] . append (unique_count) ## after the sims are done
for ind, coverage in enumerate (coverages) :
## compute mean for each coverage level and print mean_unique = np . mean (unique_counts [ ind] ) info = [ read_length, flip_rate, template_count, coverage, mean_unique]
print tabprint (info)
sys . stdout . flush ( )
fixed_window_finite_coverage Performs simulations testing the recovery of template count over a fixed window for a range of flip_rates, read_lengths , and template_counts
assuming exhaustive coverage
import sys
import numpy as np
from support_functions import getRandomPatterns, toBinary, tabprint
## fixes a random seed
seed = hash ("This is not a random seed.")
np . random .seed (seed)
## input parameters
numsim = 100
read_lengths = [10, 20, 30]
flip_rates = np . linspace (0, 0.5, 11) [1:]
template_counts = 2**np . arange (1 , 11 )
headings = ['R', 'ρ', 'T', ' mean_distinct ' ]
print "\t" . j oin (headings)
all_data = [ ]
all_info = []
## for each read length, flip rate and number of templates
for read_length in read_lengths :
for flip_rate in flip_rates:
for template_count in template_counts :
unique_counts = [ ]
## perform num sim simulations
for sim in range (numsim) :
## generate template patterns and convert to binary representation as int
templates =
getRandomPatterns (template_count, read_length, flip_rate)
templates_binary = np . array ([ toBinary (x) for x in templates] )
## count how many are unique and save answer unique_count =
len (np . unique (templates_binary) )
unique counts . append (unique count)
## compute mean of sims and output results mean_unique = np . mean (unique_counts)
info = [ read_length, flip_rate, template_count, mean_unique ]
all_info . append ( (read_length, int (100*flip_rate) , template_count) )
all_data . append (unique_counts)
print tabprint ( info)
sys . stdout . flush ( ) many_windows_finite_coverage_assembly
Performs simulations testing the recovery of template assembly over a template of many read lengths for a fixed flip_rate, and a range of read_lengths, and template_counts,
and the number read lengths in the template
and a range of template coverage.
Scores edges based on likelihood of accidental agreement
and starting with the most unlikely edges,
adds a new edge if neither vertex already has a partner
in that direction with a better score.
Allows multiple edges when scores exactly agree
and so acts like the infinite case in the limit of total coverage. import numpy as np
import sys
import time
from support_functions import getRandomPatterns , getMatch, \
getWordSpace, getOneCount, getScoreTable, \ greedyAssembly, pathScore, tabprint from collections import defaultdict
## input parameters
read_length 30
seed = hash ("This is not a random seed.")
flip_rate 0.35
span_lengths (2**np . arange (5,6)) *read_length
template_counts 2**np . arange (1,11)
coverages 2*np . arange (1,9)
## fixes a random seed
np . random .seed (seed)
headings = [ "template_length" , " span_length" , " read_length" ,
"flip_rate", "template_count" , "coverage", "true_exact_match" , " false_exact_match" , "true_path",
"false_path", "time"]
print "\t" . j oin (headings)
## for each read length, flip rate and number of templates
for span_length in span_lengths :
for template_count in template_counts :
## we generate a template with a read length on either side of the span
template_length = 2*read_length + span_length ## length of template
left_edge = read_length ## left edge of span (first read position to not contain left mark)
right_edge = span_length ## right edge of span (last read position to not contain right mark)
## generate template patterns
templates = getRandomPatterns (template_count,
template_length, flip_rate) ## make unique matched markers at the ends of the span Apos_marks = np . arange (template_count)
Bpos_marks = np . arange (template_count)
Apos = read_length - 1
Bpos = read_length + span_length
marks = {Apos : Apos_marks, Bpos : Bpos_marks }
## compute matches between templates over all possible windows
match = getMatch (templates , read_length, marks=marks) ## find space of unique reads
word_space = getWordSpace (match)
## get count of ones, for every template, position, and window size
one_count = getOneCount (templates, read_length)
## get score lookup table
score_table = getScoreTable (read_length, flip_rate) ## maximum read start position contained in the template max_start = template_length - read_length + 1
for ind, coverage in enumerate (coverages) :
tO = time. time ()
num_reads = int (coverage * template_count * ( float (template_length) / float ( read_length) ) )
## generate reads
read_template = np . random. randint (0, template_count, num reads)
read_start = np . random. randint (0, max_start, num reads)
## re-index reads from template space to word space ## word space assigns ambiguous reads to their least template
read_index = [word_space [x] for x in zip (read_template, read_start) ]
## reverse lookup to track which (index, start) derives from which templates
## needed to check for true paths
read_tracker = defaultdict (set)
for rstart, rtemplate, rindex in zip (read_start, read_template , read_index) :
read_tracker [ (rstart, rindex) ] . add ( rtemplate ) ## keep only unique read_index, read_start pairs readlocs = np . zeros (shape= (template_count, max_start) , dtype=bool)
readlocs [ read_index, read_start] = True
read_start, read_index = np. where (readlocs .T)
## make a greedy assembly
edge_in, edge_out, inscore = greedyAssembly (read_start, read_index, match, one_count, score_table, read_length,
template_length)
## score the result
true_exact_match, false_exact_match, true_path, false_path = pathScore (edge_in, edge_out, read_start, read_index, left_edge, right_edge, read_tracker)
## and output
timediff = time. time () - tO info = [ template_length, span_length, read length, flip_rate, template_count, coverage, true_exact_match, false_exact_match, true_path, false_path, timediff]
print tabprint ( info)
sys . stdout . flush ( )
many_windows_finite_coverage_counting Performs simulations testing the recovery of template count over a template of many read lengths for a fixed flip_rate, and a range of read_lengths, and template_counts,
and the number read lengths in the template
and a range of template coverage.
For speed, finds minimum cover in overlapping intervals of length 2R and takes the maximum over all such windows.
We can do this because the maximum antichain
must be contained in an interval of length R
and must therefore be included in at least one interval.
The provides a speed advantage for long templates.
import numpy as np
from support_functions import getRandomPatterns , getMatch, \
getWordSpace , overlapMatch, tabprint from hopcroft_karp import bipartiteMatch
import bisect
import sys
## input parameters
seed = hash ("This is not a seed.")
flip_rate = 0.35
read_lengths = [10,20,30]
template_factors = (2**np . arange (1, 5) )
template_counts = 2**np . arange (1 , 11 )
coverages = np. arange (1, 11)
## fixes a random seed
np . random . seed (seed)
headings = [ "template_length" , " read_length" , "flip_rate",
"template_count" , "coverage", "min_cover"]
print "\t" . j oin (headings)
## for each read length
for read_length in read_lengths :
## for each template factor
for template_factor in template_factors :
template_length = read_length*template_factor
## for each number of templates
for template_count in template_counts :
## generate template patterns
templates = getRandomPatterns (template_count,
template_length, flip_rate)
## compute matches between templates over all possible windows
match = getMatch (templates, read_length)
## find space of unique reads
word_space = getWordSpace (match)
## maximum read start position contained in the template max_start = template_length - read_length + 1
for ind, coverage in enumerate (coverages) :
num_reads = int (coverage * template_count *
( float (template_length) / float ( read_length) ) )
## generate reads
read_template = np . random . randint ( 0 ,
template_count, num_reads)
read_start = np . random . randint ( 0 , max_start, num_reads )
## re-index reads from template space to word space
## word space assigns ambiguous reads to their least template
read_index = [word_space [x] for x in zip (read_template, read_start) ]
## keep only unique read_index, read_start pairs readlocs = np . zeros (shape= (template_count,
max_start) , dtype=bool)
readlocs [ read_index, read_start] = True
read_start, read_index = np .where (readlocs . T) ## we will look at overlapping intervals of length
2R
s = np.arange(0, template_length-read_length, step=read_length)
e = s + 2*read_length+l
## appropriate indices in read_starts
A = [bisect .bisect_left (read_start, x) for x in s] B = [bisect .bisect_left (read_start, x) for x in e] ## set minimum cover to 0
min_cover = 0
## for each interval
for (a,b) in zip (A, B) :
## get read starts and indices
rs = read_start [a :b]
ri = read_index [a :b]
## get graph of all overlapping matching edges overmatch_graph = overlapMatch ( rs , ri, match, read_length, template_length)
## and all edges that fail to overlap
for i, j in zip ( *np . where (np . less_equal . outer ( rs + read_length, rs) ) ) :
overmatch_graph [ i ] . add(j)
## compute bipartite matching
M, A, B = bipartiteMatch (overmatch_graph) ## and take difference with vertex set for minimum cover
mc = len(rs) - len (M)
## update mincover to reflect maximum value min_cover = max(mc, min_cover)
## and output the results
info = [ template_length, read_length, flip_rate, template_count, coverage, min_cover]
print tabprint (info)
sys . stdout . flush ( ) many_windows_infinite_coverage Performs simulations testing the recovery of template count and assembly
over a template of many read lengths
for a range of flip_rates, read_lengths , and template_counts
assuming exhaustive coverage
# ! /data/software/local/bin/python
import sys
import numpy as np
from support_functions import getRandomPatterns, templatesToWindows, \
generateCompleteDAG_trimmed, tabprint from hopcroft_karp import bipartiteMatch
## input parameters
read_lengths = [10, 20, 30]
template_factors = (2**np . arange (1, 5) )
flip_rates = np . linspace (0, 0.35, 8) [1:]
template_counts = 2**np. arange (1, 11)
seed = hash ("This is not a random seed.")
## fixes a random seed
np . random .seed (seed)
headings = [ 'L ' , 'R', 'ρ', 'T', ' perfect_paths ' , 'min_cover'] print "\t" . j oin (headings)
for read_length in read_lengths :
## for each read length, flip rate and number of templates for template_factor in template_factors :
template_length = read_length*template_factor
for flip_rate in flip_rates:
for template_count in template_counts :
## generate template patterns
templates = getRandomPatterns (template_count, template_length, flip_rate)
## generate reads as binary representations
read = np . array (templatesToWindows (templates, read_length) )
## generate overlaps (read_length-l windows)
overlap = np . array (templatesToWindows (templates, read_length-l ) )
## counts the number of overlap collisions for each template at each window
overlap_count = np . array ( [np . sum (np . equal . outer (x, x) , 0) for x in overlap . T ]). T
## number of overlap collisions in each template collapse_count = np . sum (overlap_count > 1, axis=l) ## the number of unambiguous paths
## (those with no overlap collisions at any position) perfect_paths = np . sum (collapse_count == 0) ## generate bipartite graph from reads and overlaps (trimmed of simple nodes)
graph = generateCompleteDAG_trimmed ( read, overlap)
## and determine maximal bipartite match to get minimum vertex cover
M, A, B = bipartiteMatch (graph)
## the size of the minimum cover
## equals the number of items lacking partners in the match
min_cover = len (graph) - len (M)
## and report
info = [ template_length, read_length, flip_rate, template_count, perfect_paths, min_cover]
print tabprint (info)
sys . stdout . flush ( )
support_functions
A collection of function common to many of the simulation tests.
Includes methods for generating random template patterns
and random reads from template patterns
as well as some of the directed graph functions import numpy as np
import heapq
import bisect
from collections import defaultdict, Counter def tabprint (A) :
''' returns all elements of A, converted into strings and joined by tabs . ' ' '
return " \t" . j oin (map ( str, A)) def toBinary(X) :
' ' 'converts a bool array to its binary integer (or long) representation' ' '
return int ( (len(X) * '%d') % tuple (X [ : : -1 ] ) , 2 )
def bitCount (int_type) :
' ' 'returns the number of bits that are on in the word' ' ' return bin (int_type) .count ("1") def toBitCount (inputArray) :
takes an array of binary ints and
returns an array of the same size containing the bitcount ans = np . zeros_like ( inputArray)
for index, row in enumerate ( inputArray) :
ans [index] = [bitCount (x) for x in row]
return ans
def getRandomPatterns (number_of_templates, length_of_template, probability_of_flip) :
' ' 'generate random bit patterns' ' '
return np . random. random (size= (number_of_templates ,
length_of_template ) ) < probability_of_flip
def templateToWindows (template, window_size) :
' ' ' convert a template pattern into its sequence of window_size subwords ' ' '
max_start = len (template) - window_size + 1
return [toBinary (template [i : (i+window_size) ] ) for i in
range (max_start) ]
def templatesToWindows (templates, window_size) : ' ' ' convert a set of template patterns into their sequence of window_size subwords ' ' '
return [ templateToWindows (template, window_size) for template in templates ]
def getMatch (templates, R, marks = None) :
function to determine exact agreement between all pairs of templates
starting at a position and extending up to some length,
in particular:
shape: template, template, position, window_size
and match [Tl, T2, P, W] returns the truth value for the
statement
Tl and T2 match for every position on the window P: (P+W)
also set to accept a set of special marks,
marks is None or a dictionary
key is a position and marks [pos] are the mark values for each template in order.
for example, if each of four templates has a unique mark at pos
10 :
marks[10] = [0,1,2,3]
if there are two biallelic loci Aa and Bb with true phase AB, ab, might have :
marks[10] = [A, a, a, A]
marks[70] = [B, b, b, B]
T, L = templates . shape
## does a pair tl, t2 mismatch at a position p?
mismatch_pos = np . zeros (shape= (T, T, L) , dtype= ' bool ' )
for i in range (T) :
mismatch_pos [ i ] = np . logical_xor (templates [ i ] , templates)
## if there are special marks:
if marks != None:
## then at each position:
for pos in marks :
markers = marks [pos]
## put a mismatch wherever markers disagree
np . logical_or (mismatch_pos [ : , : , pos ] ,
np . not_equal . outer (markers , markers ) , out=mismatch_pos [ : , : , pos ] )
## does a pair tl, t2 mismatch on a window p: (p+w) ?
mismatch = np . zeros (shape= (T, T, L, R+l), dtype= ' bool ' )
for k in range (0, R) :
toEnd = L-k
np . logical_or (mismatch [: , :, :toEnd, k] , mismatch_pos [ : , :, k:], out=mismatch [ : , :, :toEnd, k+1])
return np. logical not (mismatch) def getWordSpace (match) :
At each position, converts from template index to word index. if each word is unique at a position, then they are identical, otherwise, each word is assigned its lowest index template.
This is necessary for handling collisions between reads
for indexing graph vertices.
template_count = match . shape [ 0 ] ## number of templates
read_length = match . shape [ 3 ] - 1 ## length of read
max_start = match . shape [ 2 ] - read_length + 1 ## maximum read start position
word_space = np . zeros (shape= (template_count, max_start) ,
dtype=int)
## for each possible start position
for pos in range (max_start) :
## for each template
for t in range (template_count) :
## set word space to the least index with a match x = match [t, :, pos, -1]
word_space [t, pos] = np.argmax(x)
return word_space def getOneCount (templates, read_length, dtype= ' int8 ' ) :
Number of bits flipped to '1' for every template, every position, and every window up to the read length.
Has shape: template, position, window_size
and one_count[T, P, W] returns the number of flipped positions in the window P: (P+W) in template T.
T, L = templates . shape
one_count = np . zeros (shape= (T, L, read_length+l ) , dtype=dtype) ## computed iteratively
for k in range (0, read_length) :
toEnd = L-k
one_count [ : , :toEnd, k+1] = one_count [ : , : toEnd, k] + templates [ : , k : ]
return one count def getScoreTable (read_length, flip_rate) :
Assigns a value to an edge based on the probability that it is accidental .
These values depend on the number of ones, the length of the word,
and the probability of a one.
This function generates a lookup table that keeps the values for: length of overlap, number of ones for a fixed flip_rate.
logp = np . log ( flip_rate )
logq = np . log (1-flip_rate) M = np . arange ( read_length+l ) [:, np.newaxis]
K = np . arange ( read_length+l ) [np.newaxis, :]
score_table = K*logp + (M-K) *logq
return score table
def generateCompleteDAG ( read, overlap):
function generates the enhanced directed acyclic graph
assuming perfect data -- all reads and all overlaps are known.
The primary DAG has vertices for each read
and an edge from A to B if pos (B) = pos (A) + 1
and the reads agree on the overlap.
The complete DAG has the same vertices,
and an edge from A to B if there exists a path from A to B in the primary DAG
The reason for enhanced DAG is this:
Applying bipartite matching to the primary DAG gives a minimum vertex-disjoint cover.
Applying bipartite matching to the enhanced DAG gives a minimum vertex cover.
graph = defaultdict (set)
max_start = read . shape [ 1 ] - 2
## iterate backwards from the next to last position to the first for position in range (max_start, -1, -1) :
## compute which overlaps agree
overlap_match = np . equal . outer (overlap [: , position+1], overlap[:, position+1])
## for all pairs that share an overlap
for x,y in zip (*np .where (overlap_match) ) :
source = (position, read[x, position]) ## define source
target = (position+1, read[y, position+1]) ## and target
## to the source, add the target (and the target's targets)
graph [source] .update (graph [target] )
graph [ source ] .add (target)
return graph Similar to the complete DAG
but built from a primary DAG that is free of simple nodes.
Y is a simple node if
for X -> Y -> Z
there is only one such X and Z and X has no other out-edges. Then any path through Y must be from X and to Z
and any path through X must continue through Z .
So Y can be removed without changing the vertex cover.
Repeat until there is no such Y in the graph.
def generateCompleteDAG_trimmed ( read, overlap)
graph = defaultdict (set) max_start = read . shape [ 1 ] - 2
in_edge = defaultdict (set)
out_edge = defaultdict (set)
## iterate backwards from the next to last position to the first for position in range (max_start, -1, -1) :
## compute which overlaps agree
overlap_match = np . equal . outer (overlap [: , position+1], overlap[:, position+1])
## for all pairs that share an overlap
for x,y in zip (*np .where (overlap_match) ) :
source = (position, read[x, position]) ## define source
target = (position+1, read[y, position+1]) ## and target
out_edge [ source ]. add (target)
in_edge [target] .add (source)
## remove simple edges
for x in in_edge . keys ( ) :
if len (in_edge [x] ) == 1 and len (out_edge [x] ) == 1:
source, target = list (in_edge [x] ) [0],
list (out_edge [x] ) [0]
if len (out_edge [ source ] ) == 1:
out_edge [source] . remove (x)
in_edge [target] . remove (x)
out_edge [source] . add (target)
in_edge [target] .add (source)
del in_edge [x]
del out_edge [x]
for x in in_edge . keys ( ) :
if len (in_edge [x] ) == 0:
del in_edge[x]
for x in out_edge . keys ( ) :
if len (out_edge [x] ) == 0:
del out_edge [x]
node_list = sorted (out_edge . keys ()) [:: -1 ]
## build graph
for x in node_list:
graph [x] . update (out_edge [x] )
for y in out_edge [x] :
graph [x] . update (graph [y] )
return graph
def generatePrimaryDAG (read, overlap)
function generates the primary directed acyclic graph
assuming perfect data -- all reads and all overlaps are
The primary DAG has vertices for each read
and an edge from A to B if pos (B) = pos (A) + 1
and the reads agree on the overlap.
graph = defaultdict (set)
max_start = read . shape [ 1 ] - 2 ## iterate backwards from the next to last position to the first for position in range (max_start, -1, -1) :
## compute which overlaps agree
overlap_match = np . equal . outer (overlap [: , position+1], overlap[:, position+1])
## for all pairs that share an overlap
for x,y in zip (*np .where (overlap_match) ) :
source = (position, read[x, position]) ## define source
target = (position+1, read[y, position+1]) ## and target
graph [ source ] .add (target)
return graph
def overlapMatch (read_start, read_index, match, read_length, template_length) :
First, assumes that read_starts are sorted. Safe in this code. Uses bisect to make an index for the read_start positions.
Then iterating over each position,
checks for all overlaps in blocks to take advantage of vector operations .
Returns a list of all read pairs that overlap and agree on that overlap.
max_start = template_length - read_length + 1
## number of reads
num_nodes = len (read_start)
## index into the start positions (assumed sorted)
index_to_start = np . array ( [bisect . bisect_left (read_start, x) for x in np . arange (template_length) ] + [num_nodes] )
overmatch_graph = defaultdict (set)
## for each read start position in the template
for a_pos in np . arange (max_start) :
## get the upper and lower index bounds for a_pos
low, high = index_to_start [ a_pos : (a_pos + 2)]
## get the upper index bound for possible overlaps end = index_to_start [ a_pos + read_length]
## a is everything at this position (a_pos)
## b is everything from the next position on,
## that would overlap reads at this position.
a_template = read_index [ low : high]
b_template = read_index [high : end]
b_pos = read_start [high : end]
overlap_length = read_length - b_pos + a_pos
hasmatch =
match [ : , b_template , b_pos , overlap_length] [ a_template ]
## anything that has a match, goes into the list
for i, j in zip ( *np . nonzero (hasmatch) ) :
overmatch_graph [ low+i ] . add (high+j )
return overmatch_graph
def pathScore (edge_in, edge_out, read_start, read_index, left_edge, right_edge, read_tracker) : Determine the number of exactly matched tag pairs.
Of those, how many are tags from the same original template ("true matches")?
Of the true matches, how many admit a "true path"
in which all the reads on the path could have derived from the same initial template?
True matched pairs that admit true paths
are well-assembled templates in which all template information, such as SNP phasing or intron exclusions,
is recoverable from mutation pattern assembly alone.
A true match with not true path correctly pairs the ends, but may have some errors in between.
The function expects that the read_starts are sorted,
so that out_edges always point to elements lower in the list. num_nodes = len (edge_in)
in_count = np . array ([ len (x) for x in edge_in] ) ## number in
out_count = np . array ([ len (x) for x in edge_out] ) ## number out
path_start = np .where (in_count == 0) [0] ## starts have no in
path_end = np . where (out_count == 0) [0] ## ends have no out
## for each path start, add marks that are to the left of the edge
## i.e. contain the left marker
start_marks = [set () for _ in range (num_nodes ) ]
for pstart in path_start:
if read_start [pstart] < left_edge:
start_marks [pstart] .add (pstart)
## propagate through the graph
for source in range (num_nodes ) :
for target in edge_out [ source ] :
start_marks [target] .update (start_marks [source] )
## and out the end,
## track how many times a start node is seen an end node
## (beyond the right_edge, i.e. containing a right mark) start_counter = Counter ()
for pend in path_end:
if read_start [pend] > right_edge :
for pstart in start_marks [pend] :
start_counter [pstart] += 1
def hasTruePath (tindex, node, pend):
determines if there is a true path from node to pend, i.e. a path that traverses nodes supported by reads that are in the same template .
Some nodes may be referred by multiple reads from different templates -- if the mutation patterns are degenerate for more than a read length .
if node == pend:
return True
elif tindex not in read_tracker [ read_start [node ] ,
read_index [node] ] :
return False
else :
for next_node in edge_out [node ] :
if hasTruePath (tindex, next_node, pend) :
return True
return False
## now count up true and false exact matches and true paths true_exact_match = 0
false_exact_match = 0
true_path = 0
false_path = 0
## for each end
for pend in path_end:
## if it is beyond the right edge (contains a right mark) if read_start [pend] > right_edge :
## does it have exactly one start?
if len ( start_marks [pend] ) == 1:
pstart = list (start_marks [pend] ) [0]
## does that start have just one end? if start_counter [pstart] == 1:
## do they agree on template index?
if read_index [pstart ] == read_index [pend] :
## test for true path
tindex = read_index [pstart ]
has_true_path = hasTruePath (tindex, pstart, pend)
if has_true_path :
true_path += 1
else :
false_path += 1
true_exact_match += 1
else:
false_exact_match += 1
return true_exact_match, false_exact_match, true_path,
false_path def scoredOutEdgeHeaps (read_start, read_index, match, one_count, score table, read length, template length) :
Function returns a list of ordered heaps of out edges for each vertex.
Two reads have an edge if they share an overlap, agree on that overlap
and are not at the exact same position.
The score for an edge is the log likelihood of an accidental overlap
of that length and with that number of flipped bits. Used by greedy assembly.
max_start = template_length - read_length + 1
## number of reads
num_nodes = len (read_start)
## index into the start positions (assumed sorted)
index_to_start = np . array ( [bisect . bisect_left (read_start, x) for x in np . arange (template_length) ] + [num_nodes])
## for each node, make a heap to store out edges
outedge_heaps = [[] for _ in range (num_nodes) ]
## for each read start position in the template
for a_pos in np . arange (max_start) :
## get the upper and lower index bounds for a_pos
low, high = index_to_start [ a_pos : (a_pos + 2)]
## get the upper index bound for possible overlaps
end = index_to_start [ a_pos + read_length]
## a is everything at this position (a_pos)
## b is everything from the next position on,
## that would overlap reads at this position.
a_template = read_index [ low : high]
b_template = read_index [high : end]
b_pos = read_start [high : end]
overlap_length = read_length - b_pos + a_pos
hasmatch =
match [ : , b_template , b_pos , overlap_length] [ a_template ]
for i, j in zip ( *np . nonzero (hasmatch) ) :
bits = overlap length[j]
## bits of overlap
flipped = one count [b template [j], b _pos [ j ] , bits ]
## how many are on
score = score table [bits, flipped]
## lookup score
heapq . heappush (outedge heaps [ low+i ] , (score, high+j ) )
## push into heap
return outedge_heaps def greedyAssembly (read_start, read_index, match, one_count,
score_table, read_length, template_length) :
Returns the edges of a greedy assembly.
Iterate through the edges in order of least likely read
coincidence .
Join an edge A to B if neither A nor B already have an out (in) partner
with a better score.
num_nodes = len (read_start)
## getting edge out heaps
outedge_heaps = scoredOutEdgeHeaps (read_start, read_index, match, one_count, score_table, read_length, template_length)
## keep the best in edge score for each vertex
inscore = np . zeros (num_nodes , dtype=float)
## list of lists for storing edges
edge_in = [listO for _ in range (num_nodes) ] edge_out = [listO for _ in range (num_nodes) ]
## build a heap for the vertices
vertex_heap = []
for i in range (num_nodes) :
try :
## ordered by the score of their top element item = (outedge_heaps [ i ] [ 0 ] [ 0 ] , i)
heapq . heappush (vertex_heap, item)
except :
## should it exist
continue
## then start working through the vertex heap
while len (vertex_heap) > 0:
## find the out vertex with the best score
score, out_vertex = heapq . heappop (vertex_heap)
outheap = outedge_heaps [out_vertex]
## and then pop its heap to get all its partners with that best score
foundPartner = False
while len (outedge_heaps [out_vertex] ) > 0:
if outheap[0] [0] == score: ## if best element has score
in_score, in_vertex = heapq . heappop (outheap) ## pop it off the heap
## check if its a good score for the in vertex if score <= inscore [ in_vertex] :
foundPartner = True ## found a partner
edge_in [ in_vertex] . append (out_vertex) ## add the edges
edge_out [out_vertex] . append (in_vertex)
inscore [ in_vertex] = score ## and record the score
else :
break
## if the vertex did not find a partner, it returns to the vertex heap
## with its best score.
if not foundPartner:
try:
item = (outedge_heaps [out_vertex] [0] [0], out_vertex) heapq . heappush (vertex_heap, item)
except :
continue
return edge_in, edge_out, inscore McCloskey ML, Stoger R, Hansen RS, & Laird CD (2007) Encoding PCR products with batch-stamps and barcodes. Biochemical genetics 45 (11-12) : 761-767.
Miner BE, Stoger RJ, Burden AF, Laird CD, & Hansen RS (2004) Molecular barcodes detect redundancy and contamination in hairpin-bisulfite PCR. Nucleic acids research 32(17) :el35.
Casbon JA, Osborne RJ, Brenner S, & Lichtenstein CP (2011) A method for counting PCR template molecules with application to next-generation sequencing. Nucleic acids research 39(12) :e81.
Fu GK, Hu J, Wang PH, & Fodor SP (2011) Counting individual DNA molecules by the stochastic attachment of diverse labels. Proceedings of the National Academy of Sciences of the United States of America 108 (22) : 9026-9031.
Islam S, et al . (2014) Quantitative single-cell RNA-seq with unique molecular identifiers. Nature methods 11 (2) : 163-166.
Jabara CB, Jones CD, Roach J, Anderson JA, & Swanstrom R (2011) Accurate sampling and deep sequencing of the HIV-1 protease gene using a Primer ID. Proceedings of the National Academy of Sciences of the United States of America 108 (50) : 20166-20171.
Kivioja T, et al . (2012) Counting absolute numbers of molecules using unique molecular identifiers. Nature methods 9(1) : 72-74.
Hiatt JB, Pritchard CC, Salipante SJ, O'Roak BJ, & Shendure J (2013) Single molecule molecular inversion probes for targeted, high-accuracy detection of low-frequency variation. Genome research 23 (5) : 843-854.
Kinde I, Wu J, Papadopoulos N, Kinzler KW, & Vogelstein B (2011) Detection and quantification of rare mutations with massively parallel sequencing. Proceedings of the National Academy of Sciences of the United States of America 108 (23) : 9530-9535.
10. Schmitt MW, et al . (2012) Detection of ultra-rare mutations by next-generation sequencing. Proceedings of the National Academy of Sciences of the United States of America 109(36) : 14508-
14513.
11. Hicks J, Navin N, Troge J, Wang Z, & Wigler M (2012) Varietal counting of nucleic acids for obtaining genomic copy number information. (Patent) 12. Keith JM, et al . (2004a) Algorithms for sequence analysis via mutagenesis. Bioinformatics 20 (15) : 2401-2410.
13. Keith JM, et al . (2004b) Unlocking hidden genomic sequence.
Nucleic acids research 32(3) :e35.
14. Mitchelson KR (2011) Sequencing of difficult DNA Regions by SAM sequencing. Methods in molecular biology 687:75-88.
15. Sipos B, Massingham T, Stutz AM, & Goldman N (2012) An improved protocol for sequencing of repetitive genomic regions and structural variations using mutagenesis and next generation sequencing. PloS one 7(8):e43359. 16. Shortle D & Botstein D (1983) Directed mutagenesis with sodium bisulfite. Methods in enzymology 100:457-468.
17. Dilworth RP (1950) A Decomposition Theorem for Partially Ordered Sets. Ann Math 51 (1) : 161-166.
18. Konig D (1931) Graphen und matrizen. Mat. Fiz. Lapok 38:116- 119.
19. Hopcroft J & Karp R (1973) An n5/2 Algorithm for Maximum Matchings in Bipartite Graphs. SIAM Journal on Computing 2(4): 225-231. Alberts B (1994) Molecular biology of the cell (Garland Pub., New York) 3rd Ed pp xliii, 1294, 1267 p. arzisi G & Mishra B (2011) Scoring-and-unfolding trimmed tree assembler: concepts, constructs and comparisons. Bioinformatics 27 (2) : 153-160. Bransteitter R, Pham P, Scharff MD, & Goodman MF (2003) Activation-induced cytidine deaminase deaminates deoxycytidine on single-stranded DNA but requires the action of RNase . Proceedings of the National Academy of Sciences of the United States of America 100 (7) : 4102-4107. Krueger F & Andrews SR (2011) Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics 27 (11) : 1571-1572. Otto C, Stadler PF, & Hoffmann S (2012) Fast and sensitive mapping of bisulfite-treated sequencing data. Bioinformatics
28 (13) : 1698-1704. rueger F, Kreck B, Franke A, & Andrews SR (2012) DNA methylome analysis using short bisulfite sequencing data. Nature methods 9(2) : 145-151.

Claims

What is claimed:
1. A method for determining the number of nucleic acid molecules
(NAMs) in a group of NAMs, comprising
i) obtaining an amplified and mutagenized group of NAMs that was produced by
a. subjecting the group of NAMs to a chemical mutagenesis which mutates only select nucleic acid bases in the group of NAMs at a rate of 10% to 90% thus forming a group of mutagenized NAMs (mNAMs) , and b. amplifying the group of mNAMs;
ii) obtaining sequences of the mNAMs in the group of amplified mNAMs; and
iii) counting the number of different sequences obtained in step (ii) to determine the number of unique mNAMs in the group of mNAMS,
thereby determining the number of NAMs in the group of NAMs.
2. The method of claim 1, wherein obtaining sequences in step (ii) comprises obtaining composite sequences produced by assembling sequence reads of the mNAMs by
a) aligning the sequence reads according to matching mutation patterns in overlaps of the sequence reads, thereby obtaining composite sequences, and
b) mapping the composite sequences; and
wherein step (iii) comprises counting the number of jointly overlapping different composite sequences obtained in step (ii) . 3. A method for determining the number of nucleic acid molecules (NAMs) in a group of NAMs, comprising
i) subjecting the group of NAMs to a chemical mutagenesis which mutates only select nucleic acid bases in the group of NAMs at a rate of 10% to 90%, to produce a group of mutagenized NAMs (mNAMs) ;
ii) amplifying the group of mNAMs; m) sequencing the mNAMs in the group of amplified mNAMs to obtain sequences of the mNAMs; and
iv) counting the number of different sequences obtained in step (iii) to determine the number of unique mNAMs in the group of mNAMs,
thereby determining the number of NAMs in the group of NAMs.
The method of claim 3, wherein the sequencing in step (iii) comprises assembling sequence reads of the mNAMs into composite sequences by
a) aligning the sequence reads according to matching mutation patterns in overlaps of the sequence reads, thereby obtaining composite sequences, and
b) mapping the composite sequences; and
wherein step (iv) comprises counting the number of jointly overlapping different composite sequences obtained in step
The method of claim 1, wherein a sub-group of NAMs in the group of NAMs is determined to have substantially the same nucleotide sequence, wherein the sub-group of NAMs is determined to have nucleotide sequences that are at least 95, 96, 97, 98, 99, 99.1, 99.2, 99.3, 99.4, 99.5, 99.6, 99.7, 99.9, or 99.9% identical; or wherein the nucleotide sequences of a sub-group of NAMs comprise a stretch of consecutive nucleotides having a sequence which includes at least two mutable positions and is i) identical to the sequence of a stretch of consecutive nucleotides within another NAM within the sub-group of NAMs, or ii) determined to have at least 95, 96, 97, 98, 99, 99.1, 99.2, 99.3, 99.4, 99.5, 99.6, 99.7, 99.9, or 99.9% identical to the sequence of a stretch of consecutive nucleotides within another NAM within the sub-group of NAMs. The method of claim 1, wherein the counting comprises counting the number of different sequences that are determined to have substantially the same sequence except for their mutable positions, thereby determining the number of NAMs in the group of NAMs that had substantially the same sequence; or wherein the counting comprises counting the number of different sequences which lack substantially the same sequence in any stretch including at least two mutable positions, thereby determining the number of NAMs without substantially the same sequence in the group of NAMs.
A method for determining the number of different sequences in a group of nucleic acid molecules (NAMs) that have been mutagenized and then amplified comprising
a) obtaining the group of NAMs that have been mutagenized and then amplified;
b) obtaining sequences of the mutagenized NAMs (mNAMs) in the group of amplified mNAMs; and
c) counting the number of different sequences obtained in step (ii) ,
thereby determining the number of different sequences in the group of amplified mNAMs.
A method for sequencing a nucleic acid molecule (NAM) that comprises two or more segments having substantially the same sequence, and that has a length of more than one sequencing read, comprising
i) obtaining two or more copies of the NAM;
ii) subjecting each copy of the NAM in step (i) to a mutagenesis that mutates only select nucleic acid positions in the NAMs at a rate of 10% to 90% to produce mutated copies of the NAM (mcNAM) ;
iii) amplifying each of the mcNAMs;
iv) obtaining composite sequences of the mcNAMs that are produced by assembling sequence reads of the amplified mcNAMs, such that when taken together, span as much as possible of the entire length of the NAM, by
a) aligning the sequence reads according to matching mutation patterns in overlaps of the sequence reads, and
b) mapping the composite sequences,
thereby sequencing the NAM
The method of claim 8, wherein each of the two or more copies of the NAM has a unique primer at its 5' end and another unique primer at its 3' end, wherein the unique primers of each mcNAM lack a nucleotide that is mutable by the mutagenesis.
A method for determining genomic copy number information from genomic material, comprising,
i) obtaining segments of the genomic material; and
ii) determining the number of segments of the genomic material according to the method of claim 1,
thereby determining genomic copy number information from genomic material .
A method for profiling RNA transcripts, comprising
i) obtaining a group of RNA transcripts;
ii) determining the number of RNA transcripts in the group of RNA transcripts according to the method of claim 1; and iii) determining the proportionate number of a plurality of RNA transcripts having the same sequence to a second different plurality of RNA transcripts that have the same sequence, thereby determining RNA transcript profile.
A method for determining allelic imbalance, comprising
i) obtaining copy number of a first allele;
ii) obtaining copy number of a second allele; and
iii) comparing the copy numbers obtained in steps (i) and (ii) , thereby determining allelic imbalance, wherein the copy number in steps (i) and (ii) is obtained by the method of claim 1.
A method for determining genome assembly, comprising
i) obtaining segments of a genome, wherein the segments span the entire length of the genome;
ii) sequencing the segments of the genome according to the method of claim 3;
iii) aligning the sequences obtained in step (ii) according to matching mutation patterns in overlaps of the sequences; and
iv) mapping the sequences,
thereby assembling the genome.
A method for determining haplotype assembly, comprising i) obtaining a group of alleles, wherein the alleles in the group of alleles are located in the same chromosome;
ii) sequencing each allele in the group of alleles according to the method of claim 8, and
iii) comparing the sequences obtained in step (ii) ,
thereby determining haplotype assembly.
The method of claim 1, wherein the mutagenesis is by cytosine deamination .
The method of claim 1, wherein each mutable position of the NAMs comprises a cytosine (C) , wherein the cytosine (C) is mutated to a uracil (U) or a thymine (T) ,
The method of claim 1, wherein each NAM in the group of NAMs has a unique primer at its 5' end and another unique primer at its 3' end.
18. The method of claim 1, further comprising the step of tagging each NAM or copy thereof, wherein the tag lacks a nucleotide that is mutable by the mutagenesis .
The method of claim 1, wherein the NAM is within a mixture of DNA or RNA extracted from a cell, wherein the DNA or RNA extracted from the cell has been fragmented .
The method of claim 1, wherein the NAM is a DNA molecule, or an RNA molecule.
The method of claim 1, wherein one or more NAMs in the group of NAMs has a length of one sequencing read length, or wherein one or more NAMs in the group of NAMs has a length of two or more sequencing read lengths.
The method of claim 1, wherein the number of NAMs in the group of NAMs is about 2, 3, 4, 5, 6, 7, 8, 9, 10, or 10-10000, wherein if the number of NAMs in the group of NAMs is greater than 10000, then diluting the group of NAMs.
A kit for determining the number of NAMs in a group of NAMs comprising :
a) a mutagen; and
b) instructions for performing mutagenesis resulting in suboptimal mutagenesis,
wherein the mutagen is a bisulfite or a salt thereof, or a deamination agent.
24. The kit of claim 23, further comprising a plurality of unique primers including: a) a plurality of substantially unique primers suitable fo ligation to the 5' of a NAM; and
b) a plurality of substantially unique primers suitable fo ligation to the 3' of a NAM.
wherein the substantially unique primers comprise substantiall unique tags .
A composition of matter comprising a plurality of mutagenized nucleic acid molecules (mNAMs) , wherein selected mutable nucleic acid positions in the plurality of mutagenized NAMs (mNAMs) are mutated at a rate of 10% to 90%.
A composition of matter derived from sequencing primers has the sequence: ACACTCTTTCCCTACACACGACGCTCTTCCGATC*T (Seq ID p5.mC), wherein the cytosines (C) are methylated, and wherein *T is a phosphorothioated thymine;
A composition of matter derived from sequencing primers has the sequence: 5Phos-GATCGGAAGAGCGGTTCAGCAGGAATGCCGA*G (Seq ID p7.mC), wherein the cytosines (C) are methylated, wherein *G is a phosphorothioated guanine, and wherein 5Phos is a 5'- phosphorylated, deoxyuridine-containing anchor-primer.
A composition of matter comprising two or more copies of a nucleic acid molecule (NAM) comprising two or more segments having substantially the same sequence, and that has a length of more than one sequencing read, wherein each copy of the NAM has a unique primer at its 5' end and another unique primer at its 3' end, and is subjected to a mutagenesis that mutates each mutable position in the NAMs at a rate of 10% to 90% to produce mutated copies of the NAM (mcNAM) , wherein the unique primers of each mcNAM lack a nucleotide that is mutable by the mutagenesis . 29. Sequence information produced by a system including one or more processing units which counts the number of different sequences obtained by a sequencer that processed the group of amplified mNAMs in the method of claim 1, or the group of mcNAMs of in the method of claim 8.
A system including one or more processing units which counts the number of different sequences obtained by a sequencer that processed the group of amplified mNAMs of in the method of claim 1, or the group of mcNAMs of in the method of claim 8.
PCT/US2015/054981 2014-10-10 2015-10-09 Random nucleotide mutation for nucleotide template counting and assembly WO2016057947A1 (en)

Priority Applications (7)

Application Number Priority Date Filing Date Title
EP15849374.2A EP3204521B1 (en) 2014-10-10 2015-10-09 Random nucleotide mutation for nucleotide template counting and assembly
AU2015330685A AU2015330685B2 (en) 2014-10-10 2015-10-09 Random nucleotide mutation for nucleotide template counting and assembly
US15/515,913 US11008606B2 (en) 2014-10-10 2015-10-09 Random nucleotide mutation for nucleotide template counting and assembly
EP21176777.7A EP3957742A1 (en) 2014-10-10 2015-10-09 Random nucleotide mutation for nucleotide template counting and assembly
CA2964169A CA2964169C (en) 2014-10-10 2015-10-09 Random nucleotide mutation for nucleotide template counting and assembly
IL251509A IL251509B (en) 2014-10-10 2017-04-02 Random nucleotide mutation for nucleotide template counting and assembly
US17/320,634 US20210340604A1 (en) 2014-10-10 2021-05-14 Random nucleotide mutation for nucleotide template counting and assembly

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201462062571P 2014-10-10 2014-10-10
US62/062,571 2014-10-10

Related Child Applications (2)

Application Number Title Priority Date Filing Date
US15/515,913 A-371-Of-International US11008606B2 (en) 2014-10-10 2015-10-09 Random nucleotide mutation for nucleotide template counting and assembly
US17/320,634 Continuation US20210340604A1 (en) 2014-10-10 2021-05-14 Random nucleotide mutation for nucleotide template counting and assembly

Publications (1)

Publication Number Publication Date
WO2016057947A1 true WO2016057947A1 (en) 2016-04-14

Family

ID=55653867

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2015/054981 WO2016057947A1 (en) 2014-10-10 2015-10-09 Random nucleotide mutation for nucleotide template counting and assembly

Country Status (6)

Country Link
US (2) US11008606B2 (en)
EP (2) EP3204521B1 (en)
AU (1) AU2015330685B2 (en)
CA (1) CA2964169C (en)
IL (1) IL251509B (en)
WO (1) WO2016057947A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020035669A1 (en) * 2018-08-13 2020-02-20 Longas Technologies Pty Ltd Sequencing algorithm
US11421238B2 (en) 2018-02-20 2022-08-23 Longas Technologies Pty Ltd Method for introducing mutations
WO2023039509A1 (en) * 2021-09-10 2023-03-16 Cold Spring Harbor Laboratory Method of measuring microsatellite length variations

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020087076A1 (en) * 2018-10-26 2020-04-30 The Board Of Trustees Of The Leland Stanford Junior University Methods and uses of introducing mutations into genetic material for genome assembly
GB202111184D0 (en) * 2021-08-03 2021-09-15 Hendriks Gerardus Johannes Methods

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004042078A1 (en) * 2002-11-05 2004-05-21 The University Of Queensland Nucleotide sequence analysis by quantification of mutagenesis
US20050266453A1 (en) * 2002-11-01 2005-12-01 Gregory Coia Mutagenesis methods using ribavirin and/or RNA replicases
US20090047680A1 (en) * 2007-08-15 2009-02-19 Si Lok Methods and compositions for high-throughput bisulphite dna-sequencing and utilities
WO2013177086A1 (en) * 2012-05-21 2013-11-28 Sequenom, Inc. Methods and processes for non-invasive assessment of genetic variations
US20130338043A1 (en) * 2012-06-12 2013-12-19 The Johns Hopkins University Efficient, Expansive, User-Defined DNA Mutagenesis
US20140065609A1 (en) * 2010-10-22 2014-03-06 James Hicks Varietal counting of nucleic acids for obtaining genomic copy number information

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5866344A (en) * 1991-11-15 1999-02-02 Board Of Regents, The University Of Texas System Antibody selection methods using cell surface expressed libraries
AU2002219818B2 (en) * 2000-11-20 2007-08-16 Cargill, Incorporated 3-hydroxypropionic acid and other organic compounds
US20070154892A1 (en) * 2005-12-30 2007-07-05 Simon Wain-Hobson Differential amplification of mutant nucleic acids by PCR in a mixure of nucleic acids
EP2351858B1 (en) 2006-02-28 2014-12-31 University of Louisville Research Foundation Detecting fetal chromosomal abnormalities using tandem single nucleotide polymorphisms
US20100184044A1 (en) 2006-02-28 2010-07-22 University Of Louisville Research Foundation Detecting Genetic Abnormalities
CN103620055A (en) 2010-12-07 2014-03-05 利兰·斯坦福青年大学托管委员会 Non-invasive determination of fetal inheritance of parental haplotypes at the genome-wide scale
US9725765B2 (en) 2011-09-09 2017-08-08 The Board Of Trustees Of The Leland Stanford Junior University Methods for obtaining a sequence
US20140377762A1 (en) 2011-12-19 2014-12-25 360 Genomics Ltd. Method for enriching and detection of variant target nucleic acids
US9977861B2 (en) * 2012-07-18 2018-05-22 Illumina Cambridge Limited Methods and systems for determining haplotypes and phasing of haplotypes
US20140066317A1 (en) 2012-09-04 2014-03-06 Guardant Health, Inc. Systems and methods to detect rare mutations and copy number variation
US10253325B2 (en) * 2012-12-19 2019-04-09 Boston Medical Center Corporation Methods for elevating fat/oil content in plants
ES2815684T3 (en) * 2015-06-10 2021-03-30 Biocartis N V Improved detection of methylated DNA

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050266453A1 (en) * 2002-11-01 2005-12-01 Gregory Coia Mutagenesis methods using ribavirin and/or RNA replicases
WO2004042078A1 (en) * 2002-11-05 2004-05-21 The University Of Queensland Nucleotide sequence analysis by quantification of mutagenesis
US20090047680A1 (en) * 2007-08-15 2009-02-19 Si Lok Methods and compositions for high-throughput bisulphite dna-sequencing and utilities
US20140065609A1 (en) * 2010-10-22 2014-03-06 James Hicks Varietal counting of nucleic acids for obtaining genomic copy number information
WO2013177086A1 (en) * 2012-05-21 2013-11-28 Sequenom, Inc. Methods and processes for non-invasive assessment of genetic variations
US20130338043A1 (en) * 2012-06-12 2013-12-19 The Johns Hopkins University Efficient, Expansive, User-Defined DNA Mutagenesis

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LEVY ET AL.: "Facilitated sequence counting and assembly by template mutagenesis", PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, vol. 111, no. 43, 13 October 2014 (2014-10-13), pages E4632 - E4637, XP055428445 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11421238B2 (en) 2018-02-20 2022-08-23 Longas Technologies Pty Ltd Method for introducing mutations
WO2020035669A1 (en) * 2018-08-13 2020-02-20 Longas Technologies Pty Ltd Sequencing algorithm
US20210174905A1 (en) * 2018-08-13 2021-06-10 Longas Technologies Pty Ltd. Sequencing Algorithm
CN113015813A (en) * 2018-08-13 2021-06-22 朗斯科技有限公司 Sequencing algorithm
EP4293123A3 (en) * 2018-08-13 2024-01-17 Illumina Singapore PTE. Ltd. Sequencing algorithm
WO2023039509A1 (en) * 2021-09-10 2023-03-16 Cold Spring Harbor Laboratory Method of measuring microsatellite length variations

Also Published As

Publication number Publication date
US11008606B2 (en) 2021-05-18
CA2964169A1 (en) 2016-04-14
US20210340604A1 (en) 2021-11-04
AU2015330685A1 (en) 2017-04-20
US20170306392A1 (en) 2017-10-26
IL251509A0 (en) 2017-05-29
CA2964169C (en) 2023-09-19
EP3204521A1 (en) 2017-08-16
EP3957742A1 (en) 2022-02-23
EP3204521B1 (en) 2021-06-02
AU2015330685B2 (en) 2022-02-17
IL251509B (en) 2021-04-29
EP3204521A4 (en) 2018-03-21

Similar Documents

Publication Publication Date Title
US11814678B2 (en) Universal short adapters for indexing of polynucleotide samples
US11788139B2 (en) Optimal index sequences for multiplex massively parallel sequencing
US20210340604A1 (en) Random nucleotide mutation for nucleotide template counting and assembly
CN108431233B (en) Efficient construction of DNA libraries
AU2018331434A1 (en) Universal short adapters with variable length non-random unique molecular identifiers
JP7051677B2 (en) High Molecular Weight DNA Sample Tracking Tag for Next Generation Sequencing
US11608518B2 (en) Methods for analyzing nucleic acids
ES2965194T3 (en) Sequencing algorithm
US20220364080A1 (en) Methods for dna library generation to facilitate the detection and reporting of low frequency variants
JP2023531720A (en) Methods and compositions for analyzing nucleic acids
Wang et al. High coverage of single cell genomes by T7-assisted enzymatic methyl-sequencing
WO2023212223A1 (en) Single cell multiomics
Wei Single Cell Phylogenetic Fate Mapping: Combining Microsatellite and Methylation Sequencing for Retrospective Lineage Tracing
Bellos Statistical methods for elucidating copy number variation in high-throughput sequencing studies

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15849374

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 251509

Country of ref document: IL

ENP Entry into the national phase

Ref document number: 2964169

Country of ref document: CA

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2015330685

Country of ref document: AU

Date of ref document: 20151009

Kind code of ref document: A

REEP Request for entry into the european phase

Ref document number: 2015849374

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2015849374

Country of ref document: EP