WO2008134867A1

WO2008134867A1 - Methods, kits, and systems for nucleic acid sequencing by hybridization

Info

Publication number: WO2008134867A1
Application number: PCT/CA2008/000828
Authority: WO
Inventors: Arno Pihlak; Goran Bauren; Ellef Hersoug; Peter Lonnerberg; Ats Metsis; Johanna Sagemark; Sten Linnarsson
Original assignee: Genizon Biosciences Inc.
Priority date: 2007-05-04
Filing date: 2008-05-06
Publication date: 2008-11-13

Abstract

The present invention provides a DNA sequencing method based on hybridization of a universal panel of tiling probes. In various embodiments, millions of shotgun fragments are amplified in situ on a solid support using rolling circle amplification and then subjected to sequential hybridization with short fluorescent probes. Long reads ensure unique placement even in large genomes. The sequencing chemistry is simple, enzyme-free and consumes only dilute solutions of the probes, resulting in an order of magnitude reduction in sequencing cost, and a substantial increase in speed. As exemplified herein, a prototype instrument based on commonly available equipment was used to resequence the Bacteriophage λ and E. coli genomes to better than 99.9% accuracy at a raw throughput of 320 Mbp/day. The present invention further provides kits and systems for sequencing by hybridization.

Description

METHODS, KITS, AND SYSTEMS FOR NUCLEIC ACID SEQUENCING BY

HYBRIDIZATION

[0001] This application claims priority to U.S. Provisional Application No.

60/924,245 filed May 4, 2007, which is hereby incorporated by reference in its entirety.

FIELD OF THE INVENTION

[0002] The present invention relates to nucleic acid sequencing by hybridization, and is related to the sequencing methods disclosed in PCT/EP2005/002870 (corresponding to WO 2005/093094), the entire disclosure of which is hereby incorporated by reference in its entirety.

BACKGROUND OF THE INVENTION

[0003] Direct nucleic acid sequencing is one of the most valuable tools for genomic research, including not only do novo nucleic acid sequence determination, but also individual geno typing and gene expression analysis. If efficient nucleic acid sequencing methods were available, a model species could be sequenced, individuals could be genotyped by whole- genome sequencing, and RNA populations could be exhaustively analyzed after conversion to cDNA. Advances in sequencing technology might also improve, or render more efficient, studies in epigenomics (e.g., methylated cytosines could be identified by bisulfite conversion of unmethylated cytosine to uridine), protein-protein interactions (e.g., by sequencing hits obtained in a yeast two-hybrid experiment), protein-DNA interactions (e.g., by sequencing DNA fragments obtained after chromosome immunoprecipitation), among others.

[0004] A living cell contains about 300,000 copies of messenger RNA, each about

2,000 bases long on average. To completely sequence the RNA in even a single cell, 600 million nucleotides must be analyzed. In a complex tissue composed of dozens of different cell types, the task becomes even more difficult as cell-type specific transcripts become diluted. Gigabase daily throughput will be required to meet these demands. The following table shows some estimates on the throughput required for various sequencing projects (numbers are for human sequencing, unless otherwise indicated): Table 1

[0005] Thus efficient methods for nucleic acid sequencing are needed.

[0006] A number of different sequencing technologies have been developed. These include, among others, Sanger sequencing, sequencing-by-synthesis, and sequencing-by- hybridization.

[0007] Sanger sequencing (Sanger et al, PNAS 74 no. 12: 5463-5467, 1977) relies on the physical separation of a large number of fragments corresponding to each base position of the template and is thus not readily scalable to ultra-high throughput sequencing.

[0008] Various approaches have been designed for sequencing by synthesis (SBS), which involve either detecting a byproduct released from incorporated nucleotides, or detecting a permanently attached label. Technologies involving sequencing-by-synthesis are often limited to short read lengths and relatively low throughput.

[0009] Sequencing-by-hybridization (SBH) involves hybridizing a panel of probes to template sequences to reconstruct the nucleic acid sequence of the template. However, reconstructing the template sequence from the hybridization data can be complex, and the efficiency of the method is impacted by hybridization kinetics and the sequential nature of the protocols.

[0010] Improved and more efficient sequencing methods, such as those providing long read lengths, high throughput, accuracy, and/or lower sequencing costs, are desirable for more efficient and improved genomic and biological analyses. SUMMARY OF THE INVENTION

[0011] The present invention relates to methods for sequencing by hybridization, including techniques and analytic tools for sequence analysis, as well as probes, probe sets, systems (e.g., a sequencing apparatus), and kits for sequencing. The present invention allows for automation of a vast sequencing effort, using only standard bench-top equipment that is readily available in the art.

[0012] In one aspect, the invention provides a nucleic acid sequencing method. The method generally comprises hybridizing a panel of labeled probes (e.g., fluorescently-labeled probes) to an array of DNA molecules, where each DNA molecule in the array comprises a single-stranded fragment of a target sequence to be determined. The panel of probes is a universal panel of probes, as described herein, that reduces redundancy and maximizes match/mismatch discrimination during hybridization at a relatively uniform temperature, thus allowing hybridization data to be obtained efficiently. Further, in accordance with this aspect, hybridization between target fragments and probe molecules is quantified, the fragments placed within one or more reference sequences, and the identity of each base in the target sequence called using analytical tools described herein. The method of the invention provides for long read lengths, high throughput, high accuracy, and/or low sequencing costs, as compared to other available sequencing methods.

[0013] In another aspect, the present invention provides a universal panel of labeled probes (e.g., fluorescently-labeled probes) for sequencing by hybridization. The universal panel of probes is designed for efficient sequencing by hybridization by reducing redundancy, and maximizing match/mismatch discrimination during hybridization at a relatively uniform temperature.

[0014] In a third aspect, the present invention provides kits, systems, and a sequencing apparatus, for performing the methods of the invention.

BRIEF DESCRIPTION OF THE FIGURES

[0015] Figure 1 shows a massively parallel DNA display platform based on in situ rolling-circle amplification (RCA). (a) Single-stranded closed circular DNA templates prepared from target DNA, with each circle carrying a universal linker sequence (thick line) and an insert fragment (thin line). Circles are annealed to covalently bound primer on the surface of a microarray slide, then amplified in situ using phi-29 polymerase, (b) Epifluorescence microscopy image showing RCA products on the surface of a slide, visualized by hybridization of a Cy3 -labeled universal probe targeting the linker sequence. Scale bar is 10 μm. (c) A series of specific probe hybridizations tracking four individual RCA products over six hybridization cycles (UNIP, universal probe; BLANK, buffer-only control hybridization), and showing detection of four pentamer sub-sequences in individual single-molecule features. Note that different probes showed different inherent maximal and minimal signal intensities which are normalized in downstream image processing based on statistics from all features on a slide. (d) Array stability over hundreds of hybridization/imaging cycles was demonstrated by averaging UNIP and BLANK signals obtained between every set of 96 specific probes. After almost 600 cycles, half the initial signal remained and was still well separated from background. The signal decay was not directly due to photobleaching (since each cycle used fresh probe) but rather may reflect a slow degradation or masking of the target DNA, for example through depurination.

[0016] Figure 2. Probe design and characterization, (a) heptamer probe having two flanking degenerate positions and a 5' Cy3 label. Each probe was designed with two 6FAM- labeled test targets: one perfect match and one carrying a mismatch at the central position. Mismatch nucleotides were selected randomly, (b) The melting point Tm was determined by melting curve analysis, where hybridization was indicated by the appearance of fluorescence resonance energy transfer (FRET) between the 6FAM and Cy3 labels when they were brought in close proximity, detected as a quenching of the 6FAM signal at low temperatures. The figure shows typical match and mismatch melting curves. Dashed lines are overlaid on the raw data for clarity, (c) Histogram showing the distribution of Tm values for all 582 probes. The average Tm was 49.0⁰C. (d) Histogram showing the distribution of match/mismatch ΔTm. The probes showed good mismatch rejection, with an average ΔTm = 30.4⁰C, which is probably an underestimate of the true average, since mismatch melting points below zero could not be measured.

[0017] Figure 3. Fragments aligned to the reference genome in the Bacteriophage λ assembly. A composite reference genome was constructed by splicing the 48,502 nt λ genome (accession NC_001416.1) at position 7,000 in the sequence of yeast chromosome 5 (accession NCJ)Ol 137.2). The total length of the composite genome was 625,371 nucleotides, (a) A plot of the score (in standard deviations from the average score along the composite genome) for each alignment, showing that very few (5%) fragments align outside the lambda genome, and with lower average scores. For clarity, only 10% of the alignments are shown. Only alignments with S.D. > 6 were used in subsequent analyses, (b) Histogram of the genome coverage (number of hits per 200 bp, equivalent to the effective fold-coverage since all fragments were 200 bp long) showing the specificity of the genome alignment for the λ sequence. A few hotspots could be seen in the yeast genome (for example, at position 230K), which may represent sequences of low complexity that tend to attract poor quality fragments.

[0018] Figure 4. Probabilistic basecalling algorithm, (a) The intensity distribution for each probe was split in two components - match and mismatch - here shown for probe CGCAT and denoted CGCATi and CGCATo, respectively. Given a genome alignment, and under the reasonable assumption that most sequences would be conserved, the aligned fragments could be separated into those that contained CGCAT and those that did not. The two histograms were converted to probability distributions by normalizing their areas to 1.0. As a result, the likelihood that a given intensity measurement represents a 'match' could be determined by simply looking up the intensity in the CGCATi distribution, (b) For convenience, and to avoid round-off errors, all computations were performed on log-odds, defined as the base-ten logarithm of the ratio of the probabilities given by the histograms in (a). The log-odds for any given intensity measurement gives the (logarithm of the) odds in favor of a probe being 'match' vs. it being 'mismatch', and again could be found by lookup in the log-odds curve, (c) Basecalling was performed by examining one position at a time in the reference genome, and collecting log-odds terms for each probe overlapping that position as indicated. For each possible call, there are five positive terms and five negative terms plus the prior log-odds in favor of a substitution, P = -1.2. By convention, odds were computed against the reference, so that the log-odds for not calling any substitution were always zero.

[0019] Figure 5. The depth of coverage along the E. coli chromosome was strongly skewed toward the origin of replication. The plot shows the ten-bin running average depth of coverage in 10 kb bins, normalized to the depth at terminus as indicated by concentric circles. Coverage was lowest near the terminus, which was presumably always haploid, and increased towards the origin in both replichores, reaching an almost diploid level. Note that the data was obtained from a growing bacterial culture and thus represents the average ploidy along the chromosomes of millions of individual dividing cells. In both replichores, the leading strand showed slightly higher coverage all the way from origin to terminus, and the two replichores were separated almost perfectly by the transition points at origin and terminus. The difference may reflect the fact that at any given point, the lagging strand contains RNA primers and nicks generated during the synthesis of Okazaki fragments. The outer ring shows nucleotide positions (M, million nucleotides), arrows indicate the direction of replication for the two replichores, the origin is indicated at position 3,923,882 (midpoint of oriC) and the terminus at 1,588,787 (midpoint of the Dif site).

[0020] Figure 6. Assembly statistics for the E. coli genome, (a) The error rate as a function of fold coverage at individual positions. The first data point at left shows the combined error rate for all positions covered by a single fragment. The accuracy increases rapidly up to about 30-fold coverage and then saturates at an error rate of approximately 10^"3, suggesting the presence of systematic errors at that frequency, (b) Error rate as a function of quality score. The secondary horizontal axis shows the interim quality score q, taken as the difference in log-odds between the best and second best call at each position. A linear fit (R² = 0.95; indicated by dashed gray line) was used to compute the constant of proportionality, which was then used to convert the interim score into a phred-equivalent standard quality measure Qphred (shown on the primary horizontal axis). The scatterplot only extends to about Q45, whereas half the assembled bases were in Q47 or better. This is because the error rate could not be reliably estimated below that at Q45. (c) Assembly- wide distribution of Qphred scores (shown on left axis; gray markers) and the corresponding cumulative distribution (on right axis; black markers). The median score was Q47, corresponding to an expected error rate of 1/50,000 bases called.

[0021] Figure 7. Improving the performance of the probe set. (a) Melting points measured individually against match (black) and single mismatch (white) targets, for all 16 oligos comprised in probe NAGTCGN. The nucleotides at the two degenerate positions are indicated on the horizontal axis and the results were sorted by match-Tm. The perfect separation by number of AT bases, where the four lowest-Tm oligos had two flanking AT bases and the four highest-Tm oligos had none. The shaded rectangle indicates the narrow temperature range available for separating the lowest match-Tm from the highest mismatch- Tm. (b) Same as (a) except the concentrations of individual oligonucleotides were balanced - increasing the concentration of AT-rich oligos and decreasing the concentration of GC-rich oligos. As a result, the target temperature range for good match/mismatch discrimination (shaded rectangle) was substantially wider. In addition, the number of flanking AT bases no longer correlated with Tm. (c) Palindromic probes such as NAGCGCTN showed almost complete loss of hybridization for those fractions capable of forming six-nucleotide perfect hybrids (indicated by arrow 'palindromic'; compare TAGCGCT with TAGCGCA — the latter behaved normally). To resolve the problem, the terminal nucleotide was removed, which restored normal hybridization performance (indicated by arrow 'hexamer').

[0022] Figure 8 is a gel image that shows the result of cleaving a cDNA sample (lane

4) with CviJ* for increasing durations. A gradual reduction in the average fragment length towards 100 bp is observed (100 bp is the lowest fragment of the size standard, lane 3). An optimal cleavage reaction for 100 bp fragments is loaded in lane 1 and fragments around 100 bp are purified.

[0023] Figure 9 shows adapter ligation. Lane 1 is the size marker; lane 2, unligated fragments; lanes 3 and 4, ligated fragments. Most fragments are correctly ligated.

[0024] Figure 10 shows the sample of fragments before (lane 1) and after (lane 2) circularization. Lane 3 shows the result after purification. Notice the absence of linker in lane 3.

[0025] Figure 11 shows a section of approximately 0.8 by 2.4 mm from a random array slide scanned using a Tecan™ LS400 at 4 μm resolution using the 488 nm laser and 6FAM filter. Spots represent amplification products generated from individual circular template molecules.

[0026] Figure 12 shows the stability of short oligonucleotide probes measured by melting point analysis. Figure 12A shows the effect of CTAB in 100 mM tris pH 8.0, 50 mM NaCl. Figure 12B shows the effect of LNA in TaqExpress buffer (GENETIX, UK). Figure 12C shows the specificity of LNA in TaqExpress buffer. Figure 12D shows the effect of introducing degenerate position: 7-mer with 5 LNA (left), 7-mer with 5 LNA and 2 degenerate positions (middle), 7-mer with 3 LNA and 2 degenerate positions (right).

[0027] Figure 13 shows a FAM-labeled universal 20-mer probe (left panel) and a

TAMRA-labeled 7-mer probe (middle), hybridized to a random array and visualized by fluorescence microscopy. The array was synthesized with two templates; both of which should bind the universal probe but only one of which should bind the 7-mer at the sequence CGAACCT. The image was captured using a Nikon DSlQM CCD camera at 2Ox magnification on a Nikon TE2000 inverted microscope. The right-hand panel shows a color composite, and demonstrates that all TAMRA- labeled features were also FAM-positive, as expected.

DETAILED DESCRIPTION OF THE INVENTION

[0028] The present invention relates to methods for sequencing by hybridization, including techniques and analytic tools for sequence analysis, as well as probes, probe sets, systems, and kits for sequencing. The invention employs hybridization of a universal panel of tiling probes to shotgun fragments from a target sequence, which are prepared by amplification in situ on a solid support using rolling circle amplification. Long read lengths ensures unique placement of shotgun fragments in reference sequences, even when sequencing large genomes. The sequencing chemistry is simple, enzyme-free and consumes only dilute solutions of probes, resulting in an order of magnitude reduction in sequencing cost and a substantial increase in speed. As described herein, a prototype instrument based on commonly available equipment was used to resequence the Bacteriophage \ and E. coli genomes to better than 99.9% accuracy at a raw throughput of 320 Mbp/day.

Sequencing-By-Hybridization

[0029] In one aspect, the present invention provides a method for "shotgun sequencing by hybridization" (shotgun SBH), in which a target sequence is reconstructed from a complete tiling of the target sequence with short probes. The sequencing method of the invention generally comprises hybridizing a panel of labeled probes (e.g., fluorescently- labeled probes) to an array of DNA molecules, where each DNA molecule in the array comprises a fragment of a target sequence to be determined. Generally, the DNA molecules are prepared by rolling circle amplification (RCA) of circular single-stranded molecules, and are randomly immobilized on a solid support. The panel of probes may be a universal panel of probes designed as described herein, and may be hybridized to the array in a sequential, or largely sequential, fashion. At least one, but generally a plurality, of locations of the array are imaged during hybridization with the labeled probes, and hybridization complexes in each image are identified. The signal intensity of the detected hybridization complexes are quantified, and the probability of hybridization for each complex (and for each probe) determined, to thereby generate a hybridization spectrum for the various fragments of the target sequence. The positions of these fragments in a reference target sequence are then determined based on the fragments' hybridization spectra (e.g., as compared to expected hybridization spectra for the reference sequence). The probable nucleotide at each position in the target sequence is then determined based on the hybridization spectra of fragments that overlap each nucleotide position.

[0030] The initial target library may be or comprise one or more of an RNA library, an mRNA library, a cDNA library, a genomic DNA library, a plasmid DNA library or a library of DNA molecules.

[0031] Generally, the fragments of the target sequence are such as to allow reliable placement within the reference sequence(s), and to provide sufficiently long read lengths. For example, the fragments of the target sequence may be from about 20 to about 500 nucleotides in length, or may be from about 50 to about 250 nucleotides in length. The fragments may be about 100, about 150, or about 200 nucleotides in length. In certain embodiments, the size of the fragments is fairly uniform, and does not vary by more than about 10, about 20, or about 50 base pairs. Preferably, the initial sequences have the same length within 50% CV, preferably 5-50% CV, preferably within 10% CV, preferably within 5% CV i.e. such that the distribution is such that the coefficent of variation (CV) is e.g. 5%. CV = standard deviation divided by the mean. The initial sequences may have the same length.

[0032] The DNA molecules are generally prepared by fragmenting the target sequence, and converting the fragments into single-stranded circular molecules having the fragment as an insert for sequencing. In preferred embodiments, the single-stranded circular molecules are amplified, or replicated, by rolling circle amplification using known techniques, and immobilized or arrayed on a solid support (e.g., a glass slide) for hybridization. Generally, the DNA molecules are arrayed at a high density to support the desired throughput. For example, the array may contain from about 100,000 to about 10 million DNA molecules per cm². The array may be imaged, or various locations of the array may be imaged, where each such location contains from about 200 to about 20,000 DNA molecules. Thus, each image may contain 200 or more, 300 or more, 500 or more, 1000 or more, 2000 or more, 5000 or more, or 10,000 or more arrayed DNA molecules. The array of DNA molecules may be a random array, for example, where the identity of each sample in the array need not be known during hybridization, as the fragments will be assembled during data analysis.

[0033] In exemplary embodiments, a plurality of locations, such as about 100 to about

500 locations, on the array are imaged during hybridization with each labeled probe in the panel. Further, in such embodiments, each imaged location of the array has a surface area of from about 100 μm² to about 10 mm².

[0034] The method of the invention is applicable to target sequences of all sizes, including target sequences of from about 5,000 to about 10 million base pairs in length, which are difficult, time consuming, and/or costly to sequence with other sequencing technologies. In exemplary embodiments, the target sequence is from about 20,000 to about 1 million base pairs in length, and/or may be a viral or bacterial genome, or in certain embodiments a human genome. Where the target sequence is a human or mammalian genome, particular candidate regions of interest may be enriched for sequencing as described herein.

[0035] The array of DNA molecules (e.g., single-stranded circular DNA molecules amplified or replicated by RCA), are hybridized to a universal panel of labeled tiling probes, which is generally applicable for use with all targets. The probes may be fluorescently- labeled, although the invention is compatible with other labeling technologies. The panel of labeled probes further comprises a universal reporter probe that hybridizes to a sequence in all of said DNA molecules. While generally the method employs a single reporter probe sequence, the invention may just as easily employ some combination of reporter sequences.

[0036] The panel may be as described herein, in the second aspect of the invention.

For example, the probes may comprise oligonucleotides having an effective specificity (as described herein) of from 3 and 10 bp, such as from 4 to 6 bp. Tables 2 and 3 herein show probe panels based on an effective specificity of 5. In an exemplary embodiment, the labeled tiling probes are each designed as an oligonucleotide hexamer or heptamer, with potentially dimerizing probes prepared as hexamers. For example, each labeled probe in the universal panel may contain a pentamer probe sequence with one or two flanking degenerate nucleotides (thus providing an effective specificity of 5). In accordance with such embodiments, the labeled probes comprise oligonucleotides having the formula 5'- NXXXXXN-3', wherein X is a specified base and N is a degenerate position, with the proviso that heptamer probes having a propensity to dimerize are constructed as hexamers having a single degenerate position.

[0037] Because the efficiency of the system can be impacted by the sequential nature of certain embodiments, the universal panel is generally optimized to reduce redundancy. For example, the panel may statistically hybridize to at least 10% of all positions in a target sequence, or may statistically hybridize to at least 25%, at least 50%, or at least 90% of all positions in a target sequence. In an exemplary embodiment, the set of probes hybridizes to 100% of all positions in a target sequence or its reverse complement, such that each position in the target or the reverse complement of the target at that position is hybridized by at least one, or exactly one probe (statistically), in the panel. For example, a preferred panel based on an effective specificity of 5 comprises, or consists essentially of, about ¹A of all possible pentamer sequences, e.g., about 512 labeled tiling probes. In such embodiments, the tiling probes may be designed to exclude reverse complementary pentamer sequences (as described more fully herein). Generally, the probe panel contains fewer than about 800, fewer than about 700, or fewer than about 600 tiling probes. Exemplary sets of such probes are shown in Tables 2 and 3, and are described in further detail below.

[0038] The structure and/or chemical structure of the probes may further be designed to optimize hybridization efficiency, match/mismatch discrimination, and to allow a uniform, or substantially uniform, hybridization temperature (e.g., T_m). For example, in certain embodiments, the panel of probes employs locked nucleic acid (LNA). When based on an effective specificity of 5, the LNA may be incorporated, for example, at nucleotide positions 1, 2, 4, 6 and 7 of heptamer probes, and at positions 1, 2, 4, and 6 or at positions 1, 3, 5 and 6 of the hexamer probes.

[0039] In exemplary embodiments, the average probe T_m of the universal panel is between about 40 and 55°C, such as about 49° C, and fewer than 5% of the probes in the universal panel have a T_n, of less than about 20° C. In exemplary embodiments, the average single nucleotide match/mismatch discrimination (ΔT_m) of the universal panel is at least about 10⁰C, or at least about 20⁰C, or at least about 30⁰C. Such Tm values may be determined in the presence of hybridization buffers described herein, such as buffers containing high salt or TMAC. The universal panel may be constructed from the set of oligonucleotides shown in Tables 2 and 3. [0040] In some embodiments, the probe set is designed to have a more uniform T_n, across the panel. In such embodiments, the proportion of A and T may be increased relative to G and C at degenerate positions. Thus, where the labeled probes comprise oligonucleotides having the formula 5'-NXXXXXN-3\ wherein X is a specified base and N is a degenerate position, the degenerate positions N are skewed toward A and T nucleotides. By raising the proportion of A and T, the T_m values of the universal panel can be balanced. For example, the proportion of A and T to G and C at degenerate positions may be about 3:2, 5:3, 2:1, 3:1, or 4:1. Alternatively, or in addition, the universal panel may be hybridized, for example, in a sequential manner, to the array of DNA samples, in the presence of agents that enhance the match/mismatch discrimination, such as tetramethylammonium chloride (TMAC). TMAC mitigates the preferential melting of A-T versus G-C base pairs, allowing the stringency to be a function of probe length.

[0041] When imaging the array during sequencing, it is important to obtain background signals, such as by imaging the location(s) of the array in the presence of hybridization buffer only as a blank, that is, in the absence of labeled probe. The signal intensity values for the hybridization complexes are also normalized against signal intensity values upon hybridization of a universal reporter probe, which hybridizes to a sequence in each arrayed molecule. Hybridization complexes may be identified in images taken during hybridization with a universal reporter probe, hybridization complexes at corresponding positions then being examined in subsequent images taken during hybridization with each tiling probe in the panel. The hybridization signal intensity is quantified, for example, using a maximum pixel value for each detected hybridization complex as a raw value, and subtracting a background value.

[0042] Fragments, corresponding to the insert in each arrayed sample, are positioned within one or more reference sequences by calculating an alignment score at each position in the reference sequence. For example, a window of width equal to the expected fragment length is scanned across the reference sequence, and for each window position (e.g., at one nucleotide intervals), the presence or absence of each probe sequence in the window is recorded. An alignment score for a fragment is then calculated for each position in the reference based on the observed hybridization spectrum. In some embodiments, the alignment score takes into account probe hybridization intensities, as described more fully herein. The position with the maximum score is selected. [0043] The probable nucleotide at each position in the target sequence is determined based on the hybridization spectra of fragments that overlap each nucleotide position. In certain embodiments, basecalling operates on a probabilistic representation of hybridization, as described in detail herein.

[0044] Thus, in accordance with the first aspect of the invention, tiling of the target sequence with indicative probes is achieved hierarchically using a small universal set of probes compatible with any genome. In certain embodiments, the method of the invention has four steps: (1) in situ rolling-circle amplification of millions of randomly dispersed circular single-stranded DNA templates; (2) sequential controlled hybridization of a universal panel of probes, thereby tiling each target molecule and generating for each target a hybridization spectrum; (3) alignment of hybridization spectra to the reference genome or sequence; and (4) reconstruction of the target sequence using the combined tiling patterns of all aligned fragments.

Preparation of Array and Rolling Circle Amplification

[0045] In certain embodiments, the DNA molecules are prepared by in situ rolling circle amplification (RCA). For example, genomic DNA is fragmented using any appropriate means (e.g., enzymatically or mechanically) and converted to single-stranded, circular molecules having a relatively uniform size. The single-stranded circular molecules may have an insert of about 100 bp, or about 200 bp, or more, and a linker. Templates may be annealed to surface-bound primer on a microscope glass slide and amplified by RCA to form covalently attached, tandem-repeated products that spontaneously curl up into a sub- micrometer structure. The approach has several desirable features for DNA sequencing. First, it is simple to perform, as shown in Fig. Ia. Second, the amplified templates generate easily detectable signal when visualized with fluorescent universal reporter probe (Fig. Ib) and with short sequence-specific probes (Fig. Ic). Third, the templates remain stable over hundreds of wash cycles (Fig. Id), yet are readily accessible to hybridization due to their loose, single-stranded nature. Finally, the array density can be controlled to give 0.5 - 10 million, or more, resolvable features per cm².

[0046] Any random array synthesis protocol may be employed in accordance with the invention. In certain embodiments, the random array synthesis may comprise: providing a support (e.g. glass) with an activated surface; attaching primers, via a covalent or non- covalent bond; adding circular single-stranded templates at a density suitable for the detection equipment; annealing the templates to the primers; and amplifying using rolling-circle amplification to produce a long single-stranded tandem-repeated template attached to the surface at each position (see, e.g., Lizardi et al., "Mutation detection and single- molecule counting using isothermal rolling circle amplification": Nature Genetics vol 19, p. 225). Modifications to this procedure include preannealing the circular template molecules to activated primers before immobilization, and/or providing "open-circle" template molecules which are circularized upon annealing to the primer and closed using a ligation reaction.

[0047] The density of the array is preferably one that maximizes throughput, e.g. a limiting dilution that ensures that as many as possible of the detectors (or pixels in a detector) detect a single template molecule. On any regular array, a perfect limiting dilution will make 37% of all positions hold a single template (because of the form of the Poisson distribution); the rest will hold none or more than one. For example, on a Tecan LS400 with a 6 μm pixel size, the 7.5x2.2 cm reaction surface holds 45 million pixels. With a limiting dilution (Poisson distribution), 37% of those would hold a single template, i.e. 17 million templates. Sequencing 150 nucleotides on each template yields 2.5 Gb of sequence in 150 cycles. With a cycle time of 5 minutes, daily throughput is about 5 Gbp, equivalent to two full sequences of the human genome. In practice, more than one pixel may be needed to reliably detect a feature, but the same reasoning holds whether the detector is a single pixel or multiple pixels.

[0048] Templates suitable for solid-phase RCA should optimize the yield (in terms of number of copies of the template sequence), while providing sequences appropriate for downstream applications. In general, small templates are preferable. In particular, templates can consist of a short primer binding sequence and a 40 - 500 bp insert, which may be a 40- 200 bp insert. However, templates up to 500 bp or up to 1000 bp or up to 5000 bp are also possible, but may yield lower copy numbers and hence lower signals in the sequencing stage. The primer binding sequence may be used both to circularize an initially linear template and to initiate RCA after circularization, or the template may contain a separate RCA primer binding site.

[0049] In order to increase the signal generated from rolling circle-amplified templates it may be desirable to condense them. Since an RCA product is essentially a single- stranded DNA molecule consisting of as many as 1000 or even 10000 tandem replicas of the original circular template, the molecule will be very long. For example, a 100 bp template amplified 1000 times using RCA would be on the order of 30 μm, and would thus spread its signal across several different pixels (assuming 5μm pixel resolution). Using lower- resolution instruments may not be helpful, since the thin ssDNA product occupies only a very small portion of the area of a 30 μm pixel and may therefore not be detectable. Thus, it is desirable to be able to condense the signal into a smaller area. The RCA product may be condensed by using epitope-labeled nucleotides and a multivalent antibody as crosslinker. Alternative approaches include biotinylated nulceotides cross-linked by streptavidin. Alternatively, condensation may be achieved using DNA condensing agents such as CTAB (see e.g. Bloomfeld 'DNA condensation, by nultivalent cations' in 'Biopolymers: Nucleic Acid Sciences').

[0050] In order to immobilize the RCA primer oligonucleotides to a surface, many different approaches have been described (see e.g., Lindroos et al. "Minisequencing on oligonucleotide arrays: comparison of immobilization chemistries", Nucleic Acids Research 2001 : 29(13) e69). For example, biotinylated oligos may be attached to streptavidin-coated arrays; NH₂- modified oligos may be covalently attached to epoxy silane- derivatized or isothiocyanate-coated glass slides, succinylated oligos may be coupled to aminophenyl- or aminopropyl-derived glass by peptide bonds, and disulfide- modified oligos may be immobilized on mercaptosilanized glass by a thiol/disulfide exchange reaction. Many more have been described in the literature.

[0051J In certain embodiments, the target sequence is enriched from a larger pool of sequences. Target nucleic acids of interest may be nucleic acid segments identified from whole genome association studies in a disease cohort. Such a disease cohort may comprise DNA samples from patients with diseases or complex genetic traits such as: Crohn disease, psoriasis, baldness, longevity, schizophrenia, diabetes, diabetic Retinopathy, ADHD, Endometriosis, asthma, an autoimmune related diseases, an inflammatory related diseases, a respiratory related diseases, a gastrointestinal related diseases, a reproduction related disease, a women's health related diseases, a dermatological related diseases, and an ophthalmologic related disease.

Universal Probe Panel and Hybridization

[0052] In some embodiments, as described above, the probes are hybridized sequentially, or in a largely sequential fashion, and thus, their number and length should be limited. The panel of probes may use mixtures of 16 heptamer oligos acting effectively as pentamers (i.e. heptamers with two degenerate positions; Fig. 2a, Tables 2 and 3). In order to render the melting points practical, locked nucleic acid (LNA) (26) monomers may be incorporated, and tetramethylammonium chloride (TMAC) may be added to hybridization buffers. Since shotgun fragments may be obtained from both strands of a genome, half of all 1024 possible 5-mers suffices to tile the reference genome at every position on either strand. Thus, in some embodiments, the set of tiling probes comprises (or consists essentially of) the minimal set of probes required to completely tile any target sequence, with each position targeted on one strand or the other. In certain embodiments, the tiling panel consists of less than 800, or less than 700, or less than 600 tiling probes.

[0053] Further, the probe set is designed to show reasonable melting points and excellent match/mismatch discrimination, as determined by melting curve analysis with perfect match and single-mismatch DNA targets. In exemplary embodiments, the average melting point Tm is about 49°C (Fig. 2c) and the average single-nucleotide match/mismatch discrimination ΔTm is about 3O⁰C (Fig. 2d). Fewer than twenty probes in the exemplary panel showed Tm < 2O⁰C or ΔTm < 1O⁰C. Although the full probe set was assayed for match/mismatch discrimination only at the central nucleotide position, no difference in performance was observed at the five central positions when the probes were used in sequencing (not shown).

[0054] The probe set may be synthesized with approximately equimolar ratios at degenerate positions ('N'), so that the amounts of each of the 16 individual oligonucleotides comprising each probe were approximately equal. However, oligonucleotides within each probe that have N = G or C show stronger binding than oligonucleotides that have N = A or T, as expected. This results in a narrow temperature range for optimal hybridization, since the mismatch-Tm of GC-rich oligonucleotides is close to the match-Tm of AT-rich probes (Fig. 7a). Thus, in certain embodiments, it may be desirable to balance the relative concentrations of GC vs. AT at degenerate positions (i.e. by increasing the amounts of adenosine and thymine relative to guanine and cytosine during oligonucleotide synthesis at degenerate positions; Fig. 7b). Further, some probes may self-dimerize, resulting in weak signals and thus substandard data for these probes. Self-dimerization may be eliminated by shortening these probes to hexamers (Fig. 7c), selectively disrupting the self-dimer (which loses two interactions) relative to target hybridization (which loses only one). These modifications may substantially increase overall sequencing accuracy. [0055] The panel of probes, as described above, and the target length (as described above) are optimized so that the spectra can be used both (1) to locate unambiguously each target sequence in the reference sequence and (2) to resolve accurately any sequence difference between the target and the reference sequence.

[0056] In order to fulfill the first requirement, the panel contains enough information

(in the information-theoretic sense) to unambiguously locate the target. In accordance with these embodiments, a preferred panel contains probes with a 50% statistical probability of hybridizing to each target, corresponding to 1 bit of information per probe. 50 such probes would be capable of discriminating more that 1000 billion targets. Such panels have the additional advantage of being resilient to error and to genetic polymorphisms. A panel of 100 4-mer probes is capable of uniquely placing 100 bp targets in the human transcriptome even in the presence of up to 10 SNPs.

[0057] In order to fulfill the second requirement, the panel of probes must cover the target and must be designed such that sequence differences result in unambiguous changes in the spectrum. For example, a panel of all possible 4-mer probes would completely cover any given target with four-fold redundancy. Any single-nucleotide change would result in the loss of hybridization of four probes and the gain of four other characteristic probes.

[0058] The sensitivity of a probe panel can be calculated:

[0059] A probe is a mixture of one or more oligonucleotides. The mixture and the sequence of each oligonucleotide defines the specificity of the probe. The dilution factor of a probe is the number of oligonucleotides it contains. The effective specificity of a probe is given by the length of a non- degenerate oligonucleotide with the same probability of binding to a target. For example, a 6-mer probe consisting of four oligonucleotides where the first position is varied among all four nucleotides (i.e. is completely degenerate) has an effective specificity of 5 nucleotides.

[0060] A panel is preferably a set of k-mer probes with the preferred property that any given k long target is hybridized by one and only one probe in the panel. Thus, a panel may be a complete and non-redundant set of probes.

[0061] The complexity C of a probe panel is the number of probes in the panel. [0062] The sensitivity of a position within a panel is the set of different targets it can discriminate at that position. For example, a panel where the probes are either GC mixed or AT mixed at a position (denoted GC/ AT) is sensitive to G-A, C-A, C-T and G-T differences (i.e. transitions), but not to transversions (G to C etc).

[0063] When probing with a full panel of probes, each position in the target is guaranteed to be probed by each position in the panel, i.e. by k staggered overlapping probes. However, the sensitivity of each position may be different, so that some differences in the target are only detectable by less than k probes.

[0064] For example, the panel given by

(GCAT) (GC/AT) (GC/AT) (G/C/A/T) (G/C/A/T) (GC/AT) (GC/AT) (GCAT) has 8 positions (i.e. k = 8). The first and last position are completely degenerate, so no change in the target is detected by those positions. Transitions (GC <-> AT) are detected by 6 positions, while transversions (GA <-> CT) are detected by only two positions in each probe. The effective specificity can be calculated by summing the effective specificity of each position: 0 + 0.5 + 0.5 + 1 + 1 + 0.5 + 0.5 + 0 = 4 bp.

[0065] For non-trivial targets, it will often be the case that probes are repeated in the target. Such probes lose their sensitivity to changes at any single position, since they will still hybridize to the other.

[0066] Given the length L of the target, one can calculate the probability (for each position in the target) that there is at least one probe sensitive to a change at that position. First, determine how many probes are sensitive to the change of interest in a repeat-free target. Call this Ic₀; kc is 6 for transitions and 2 for transversions in the previous example.

[0067] The probability p(R) that any given probe is present in one or more of the other positions in the target (i.e. that it is repeated) is:

[0068] The probability p(S) that not all of the 2k_c sensitive probes are repeated is then:

The exponent is 2k_c because any change causes the disappearance of k_c probes and the appearance of Ic₀ new probes.

[0069] The sensitivity given the target length may be calculated. For example, C =

256, k_c = 2, L = 120 gives p = 98%, i.e. the panel with 256 probes is sensitive to 98% of all transversions (and 100% of transitions, kc = 6). If only half of the probes in the panel are used, so that the effective k_c = 1, then p = 86% for transversions and 99.7% for transitions (k_c = 3). The overall average sensitivity in a species like the human (which has 63% transitions) would be 95%.

[0070] The theory is strictly valid as long as the number of SNPs is low compared with the target length - i.e. as long as multiple SNPs do not occur within the length one probe. In practical experiments this is almost always true: for example, human genomic DNA contains about 1 SNP per 1000 nucleotides, and two SNPs within 7 bases is thus very unlikely.

[0071] In practice, at least two sensitive probes may be required to score a SNP (i.e. because hybridization data is error-prone). In that case, the probability P(S) becomes 1 - p(R)2^kc"' and the calculations are again straightforward.

[0072] When working with subsets of panels (in order to save time and reagents), it may desirable to nevertheless guarantee that any position in the target is probed on one strand or the other. In other words, a subset of probes is determined such that any k-mer that is not probed is guaranteed to be probed on the opposite strand. Such subsets can be obtained by placing (G/A), (C/T), (G/T) or (C/A) in the middle position. For example (G/A) will fail to probe G and A in the target, in which case the opposite strand is guaranteed to be either C or T, which are probed. Other variations are possible.

[0073] The (GC/AT) degenerate position has two desirable features. First, it guarantees that the individual oligos in each probe have similar melting point (since they will either be all GC or all AT). Second, the position will be sensitive to transitions which represent 63% of all SNPs in humans.

[0074] In the present invention, it is envisaged that a panel of probes is sequentially hybridized to the targets. In order to limit the complexity of the panel of probes, it is desirable to keep the probes short, preferably to have only 3 - 6 bp effective specificity. [0075] The probes are stabilized in order for them to hybridize efficiently. In addition, stabilization may help the probe compete with any internal secondary structure that may be present in the target. Stabilization can be achieved in many different ways. For example, stabilization may be achieved through stabilizing additives in the hybridization reaction, for instance salt, CTAB, magnesium, stabilizing proteins. Alternatively, or in addition, stabilization may be achieved through the addition of degenerate positions that extend the length of the probe without increasing its complexity. For example, a 6-mer probe extended with an 'N' position would really be a mixture of four oligonucleotides, each 7 bases long. A (GC/ AT) position - indicating a mix of G and C or a mix of A and T - would extend the probe by one base while only doubling the complexity (instead of quadrupling it). Alternatively, or in addition, stabilization may be achieved through modification of the probe chemistry, for example by means of locked nucleic acid (Exiqon, Denmark), peptide nucleic acid and or minor groove binder (Epoch Biosciences, US). Stabilization may of course be obtained through any combination of the above, including a combination of a degenerate probe with LNA hybridized in CTAB buffer.

[0076] Many approaches are known for detecting hybridization. For example, the methods and probes, and kits described herein may employ direct fluorescence, where the probe is labeled and hybridization is detected by the increased local concentration of probes hybridized to the target. This may require high magnification, confocal optics or total internal reflection excitation (TIRF). Alternatively, the methods, probes, and kits may employ energy transfer, where the probe is labeled with a quencher or donor and the target is labeled with counterpart donor or quencher. In these embodiments, hybridization is detected by the decrease of donor fluorescence and/or the increase in quencher fluorescence. Alternatively still, the methods, probes, and kits may employ single-base extension, where the hybridized probe serves as primer for a single base extension reaction incorporating fluorescent dye (alternatively, released PPi maybe detected as in Pyrosequencing).

[0077] In one embodiment, the probe is labeled by a fluorophor detectable in an epifluorescence microscope or a laser scanner, for example Cy3. Many other suitable dyes are commercially available. The probe is hybridized to the array at a concentration optimized to permit detection of the local increase in concentration at a hybridized array feature, over the background present in all the liquid. For example, 400 nM may be used, or the probe may be hybridized at 1 nM up to 500 nM or even 500 nM up to 5 μM depending on the optical setup. The advantage of this detection scheme is that it avoids a washing step, so that detection can proceed at equilibrium hybridization conditions, which facilitates match/mismatch discrimination.

[0078] When using an energy transfer approach, the target may carry a permanently hybridized helper oligonucleotide with a fluorescence donor. The helper is designed to withstand washes that would melt away the short probes. The probes carry a dark quencher. For example, the donor may be fluorescein and the quencher Eclipse Dark Quencher (Epoch Biosciences). Many other donor/quencher pairs are known (see e.g. Haugland, R.P., 'Handbook of fluorescent probes and research chemicals', Molecular Probes Inc., USA), hi general, it is desirable to have a probe with a long Fδrster radius, capable of quenching over long distances. Hybridization is detected by the quenching of the donor fiuorophor upon hybridization of the probe.

Spectral search and alignment

[0079] A crucial aspect of shotgun SBH is the long read length, which should facilitate assembly of vertebrate genomes. In fact, reads of at least 60 bp are required to cover most of the human genome (29). Nevertheless, scaling up assembly to gigabase-sized, highly repetitive genomes poses a number of additional challenges. The algorithms as described and/or exemplified herein assembles a haploid consensus sequence. For diploid genomes the method must be modified to allow heterozygous basecalls. One approach would be to sequence batches of pooled large-insert clones (e.g. BACs), simultaneously reducing the genome alignment problem (by reducing the effective genome size), the heterozygous assembly problem (since individual clones would all be haploid), and the problem of assembling long-range haplotypes (by overlapping and distinguishing clones originating from the two haplotypes along each chromosome).

[0080] During spectral alignment and base calling, careful attention should be paid to maximizing sequence accuracy, as well as to providing a reliable (and p/jred-compatible) quality measure for every nucleotide position. This should simplify the interpretation of the called sequence and its use in downstream analyses.

[0081] Given the spectrum of a target, the location of the target within the reference sequence is sought, allowing for sequence differences. The search can be performed by simply scanning the reference sequence with a window of the same size as the target, computing an expected spectrum for each position and comparing the expected spectrum with the observed spectrum at the position. The highest-scoring position or positions are returned. Because the method of the invention generates very large numbers of hybridization spectra in a short time, it is important to optimize the search step. For example, in a current implementation, spectral search proceeds at 1.2 billion matches per second on a high-end workstation, and we estimate that ten workstations will be required to keep up with a single sequencing instrument. It is another aspect of the invention to accelerate the search using programmable hardware, i.e. field-programmable gate arrays (FPGA). By translating the search algorithm to Mitrion-C (Mitrion AB, Sweden), an acceleration of 30 times can be achieved using just two FPGA chips in a single workstation computer.

[0082] The reference sequence will be a similar sequence to the target. Similarity between a reference sequence and a target can be measured in many ways. For example, the proportion of identical nucleotide positions is commonly used. More advanced measures allow for insertions and deletions e.g. as in Smith- Waterman alignment and provide a probabilistic similarity score as in Durbin et al. "Biological Sequence Analysis" (Cambridge University Press 1998).

[0083] The degree of similarity required for the method of the present invention is determined by several factors, including the number and specificity of the probes used, the quality of the hybridization data, the template length and the size of the reference database. For example, simulations show that under the assumption of degree melting point difference between match and mismatch probes (with 1 degree coefficient of variation), 256 probes and using the human genome as reference with 100 bp templates, then up to 5% sequence divergence can be tolerated. This corresponds for example to sequencing the Gorilla genome using the human genome as reference. Further increasing the number of probes, decreasing the length of the templates or improving the match/mismatch discrimination allows sequences of even lower similarity to be used as reference, e.g. 5-10%, up to 10%, 5-20%, 10-20% or up to 20%.

[0084] Once one or more likely locations for a fragment have been found within the reference, a modification to the reference sequence is sought that will explain any discrepancies between the observed and expected spectra. We may at this stage introduce relevant modifications to the reference sequence, e.g. SNPs, short indels, long indels, microsatellites, splice variants etc. For each modification or combination of modifications, we again compute a score for the similarity between the observed and expected spectra. The most likely modified reference sequence or sequences are returned. Methods for searching very large parameter spaces are known in the art, e.g. Gibbs sampling, Markov-chain Monte Carlo (MCMC) and the Metropolis-Hastings algorithm.

[0085] When comparing spectra, a simple binary overlap score may be used (scoring

1 for each probe that either does or does not hybridize in both spectra, 0 otherwise), or a more sophisticated statistical approach may use gradual or probabilistic measures of spectral overlap. Where multiple targets locate to the same position in the target, higher-level analysis may then be performed to assess the confidence in any sequence differences.

Probes, Panel, and Kits

[0086] In a second aspect, the invention provides a set or panel of probes wherein each probe comprises an oligonucleotide, each of which said is stabilized, and each of which carries a reporter moiety. The effective specificity of each probe may be from 3 to 10 bp, such as from 4 to 6 bp. For example, the effective specificity may be 3, 4, 5, 6, 7 8, 9 or 10 bp. In an exemplary embodiment, the labeled tiling probes are each designed as an oligonucleotide hexamer or heptamer, with potentially dimerizing probes prepared as hexamers. For example, each labeled probe in the universal panel may contain a pentamer probe sequence with one or two flanking degenerate nucleotides (thus providing an effective specificity of 5). In accordance with such embodiments, the labeled probes comprise oligonucleotides having the formula 5'-NXXXXXN-3', wherein X is a specified base and N is a degenerate position, with the proviso that heptamer probes having a propensity to dimerize are constructed as hexamers having a single degenerate position.

[0087] Because the efficiency of the system can be impacted by the sequential nature of certain embodiments, the universal panel is generally optimized to reduce redundancy. For example, the panel may statistically hybridize to at least 10% of all positions in a target sequence, or may statistically hybridize to at least 25%, at least 50%, or at least 90% of all position in a target sequence. In an exemplary embodiment, the set of probes hybridizes to 100% of all positions in a target sequence or its reverse complement, such that each position in the target or the reverse complement of the target at that position is hybridized by at least one, or exactly one probe (statistically), in the panel. For example, a preferred panel based on an effective specificity of 5 comprises, or consists essentially of, about ¹A of all possible pentamer sequences, e.g., about 512 labeled tiling probes. In such embodiments, the panel may be designed to exclude reverse complementary pentamer sequences. Generally, the probe panel contains fewer than about 800, fewer than about 700, or fewer than about 600 tiling probes. Exemplary sets of such probes are shown in Tables 2 and 3.

[0088] The structure and/or chemical structure of the probes may further be designed to optimize hybridization efficiency, match/mismatch discrimination, and to allow a uniform, or substantially uniform, hybridization temperature (e.g., T_m). For example, in certain embodiments, the panel of probes employs locked nucleic acid (LNA). When based on an effective specificity of 5, the locked nucleic acid (LNA) may be incorporated at nucleotide positions 1, 2, 4, 6 and 7 of the heptamer probes, and at positions 1, 2, 4, and 6 or at positions 1, 3, 5 and 6 of the hexamer probes.

[0089] In exemplary embodiments, the average probe T_m of the universal panel is between about 40 and 55°C, such as about 49° C, and fewer than 5% of the probes in the universal panel have a T_m of less than about 20° C. In exemplary embodiments, the average single nucleotide match/mismatch discrimination (ΔT_m) of the universal panel is at least about 10⁰C, or at least about 20°C, or at least about 3O⁰C. Such Tm values may be determined in the presence of high salt buffer or buffer containing TMAC. The universal panel may be constructed from the set of oligonucleotides shown in Tables 2 and 3.

[0090] In some embodiments, the probe set is designed to have a more uniform T_m across the panel. In some embodiments, the proportion of A and T may be increased relative to G and C at degenerate positions. Thus, where the labeled probes comprise oligonucleotides having the formula 5'-NXXXXXN-3', wherein X is a specified base and N is a degenerate position, the degenerate positions N are skewed toward A and T nucleotides. By raising the proportion of A and T, the T_m values of the universal panel can be balanced. For example, the proportion of A and T to G and C at degenerate positions may be about 3:2, 5:3, 2: 1 , 3:1, or 4:1. Alternatively, or in addition, the universal panel may be hybridized, for example, in a sequential manner, to the array of DNA samples, in the presence of agents that enhance the match/mismatch discrimination, such as tetramethyl ammonium chloride (TMAC). TMAC mitigates the preferential melting of A-T versus G-C base pairs, allowing the stringency to be a function of probe length. [0091] The panel of labeled probes further comprises one or more universal reporter probes, for hybridizing to all of the arrayed DNA molecules.

[0092] The reporter moiety (or label) may for example be selected from the group consisting of a fluorophor, a quencher, a dark quencher, a redox label, and a chemically reactive group which can be labeled by enzymatic or chemical means, for example a free 3'- OH for primer extension with labeled nucleotides or an amine for chemical labeling after hybridization.

[0093] The panel of probes may be supplied in the form of a kit, together with one or more reagents for amplifying target molecules by RCA. For example, the kit may comprise, in addition to the panel of probes, a DNA polymerase suitable for amplifying single-stranded circular DNA by RCA (e.g., Phi29). The probes may each be supplied in separate vials, in concentrated, diluted, lyophilized, or other form. Other components to the kit might include a suitable buffers for RCA and/or hybridization (as described herein), one or more solid support(s), and an RCA primer (as described herein), which may covalently attach to the solid support.

[0094] A further aspect of the invention provides a random array of single-stranded

DNA molecules, wherein each molecule consists of at least two tandem- repeated copies of an initial sequence, each molecule is immobilized on a surface at random locations with a density of between 10³ and 10⁷ per cm², preferably between 10⁴ and 10⁵ per cm², or preferably between 10⁵ per cm² and 10⁷ per cm², each initial sequence represents a random fragment from an initial target DNA or RNA library comprising a mixture of single- or double-stranded RNA or DNA molecules, and the initial sequences of all said DNA molecules have approximately the same length. Generally, the molecules will comprise at least 100 tandem- repeated copies of an initial sequence, usually at least 1000, or at least 2000, preferably up to 20,000. The molecules may comprise 50 or more tandem-repeated copies of an initial sequence, which is detectable using standard microscopy.

An apparatus for automated high-throughput sequencing

[0095] In a third aspect, the present invention provides for a sequencing apparatus that cycles a number of reagent solutions (e.g., probe and/or hybridization solutions) through a reaction chamber placed on or in a detector, optionally with thermal control. Optionally, the apparatus is operationally coupled to a work station for conducting sequence analysis, that is, for aligning fragments with one or more reference sequences and for basecalling. Thus, the work station may be programmed to perform the analyses described herein.

[0096] In one example, the detector is a CCD imager, which may for example be operating by white light directed through a filter cube to create separate excitation and emission light paths suitable for a fluorophore bound to each target. For instance, a Kodak KAF-16801E CCD may be used; it has 16.7 million pixels, and an imaging time of ~2 seconds. Daily sequencing throughput on such an instrument would be up to 10 Gbp.

[0097] The reaction chamber provides: easy access for the optics, a closed reaction chamber, an inlet for injecting and removing reagents from the reaction chamber, and an outlet to allow air and reagents to enter and exit the chamber.

[0098] A reaction chamber may be constructed in standard microarray slide format, suitable for being inserted in an imaging instrument. The reaction chamber can be inserted into the instrument and remain there during the entire sequencing reaction. A pump and reagent flasks supply reagents according to a fixed protocol and a computer controls both the pump and the scanner, alternating between reaction and scanning. Optionally, the reaction chamber may be temperature-controlled. Also optionally, the reaction chamber may be placed on a positioning stage to permit imaging of multiple locations on the chamber.

[0099] A dispenser unit may be connected to a motorized valve to direct the flow of reagents, the whole system being run under the control of a computer. An integrated system would consist of the scanner, the dispenser, the valves and reservoirs and the controlling computer.

[00100] In certain embodiments, the instrument comprises: an imaging component able to detect an incorporated or released label, a reaction chamber for holding one or more attached templates such that they are accessible to the imaging component at least once per cycle, a reagent distribution system for providing reagents to the reaction chamber.

[00101] The reaction chamber may provide, and the imaging component may be able to resolve, attached templates at a density of at least 100/cm², optionally at least 1000/cm², at least 10 000/cm² or at least 100 000/cm², or at least 1 000 000/cm², at least 10 000 000/cm² or at least 100 000 000 per cm². [00102] The imaging component may for example employ a system or device selected from the group consisting of photomultiplier tubes, photodiodes, charge-coupled devices, CMOS imaging chips, near-field scanning microscopes, far-field confocal microscopes, wide-field epi-illumination microscopes and total internal reflection miscroscopes. The imaging component may detect fluorescent labels, or alternatively, the imaging component may detect laser-induced fluorescence.

[0103] In one embodiment of an instrument according to the present invention, the reaction chamber is a closed structure comprising a transparent surface, a lid, and ports for attaching the reaction chamber to the reagent distribution system, the transparent surface holds template molecules on its inner surface and the imaging component is able to image through the transparent surface.

[0104] Raw sequence throughput might be further increased by decreasing fluidics cycle time. Specifically, hybridization kinetics were observed to be fast (on the order of a few seconds), so the fluidics cycle speed was dominated by the speed of liquid handling and temperature change. The fluidics cycle time might be effectively eliminated by using two flow cells and alternating between imaging and reacting. Furthermore, relatively sparse arrays are exemplified herein to avoid excessive numbers of unresolved overlapping image features. However, the maximum number of non-overlapping features would be obtained at much higher densities. Thus, if overlapping features are efficiently detected and removed, the raw sequence yield per slide would be at least tripled. The combined effect of these improvements would be to increase the throughput as much as ten-fold, and, notably, the suggested improvements would carry little or no additional cost, implying that any increased throughput would result in a corresponding reduction in cost per base sequenced.

Examples of Applications

[0105] By sequencing cDNA fragments at random, the expression level of the corresponding RNA can be quantified by counting the number of occurrences of fragments from each RNA. Structural features (splice variants, 573' UTR variants etc.) and genetic polymorphisms can be simultaneously discovered.

[0106] Shotgun sequencing of whole genomes can be used to genotype individuals by noticing the occurrence of sequence differences with respect to the reference genome. For example, SNPs and indels (insertion/deletion) can easily be discovered and genotyped in this way. In order to discriminate heterozygotic sites, dense fragment coverage may be required to ensure that both alleles will be sequenced.

[0107] Further aspects and embodiments of the present invention will be apparent to the skilled person in the light of the present disclosure. All documents cited anywhere in the specification are incorporated by reference.

EXAMPLES

[0108] Probes. Synthetic oligonucleotides may be purchased from Sigma Proligo,

France. Probes were of the general formula 5'-Cy3-NXXXXXN-3' (X are specified bases, N are degenerate positions), with LNA at positions 1,2, 4, 6 and 7; DNA at positions 3 and 5. For example, one probe was 5'-Cy3-NCGCATN-3'. Each probe was quality controlled by mass spectrometry and capillary electrophoresis (not shown), and functionally validated as follows. For each probe, perfect match (for example, 5'-AANATGCGNAA-6FAM-3' SEQ ID NO.:1)^' and mismatch (for example, 5'-AANATGGGNAA-6FAM-3' SEQ ID NO.:2) targets were synthesized using DNA monomers. The melting temperature Tm and the match/mismatch discrimination ΔTm were calculated from melting curves obtained for the probe against the two targets separately. Hybridization (in 2.5M TMAC, 50 mM Tris-HCl pH 8.00, 0.05% Tween-20) was measured by fluorescent resonant energy transfer (FRET) between the 6FAM and Cy3 dyes in a real-time PCR instrument (7900HT, Applied Biosystems). Probe sequences and melting points are reported in Table 2.

[0109] Sample preparation. 4 μg genomic DNA (Bacteriophage λ from New England

Biolabs, Ipswitch, MA; E. coli K12 strain MG1655 from LGC Promochem, Boras, Sweden) was fragmented enzymatically in 50 mM Tris-HCl pH 7.5, 50 μg/ml BSA, 10 mM MnCl₂ and 0.04 U DNaseI (NEB) in a total volume of 120 μL. Two reactions were incubated at 25⁰C for 10 and 15 min, then stopped with 4.2 μL 0.5 M EDTA and purified on silica spin columns (PCR Cleanup, Qiagen). Fragmented samples were blunt-ended by Klenow enzyme treatment (55 μL eluted DNA, 30 μM dNTP, 0.03 U/μL NEB Klenow enzyme in 70 μL NEB2 buffer), then purified on silica spin columns (PCR Cleanup, Qiagen) and recovered in 55 μL elution buffer. 10 μL of each reaction was separated on a 2% E-gel (Invitrogen) for 25 minutes to visualize the size distribution. Based on the gel picture, either one sample was chosen or the two samples were pooled to give a good representation of the targeted 200±10 bp range. 5 pmol fragmented DNA, 187 pmol each of left (S'-GCAGAATCCGAGGCCGCCT- 3' SEQ ID NO.:3 and 5'-GACAAGGCGGCCTCGGATTCTGC-S' SEQ ID NO.:4) and right (5'-AGTGGCGTGTCTTGGATGC-S' SEQ ID N0.:5 and 5'-

CGATAACGCATCCAAGACACGCCACT-3' SEQ ID N0.:6) double-stranded adaptors, 5 μL Quick Ligase and 50 μL Quick Ligation buffer (NEB) were incubated at 25°C for 15 minutes in a total volume of 105 μL, then purified on silica spin columns. To produce blunt-ended fragments, 20 μL 5x Phusion buffer HF (Finπzyme), 2 μL 10 mM dNTP were added to the sample, which was heated to 72^CC before 2 units Phusion polymerase (Finnzyme) was added and incubation continued for 5 minutes. The sample was cleaned up on a silica spin column and eluted in 30 μl H₂O. Samples were separated on an 8% non-denaturing PAGE gel run at 250V over night. The SYBR Gold (Molecular Probes)-stained gel was scanned on a Typhoon 9200 (Amersham Biosciences). A gel piece including the 250±10 bp range was excised using a scalpel, collected in 50 μl of 10 mM Tris pH 8 and incubated for 3 hours at 37°C. To maximize yield and minimize PCR errors, eight amplification reactions were set up from each eluted sample. 0.2 mM dNTP, 400 nM biotinylated primer (5'-biotin-

GACAAGGCGGCCTCGGATTCTG-3' SEQ ID NO.:7), 400 nM phosphorylated primer (5'- phosphate- CGATAACGC ATCC AAGACACGC-3' SEQ ID NO.: 8), 1 μL eluted template and 1 μL Phusion Taq polymerase (Finnzyme) in a total volume of 100 μL in Phusion buffer HF was thermocycled (98° C 10 sec, 72° C 20 sec) for 25 cycles. The reactions were sequentially purified over a single silica spin column (Qiagen PCR Cleanup). To remove primer dimer artefacts, the concentrated eluate was purified from a 2% agarose gel (Qiagen Gel Extraction Kit). From this point on, all procedures were carried out in polyallomer tubes (Beckman) to minimize loss of material due to adsorption. The phosphorylated DNA strand was isolated as follows. 100 μL paramagnetic streptavidin-coated beads (M280, Dynal, Norway) were washed twice in 200 μL B&W buffer (Dynal), then left in 100 μL B&W. 100 μL purified PCR product (having one biotinylated and one phosphorylated strand) was added and left for 20 minutes at room temperature. After two washes in 200 μL B&W and two in 200 μL 10 mM Tris pH 8.0, the phosphorylated strand was eluted in 100 μL 0.1 M NaOH for 3 minutes. The supernatant was transferred to a fresh tube, 25 μL 1.0 M Tris pH 7.5 was added and the sample was cleaned up on a silica spin column. The single-stranded linear DNA was annealed at 0.03 μM to 0.06 μM biotinylated linker (5'-biotin- TGCGTTATCGGACAAGGCGG-3' SEQ ID NO.:9) in 30 μL ligation buffer (Fermentas) by incubation for 2 minutes at 65°C followed by cooling to 25⁰C over 15 minutes. 70 μL ice-cold DNA ligase in ligase buffer (Fermentas) was added and the mix was incubated at 25⁰C for one hour. Circular template was purified on 25 μL Dynabeads (M280, Dynal). The beads were first washed twice in 100 μL B&W buffer, then 100 μL B&W buffer and 100 μL ligation product was added, and after let 20 minutes, was washed twice in 100 μL B&W. Circular DNA was eluted in three fractions (30 μL H2O, 30 μL 40 mM NaOH, 30 μL H2O), the fractions were pooled and 5μl of IM Tris-HCl pH 8.0 was added. The final circular DNA library was stored at -2O⁰C.

[0110] Array synthesis. Activated microarray slides (Genorama SAL-I Ultra, Asper

Biotechnology) were coated with aminated primer (5'-NH-

AAAAAAAAAAGCGTGTCTTGGATGCGTTATCG -3' SEQ ID NO.: 10) at 1.0 μM in 100 mM carbonate buffer pH > 9.0, 15% DMSO, 0.01% Triton X-100 by incubation for 50 minutes at 30°C, then blocked in 1% NH₄OH twice for two minutes. Before hybridization, the slide was incubated in SSB (2 x SSC, 0.1% SDS) 2 min at 65°C, 3 min at 50⁰C, 5 min at 30⁰C, then rinsed in TWB (2 x SSC₅ 0.1% Tween-20) followed by MGB (1.5 mM Tris pH 8.0, 10 mM Mg²⁴). The circular template DNA library was then annealed, typically at 1 :200 dilution, in SSB 2 min at 65°C, 3 min at 50⁰C, 10 min at 3O⁰C, followed by wash in SSB 5 min at 30⁰C, then rinsed in TWB followed by two rinses in MGB. Amplification buffer (1 mM dNTP, 0.1 x BSA, 0.1 u/μL Phi29 polymerase in Phi29 DNA Polymerase Reaction Buffer, both from NEB) was added to the slide, which was incubated at 30⁰C for 3 hours. The slide was then rinsed in MGB and washed in SSB 2 min at 65⁰C, 3 min at 50⁰C, 2 min at 30⁰C, then rinsed in TWB followed by two rinses in MGB. The slide was finally dried at 30⁰C for two minutes and ready for mounting on the instrument.

[0111] Instrument. An integrated and automated instrument was built as follows. A

Nikon TE2000PFS motorized inverted microscope was fitted with a Scan IM motorized stage (Marzhauser, Germany), a Cy3 filter cube (FF562, Semrock, Rochester, NY), a 120W metal halide illumination system (X-Cite 120 PC, EXFO, Canada), an electro-mechanical shutter (Uniblitz VS35 with VCM-Dl controller, Vincent Associates, Rochester, NY) and a monochrome 4 megapixel cooled CCD camera (Spot Xplorer, Diagnostic Instruments, Sterling Heights, MI). All images were acquired with a 2Ox magnification Nikon PlanFluor LD objective through 1 mm glass slides. A custom flat rectangular flow cell capable of holding two slides was machined in aluminum, black anodized and coated with 4 μm Parylene (Plasma Parylene Coating Service, Germany). The flow cell was permanently fixed on a Peltier module (MPA 250-12, Melcor, Edmonton, Canada) in place of the hot plate. A plastic adapter ring was used to mount the flow cell assembly onto the microscope stage. When a standard 25x75 mm glass slide was held onto the flow cell by vacuum suction (Vacuum Pump System VCS-I, C&L Instruments, Inc, Hershey, PA) between two o-rings, an interior 10x50x0.15 mm chamber was formed with inlet and outlet at either end, inducing laminar flow across the glass surface. The flow cell was connected by tubing to a Tecan MSP9250 autosampler, from which reagents could be aspirated through the flow cell. All parts of the instrument were controlled by a custom software application.

Overview of a sequencing run. Each run was performed with a full set of 582 probes in 96- well plates. Between probe plates, two universal probes (the all-DNA 5'-Cy3- GAATCCGAGGCCGCCTTG-3' SEQ ID NO.:11 and the mixed LNA/DNA 5'-Cy3- NCGAGGN-3') targeting the adaptor sequence and two buffer-only negative controls were hybridized. Universal probes are henceforth denoted 'UNIP'. The whole run was fully automated except that buffers had to be replenished daily.

Hybridization. Each hybridization cycle was performed as follows. 450 μL probe in TMAC buffer (3M TMAC, 50 mM Tris-HCl pH 8.0, 0.4% β-mercapto-ethanol and 0.05% Tween-20) was aspirated from a 96-well microtiter plate into the flow cell held at 45°C. The temperature was briefly raised to 65°C, then adjusted to the desired hybridization temperature (Tm - 33⁰C), and excess probe was removed by two washes in 450 μL TMAC buffer. After image acquisition, the temperature was raised to 45°C in preparation for the next cycle.

Image acquisition. Before the first imaging cycle, an autofocus routine was performed as follows. A stack of images bracketing the expected focal plane was acquired and the best focus was determined by maximizing the focus criterion where

Pi_j denotes the pixel value at ij. This ensured that the CCD sensor was perfectly in focus at the start of the experiment. It was then kept in focus indefinitely by the Nikon opto-mechanical Perfect Focus System. Images were acquired in a grid with 1.25 mm spacing at 1 second exposure.

Feature extraction. All local maxima (in a 7x7 neighborhood with chipped corners) were extracted in the first UNIP image and a threshold was applied to remove weak features. The threshold was set once per experiment and was verified by visual examination. Only the features extracted from the first UNIP image were then analyzed in subsequent images. Subsequent images were registered onto the first UNIP image by scanning through a range of translations systematically, maximizing the sum-of-products of pixel values for all detected features.

Feature quantification and normalization. To allow for a small local image offset, the local maximum pixel value in a 3x3 neighborhood of each feature in each image was taken as its raw value for the corresponding probe. A background value was calculated for each feature and image by taking the second lowest pixel value in the corners of a 15x15 square. To monitor the reduction in maximum signal with time (number of hybridizations), each set of 96 probes was flanked by UNIP and blank hybridizations. The intensity value of each feature in each image was normalized by first removing the background value, then dividing by the interpolated signal of the two flanking UNIP hybridizations.

Spectrum alignment. Each extracted feature corresponded to a DNA fragment from the original sample library. The vector of normalized intensity values of each feature across the full set of probes, i.e. the 'hybridization spectrum' of the fragment, was aligned to the repeat-masked reference genome as follows. A window of width equal to the expected fragment length was scanned across the reference sequence. For each window position, the presence or absence of each probe sequence in the window was recorded. An alignment score was calculated as follows:

(where /denotes the normalized intensity of a probe minus the median intensity of the probe across all spectra, n_umque is the total number of distinct probes in the window and where the first sum is over the probes present in the window, while the second sum is over the probes absent from the window). Thus a probe with a relatively large normalized value would contribute a high score to positions where it was present, and vice versa. The position with the maximum score was reported. . The score is admittedly heuristic, but yielded accurate alignments on simulated data with similar sources of errors to those we think are present in real data. To eliminate low-quality reads due to e.g. unresolved overlapping image features or chimeric inserts, only reads whose maximum alignment score was more than 6 standard deviations greater than the mean score along the genome were used. In this way, typically fewer than 10% of all reads were removed.

Calculating hybridization probabilities. Basecalling was designed to operate on a probabilistic representation of hybridization. For each aligned fragment and probe, the probability of observed normalized intensity values conditional on the presence or absence of the probe was needed. For each probe, these probability densities are functions of the observed intensity, given by all fragments where the probe did or did not in fact occur. To obtain these distributions, it was assumed that the experimental genome was almost identical to the reference genome, i.e. that the divergence was low. The number of occurrences of a probe in each fragment were taken from the corresponding window in the reference sequence as given by spectral alignment. Fragments predicted to have more than one occurrence of the probe, or where the number of occurrences could not be determined with confidence were disregarded. After normalization to unit area these histograms were used to directly calculate the required probabilities. To illustrate this, histograms for probe CGCAT are shown in Figure 4a. As was typical of other probes, there was a significant overlap between the distributions, indicating that base calls could not be confidently based on single probes. For the E. coϊi experiment, the differential melting points for each probe — depending on the presence of GC ('strong') or AT ('weak') base pairs in the target at positions corresponding to the two degenerate nucleotides — were taken into account by generating separate histograms for the four cases (weak-weak, weak-strong, strong-weak and strong-strong) of flanking nucleotides. In subsequent computations it was more convenient to work with log odds scores, a measure of the odds in favor of the presence of the probe over its absence. The log odds as a function of normalized intensity was taken as the base- 10 logarithm of the ratio between positive and negative probabilities; this is again illustrated for probe CGCAT in Figure 4b. These curves were capped at their extremes to minimize errors due to the low number of cases in the tails of the histograms. Note that the zero-crossing of the log-odds curve corresponds to the crossing of the positive and negative histograms in Figure 4a.

Basecalling. To compute the final sequence, a Bayesian model was constructed. Given the reference sequence and a number of aligned fragments, the goal of sequence reconstruction was to find the most likely modification of the reference sequence as indicated by the probe hybridization probabilities. The current algorithm was designed to consider single-nucleotide changes only, but the extension to small indels should be straightforward. Basecalling proceeded nucleotide by nucleotide across the entire reference genome. At each position, the four possible substitutions were considered (one of which would be identical to the reference sequence). For each substitution, typically five overlapping probes would change from present to absent and five from absent to present. There could be more than five in the cases where both strands were probed, as a result of having more than 512 probes. For each probe, we calculated the average of the measured probe intensity for all fragments containing that probe and overlapping the position by at least 20 bases (to guard against slight misalignments). This average intensity was used to calculate the posterior log-odds of a substitution, by taking the sum of log-odds of each probe given by the log-odds distribution for that probe (Fig. 4b), subtracting the log-odds for probes that would disappear as a result of the substitution. This is illustrated for one position and substitution in Figure 4c. Finally a prior probability term P was added to account for the prior expectation of a substitution. To assess accuracy, two kinds of mock substitutions were introduced: SNPs at a rate of 10^"3 and private mutations at a rate of 10^~4. The only difference between them was that the two common SNP alleles were treated as a priori equally probable (P = 0.0) and more probable than the two rare alleles (P = -1.2), whereas private mutations were treated as completely unknown and thus received the standard bias P = -1.2). The scheme was designed to mimic the ultimate target application, human genome resequencing. A call was made as 'N' if the quality q (defined as the difference between the maximum posterior odds and the second highest) was less than a predefined q_min, and as 'R' if in a repeat; otherwise the base with maximum posterior odds was called. Coverage was reported as the fraction of non-N non-R bases; similarly, accuracy was reported as the fraction of accurate non-N non-R bases. The free parameters (q_min and P) were adjusted to balance false positive and false negative calls and the overall coverage. In all experiments reported here q_min = 0.7 and P = -1.2, i.e. biased slightly against substitutions.

EXAMPLE 1

Sequencing Bacteriophage λ

[00112] The 48,502 bp Bacteriophage λ genome was sequenced. An RCA array, prepared from a single DNA sample fragmented to 200 bp, was subjected to serial hybridization with the 582 probes and images were acquired at a single location on the slide, thus forming an image stack. A total of 14,237 features, representing 2.8 Mbp of raw sequence and 60-fold nominal genome coverage, were identified by thresholding in the first image (universal probe) and these were quantified in subsequent images (specific probes). Normalized intensity values for each probe were collected for each feature; this was called the 'hybridization spectrum' of the feature.

[00113] Hybridization spectra were aligned using an algorithm to a composite reference genome comprising yeast chromosome 5 with the lambda genome spliced in at position 7000. The yeast chromosome served to control for and quantify alignment errors. As shown in Fig. 3, 95% of all spectra aligned to the lambda genome, and with higher average alignment scores. Assuming the same rate of false hits to the lambda genome as to the 13 times larger yeast chromosome, 99.9% of alignments to lambda were placed there correctly.

[00114] Given that the number of misaligned fragments and the mutation rate are small, the match/mismatch status for each probe and aligned fragment can be inferred with high confidence from the reference sequence. This was used to obtain match and mismatch intensity histograms for each probe (Fig 4a), which were converted to log-odds curves giving the logarithm of the odds in favor of a probe being 'match' as a function of the observed intensity (Fig. 4b). Thus, an observed intensity can be converted it into a probability. In particular, every position in the reference genome can be examined and the probability of the observed intensities of probes corresponding to each possible call at that position can be calculated.

[00115] Typically, a substitution would cause ten probes to change relative to the reference sequence. For example, five probes (AGCTG, GCTGG, CTGGA, TGGAA and GGAAT) would detect the central position in AGCTGGAAT, and these would be replaced by five others (AGCTC, GCTCG, CTCGA, TCGAA and CGAAT) if that position were replaced with a C. For each probe, the odds in favor of its hybridizing can be calculated. In this way, a consensus sequence is computed by calculating a Bayesian posterior probability for each possible call at each position along the genome, based on the log-odds of each overlapping probe (see Methods and Fig. 4c). A quality score q is calculated as the difference between the best and second best calls, and a threshold q_mm = 0.7 is applied. 2395 positions (5% of the genome) with q < q_min were called as 'N'. [00116] In order to estimate the accuracy of the called sequence, mock substitutions (9) were introduced in the reference genome, and the ability of the basecaller to revert these to the original allele was assessed. In this context, the overall basecalling accuracy was 99.94% and 28 of 31 mock substitutions were correctly called. The remaining three were false negative calls; no miscalls were observed. In all, 30 false positive errors were made. Examining the contexts of these errors, we observed several GC-runs. For example, an A was erroneously called at position five of GCGGCGGCGGGG SEQ ID NO.: 12. This may indicate that local strong secondary structures in the target molecule prevented probe hybridization in some cases.

EXAMPLE 2 Sequencing E. coli

[00117] We then proceeded to resequence the 4.6 Mbp genome of Escherichia coli. A total of 3.3 million image features corresponding to 660 Mbp of raw sequence and 143-fold nominal coverage were collected.

[00118] First, the distribution of coverage depth across the genome was examined. The most salient feature of the coverage distribution was a pronounced wave, across the genome, with maximum near the origin of replication and minimum near the terminus (Fig. 5). This probably reflects true intra-genomic differences in DNA content in rapidly growing E. coli cultures, where multiple nested replication forks co-exist inside any given cell, resulting in double or quadruple DNA content near the origin, but single-copy near the terminus. It also demonstrates our ability to detect and quantify copy-number changes in this range (i.e. 2-3 copies), even in the absence of a haploid control genome. Intriguingly, there was a clear difference in coverage between the leading and lagging strands for both replichores. The leading strand of replichore 1 - going clockwise from the origin - showed higher coverage than the lagging strand, which presumably is synthesized more slowly and may have numerous nicks and single-stranded regions at any one instant. The same pattern was evident for replichore 2 - going counter-clockwise from the origin - where the leading strand corresponds to the 'reverse' strand.

[00119] Next, basecalling was performed as above. The resulting assembly covering 82% of the genome was 99.93% accurate overall, and mock substitutions were called with 97% accuracy (3205 out of 3295). Again, the remainder were all false negatives; no miscalls were observed. This demonstrates the utility of the method both for SNP discovery and for calling known SNPs with high accuracy.

[00120] In order to determine the limits of oversampling, excessive coverage was generated. Examining the accuracy as a function of raw sequence depth (Fig. 6a) revealed saturation after about 30-fold nominal coverage. Remaining errors at high coverage were presumably due to systematic sources, such as occasional strong secondary structures or unfavorable combinations of poorly performing probes.

[00121] To provide an easily interpreted quality measure, a p/wecP-equivalent quality score termed Qp_hred was constructed. First, an interim score q was calculated as above by taking the difference between the log-odds for the base called and the second most probable base call at each position. This measure should be roughly proportional to the logarithm of the error rate, i.e., q oc log /^, as confirmed by the scatterplot in Fig. 6b. Next, the constant of proportionality was determined using a linear fit (R² = 0.95), and this was used to convert q to Qphred (since Q^_red ≡ -101OgP₁, ). The resulting phred-eqaiv&lerA accuracy scores were then used to summarize the assembly quality (Fig. 6c), showing that 82% of the assembly was Q20 or better, 58% was Q40 or better and 40% was Q50 or better. The average phred score was Q_phred = 47, corresponding to one error in approximately 50,000 bases.

[00122] The present invention thus provides a rapid and inexpensive genome resequencing method. The invention provides a single molecule display platform based on in situ rolling-circle amplification, and provides, adapted to this platform, a hierarchical genome tiling approach to reveal sequence differences in the context of a reference genome. The method of the invention is suitable for viral and bacterial genomes, and is scalable to larger genomes, such as, ultimately, to whole human genomes.

[00123] The ultimate goal of human genome resequencing requires that throughput and cost be carefully optimized. In developing shotgun SBH, the present inventors also maximized throughput and minimized cost by using simple reagents and instrumentation and a very high degree of multiplexing. Current maximal throughput of the prototype instrument was achieved when using the full imaging surface of 405 images. Cycle times were 1 1.5 minutes (divided approximately equally between imaging and the fluidic cycle), typically yielding 1.6 Gbp of raw sequence in less than five days, for an overall sequencing speed of 3,800 bp/s. The sequencing chemistry consumed only simple oligonucleotide probes and buffer, and as a consequence, costs were dominated by equipment and plasticware. The reagent cost was $0.32/megabase, which would translate to $960 per human genome at single fold coverage. Including the amortized cost of equipment, the overall cost was $0.5/megabase. By comparison, Shendure (9) reported a speed of 140 bp/s and a cost of $110/megabase in an assembly covering 70% of the E. coli genome, while Margulies (8) achieved a throughput of 1,700 bp/s at a reported (28) cost of $200/megabase of raw sequence when sequencing Mycoplasma genitalium.

EXAMPLE 3 Exemplary Protocol Input

[0124] Double stranded DNA template.

Template fractionation:

[0125] The restriction enzyme CviJ I* (EURx, Poland) was used, which recognizes

5'-GC-3' and cuts blunt in between. The restriction reactions were prepared as follows:

Table 4

[0126] The cleaved DNA was purified with PCR cleanup kit (Qiagen) according to manufacturer's protocol.

[0127] A fraction was analyzed on a 2% agarose gel to identify the optimal reaction conditions for the specific batch of template and enzyme (see Figure 8, lanes 4 - 8). [0128] The optimal cleavage reaction was repeated to get a total of 5 ug DNA (Figure 8, lane 1).

Template size selection:

[0129] The DNA was purified on an 8% non-denaturing PAGE (40 cm high, 1 mm thick). Each well was loaded with no more than lμg of DNA, and a 95-105 ladder was included, indicating the region of interest. The ladder consisted of 3 PCR fragments, at 95, 100 and 105 base pairs.

[0130] The gel was stained with SYBR gold, the results analyzed on a scanner, and the region of interest (95-105 bp) excised and electro-eluted with ElutaTube™ (Feπnentas) according to manufactures protocol.

Adaptor ligation:

[0131] One adaptor was used for ligation.

5' GCAGAATGCGCGGCCGCCTTAG 3' SEQ ID NO.:13 3' CGTCTTACGCGCCGGCGGAATC 5' SEQ ID NO.: 14

It contained 5' phosphates and an internal Not I site. [0132] The following ligation mixture was prepared:

Table 5

Incubated at 25° C for 15 minutes. The reaction was purified using PCR cleanup (Qiagen) according to manufacturer's protocol. See Figure 9.

Restriction digest Not I: [00133] The following reaction was prepared:

Incubated at 37° C for 4 hours or overnight. Samples were purified using PCR cleanup (Qiagen) according to manufactures protocol.

[00134] The purification was repeated with PCR cleanup to remove as much excess adaptors as possible.

Circularization of templates:

[00135] Single stranded circles were prepared by denaturing the samples in the presence of a linker oligonucleotide

5 '-CGTCTTACGCGCCGGCGGAATCCGTCTTACGCGCCGGCGGAATC-S ' SEQ ID NO.: 15.

[00136] Specifically, the reaction was prepared as follows:

Heated to 93° C for 3 minutes, put on ice until cold, quick spin. 50 ul of 2x Quick ligation buffer (NEB) and 1 ul of Quick ligase (NEB) were added, mixed briefly, and Incubated 25° C for 15 minutes.

[00137] At this stage the circles are formed and the samples can be used for RCA. See

Figure 10.

Immobilization:

[0138] μM RCA primer (identical to the circularization linker with an additional 5'-

AAAAAAAAAA-C6-NH-3' tail SEQ ID NO.: 16, where C6 is a six-carbon linker and NH is an amine group) was immobilized on SAL-I slides (Asper Biotech, Estonia) in 100 mM carbonate buffer pH 9.0 with 15% DMSO, and incubated at 23⁰C for 10 hours.

[0139] Remaining active sites on the slide surface were blocked by first soaking in 15 mM glutamic acid in carbonate buffer (as above, but 40 mM) at 30⁰C for 40 minutes, lhen soaking in 2 mg/ml polyacrylic acid, pH 8.0 in room temperature for 10 minutes.

[0140] Circular templates were annealed at 3O⁰C in buffer 1 (2xSSC, 0.1%SDS) for 2 hours, then washed in buffer 1 for 20 minutes, then washed in buffer 2 (2xSSC, 0.1% Tween) for 30 minutes, then rinsed in 0.IxSSC, then rinsed in 1.5 mM MgCl₂.

Amplification:

[0141] Rolling-circle amplification was performed for 2 hours in Phi29 buffer, 1 mM dNTP, 0.05 mg/mL BSA and 0.16 u/μL Phi29 enzyme (all from NEB, USA) at 30°C.

[0142] Reporter oligonucleotide complementary to the circularization linker and labeled with 6-FAM was annealed as above, followed by soaking in buffer 3 (5 mM Tris pH 8.0, 3.5 mM MgCI₂, 1.5 mM (NH₄J₂SO₄, 0.01 mM CTAB). Figure 11 shows a small portion of a slide with individual RCA products clearly visible.

Probe panel hybridization:

[0143] Each probe was designed according to the following scheme: (GCAT) (GC/AT)

(GC/AT) (G/C/A/T) (GC/AT) (G/C/A/T) (GC/AT), each with locked nucleic acid (Exiqon, Denmark) at positions 2, 4 and 6 and with Eclipse dark quencher (Epoch Biosciences, USA) at the 3' end.

[0144] Probes were hybridized in buffer 3 at 100 nM. A temperature ramp was used for each probe to discover the optimal temperature for match/mismatch discrimination. Figure 5 shows the result of hybridization of two match/mismatch pairs.

EXAMPLE 4

Preparation of Candidate Region Enrichment Fragments Step 1 : Selection of regions for enrichment and probe preparation

[0145] The average candidate region size, based on genome wide association studies in diseases or complex genetic traits, such as Crohn's and psoriasis, is about half a megabase (0.5Mb). All candidate regions associated with the disease can be selected, but in this example, 3 distinct regions from different chromosomes (region H: 453.5 kb, region R: 285.5 kb and region E: 193.6 kb) were selected, that together cover a total of 932.6 kb. In addition, in a separate example, only region E (193.6 kb) was selected to verify the effect of size on the enrichment method of the invention

[0146] A probe set in this method refers to specific DNA molecules that cover an entire chromosomal region, namely candidate regions resulting from Genizon GWS studies. The source of probes could be either YACs, BACs, cosmids or phages alone or in combination. In this example, BAC molecules are used.

[0147] Candidate regions are scanned for the availability of commercial BAC clones specific to the regions of interest and are ordered as the source material for probe preparation.

[0148] For probe preparation the following steps are performed:

a) BACs are stored at -80⁰C in LB-Glycerol. With sterile pipette tips or an inoculating loop, the top of the vial is scraped.

b) The inoculum is then steaked on an LB agar plate (Chloramphenicol 12.5 μg/mL) to obtain single colonies.

c) The plate is then placed inverted at 37°C overnight.

d) A single colony is selected from the freshly streaked selective plate and used to inoculate a starter culture of 5 ml LB (Chloramphenicol 12.5 μg/mL).

e) The culture is incubated for 8h with vigorous shaking (300 rpm) at 37⁰C.

f) A dilution is performed by taking 0.5-1.0 ml of the starter culture and adding it to 500 ml of selective LB medium (resulting in a 1/500 to 1/1000 dilution).

g) The diluted culture is then incubated at 37°C for 12-16 h with vigorous shaking

(-300 rpm). A flask or vessel with a volume of at least 4 times the volume of the culture is preferably used. The culture should reach a cell density of approximately 3-4 x 10⁹ cells per ml. h) From the 500ml overnight culture, the BAC -DNA is isolated using a QIAGEN® Large-Construct Kit as described by the manufacturer. Up to 150 μg of BAC- DNA free of bacterial genomic DNA is typically obtained.

Step 2: Genomic DNA preparation

[0149] DNA samples are selected from individuals affected by a particular disease

(disease samples) or from unaffected individuals, which are used as controls (control samples). Disease samples represent specific combinations of haplotypes, including risk, neutral, protective and rare haplotypes, and cover all candidate regions of interest.

[0150] In this example, 3 different human genomic DNAs from healthy individuals were used. After standard preparation and purification of genomic DNA, the samples were treated consecutively by bovine pancreatic DNase I and mung bean nuclease. The first enzymatic reaction was used to cause double strand breaks in the DNA in the presence of Mg²⁺, and the second enzymatic reaction produced blunt ended DNA fragments. The average fragment length (~200bp) and genomic DNA concentration were estimated by gel electrophoresis. The resulting fragments were then ready for adaptor ligation. The two different adaptors used in this example are described below and have no base modifications present in their sequence:

Adaptor- 1

5'- GCAGAATCCGAGGCCGCCT-3' oligo name: UA-ADP1-512 SEQ ID NO.:3

5'- GACAAGGCGGCCTCGGATTCTGC-3' oligo name: LA-ADP1-512 SEQ ID NO.:4

Adaptor-2

5'- AGTGGCGTGTCTTGGATGC-3' oligo name: UA-ADP2-512 SEQ ID NO.:5

5'- CGATAACGCATCCAAGACACGCCACT-3' oligo name: LA-ADP2-512 SEQ ID NO.:6

[0151] The adaptors were designed to only ligate at the blunt end of the genomic DNA fragments. [0152] a) The two adaptors were mixed and added to the ligation reaction in 75 fold excess (37.5 times each) in relation to the template genomic DNA fragments.

[0153] b) After the ligation reaction, the two strands were melted (72⁰C) and Phusion polymerase (NEB, proofreading polymerase) was used to create blunt and double stranded ends.

[0154] c) The fragments ligated to the adaptors were then separated by electrophoresis on 3.5% Metaphor agarose (Cambrex, Baltimore, MD). The region of interest was excised (fragment target size was ranging from 200 bp to 400 bp) and the DNA was purified using a GFX column (GE Healthcare).

[0155] d) The resulting purified genomic DNA fragments with adaptors (linkered-512 genomic DNA) were quantified by picogreen dye and adjusted to a 200ng/ul concentration.

[0156] The resulting linkered-512 genomic DNA was concentrated by ethanol precipitation and kept for Step 4 (enrichment step).

Step 3: BAC-DNA probe preparation

[0157] The BAC-DNA from step 1 was fragmented by Dnasel and biotinylated using a Biotin-Nick translation reaction mix (Roche) using 4OuM Biotin-16-dUTP. An isotope was included in the Nick translation reaction as a tracer to confirm that the biotinylation reaction had proceeded efficiently and to confirm binding of the BAC-DNA to the streptavidin-coated magnetic beads.

[0158] As described for the genomic DNA in step 2, repetitive sequences in the BAC-

DNA were removed by blocking with Cot-1 DNA (Invitrogen) resulting in Cot-1-blocked- BAC-DNA, which was kept for Step 4 (enrichment step).

Step 4: Enrichment step

[0159] This step comprises two rounds of enrichment. Briefly, the first round enriches target DNA fragments from whole genomic DNA, while the second round enriches for target DNA fragments from the first round by reducing the amount of contaminating fragments. In both enrichment steps, the end products were DNA fragments of ~250 bp. To quantify this enrichment, the resulting fragments were cloned into plasmids and transformed into bacteria. The resulting bacteria were streaked on appropriate LB plates. Independent clones were picked at random and probed for sequences specific to enriched regions. The formula used to calculate enrichment was:

Size HG/ Size CR X % Specific sequence = Level of enrichment

Size HG: size of human genome (kb)

Size CR: size of the candidate region of interest (kb)

% SS: % of sequence specific to enriched region

The table below summarizes the enrichment determination performed in this example:

Table 7

[0160] In experiment B, the conclusion is that 1 in 3 clones will have the target sequence from one of the 3 CR and the features (linkers) necessary for sequencing with the Cantaloupe technology.

First enrichment

[0161] Hybridization of linkered-512-genomic DNA (from step 2) to Cot-2-blocked-

BAC-DNA (from step 3).

[0162] The linkered 512-genomic DNA (lug) was transferred to a 200ul PCR tube and overlaid with mineral oil. [0163] The sample was denatured by heating at 95⁰C for 5 min and incubated at 65°C for 15 min.

[0164] Cot-1-blocked BAC-DNA was added and the hybridization reaction was performed at 65⁰C for 70 hours.

Binding of the hybridization reaction to streptavidin-coated magnetic beads

[0165] The hybridization mixture was then added to streptavidin-coated magnetic beads (lOOul) at 15-25⁰C for 30 min.

[0166] The beads were removed using a magnetic separator and the supernatant was discarded.

[0167] The beads were washed at room temperature for 15 minutes in 1 ml of IX

SSC, 0.1% SDS.

[0168] The beads were washed 3 times, each at 65⁰C for 15 minutes in 1 ml of 0.1 X

SSC, 0.1% SDS.

[0169] The hybridized linkered 512-genomic DNA-Cot-1 -blocked BAC-DNA was eluted from the magnetic beads by the addition of lOOul of 0.1 M NaOH and incubated at room temperature for 10 minutes.

[0170] The beads were removed using a magnetic separator. The beads contained the

Cot-1-blocked BAC-DNA which was biotinylated and remained on the magnetic beads. The supernatant was neutralized with an equal volume of IM Tris pH8, and then desalted with Centricon YM-30 columns (Millipore).

[0171] The resulting DNA (linkered 512-genomic DNA) was used as template for the first enrichment and amplification step described below.

First round of amplification

[0172] The amplification reaction contains the Template DNA (linkered 512-genomic

DNA) from above.

[0173] The primers used (1 OuM each) were: Forward: S'-GACAAGGCGGCCTCGGATTCTG-S' SEQ ID NO.:7

Reverse: 5'-CGATAACGCATCCAAGACACGC-S ' SEQ ID NO.:8

[0174] The other reagents used:

25mM each dNTPs

5X Phusion reaction Buffer

Phusion Polymerase IU

Water up to 50ul

[0175] The amplification program was one denaturing cycle at 98⁰C (30sec) followed by

30 cycles of: 10 seconds denaturation at 98⁰C, 10 seconds of annealing at the primer melting temperature and 20 sec elongation at 72⁰C.

[0176] The amplification products were purified using a QIAquick PCR purification kit

(QIAGEN) and kept as input DNA for a second enrichment step.

Second enrichment

[0177] The second enrichment was performed as described in the first enrichment step with the input DNA being the amplification products from the first enrichment. The second amplification was similar to the first amplification, described in the first enrichment above, with the difference being in the primers used (primers were identical in sequence but with modifications on the 5 '-end):

Forward: 5'-BIOTIN-GACAAGGCGGCCTCGGATTCTG-S' SEQ ID NO.:7

Reverse: 5'-PHO-CGATAACGCATCCAAGACACGC-S' SEQ ID NO.:8

[0178] These modifications (biotinylation and phosphorylation) in the primers where included so as to ensure that the resulting DNA fragment were ready for the preparation (circularization) of input DNA.

EXAMPLE 5

Preparing DNA Templates For Sequencing Step 1 : Single strand production and circularization

[0179] The purpose of this step is to retain only the phosphorylated single strand of the input double stranded target DNA generated in the second amplification step.

[0180] The Dynabeads retained the input double stranded biotinylated and phosphorylated fragments,. Incubation with 0.1 M NaOH facilitated the release and isolation of the single stranded fragments of DNA containing the 5'-phosphate group necessary for the circularization step. The biotinylated strand is retained on the Dynabeads and the complementary strand is released in solution and used as input for the circularization step.

[0181] Single stranded circular molecules were formed by denaturing the samples in the presence of the following biotinylated linker oligonucleotide:

5'-BIOTIN-CGTCTTACGCGCCGGCGGAATCCGTCTTACGCGCCGGCGGAATC-3' SEQ ID NO.:15

[0182] The reaction mixture consisted of: Single stranded linear fragments produced in step a (0.3uM), 0.6 uM of the linker described above, and water up to 50 ul. The reaction mixture was heated to 65° C for 2 minutes, and then cooled down to room temperature (the step took ~15 minutes). Ice cold ligation mix (DNA ligase, 5U in IX ligation buffer, Fermentas) was then added to the reaction mixture. The purpose of the addition of the ligase was to join the 3' and 5' ends of the single stranded fragments to permit the formation of circular molecules. For purposes of clarity, the circular molecules were hybridized to the biotinylated linkers to permit the juxtaposition of the 3' and 5' ends of the single stranded fragments. The biotinylated linkers were removed subsequently to obtain purified circular molecules, which were the input template DNA.

Step 2: Purification of circularized molecules

[0183] The circularized molecules (annealed to the biotinylated linker from step 2) were then added to Dynabeads.

[0184] The beads were washed and left to dry after the final wash (as described in the manufacturers instructions).

[0185] The circular molecules were eluted from the beads using 4OmM NaOH.

[0186] The molecules were quantified by real time PCR. [0187] The pure circular molecules are the template used for the rolling circle amplification steps.

Step 3: Immobilization of Circularized molecules on glass slides

[0188] Asper Biotech Genorama ™ SAL, 0.15 or 1 mm slides were used in accordance with the manufacturer's instructions for handling and storage.

Immobilization

[0189] 5 uM RCA primer (identical to the circularization linker with an additional 5'-

AAAPAAAAAA-C6-NH-3' tail SEQ ID NO.: 17, where C6 is a six-carbon linker and NH is an amine group) was immobilized on SAL-I slides (Asper Biotech; see oligo used in Diagram A: 5' XAAAAAAAAAAGCGTGTCTTGGATGCGTTATCG 3' RCA-G-RTNG X=NH2-(CH2)6- PO4-Oligo SEQ ID NO.:18) in 100 mM carbonate buffer pH 9. 0 with 15% DMSO.

[0190] Samples were incubated at 30⁰C for 1 hours.

[0191] The remaining active sites on the slide surface were blocked by first soaking in

15 mM glutamic acid in carbonate buffer (as above, but 40 mM) at 30⁰C for 40 minutes, and then soaking in 2 mg/ml polyacrylic acid, pH 8.0 in room temperature for 10 minutes.

[0192] Circular templates were annealed at 30⁰C in buffer 1 (2 x SSC, 0.1% SDS) for 2 hours, then washed in buffer 1 for 20 minutes, then washed in buffer 2 (2 x SSC, 0.1% Tween) for 30 minutes, then rinsed in 0.1 x SSC, then rinsed in 1. 5 mM MgCb-

Diagram A

REFERENCES

The following references are incorporated by reference in their entirety.

1. Sanger, F., Nicklen, S. & Coulson, A.R. DNA sequencing with chain- terminating inhibitors. Proc Natl Acad Sci USA 74, 5463-5467 (1977).^'

2. Prober, J.M. et al. A system for rapid DNA sequencing with fluorescent chain- terminating dideoxynucleotides. Science 238, 336-341 (1987).

3. Luckey, J.A. et al. High speed DNA sequencing by capillary electrophoresis. Nucleic Acids Res. 18, 4417-4421 (1990).

4. Venter, J.C. et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science 304, 66-74 (2004).

5. A haplotype map of the human genome. Nature 437, 1299-1320 (2005).

6. Klein, RJ. et al. Complement Factor H Polymorphism in Age-Related Macular Degeneration. Science 308, 385-389 (2005).

7. Maraganore, D.M. et al. High-resolution whole-genome association study of Parkinson disease. Am. J. Hum. Genet. 77, 685-693 (2005).

8. Margulies, M. et al. Genome sequencing in micro fabricated high-density picolitre reactors. Nature 437, 376-380 (2005).

9. Shendure, J. et al. Accurate multiplex polony sequencing of an evolved bacterial genome. Science 309, 1728-1732 (2005).

10. Blazej, R.G., Kumaresan, P. & Mathies, R.A. Microfabricated bioprocessor for integrated nanoliter-scale Sanger DNA sequencing. Proc Natl Acad Sci U S A 103, 7240- 7245 (2006).

1 1. Brenner, S. et al. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat. Biotechnol. 18, 630-634. (2000). 12. Ghadessy, F.J., Ong, J. L. & Holliger, P. Directed evolution of polymerase function by compartmentalized self-replication. Proc Natl Acad Sci U S A 98, 4552-4557 (2001).

13. Mitra, R.D. & Church, G.M. In situ localized amplification and contact replication of many individual DNA molecules. Nucleic Acids Res. 27, e34 (1999).

14. Bing, D.H. et al. in Genetic Identity Conference Proceedings, Seventh International Symposium on Human Identificationl996).

15. Braslavsky, L, Hebert, B., Kartalov, E. & Quake, S.R. Sequence information can be obtained from single DNA molecules. Proc Natl Acad Sci U S A 100, 3960-3964 (2003).

16. Hyman, E.D. A new method of sequencing DNA. Anal Biochem 174, 423-436 (1988).

17. Ronaghi, M., Karamohamed, S., Pettersson, B., Uhlen, M. & Nyren, P. Realtime DNA sequencing using detection of pyrophosphate release. Anal Biochem 242, 84-89 (1996).

18. Metzker, M.L. et al. Termination of DNA synthesis by novel 3'-modified- deoxyribonucleoside 5'-triphosphates. Nucleic Acids Res. 22, 4259-4267 (1994).

19. Canard, B. & Sarfati, R.S. DNA polymerase fluorescent substrates with reversible 3'-tags. Gene 148, 1-6 (1994).

20. Hinds, D.A. et al. Whole-Genome Patterns of Common DNA Variation in Three Human Populations. Science 307, 1072-1079 (2005).

21. Drmanac, R., Petrovic, N., Glisin, V. & Crkvenjakov, R. Sequencing of megabase plus DNA by hybridization: theory of the method. Genomics 4, 114-128 (1989).

22. Drmanac, S. et al. Accurate sequencing by hybridization for DNA diagnostics and individual genomics. Nat. Biotechnol. 16, 54-58 (1998).

23. Bains, W. & Smith, G.C. A novel method for nucleic acid sequence determination. J. Theor. Biol. 135, 303-307 (1988). 24. Lysov, Y.P., Florent'ev, V.L., Khorlin, A.A., Khrapko, K.R. & Shik, V. V. Determination of the nucleotide sequence of DNA using hybridization with oligonucleotides. A new method. Dokl. Akad. Nauk SSSR 303, 1508-1511 (1988).

25. Lizardi, P.M. et al. Mutation detection and single-molecule counting using isothermal rolling-circle amplification. Nat Genet 19, 225-232 (1998).

26. Koshkin, A.A. et al. LNA (Locked Nucleic Acids): Synthesis of the adenine, cytosine, guanine, 5-methylcytosine, thymine and uracil bicyclonucleoside monomers, oligomerisation, and unprecedented nucleic acid recognition. Tetrahedron 54, 3607-3630 (1998).

27. Ewing, B. & Green, P. Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities. Genome Res. 8, 186-194 (1998).

28. Church, G., Shendure, J. & Porreca, G. Sequencing thoroughbreds. Nat. Biotechnol. 24, 139 (2006).

29. Whiteford, N. et al. An analysis of the feasibility of short read sequencing. Nucleic Acids Res. 33, el71 (2005).

Claims

CLAIMS:

1. A nucleic acid sequencing method comprising: hybridizing a panel of labeled probes to an array of DNA molecules, wherein each DNA molecule in the array comprises a fragment of a target sequence to be determined; imaging at least one location of the array during hybridization with each labeled probe in the panel, and identifying hybridization complexes in each image; quantifying the fluorescent intensity of the detected hybridization complexes in each image, to thereby generate a hybridization spectrum for each fragment of the target sequence; determining the positions of the fragments in a reference target sequence based on the fragment's hybridization spectrum; and determining the probable nucleotide at each position in the target sequence based on the hybridization spectra of fragments that overlap each nucleotide position.

2. The method of claim 1, wherein the fragments of the target sequence are from about 20 to about 500 nucleotides in length.

3. The method of claim 1 , wherein the fragments of the target sequence are from about 50 to about 250 nucleotides in length.

4. The method of claim 1, wherein the fragments of the target sequence are at least 1 kb in length.

5. The method of any one of claims 1 to 4, wherein the fragments have a substantially uniform size.

6. The method of any one of claims 1 to 5, wherein the array of DNA molecules is prepared by fragmenting the target sequence, converting the fragments into single-stranded circular molecules having the fragment as an insert for sequencing, and amplifying the single- stranded molecules by rolling circle amplification.

7. The method of any one of claims 1 to 6, wherein the array contains from about 100,000 to about 10 million DNA molecules per cm².

8. The method of any one of claims 1 to 7, wherein the imaged location(s) of the array each contain from about 500 to about 20,000 DNA molecules.

9. The method of any one of claims 1 to 8, wherein from about 200 to about 500 locations on the array are imaged during hybridization with each labeled probe in the panel.

10. The method of any one of claims 1 to 9, wherein each imaged location of the array has a surface area of from about 100 μm² to about 10 mm².

11. The method of any one of claims 1 to 10, wherein the array of DNA molecules is randomly immobilized on a solid support.

12. The method of any one of claims 1 to 11 , wherein said panel of labeled probes comprises a universal panel of tiling probes and a reporter probe.

13. The method of claim 12, wherein said reporter probe hybridizes to a sequence in all of said DNA molecules.

14. The method of any one of claims 1 to 13, wherein the probes are fluorescently labeled.

15. The method of any one of claims 12 to 14, wherein the tiling probes have an effective specificity of from 3 to 10 bps.

16. The method of any one of claims 12 to 14, wherein the tiling probes have an effective specificity of from 4 to 6 base pairs.

17. The method of any one of claims 12 to 14, wherein the tiling probes have an effective specificity of 5 base pairs.

18. The method of any one of claims 15 to 18, wherein each position in a target sequence or its reverse complement at the position is hybridized by at least one tiling probe in the panel.

19. The method of claim 18, wherein each position in a target sequence or its reverse complement at the position is hybridized by one tiling probe in the panel.

20. The method of claim 17, wherein each tiling probe in the panel contains a pentamer probe sequence with one or two flanking degenerate nucleotides.

21. The method of claim 17, wherein the tiling probes contain either an oligonucleotide hexamer or heptamer, with potentially dimerizing probes prepared as hexamers.

22. The method of claim 20 or 21, wherein the proportion of A and T is increased relative to G and C at degenerate positions to balance the T_n, values across the panel.

23. The method of any one of claims 1 to 22, wherein each labeled probe contains locked nucleic acid.

24. The method of claim 22, wherein locked nucleic acid (LNA) is at nucleotide positions 1, 2, 4, 6 and 7 of the heptamer probes, and at positions 1, 2, 4, and 6 or at positions 1, 3, 5 and 6 of the hexamer probes.

25. The method of any one of claims 1 to 24, wherein the average probe T_m of the panel is from about 40 to about 55° C.

26. The method of any one of claims 1 to 24, wherein the average probe T_m of the panel is about 49° C.

27. The method of any one of claims 1 to 26, wherein the average single nucleotide match/mismatch discrimination (ΔT_m) of the panel is at least 10° C.

28. The method of any one of claims 1 to 26, wherein the average single nucleotide match/mismatch discrimination (ΔT_m) of the panel is at least 20° C.

29. The method of claim 27 or 28, wherein fewer than 5% of the probes in the panel have a T_m of less than 20° C or a single nucleotide match/mismatch discrimination (ΔT_m) of less than 10° C.

30. The method of any one of claims 1 to 29, wherein the panel contains fewer than 900 tiling probes.

31. The method of any one of claims 1 to 29, wherein the panel contains fewer than 600 tiling probes.

32. The method of any one of claims 17 to 31, wherein the tiling probes have the formula 5'- Label -NXXXXXN-3\ wherein X is a specified base and N is a degenerate position, with the proviso that heptamer probes having a propensity to dimerize are constructed as hexamers having a single degenerate position.

33. The method of any one of claims 1 to 32, wherein the tiling probes are constructed from the set of oligonucleotides shown in Table 2 or Table 3.

34. The method of any one of claims 1 to 33, wherein hybridization is conducted in the presence of a high salt buffer.

35. The method of any one of claims 1 to 34, wherein hybridization is conducted in the presence of tetramethylammonium chloride (TMAC).

36. The method of any one of claims 1 to 35, further comprising, imaging the at least one location of the array in the presence of hybridization buffer only as a blank.

37. The method of any one of claims 1 to 36, wherein signal intensity values for the hybridization complexes are normalized against signal intensity values upon hybridization of a reporter probe, said reporter probe hybridizing to a sequence in all of said DNA samples.

38. The method of any one of claims 1 to 37, wherein signal intensity is quantified using a maximum pixel value for each detected hybridization complex as a raw value, and subtracting a background value.

39. The method of any one of claims 1 to 38, wherein hybridization complexes are identified in images taken during hybridization with a reporter probe, hybridization complexes at corresponding positions being examined in subsequent images taken during hybridization with each tiling probe in the panel.

40. The method of any one of claims 1 to 39, wherein fragments are positioned in the reference target sequence by calculating an alignment score at each position in the reference sequence, said position having a size equal to the predicted size of said fragment, and selecting the position with the maximum score.

41. The method of claim 40, wherein the alignment score is calculated according to the formula:

wherein /is the normalized intensity of a probe minus the median intensity of the probe across all spectra, n_Um_que is the total number of distinct probes in a window, and wherein the first sum is over the probes present in the window and the second sum is over the probes absent from the window.

42. The method of any one of claims 1 to 41 , wherein the hybridization spectrum for each fragment contains a probability of hybridization for each probe.

43. The method of claim 42, wherein the probability is a log-odds score.

44. The method of any one of claims 1 to 43, wherein determining the probable nucleotide at each position in the target sequence comprises calculating average probe hybridization intensities for the fragments expected to contain each probe sequence.

45. The method of claim 44, further comprising, calculating the posterior log-odds of a substitution.

46. The method of any one of claims 1 to 45, wherein the target sequence is from about 5,000 to about 10 million base pairs in length.

47. The method of any one of claims 1 to 45, wherein the target sequence is from about 20,000 to about 1 million base pairs in length.

48. The method of any one of claims 1 to 47, wherein the target sequence is a viral or bacterial genome.

49. The method of any one of claims 1 to 47, wherein the target sequence is a mammalian genome.

50. The method of any one of claims 1 to 49, wherein the target sequence is a human genome.

51. The method of any one of claims 1 to 47, wherein the target sequence is a non-human animal genome.

52. The method of any one of claims 1 to 51 , wherein the target sequence is enriched from genomic DNA.

53. The method of any one of claims 1 to 52, wherein the target sequence is haploid.

54. A panel of labeled probes comprising a set of tiling probes having an effective specificity of from 3 to 10 bps, and a reporter probe, wherein the set of tiling probes hybridizes once to each position in a target sequence or its reverse complement.

55. The panel of claim 54, wherein the probes are fluorescently labeled.

56. The panel of claim 54 or 55, wherein the tiling probes have an effective specificity of from 4 to 6 base pairs.

57. The panel of claim 54 or 55, wherein the tiling probes have an effective specificity of 5 base pairs.

58. The panel of claim 57, wherein each tiling probe in the panel contains a pentamer probe sequence with one or two flanking degenerate nucleotides.

59. The panel of claim 57, wherein the tiling probes contain either an oligonucleotide hexamer or heptamer, with potentially dimerizing probes prepared as hexamers.

60. The panel of claim 58 or 59, wherein the proportion of A and T is increased relative to G and C at degenerate positions to balance the T_m values across the panel.

61. The panel of any one of claims 54 to 60, wherein each labeled probe contains locked nucleic acid.

62. The panel of claim 61 , wherein locked nucleic acid (LNA) is at nucleotide positions 1 , 2, 4, 6 and 7 of the heptamer probes, and at positions 1, 2, 4, and 6 or at positions 1, 3, 5 and 6 of the hexamer probes.

63. The panel of any one of claims 54 to 62, wherein the average probe T_n, of the panel is from about 40 to about 55° C.

64. The panel of any one of claims 54 to 62, wherein the average probe T_m of the panel is about 49° C.

65. The panel of any one of claims 54 to 64, wherein the average single nucleotide match/mismatch discrimination (ΔT_m) of the panel is at least 10° C.

66. The panel of any one of claims 54 to 64, wherein the average single nucleotide match/mismatch discrimination (ΔT_m) of the panel is at least 20° C.

67. The panel of claim 65 or 66, wherein fewer than 5% of the probes in the panel have a T_m of less than 20° C or a single nucleotide match/mismatch discrimination (ΔT_m) of less than 10° C.

68. The panel of any one of claims 54 to 67, wherein the panel contains fewer than 600 tiling probes.

69. The panel of any one of claims 54 to 68, wherein the tiling probes have the formula 5'- Label -NXXXXXN-3', wherein X is a specified base and N is a degenerate position, with the proviso that heptamer probes having a propensity to dimerize are constructed as hexamers having a single degenerate position.

70. The panel of any one of claims 54 to 68, wherein the tiling probes are constructed from the set of oligonucleotides shown in Table 2 or Table 3.

71. A kit comprising the panel of any one of claims 54 to 70, and at least one reagent for performing the method of claim 1.

72. The kit of claim 71 , wherein each probe in the panel is contained in a separate vial in a concentrated, diluted, or lyophilized form.

73. The kit of claim 71 or 72, wherein said at least one reagent includes a polymerase suitable for rolling circle amplification.

74. The kit of claim 73, wherein the polymerase is Phi29.

75. The kit of any one of claims 71 to 74, wherein said at least one reagent comprises hybridization buffer.

76. The kit of claim 75, wherein the hybridization buffer comprises TMAC.

77. The kit of any one of claims 71 to 76, further comprising a solid support for arraying DNA molecules.

78. The kit of any one of claims 71 to 77, further comprising a primer for rolling circle amplification.

79. The kit of claim 78, wherein the primer is arrayed to a solid support at a high density.

80. An apparatus for performing the method of any one of claims 1 -53.

81. The apparatus of claim 80, wherein the apparatus comprises a reaction chamber, a reagent distribution system, a scanner, and a detector.

82. The apparatus of claim 80 or 81, wherein the reaction chamber has thermal control.

83. The apparatus of any one of claims 80 to 82, wherein the apparatus is operationally coupled to a work station programmed to perform sequence analysis.

84. The apparatus of claims 83, wherein the sequence analysis comprises one or more of calculating probabilities of probe hybridization, calculating spectral alignment of fragments with one or more positions in a reference sequences, and calculating posterior log-odds of substitutions.

85. The apparatus of any of claims 81 to 84, wherein the detector is a CCD imager.

86. The apparatus of claim 81, wherein the reaction chamber provides easy access for optics, an inlet for injecting and removing reagents from the reaction chamber, and an outlet to allow air and reagents to enter and exit the chamber.

87. The apparatus of any of claims 81 to 86, wherein the reagent distribution system comprises a pump and reagent flasks to supply reagents according to a fixed protocol.

88. The apparatus of claim 87, wherein a computer controls both the pump and the scanner, alternating between reaction and scanning.

89. The apparatus of any one of claims 81 to 87, wherein the reaction chamber is placed on a positioning stage to permit imaging of multiple locations on the chamber.

90. The apparatus of any one of claims 81 to 89, wherein the reagent distribution system comprises a dispenser unit connected to a motorized valve to direct the flow of reagents.

91. The apparatus of any of claims 81 to 89, wherein the detector comprises an imaging component able to resolve templates at a density of at least 1000/cm².

92. The apparatus of claim 91, wherein the imaging component comprises a system or device selected from the group consisting of photomultiplier tubes, photodiodes, charge-coupled devices, CMOS imaging chips, near-field scanning microscopes, far-field confocal microscopes, wide-field epi- illumination microscopes and total internal reflection microscopes.

93. The apparatus of any one of claims 81 to 92, wherein the reaction chamber is a closed structure comprising a transparent surface, a lid, and ports for attaching the reaction chamber to the reagent distribution system.

94. The apparatus of claim 93, wherein the transparent surface holds template molecules on its inner surface and the imaging component is able to image through the transparent surface.

95. The apparatus of any one of claims 81 to 94, comprising at least two reaction chambers, alternating between' imaging and reacting.