WO1997013868A1

WO1997013868A1 - Large scale dna sequencing by position sensitive hybridization

Info

Publication number: WO1997013868A1
Application number: PCT/US1996/016269
Authority: WO
Inventors: Leonard Adleman
Original assignee: Leonard Adleman
Priority date: 1995-10-11
Filing date: 1996-10-11
Publication date: 1997-04-17
Also published as: AU7437996A

Abstract

One aspect of the invention relates to a method for determining the nucleic acid sequence of a polynucleotide. The invented method includes steps for cleaving end-labeled polynucleotides substantially randomly and then hybridizing the cleaved fragments to a plurality of immobilized oligonucleotide probes. Positional information obtained from the hybridization procedure is then used to reconstruct the sequence of the polynucleotide.

Description

LARGE SCALE DNA SEQUENCING BY POSITION SENSITIVE HYBRIDIZATION

Field of the Invention

The present invention relates generally to methods of sequencing DNA. More specifically, the invention relates to methods of sequencing DNA by position sensitive hybridization of oligonucleotides having known sequences.

Background of the Invention

The human genome project - to sequence the approximately 3 billion base pair human genome - has focused attention on the need for efficient methods of sequencing long strands of DNA. Among the methods currently available are:

(1) Sanger method (Proc. Natl. Acad. Sci. U.S.A. 74:5463 (1977)). With this method multiple copies of the target DNA are used as templates for a polymerization reaction that employs deoxynucleotides and dideoxynucleotides (chain-terminating-nucleotides). The resulting strands of various lengths are electrophoretically separated and then visualized to reveal the sequence. A single such procedure is commonly used to sequence target strands of several hundred nucleotides. In some cases target strands of over a kilobase have been sequenced successfully (Slighton et al. Anal. Biochem. 192:4441 (1992)).

(2) Maxam-Gilbert method (Maxam et al. Proc. Natl. Acad. Sci. U.S.A. 74:560 (1977)). Multiple copies of the labeled target DNA are randomly cut by chemical means. The resulting fragments are run on a gel to reveal the sequence. A single such procedure can be used to sequence a target strand of several hundred nucleotides in length.

(3) Sequencing by Hybridization (SBH) (Lysov et al. Dokl. Akad. Nauk SSSR 303:1508 (1988); Bains et al. J Theoro. Biol. 135:303 (1988); Drmanac et al. Geπomics 4:114 (1989); Khrapko et al. FEBS Letters 256:118 (1989); Pevzner, J. Biomol. Struct. Dyn. 7:63 (1989); Lipshutz, J. Biomol. Struct. Dyn. 11:637 (1993); Fodor et al. Science 251:767 (1991); Pease et al. Proc. Natl. Acad. Sci U.S.A. 91:5022 (1994); Southern et al. Genomics 13:1008 (1992); Drmanac et al. Science 260:1649 (1993)). According to the SBH procedure, copies of the target DNA are simultaneously hybridized with a large library of short oligonucleotides to reveal which short subsequences occur in the target. The information so obtained is then processed on a computer to reconstruct the sequence of the original target. A single such procedure can be used to sequence a target strand of several hundred nucleotides in length (Drmanac et al. Science 260:1649 (1993)).

Summary of the Invention

According to one aspect of the invention there is provided a method of determining the nucleotide sequence of a polynucleotide molecule. Essential steps used in the practice of the invention involve first obtaining polynucleotide molecules to be sequenced as a substantially homogeneous population, each of said molecules having a 5' end and a 3' end, said polynucleotide molecules having a label at at least one of said ends subject to the provision that if both ends are labeled then each of said ends has a different label. Next, the polynucleotide molecules are cleaved at substantially random positions to produce cleaved fragments, and the cleaved fragments are then contacted with a plurality of oligonucleotide probes in a hybridization reaction, where each of the plurality of oligonucleotide probes has a nucleotide sequence that is known. After the hybridization reaction, the intensity of hybridization between the cleaved fragments and each of the plurality of oligonucleotide probes is measured. The intensity of the hybridization provides positional information about the occurrence of a plurality of sequences along a length of the polynucleotide. Finally, the invented method involves constructing with a computer a contiguous nucleotide stretch represented by the plurality of oligonucleotide probes hybridized to the cleaved fragments in the contacting step and constrained by the positional information obtained in the measuring step, thereby determining the nucleotide sequence of the polynucleotide. In a preferred embodiment the label at at least one of the ends of the polynucleotide is covalently attached to the polynucleotide. In another preferred embodiment the labeled oligonucleotide hybridized to the polynucleotide indirectly end-labels the polynucleotide. Optimally, the polynucleotide molecules that are to be sequenced are cleaved an average of once per molecule in the cleaving step. In yet another preferred embodiment, the plurality of oligonucleotide probes in the contacting step are immobilized to a solid support, and each of the plurality of oligonucleotide probes contacted in the contacting step has a length of from 6 to 20 nucleotides. The oligonucleotide probes used in the practice of the invented method may be DNA probes or PNA probes. In another preferred embodiment of the invented method the measuring step comprises detecting the label. A higher intensity of hybridization measured in the measuring step indicates that one of the plurality of oligonucleotide probes is nearer to the label on the polynucleotide compared to a different one of the plurality of oligonucleotide probes having a lower intensity of hybridization. In yet another preferred embodiment of the invented method the label used to end label the polynucleotide is optically detectable, and can be a fluorescent label. In yet another embodiment of the invented method, the obtaining step may also involve cleaving a circular DNA molecule with a restriction endonuclease to produce a linear DNA molecule. This cleaving step may involve an enzyme. This enzyme may, for example, be selected from the group consisting of S1 nuclease, mung bean nuclease and deoxyribonuclease. In different embodiments, the cleaving step may involve mechanically shearing the molecules or contacting the polynucleotide molecules with a chemical agent. In still another embodiment, the constructing step further comprises the use of computer programs for minimizing errors of the type selected from the group consisting of false negative errors, false positive errors, non-uniqueness errors and false position errors.

According to another aspect of the invention there is provided a method of constructing a polynucleotide sequence of a nucleic acid molecule from information about the approximate positional occurrence of a plurality of subsequences of the molecule having known sequences along a length of the molecule. A first step in the method involves processing with a computer the approximate positional occurrence information for the plurality of subsequences along the length of the molecule to create a representation of the approximate position of the subsequences. A next step involves refining the representation repeatedly to combine pairs of partially overlapping subsequences in the polynucleotide molecule to result in a plurality of combined subsequences. After the representation has been refined, positional information for the plurality of combined subsequences is established from positional information about said partially overlapping subsequences. Finally, the refining step and the establishing step are repeated until no more subsequences are combined, thereby constructing the polynucleotide sequence along the length of the nucleic acid molecule. In a preferred embodiment, the processing step comprises associating each subsequence among said plurality of subsequences with a node in a graph. The node can be identified by a nucleotide sequence, an approximate position and a standard deviation of a calculatable error in the approximate position. The standard deviation of the calculatable error in the approximate position can be determined experimentally. In another preferred embodiment the calculatable error in the approximate position is derived from a binomial distribution, but can also be derived from a non-uniform probability distribution or a probability distribution having unbounded range. In still another embodiment, a plurality of the nodes are connected by directed edges in the refining step. In still another embodiment the establishing step comprises calculating a mean value for the positions of the partially overlapping subsequences.

Detailed Description of the Invention

Overview of the Invention

As disclosed below, an improved method of DNA sequencing, referred to herein as Position Sensitive

Hybridization (PSH), has been developed. This method promises to be orders of magnitude faster and less expensive than other sequencing methods that are currently available.

The concept underlying the invented method is easily described with a simplified example. Consider that we have a 1000-mer strand of DNA that is to be sequenced. Assume that PCR primers for the strand are known so that multiple copies of the (non-duplex) strand can be obtained. Next, assume that each strand is end-labeled so that it has a fluorescent red 5' end and a fluorescent blue 3' end. Further assume that the sequence AAAAAAAA occurs once along the length of the strand. Now cut each strand once at a random position (each position with equal probability) to produce two fragments, each having one labeled terminus. Each such cut will produce one 'red fragment' (the 5' end) and one 'blue fragment' (the 3'-end). Since by assumption the cuts are in random positions, red fragments and blue fragments of all possible lengths will form with equal probability. Next, hybridize all fragments with TTTTTTTT probes immobilized to a solid support such as a bead or a silicon chip. Then wash away non-hybridized labeled fragments and measure the intensity of the red and blue fluorescence on the solid support at a position corresponding to the immobilized probe having the known polynucleotide sequence. Consider three possibilities:

(1) If AAAAAAAA is near the 5' end of the strand, then a random cut is unlikely to disconnect it from the red 5' end. So a red fragment containing the AAAAAAAA will likely be formed together with a blue fragment not containing the AAAAAAAA sequence (AAAAAAAA can't be on both fragments, since by assumption there is only one occurrence on the strand - for now ignore the possibility that the cut actually occurs in the AAAAAAAA region). Hence, after hybridization many red fragments and few blue fragments will anneal to the TTTTTTTT probe immobilized to the solid support. Consequently, a high red intensity and low blue intensity will be detected.

(2) If AAAAAAAA is near the 3' end of the strand, then the final result will be a high blue intensity and low red intensity annealed to the TTTTTTTT probe.

(3) If AAAAAAAA is near the center of the strand, then after a random cut the AAAAAAAA is about equally likely to be attached to the red fragment as to the blue fragment. Hence, the result will be that the blue intensity and red intensity will be about equal. In fact, the relative intensities of the red and blue signals determines the exact position of the AAAAAAAA in the target strand. Further, the condition that the subsequence AAAAAAAA occurs exactly once in the strand is not necessary. If the subsequence occurs exactly twice, then knowing the intensity of the blue and red signals will still be sufficient to determine the exact position of both occurrences. If the subsequence occurs more than twice, then the blue and red signal intensities will be sufficient to determine the exact positions of the first and last occurrences - but will provide no information about the positions of any 'nested' occurrences located between the first and the last. Of course, the process can be performed not only for AAAAAAAA, but for all possible 8-mer oligonucleotides at once. Indeed, all 65,536 8-mer oligonucleotides reportedly have been synthesized on a single 1.28 x 1.28 cm silicon chip (Lipshutz et al. Current Opinion in Structural Biology 4:376 (1994)). Accordingly, when used in connection with the invented method, a single procedure could be used to ascertain the positions of virtually all 8-mer subsequences along the length of a long target DNA molecule. The mathematical analysis and experimental data presented below suggest that strands of tens of thousands of nucleotides in length could be sequenced during a single such procedure.

Detailed Description of the Preferred Embodiment

Position Sensitive Hybridization

Although alternative embodiments will suggest themselves to those having ordinary skill in the art, a preferred method for practicing PSH is now presented. For definiteness, one possible protocol is described below. It will be assumed that the 20-mer 3' and 5'-end sequences of the target strand are known.

(1) Obtain multiple non-duplex copies of the target molecule to be sequenced. For example, perform PCR with one biotinylated primer and one non-biotinylated primer, incubate the resulting duplex DNA with streptavidin coated beads, heat to dissociate the duplexed strands and retain the liquid phase containing the desired strands. Alternatively, duplexed DNA can be employed as a template for PCR using only a single primer. It is also possible to run PSH with duplexed DNA. However, this latter approach would likely be subject to greater error.

(2) Prepare an excess of ('red') labeled 20-mer oligonucleotides complementary to the 5' end of the target strand and an excess of ('blue') labeled 20-mer oligonucleotides complementary to the 3'-end of the target strand. The labels (e.g. fluorescent or radioactive) on these strands should be sufficiently distinct to allow for independent detection of the resulting hybridization signals. Anneal these oligonucleotides to the target molecules in order to indirectly end-label the polynucleotide that is being sequenced. Alternatively, the single strands produced in (1) can be chemically labeled or created with labeled primers so that the label is covalently attached to the polynucleotide that is being sequenced. It is also possible to use a single label and run the procedure twice; once with the positive strand and once with the negative strand of a double stranded DNA molecule.

(3) Using a nonspecific method of cleaving the DNA strand, such as an endonuclease, a chemical or a mechanical method, make random cuts in the target molecules. In practice, such procedures would not be expected to produce only single cuts in each strand. Instead, the number of cuts per strand would likely be Poisson. However, as long as the mean number of cuts per strand is near 1 the population of red fragments, blue fragments and colorless fragment should be adequate for the final analysis. (4) As in SBH, hybridize the fragments to a large number of oligonucleotides having known sequences. The lengths of the oligonucleotides useful in the hybridization step is expected to be in the range of from 6 - 20 nucleotides. The hybridization step may be done, for example, by hybridization of the fragments with oligonucleotides immobilized to a solid support (e.g. a silicon chip) having separate loci for each of 65,536 8-mer oligonucleotides. The temperature, solvent and salt conditions required to obtain optimal results will vary from system to system, but are readily ascertainable by routine experimentation.

(5) For each oligonucleotide loci, measure (e.g. with a fluorometer or phosphoimager) and record the red fluorescence intensity and the blue fluorescence intensity representing the two labels.

(6) Using a computer, reconstruct the sequence of the target molecule from the data obtained. This computation is tractable for strands of considerable length.

In addition to methods for sequencing linear targets, the PSH technique can also be used to sequence circular targets that are single stranded. In this latter embodiment of the invented method, single stranded DNA targets produced using M13-based cloning vectors are isolated and then annealed to a linear labeled oligonucleotide that is complementary to a region of the vector genome. Preferably, the oligonucleotide is labeled at one or both ends, subject to the provision that if both ends are labeled, then the two labels must be different so as to be independently detectable. The annealing step results in a target that is substantially single stranded, but that has a short double-stranded region corresponding to the target-oligonucleotide duplex. The oligonucleotide should contain a polynucleotide sequence recognized by a restriction endonuclease that preferentially cleaves long DNA strands rarely. One example of an infrequently cleaving restriction endonuclease is NotI, an enzyme having an eight base pair recognition sequence. Additionally, the oligonucleotide should be sufficiently long that, following annealing to the target, and restriction endonuclease cleavage, a linear molecule is formed which is "indirectly labeled" at each end by the cleaved halves of the oligonucleotide that remain annealed to the target strand. The indirectly end labeled target can then be processed by random cleavage and hybridization to immobilized oligonucleotide probes according to the invented method.

Closed circular molecules that are double stranded, such as plasmids or cosmids, can also be used as starting targets for the invented PSH sequencing technique. In the practice of the invented method, the circular molecules must be linearized and end-labeled before hybridization to immobilized oligonucleotide probes. Both plasmid and cosmid vectors will typically contain restriction endonuclease cleavage sites that can be used to linearize the circular targets. Restriction enzyme cleavage sites that occur rarely, including those consisting of eight or more nucleotides, are particularly useful for linearizing cosmids or plasmids because those restriction sites are unlikely to occur more than once per molecule. Linearized plasmids or cosmids can then be end labeled, directly or indirectly, using conventional techniques and hybridized to immobilized oligonucleotide probes.

Analysis

For the purposes of mathematical analysis, PSH will first be considered in an error free setting. Error handling will be addressed subsequently. Though PSH can be performed with probes of various lengths, it will be convenient to consider PSH using all 65,536 possible 8-mers oligonucleotides as probes. However, as will be apparent from the following disclosure, the method is easily adaptable to probes of other lengths.

To begin we will indicate why PSH will determine the positions of the first and last occurrences of 8-mer subsequences occurring in the target strand. For the sake of clarity we will ignore the fact that our subsequences are of length 8 and treat the problem abstractly. Let the target strand T=s₁s₂...s_n be a string over an alphabet with 65536 symbols (i.e., one for each 8-mer). For i-1,2,,,n, T may be cut after the ith symbol to yield a red fragmentss₁s₂...s_i and a blue fragments s_i+ 1s+2...s_n. Assume that many copies of T are made and each is cut once, with cuts at each position i= 1,2,,,,n occurring an equal number of times. For each symbol s, an aliquot of the fragments are presented to locus L_s. Hence there exists a positive integer w such that each locus receives w of red fragments and w blue fragments. After washing L_s retains only those fragments containing the symbol s. Now, let a symbol s be fixed and assume that the first occurrence of s in T is at the r^th position and the last occurrence is at the J^th positions (i.e., s=s_i=sj and for all k= 1,2,,,n with k<i or k>j, s≠s_k). Let d_f=n i+ 1 denote the distance from the symbol s, to the 'blue end' of T and let d₁=j-1 denote the distance from the symbol s_i to the 'red end' of T. A cut in T occurring at one of the d_f positions i,i+ 1,...,n will leave a red fragments containing s and a cut in T occurring at one of the d, positions 1,2,...,j-1 will leave a blue fragment containing s. Hence, the number of red fragments retained on L_s is w*d/n and the number of blue fragments retains on L_s is w*d/n. The color intensity read at L_s is proportional to the number of fragments bearing the appropriate color. Let c be the proportionality constant. Then the red intensity of L_s is c*w*d/n and the blue intensity is c*w*d/n. Since c*w/n is independent of s, we have, up to a scaling factor, determine the distance of the first occurrence of s to the blue end of T and the d instance of the last occurrence of s to the red end of T. Notably, since we are only concerned with the order of positions this gives the desired result. Notice that this argument also shows that PSH can be performed using a single color. Multiple cuts and multiple labels may also be used.

While a detailed mathematical analysis of PSH will not be presented here, it is worthwhile to indicate why one might expect PSH to perform efficiently in determining the sequence of long target molecules. Assume that all 65,536 8-mers are associated with different loci and that we are attempting to sequence a target strand of length 65,536. As a first approximation, model this situation as follows:

Let the loci set L be of cardinality 65,536 (one for each 8-mer). Let the sample space S be the set of all strings over L of length 65,536. Let the probability distribution on S be uniform. For each /∈ S, X/s) is the number of times / occurs in s. Then the probability density function for X, is Poisson and approximately:

PR{X_t=k}= 1/(e*k!)

If the target strand has no occurrences of the 8-mer associated with a given loci, then that loci gives no information about the target strand. If the target strand has one occurrence of the 8-mer, then that loci gives the exact position thereby identifying the nucleotides at 8 positions on the target strand. If the target strand has k≥ 2 occurrences of the 8-mer, then that loci gives the exact positions of the first and the last of the occurrences and. hence, the exact nucleotide sequence at 16 positions on the target strand - unless the occurrences overlap. However, when k≥ 3 the loci gives no information about any of the k-2 'nested occurrences' of the 8-mer on the target strand. Hence, we can approximate the number of nested occurrences of 8-mer subsequences as:

which is approximately 6792 in this case.

For PSH to fail to determine the nucleotide at some position p, it must be the case that p has the following properties:

(1): p is the start of a nested occurrence.

(2): p-i is the start of a nested occurrence for i= 1,2,...,7.

The second condition follows since, if there exists an i≤ 7 such that p i is not the start of a nested occurrence, then the 8 nucleotides beginning at position p-i are known and one of these nucleotides if the nucleotide at position p. It is important to note that conditions (1) and (2) are necessary for PSH to be unable to determine the nucleotide at position p; however, they are not sufficient. It is possible that in some cases even if position p satisfies properties (1) and (2), knowledge of the strand obtained from PSH will still allow the nucleotide at position p to be deduced. This is addressed more fully below. We will refer to positions satisfying properties (1) and (2) as 'unknowns of type V.

The probability of being a nested occurrence is approximately 6792/65536 or roughly 1/10. Hence with our assumptions, a first very rough approximation to the probability that the target strand would have the eight straight nested occurrences necessary to yield an unknown of type 1 at position p would be (1/10⁸). Consequently we would expect PSH to determine a target sequence of length 65536 with few or no unknowns of type 1.

As indicated above, it is possible in some cases to determine the nucleotide occurring at a position despite the fact that it is an unknown of type 1 (i.e., satisfies properties (1) and (2) above). To give an example for why this may be possible, consider the case of a target sequence consisting of one thousand A residues. In this case only the TTTTTTTT loci associated with AAAAAAAA would provide information about the sequence. The TTTTTTTT loci would provide the positions of the first and last occurrences of AAAAAAAA at 1 and 993, respectively. All other positions would be the beginning of nested occurrences of AAAAAAAA. Hence there would be hundreds of unknowns of type 1. Nonetheless, since, in an idealized error-free setting, only the TTTTTTTT loci detected anything, it must follow that all of these positions contain A. Hence all unknowns of type 1 are resolved.

A computer program was written which used this deductive approach for resolving unknowns of type 1.

Several remarks are in order. First, such deductive resolutions, when possible, have zero probability of introducing an incorrect nucleotide. Second, even when the deductive approach does not resolve the nucleotide at a given position, it often provides partial information (e.g. at that position the nucleotide is either an A or a T). An unknown of type 1 which remained unresolved following this program will be called an unknown of type 2. Computer experiments were run in order to verify the range of the PSH technique. Random DNA sequences of various lengths were generated and the number of unknowns of type 1 and type 2 were calculated. The results of these experiments are shown in Table I. These results indicated that PSH could be used to sequence some target strands greater than one hundred thousand nucleotides in length without error.

It is of interest to understand how PSH might perform when used to determine the sequences of naturally occurring polynucleotides. To give some insight into this, all sequences greater that 100,000 nucleotides in length from Release 40 of the European Bioinformatics Institute EMBL Nucleotide Sequence Database were obtained. Each sequence was cut into successive blocks of length 50000 (the final partial block was discarded). Within each block, initial segments of length 10000, 20000, 30000, 40000 and 50000 were analyzed by computer to determine the number of unknowns of type 1 and type 2. Statistics were calculated and are displayed in Table 2. Since Epstein- Barr virus (EBV) and Epstein-Barr virus (HEHS4b95) are essentially the same, only Epstein-Barr virus (EBV) had very large numbers of unknowns, only initial segments of lengths 10000 and 20000 were analyzed.

As indicated by the results shown in Table II, natural sequences vary considerably in the number of unknowns they produce. Target sequences from bacteria (E. coli and B. subtilis) produced very few unknowns even for target sequences 50,000 nucleotides in length. However, sequences derived from Epstein-Barr virus yield far greater numbers of unknowns. Large numbers of unknowns often resulted from the presence of multiple tandem repeats of long subsequences. For example, in Human (HSRETBLAS) (retinoblastoma susceptibility gene), nucleotide 123912 is the beginning of 30 repeats of a subsequence 53 nucleotides long. This resulted in 1473 straight unknowns of type 1 and 1465 straight unknowns of type 2. The unknowns stem from the fact that the 28 nested occurrences of the subsequence are hidden from PSH by the first and last occurrences. One might expect to see 28 * 53 - 1486 unknowns of type 1; however, a small number of nucleotides at the borders of this hidden range are resolved because they meet, at most, one of the properties necessary for an unknown of type 1, but not both. Fortunately, in practice, one might expect such tandem repeats to be resolved easily. Indeed, since the first and last occurrences are not hidden, they are detected and correctly resolved by PSH. Essentially, one is then presented with identical 53 nucleotide first and last occurrences separated by 28 * 53 - 1486 positions. A simple heuristic can be programmed to handle such multiple long tandem repeats. If this heuristic is used, then for Human (HSRETBLAS) the results for 50,000 nucleotides become 408.67 ± 67.89 unknowns (less than 1 %). For EBV, position 12024 is the beginning of 11 tandem repeats of very slight variants of a subsequence 3072 nucleotides long (it was for this reason that the analysis of EBV was restricted to 29,000 nucleotides). It seems likely that heuristic analysis could successfully resolve most or all of this 33,792 nucleotide long stretch.

Error Handling

The PSH polynucleotide sequencing method was found to be advantageously resistant to error propagation. As indicated above, enormous numbers of polynucleotides having different sequences can be produced as segregated loci using, for example, silicon chip technology. Indeed, it has been suggested that chips having greater than 4¹² distinct loci can be fabricated. The disclosure presented above suggests that PSH employing such chips could be used to determine the sequence of super-megabase target polynucleotides. In the processing of the large volume of information obtained in such a procedure, it is essential to consider methods of correcting for errors and ambiguities that will arise in the experimentally-obtained data.

Errors of various sorts will arise when PSH is practiced in the laboratory. Like SBH, PSH will be subject to the kinds of errors identified by Lysov et al. in Dokl. Akad. Nauk SSSR 303:1508 (1988). These errors include false negatives, false positives and errors related to non-uniqueness. Additionally, PSH is also subject to a new kind of error, referred to as false position error.

As used herein, false negatives refer to errors in which, despite its presence, loci fail to detect the occurrence of an associated 8-mer in the target strand. False negatives may occur when experimental conditions inhibit hydribization and result in a locus annealing to an inadequate number of fragments to be detected. For example, this may occur when stretches of nucleotides are hidden by secondary structure in the target molecule. PSH may have advantages over SBH in this regard, since the random cutting used in PSH may result in fragments which do not possess the secondary structure that characterized the intact target strand. In addition, the use of probes having higher binding efficiency may facilitate hybridization under conditions which eliminate the secondary structure of the target molecule while still permitting probe hybridization. Included among such high binding efficiency probes are the PNA probes which have been described, for example, by Egholm et al. in Nature 365:556 (1993). Further, previously described experimental and computational methods for handling false positives and false negatives in SBH are also applicable in the analysis of positional sequence information obtained using the PSH technique.

As used herein, false positives refer to errors in which loci detect the occurrence of an associated 8-mer in the target strand, despite the absence of that 8-mer. For example, false positives may occur when a locus detects an 8-mer on the target strand which mismatches the probe associated with the locus. For example, hybridization between a target polynucleotide and an immobilized probe oligonucleotide, where the heteroduplex included a single base pair mismatch, would give rise to a false positive.

As used herein, errors of non-uniqueness refer to cases in which information from the loci may not be adequate to uniquely determine the polynucleotide sequence of the target strand. Non-uniqueness, even in the absence of error, becomes a serious problem in SBH for target strands of a few hundred bases in length. However, the previous sections have disclosed that PSH is far more resilient in this regard. For example, previously published computer simulations of SBH have shown that, in the absence of errors and with the use of all 1,048,576 10-mers as probes, only approximately 95% of target strands are fully sequenced. The analysis presented above suggests that, under the same assumptions, PSH would be capable of sequencing megabase target strands. Notably, presently available technology enables the placement of all 1,048,576 10-mer probes on a single chip. Accordingly, the PSH method practiced in connection with a single chip having loci representing all possible 10-mers would enable the sequencing of a megabase polynucleotide.

As used herein, false position errors refer to cases in which loci detect the occurrence of an associated

8-mer in the target strand at position P' despite the fact it actually occurs at position P. Advantageously, PSH is substantially resistant to such errors in position. The positional information obtainable by PSH, even with errors, establishes a limited range of possible positions for 8-mers on the target strand. Hence, the reconstruction of a long target strand substantially reduces to a series of smaller problems of reconstructing sequences within these defined ranges.

To test the resistance of PSH to positional errors, a suite of programs was written and computer experiments were run. The reconstruction was viewed in a graph theoretic context as conventionally applied in SBH analyses. The basic principles used in the approach, contraction and position refinement, are generally described as follows. We associate each loci with 2 nodes in a graph (one for the first and one of for the last occurrence of the associated 8-mer in the target strand). Each node is labeled with a triple < M,P,σ> , where M is the 8-mer associated with the loci, P is its putative position as determined by the red/blue intensity and σ is the standard deviation of the error in the position. This standard deviation is determined experimentally and will vary between different PSH systems. This standard deviation may also vary within one system and from loci to loci in a target strand dependent manner. Nodes are connected by various kinds of directed edges. For example, a 'weak' edge is inserted between a node labeled < M,P,σ> and one labeled < M',P',σ' > if and only if the 3' 7-mer of M and the 5' 7-mer of M' are identical. However, a 'strong' edge is inserted only when an additional condition from the calculus of error is met. Roughly, this condition is that the putative distance d-abs(P'-P) between the positions is less than , where c is a constant dependent on the length of the target strand, c is

chosen using probability theory to insure that two nodes NOT connected by a strong edge are unlikely to be associated with adjacent subsequences on the target strand. One then proceeds to reduce the graph through a series on contraction and position refinements. For example, if a node < M₁,P₁,σ, > is the origin of exactly one strong edge and the destination of that edge is < M₂,P₂,σ₂ > , then these two nodes may be contracted into a single node < M₃,P₃,σ₃ > where M₃ is the sequence with 5' portion M₁ and 3' portion M₂, P₃ is the mean of the original position adjusted for their relative offset and . A similar method can be used when

a sequence greater than 2 nodes is contracted. For example, in the case of the contraction of a sequence of 16 nodes each associated with a σ of 100, the position associated with the contracted node would be the adjusted average of the original 16 positions and the associated standard deviation would be 100/√16 = 25. Hence, contraction leads to refinement in the position of subsequences and reduction in the standard deviation of the error in position. As a consequence of position refinement, nodes that were originally connected by a 'strong' edge may lose that edge after a contraction. This loss of 'strong' edges may inturn enable the contraction of nodes which failed to meet the criterion for contraction on previous passes. These techniques are first used to resolve, for most loci, whether in fact one or two occurrences were detected. Following this, a new graph is built and successive passes of contraction and position refinement are made until the graph is shrunk to a single node or to a graph small enough that a Hamiltonian Path can be established. At this point the original target sequence has been recreated.

A suite of programs has been written to carry out this "contraction/position refinement" approach. These programs have successfully reconstructed target sequences of several thousand nucleotides in the presence of errors in position chosen from a binomial distribution with standard deviation 100. Among the programs written were:

EPSH2.C: takes as input the triples < M, P, σ > provided by PSH. Since each loci is associated with two triples < M, P, σ > and < M', P', σ' > - - one associated with the first occurrence of M in the target sequence and one associated with the last occurrence - EPSH1.C attempts to resolve whether in fact the first and last occurrences are the same or different. For this purpose, it first calculates the set of triples which are connected from/to < M, P, σ > by a strong edge, then the set connected by a sequence of 2 strong edges, etc. It continues this process so long as the contraction along all paths of maximal length in the resulting subgraph lead to the same subsequence. The subsequence that results is then a subsequence of the target strand which contains M. This is then done for < M', P', σ' > . If the resulting subsequences are consistent (i.e., there is a single subsequence which contains both and aligns the occurrence of M) and of sufficient length, which is determined using probability theory and is dependent on the length of the target strand, then it is concluded that the first and last occurrences of M in the target strand are the same and so the pair < M, P, σ > and < M', P', σ' > is replaced by a single triple < M", P", σ" > where P" and σ" are calculated in a manner similar to those described above. If the subsequences are inconsistent, then it is concluded that the first and last occurrences are distinct and the original triples are left unchanged. In other cases, < M, P, σ > and < W, P', σ' > are marked as "ambiguous." Since some position refinement takes place in this process, it is repeated (using similar programs to EPSH1.C) several times until the number of ambiguous triples can no longer be reduced.

EPSH3.C: takes the triples produced by EPSH1.C and its associated programs and undertakes a sequence of contractions and position refinements as described above. When no more contractions and position refinements are possible, it outputs the resulting contracted and refined graph. This graph is then searched for a Hamiltonian path and contracted along that path, the resulting single node < M*, P*, σ* > , is such that M* is the original target strand.

Additional programs were written for very long target sequences where the above process is unable to resolve the target strand in its entirety, but rather produces a strand with unknowns at some positions. These programs use a "deductive approach" to resolving these unknowns. Since an unknown at position p can only occur when p,p-1,....p-1 are all the starting points of "nested occurrences," we may consider all 4 (A,T,C,G) possible choices for the unknown at position p and determine if there is a unique one which is consistent with this nesting requirement. A given choice may be rejected if the data from PSH shows that the oligonucleotide beginning at p-7.p- 6 or p could not be hidden (for example, it was not detected by the corresponding locus). If all but one choice are rejected, the remaining choice must be the value of the unknown. The subroutines npshfl and npshfr perform these and related deductions.

A program was also written to simulate PSH on a random target strand. For each loci, I, M, denoted its associated 8-mer. For each loci I for which M, occurred in the target strand (M₁,P₁) and (M₂,P₂) were generated. Where P₁ is the position of the first occurrence of M₁ in the target strand and P₂ is the position of the last occurrence of M1 in the target strand. If M1 occurred just once, then (M₁,P₁) and (M₁,P₂) would still be generated even though P₁ would equal P₂. Next, a second program was run which provided errors in the positional information. For each pair (M,P) a triple (M,P',100) was generated with P' = P +∈ where the error term e was the value of the random variable X - μ₁ with X a random variable with binomial distribution Bin(40000, 1/2). Hence the mean error was 0 and the standard deviation of the error was 100. These triples were then given as input to a suite of programs which attempted to reconstruct the sequence of the original target strand.

A sample computer program for implementing the present invention is provided in Exhibit A.

The results of an experimental test confirmed the utility of the PSH method. First, the suite of programs reconstructed perfectly a randomly generated 1000 nucleotide target strand. Next, a random 2000 nucleotide target sequence was generated. In this case, it was expected that over 1200 positions had errors of absolute value greater than 100, and over 10 positions had errors of absolute value over 300. Despite this, the suite of programs reconstructed the sequence of the target strand perfectly.

Given the high sensitivity and range of phospho- flouro- and other forms of images, the ever improving methods for immobilizing multiple probe loci to solid supports, the improving methods for cloning long segments of DNA, and the inherent error resistance and range of PSH, long strands of DNA and other poly nucleotides can be sequenced using the PSH method. An essential component of this method is the ability of PSH to provide positional sequence information.

The following specific protocol has been developed for implementation of PSH.

STEP 1. Using standard PCR, amplify the target strand and gel purify the amplified products.

STEP 2. Using the duplexed product of STEP 1 as template, perform 50 cycles of PCR using only a positive strand primer. This produces multiple copies of the positive strand which can then be gel purified.

STEP 3. Using standard Maxam-Gilbert chemistry, introduce random cuts into the product of STEP 2.

STEP 4. Obtain a primer complementary to the 5' end of the positive strand and which is labeled with fluorescein on its 5'-end. Obtain a primer complementary to the 3' end of the positive strand and which is labeled with Texas red on its 5' end. Anneal these oligonucleotides to the product of STEP 3.

STEP 5. Obtain multiple oligonucleotides to serve as probes. Each oligonucleotide should be biotinylated at its 5' end. Immobilize each oligonucleotide to streptavidin coated beads. STEP 6. For each set of beads prepared in STEP 5, add an aliquot of the product of STEP 4.

Incubate and wash the beads to isolate only polynucleotides captured through the streptavidin-biotin linkage. Elute from beads by heating and preserve the liquid phase containing fragments of positive strand and oligonucleotides with fluorescein and oligonucleotides with Texas red.

STEP 7. For each product obtained in STEP 6, employ a measuring means such as a fluorometer to measure the signal intensity from fluorescein and separately from Texas red. Pass the data so obtained to a suite of computer programs to reconstruct the sequence of the positive strand.

The following illustrates a protocol for large scale sequencing:

STEP 1. Using standard PCR, amplify the target strand and gel purify the amplified products. STEP 2. Using the duplexed product of STEP 1 as template perform 50 cycles of PCR using only a positive strand primer. This produces multiple copies of the positive strand which can then be gel purified.

STEP 3. Using standard Maxam-Gilbert chemistry, make random cuts into the product of STEP

2.

STEP 4. Obtain a primer complementary to the 5' end of the positive strand and which is fluorescein labeled on its 5'-end. Obtain a primer which is complementary to the 3' end of the positive strand and which is labeled with Texas red on its 5' end. Anneal these oligonucleotides to the product of STEP 3.

STEP 5. Obtain from vendor (e.g., AffyMatrix) silicon chips containing loci for all 65536 8-mer

DNA probes. Incubate the chip with the product of STEP 4 according to a hybridization protocol.

STEP 6. Employ a measuring means such as a fluorimager (e.g., Molecular Dynamics), in a single operation to measure the signal intensity from fluorescein and separately from Texas red. Pass the data so obtained to a suite of computer programs to reconstruct the sequence of the positive strand.

The exemplary procedure described below provides a working example for how the PSH technique can provide positional sequence information that is essential to the invented method of DNA sequencing. As described below, this procedure was performed using hybridization substrates having known sequences. More specifically, this experiment employed a synthetic single-stranded 100-mer target DNA having a polynucleotide sequence derived from the sequence of pUC-18 plasmid DNA (GenBank accession number A02710). Immobilized oligonucleotide probes employed in this hybridization procedure also represented segments derived from the pUC-18 genome.

Although other materials and methods similar or equivalent to those described herein can be used in the practice or testing of the present invention, the preferred methods and materials useful in connection with the following exemplary demonstration are now described. General references for the various laboratory methods described below can be found in Molecular Cloning: A Laboratory Manual (Sambrook et al. eds. Cold Spring Harbor Lab Publ. 1989) and Current Protocols in Molecular Biology (Ausubel et al. eds., Greene Publishing Associates and Wiley-lnterscience 1987). The disclosures of these references are hereby incorporated by reference.

Example 1 describes the hybridization substrates used to illustrate the operational basis of the PSH method.

Example 1

Preparation of Target DNA and Oligonucleotides A 100-mer target DNA having the following sequence (5' to 3') was synthesized:

CTGCAGGCATGCAAGCTTGGCACTGGCCGTCATGTGTCAGAGGTFTTCACCGTCATCACCGAAACGCGCGGGGTTACATC GAACTGGATCTCAACAGCGG (SEQ ID NO:1)

The 100-mer target DNA of SEQ ID NO:1 represented a contiguous stretch of subsequences derived from different parts of the pUC-18 genome. These subsequences are presented below. Numbers in parentheses correspond to the number of guanosine (g) residues in each stretch of DNA. Lengths of the various stretches and corresponding positions in the pUC-18 genome are also indicated for each subsequence.

Three 16-mer probes complementary to the underlined subsequences shown above were also synthesized. The oligonucleotide probe complementary to the sequence between positions 274-279 was named pL (the probe to the left side of the target DNA). The oligonucleotide probe complementary to the sequence between positions 656-671 was named pM (the probe in the middle portion of the target DNA). The oligonucleotide probe complementary to the sequence between positions 1019-1034 was named pR (the probe to the right side of the target DNA). As a negative control for the hybridization procedure, a probe complementary to DNA encoding the late antigen of cytomegalovirus (LA6) was also synthesized. All four probes (pL, pM, pR and LA6) were biotinylated at their 5' ends according to standard laboratory procedures.

Example 2 describes the procedure that was used to label one of the two ends of the target DNA strand.

Example 2

5' End-Labeling of Target DNA

The 100-mer target DNA was 5' end-labeled with [γ-³²P]ATP using T4 polynucleotide kinase. The reaction mixture contained 30 μg of the target DNA (1 nmole); 50 mM Tris HCl, pH 8.0 at 25°C; 11.25 mM MgCl₂; 5 mM dithiotreitol; 30 units of T4 polynucleotide kinase (GibcoBRL, cat. #18004 010); 0.8 mCi [γ-³²P]ATP (700 Ci/mmole, ICN, cat #3502005) in a volume of 50 μl. The reaction was carried out at 37°C for 30 minutes, after which time the reaction mixture was passed through a Sephadex G-25 Superfine (DNA grade, Pharmacia, cat. #17-0572-02) column. Radiolabeled DNA was recovered in a total volume of 270 μl. The sample was ethanol precipitated and resuspended in 40 μl of H₂O.

Example 3 describes the method that was used to cleave the target DNA.

Example 3

DNA Fragmentation

Cleavage of the target DNA at G residues was performed according to standard Maxam-Gilbert (M.-G.) chemistry (Sambrook et al. in "Molecular Cloning" A Laboratory Manual/Second Edition, 1989, volume 2, pp. 13.78-13.95, Cold Spring Harbor Laboratory Press). Briefly, 10 μl of ³²P-labeled target DNA was mixed with 190 μl of M.-G. DMS buffer and 5 μl of 10% DMS. DMS was substituted for by H₂O in the mock experiment. After incubation and ethanol precipitation procedures, the DNA to be cleaved was dissolved in 100 μl of 1 M piperidine. The DNA sample in the mock procedure was dissolved in 100 μl of water, incubated at 90°C for 30 minutes and dried in a SpeedVac.

Example 4 describes the method that was used to coat the individual wells of a microtiter plate with avidin. Example 4

Preparation of Avidin Coated Microtiter Plates

Wells of 96-well microtiter stripped plates (E.I.A./R.I.A. Strip Plate-8, Flat Bottom, High Binding, Costar, cat.

#2581 ) were coated with avidin (Sigma, cat. #A-9360) overnight at room temperature. 150 μl of 100 μg/ml avidin solution in 50 mM carbonate-bicarbonate buffer (Sigma, cat. #C-3041), pH 9.6 at 25°C was added to each well.

Non-bound avidin was discarded and the plates blotted on paper towels. 250 μl aliquots of blocking solution were added to the wells which were then incubated at 37°C. The blocking solution contained 1 % gelatin (Sigma, cat.

#G-6144), 5X Denhardt's solution (5 Prime-3 Prime, Inc., cat. #5302-211000), and 25 μl/ml of sheared salmon sperm

DNA (5 Prime-3 Prime, Inc., cat #2-755689). Plates with the blocking solution were kept at +4°C at least overnight before use.

Example 5 describes the method that was used to immobilize the biotinylated probe DNAs to the avidin coated microtiter plates.

Example 5

Immobilization of Biotinylated Probes to

Avidin-Coated Plates

Plates having the blocking solution were incubated at 40°C for 15-20 minutes to melt the gelatin. The blocking solution was then discarded and the plates blotted on paper towels. Biotinylated probes were added (2 nmoles/well) in 80 μl of a probe diluent solution containing 7.5X Denhardt's solution and 3.5X SSPE (35 mM phosphate buffer, pH 7.4 at 25°C; 0.5215 M NaCI; 3.5 mM EDTA, Sigma, cat. #S2015). Binding was carried out at 37°C for one hour. The wells were washed five times with a wash buffer containing phosphate buffered saline (PBS, GibcoBRL, cat. #21600-010) made 0.05% Tween 20 (Polyoxyethylenesorbitan monolaurate, Sigma, cat. #P-5927).

Example 6 describes the method used to hybridize labeled target DNA with the plate-immobilized probes.

Example 6

Hybridization of the "P-Labeled Target DNA

with Immobilized Probes

5' labeled target DNA fragments were dissolved in water, denatured at 75°C for three minutes and then quickly chilled on ice. One-fourth of the 4X probe diluent solution (+4°C) was added to the denatured samples.

80 μl of each sample was added to the wells containing the immobilized probes. The plates were incubated at 37°C for 30 minutes and then returned to room temperature for 15 minutes. The wells were washed seven times with wash buffer and blotted on paper towels.

The extent of probe hybridization was determined by measuring the amount of radioactivity bound to the probe oligonucleotides immobilized to the wells of the microtiter plate. Radioactivity was counted in 5 ml of scintillation cocktail using a Beckman scintillation spectrometer. Wells containing the hybridized samples were broken off the strips and dropped into scintillation vials containing a standard scintillation cocktail. In addition to Maxam-Gilbert (M.-G.) processed and mock-processed oligonucleotides, target DNA that did not undergo any treatment was also used in hybridization experiments. As a control for nonspecific hybridization, an oligonucleotide probe complementary to a portion of the cytomegalovirus genome (LA6) was hybridized to fragmented target DNA in parallel. The results of these hybridization experiments are presented in Table III.

The ratio of the ³²P counts for M.-G. treated versus mock M.-G. (or alternatively unprocessed) were particularly notable as these results provided positional sequence information. When background radioactivity was subtracted, these ratios became:

pL: 0.66

pM: 0.42

pR: 0.05

These values were compared with the ratios of the distance in nucleotides of the probe complement from the 3' end of the target DNA versus the total length of the target DNA (=100):

pL: 0.79

pM: 0.42

pR: 0.05 These results clearly indicated a good correlation between the experimentally derived positional information and the actual positions of subsequences within the target DNA, thereby validating the utility of the PSH method.

w n n n f

/

Claims

WHAT IS CLAIMED IS:

1. A method of determining the nucleotide sequence of a polynucleotide molecule, comprising the steps:

(a) obtaining polynucleotide molecules to be sequenced as a substantially homogeneous population, each of said molecules having a 5' end and a 3' end, said polynucleotide molecules having a label at at least one of said ends subject to the provision that if both ends are labeled then each of said ends has a different label;

(b) cleaving the polynucleotide molecules at substantially random positions to produce cleaved fragments;

(c) contacting the cleaved fragments with a plurality of oligonucleotide probes in a hybridization reaction, each of said plurality of oligonucleotide probes having a nucleotide sequence that is known;

(d) measuring an intensity of hybridization between the cleaved fragments and each of said plurality of oligonucleotide probes, thereby obtaining positional information about the occurrence of a plurality of sequences along a length of said polynucleotide; and

(e) constructing with a computer a contiguous nucleotide stretch represented by the plurality of oligonucleotide probes hybridized to the cleaved fragments in the contacting step and constrained by the positional information obtained in the measuring step, thereby determining the nucleotide sequence of the polynucleotide.

2. The method of Claim 1, wherein said label at at least one of the ends of the polynucleotide is covalently attached to the polynucleotide.

3. The method of Claim 1, wherein a labeled oligonucleotide hybridized to the polynucleotide indirectly end-labels the polynucleotide.

4. The method of Claim 1, wherein the polynucleotide molecules are cleaved an average of once per molecule in the cleaving step.

5. The method of Claim 1, wherein the plurality of oligonucleotide probes in the contacting step are immobilized to a solid support.

6. The method of Claim 5, wherein each of the plurality of oligonucleotide probes contacted in the contacting step have a length of from 6 to 20 nucleotides.

7. The method of Claim 6, wherein the oligonucleotide probes are selected from the group consisting of DNA probes and PNA probes.

8. The method of Claim 1, wherein the measuring step comprises detecting the label.

9. The method of Claim 8, wherein a higher intensity of hybridization measured in the measuring step indicates that one of said plurality of oligonucleotide probes is nearer to the label on the polynucleotide compared to a different one of said plurality of oligonucleotide probes having a lower intensity of hybridization.

10. The method of Claim 8, wherein the label is detectable optically.

11. The method of Claim 10, wherein the label detectable optically is a fluorescent label.

12. The method of Claim 1, wherein the obtaining step further comprises cleaving a circular DNA molecule with a restriction endonuclease to produce a linear DNA molecule.

13. The method of Claim 1, wherein the cleaving step comprises contacting the polynucleotide molecules with an enzyme.

14. The method of Claim 13, wherein the enzyme is selected from the group consisting of S1 nuclease, mung bean nuclease and deoxyribonuclease.

15. The method of Claim 1, wherein the cleaving step comprises mechanically shearing the molecules.

16. The method of Claim 1, wherein the cleaving step comprises contacting the polynucleotide molecules with a chemical agent.

17. The method of Claim 1, wherein the constructing step further comprises the use of computer programs for minimizing errors of the type selected from the group consisting of false negative errors, false positive errors, non-uniqueness errors and false position errors.

18. A method of constructing a polynucleotide sequence of a nucleic acid molecule from information about the approximate positional occurrence of a plurality of subsequences of the molecule having known sequences along a length of the molecule, said method comprising the steps of:

processing with a computer the approximate positional occurrence information for the plurality of subsequences along the length of the molecule to create a representation of the approximate position of the sunsequences;

refining the representation repeatedly to combine pairs of partially overlapping subsequences in said polynucleotide molecule to result in a plurality of combined subsequences;

establishing positional information for said plurality of combined subsequences from positional information about said partially overlapping subsequences; and

repeating the refining step and the establishing step until no more subsequences are combined, thereby constructing the polynucleotide sequence along the length of the nucleic acid molecule.

19. The method of Claim 18, wherein the processing step comprises associating each subsequence among said plurality of subsequences with a node in a graph.

20. The method of Claim 19, wherein the node is identified by a nucleotide sequence, an approximate position and a standard deviation of a calculatable error in the approximate position.

21. The method of Claim 20, wherein the standard deviation of the calculatable error in the approximate position is determined experimentally.

22. The method of Claim 20, wherein the calculatable error in the approximate position is derived from a binomial distribution.

23. The method of Claim 20, wherein the calculatable error in the approximate position is derived from a non-uniform probability distribution.

24. The method of Claim 20, wherein the calculatable error in the approximate position is derived from a probability distribution having unbounded range.

25. The method of Claim 20, wherein a plurality of said nodes are connected by directed edges in the refining step.

26. The method of Claim 20, wherein the establishing step comprises calculating a mean value for the positions of said partially overlapping subsequences.