WO2005021786A1

WO2005021786A1 - A method of sequencing nucleic acids by ligation of labelled oligonucleotides

Info

Publication number: WO2005021786A1
Application number: PCT/GB2004/003666
Authority: WO
Inventors: Colin Barnes
Original assignee: Solexa Limited
Priority date: 2003-08-27
Filing date: 2004-08-27
Publication date: 2005-03-10
Also published as: WO2005021786A8; GB0320059D0

Abstract

The invention is concerned with a method of sequencing nucleic acid molecules, such as DNA molecules, using ligation cassettes.

Description

A METHOD OF SEQUENCING NUCLEIC ACIDS BY LIGATION OF LABELLED OLIGONUCLEOTIDES

The present invention is concerned with a method of sequencing and in particular with a method of sequencing nucleic acid molecules, such as DNA 5 molecules, using ligation cassettes. Background of the Invention Recently, the Human Genome Project determined the entire sequence of the human genome. The sequence information represents that of an average human being. Nevertheless, there is still considerable interest in sequencing operations to, for 10 example, identify differences in the genetic sequence between different individuals and in developing methods of ever increasing high throughput sequence analysis and sequencing to analyse completely the genomes of organisms. Conventional sequence analysis typically required fragmentation of the nucleic acid and separation of the resulting fragments by denaturing gel 15 electrophoresis which is time consuming and labour intensive. Increased throughput methods have however been proposed that utilise nucleic acid molecules immobilised on an array which are probed by differently labelled oligonucleotides. Hybridisation of the probe to the sample identifies those sequences that are complementary to that of the probe (Drmanac, R., et al Genomics 4 pi 14-128 (1989). Thus, the methods used 20 whether on arrays or otherwise, are usually to study hybridisation events to determine the sequence of the nucleic acid. Many of these hybridisation events are detected using fluorescent labels attached to the nucleotides, the labels being detected using a sensitive fluorescent detector, e.g. a charge coupled detector (CCD).The major disadvantages associated with these methods are that it is not possible to sequence 25 long stretches of nucleic acid and also repeat sequences can lead to ambiguity of results. These problems are recognised in Automation Technologies for Genome Characterisation, Wiley-Interscience (1997), ed. TJ. Beugelsdijk, Chapter 10 : 205- 225. An alternative sequencing approach is described in WO 89/03432, which 30 comprises hybridising a fluorescently-labelled strand of DNA to a target DNA sample suspended in a flowing sample stream, and then using an exonuclease to cleave repeatedly the end base from the hybridised DNA. The cleaved bases are detected in sequential passage through a detector, allowing reconstruction of the base sequence of the DNA. Each of the different nucleotides has a distinct fluorescent label attached, which is detected by laser induced fluorescence. This is a complex method, primarily because it is difficult to ensure that every nucleotide of the DNA strand is labelled and that this has been achieved with high fidelity to the original sequence. A further sequencing method has been described in US 5,302,509 where the sequence of a target polynucleotide is determined by detecting the incorporation of nucleotides into the nascent strand through the detection of a fluorescent label attached to the incorporated nucleotide. The target polynucleotide is primed with a suitable primer and the nascent chain is extended in a stepwise manner by the polymerase reaction. Each of the different nucleotides (A, T, G and C) incorporates a unique fluorophore at the 3' position which acts as a blocking group to prevent uncontrolled polymerisation. The polymerase enzyme incorporates a nucleotide into the nascent chain complementary to the target and the blocking group prevents further incorporation of nucleotides. The array surface is then cleared of unincorporated nucleotides and each incorporated nucleotide is "read" optically by a charge coupled detector using laser excitation and filters. The 3* -blocking group is then removed (deprotected) to expose the nascent chain for further nucleotide incorporation.

Summary of the Invention The present invention has advanced this technology further by utilising it in conjunction with ligation cassettes comprising ohgonucleotide molecules having one or more defined nucleotides in a novel method which provides a particularly convenient and flexible approach to nucleic acid molecule sequencing.

In a first aspect there is provided by the present invention a method of determining the sequence of a target nucleic acid molecule comprising,

(i) immobilising fragments of said target nucleic acid molecule onto the surface of a solid support to form an array of nucleic acid molecules which are capable of interrogation, each of said molecules being immobilised other than at that part of the molecule that can be interrogated; (ii) contacting said molecules with a library of ligation cassettes each comprising an ohgonucleotide having one or more defined bases and having a characteristic label thereon, under conditions that permit ligation of one of said cassettes to a primer sequence hybridised or otherwise maintained in a spatial relationship with said target nucleic acid molecules, each of said cassettes being suitably blocked to permit only a single ligation event; (iii) identifying the label(s) attached to any ligated cassette and removing the blocking group associated therewith and optionally removing said label; (iv) repeating steps (i) to (iii) for a sufficient number of times to generate a complementary ohgonucleotide sequence to each of said target nucleic acid molecules, each of said oligonucleotides having known nucleotides spaced intermittently along their length that can be placed in the context of a reference sequence and comparing the overlapping sequences of said ohgonucleotide sequences in the context of the reference sequence to determine the sequence of the target nucleic acid molecule. Thus, advantageously, it is not necessary to know the identity of all of the bases in the ligation cassettes but one or more defined bases, whereas previous sequencing methods required the identification of every added base in a continuous sequence of incorporation of the complementary nucleotides in the nascent chain. In previous sequencing methodologies, for example as described in International Patent Application No WO 00/06770, approximately only the first 20 nucleotides of the target nucleic acid molecule are required to be sequenced so as to be able to place the sequences obtained in the context of the complete reference sequence for an organism. In the method of the present invention, steps (i) to (iii) can be repeated a sufficient number of times for the sequences of the defined bases to be placed in the context of a reference sequence. For example the sequence obtained from the human genome project can be used as the reference sequence when the organism is human. Once the position of each of the nucleic acid molecules (fragments of the target nucleic acid molecule) has been determined with respect to the reference sequence, the degree of overlap between the fragments of the target nucleic acid molecule will permit the sequence of the target molecule to be elucidated. The primer sequence may also therefore be a ligation cassette that has been incorporated in the immediately preceding ligation event. Therefore, the fragmentation of the target nucleic acid molecule immobilised on said solid support to form an array of polynucleotide molecules allows a massively parallel approach to sequencing using the ligation cassettes by comparing the sequence data from millions of immobilised fragments of the target nucleic acid in the context of a reference sequence, which may be the genome of any organism such as a bacteria, virus or plant, amongst others. However, preferably, the sequence is derived from a mammal, which is preferably human, thus permitting large scale sequencing operations. The ligation cassette may comprise from 3 to 20 nucleotides and preferably 4 to 15 nucleotides. Thus, the cassette can be varied in length depending on the length of the target nucleic acid molecules. In a preferred embodiment of the invention the nucleic acid molecule or fragments thereof are immobilised so that they are capable of being individually resolved, preferable by optical microscopy. Therefore, the invention is particularly useful in the context of single molecule sequencing, which as described in WO 00/06770, facilitates high throughput sequencing operations. However, it is to be understood that the utility of the sequencing method provided by the invention is not limited to sequencing on single molecule arrays comprising nucleic acid molecule or fragments thereof immobilised so that they are capable of being individually resolved, preferable by optical microscopy. The method may be used for sequencing on essentially any type of array formed by immobilisation of nucleic acid molecules on a solid support. Suitable arrays may include, for example, multi-polynucleotide or clustered arrays in which distinct regions on the array comprise multiple copies of one individual polynucleotide molecule or even multiple copies of a small number of different polynucleotide molecules (e.g. multiple copies of two complementary nucleic acid strands). The method according to the invention optimally requires the ligation cassette to be blocked in such a manner so as to prevent uncontrolled polymerisation of the cassettes, thus permitting the labels associated with the cassettes and the particular defined bases associated therewith to be identified. The nature of the blockage depends on the nature of the ligation procedure to be utilised. In particular, when an enzymatic ligation procedure is utilised, the ligase enzyme requires the presence of a phosphate moiety to join together the 3' and 5' hydroxyl groups on the nucleotides. Therefore, in one embodiment the 3' group of the ligation cassette maybe chemically blocked to prevent the ligase incorporating any further cassettes into the polynucleotide chain. Typically, in this embodiment the target nucleic acid will be immobilised on the surface of the solid support at its 3' end so that the extension of the complementary polynucleotide chain from the primer sequence proceeds in the 5' to 3' direction. One of the advantages associated with the present invention is that it is possible to perform an "inverted ligation" procedure whereby the target nucleic acid molecule is immobilised at its 5' end and it is possible to ligate the cassette to the primer such that the incorporation of the cassettes into the complementary chain proceeds in the 3' to 5' direction, thus permitting a greater degree of flexibility than previous sequencing methods. In one embodiment of this aspect of the invention the target nucleic acid molecule comprises a hairpin loop molecule in which case the primer sequence and target nucleic acid are linked by a linker moiety. In this embodiment the ligation of the cassettes occurs on the 3' end of the hairpin so that the incorporation of the cassettes proceeds in the 5' to 3' direction. Thus, the hairpin molecule comprises a 5¹ overhang relative to the 3' end. The inverted ligation occurs when the hairpin includes a 3' overhang relative to the 5' end so that the ligation cassettes are ligated at the 5' end of the molecule which thus extends in the 3' to 5' direction. The block may advantageously be either a chemically blocked 3' OH group on the cassette or alternatively the block may be defined by the absence of a phosphate moiety on the ligation cassette which prevents further incorporation of any additional cassettes by the ligase. Once the label has been read and subsequently removed, enzymatic addition of a phosphate group to the 5' end of the cassette can be performed to enable the addition of a further cassette to the polynucleotide chain. When a 3' blocking group is utilised, preferably the group is an azidomethyl group although other suitable blocking groups are available and which would be known to the skilled practitioner. Typical protecting groups are described in "Protective Groups in Organic Synthesis", T.W.Greene and P.G.M.Wuts, third edition, Wiley Interscience, 605 Third Avenue, New York. Thus, the ligation step according to the invention may be effected using any known chemical or enzymatic means. Preferably, as identified in the aforementioned embodiment, the ligation may be effected by an appropriate ligase enzyme, which may be any of T4 DNA ligase, E.coli DNA ligase or Taq DNA ligase but preferably a T4 DNA ligase enzyme, although as would be well known to those of skill in the art many more ligase enzymes are available that could be utilised. Alternatively, a chemical ligation procedure may be employed to ligate the cassette to either the 5' or 3' ends of the primer sequence depending on the orientation of the target nucleic acid molecule immobilised on the solid support. Chemical ligation reactions that join two nucleic acid molecules are known. The chemical ligation may be performed between a 5'-leaving group, such as iodide or tosylate and a 3'-thiophosphate or selenophosphate nucleophile. However, other chemical ligation methods may be used for example those that utilise EDC [l-(3-Dimethylaminopropyl)-3-ethylcarbodiimide hydrochloride] as set out in Figure 4 and which are known in the art. The ligation cassette may be labelled at any position along its length by a suitable detectable label providing the label does not interfere with the ligation of the cassette to the primer sequence. In one embodiment the detectable label may be attached to the base of said defined nucleotide by for example a cleavable linker, which label may be a fluorophore, for example. The linker may be acid labile, photolabile or contain a disulfide linkage. Other appropriate linkers may be used and which are known in the art. Preferred linkages and labels include those disclosed in WO 03/048387. In the method described herein it is preferred that the label and any blocking group are removable in a single treatment step. Thus, in a preferred embodiment the blocking group may be cleaved simultaneously with the label. The target nucleic acid molecules thereof may be immobilised either directly or indirectly onto the solid support. Preferably, the nucleic acid molecules comprise a self-priming hairpin loop molecule. The invention may be more clearly understood with reference to the accompanying drawings wherein: Figures 1(a) and (b) are a schematic representations of the method according to the invention. Figure 2 is a representation of the structure of the ligation cassettes that may be used in the enzymatic ligation or inverted ligation procedure. Figure 3 is a representation of the structure of the ligation cassettes that may be utilised in the chemical ligation or inverted chemical ligation procedure according to the invention. Figure 4 is a representation of the chemical ligation steps that may be utilised in accordance with the method of the invention. Figure 5 is an illustration of the compounds utilised to synthesise the ligation cassettes utilised in Example 1. Figure 6 is an illustration of the compounds and reagents utilised in Example 2. Description of The Invention The present invention relates to a method of determining the sequence of a target nucleic acid molecule using a ligation procedure as opposed to the more frequently used polymerase reaction. The ligation cassettes used in accordance with the method of the invention comprise a sequence of nucleotides (polynucleotide or ohgonucleotide) where the identity of only some of the bases therein are known. In this manner following the step wise incorporation of the ligation cassettes into a complementary polynucleotide chain, it is possible to obtain in any sequencing procedure the sequence information of only those bases that are defined in each of the ligation cassettes incorporated along the length of the complementary polynucleotide chain. Thus, instead of obtaining sequence information about each successive nucleotide in the target nucleic acid molecule, as is the case in conventional sequencing methodologies, the identity of only those few nucleotides that are known or defined by way of an appropriate label will be identified. The incorporation of the ligation cassettes therefore results in a complementary polynucleotide sequence where only intermittent nucleotides along its length are known by virtue of the relevant label attached to the cassette. The number of defined or known bases in any given ligation cassette will determine how many incorporation steps will be required to place the sequence in the context of the reference sequence. For example, identifying from 15 to 20 of the known or defined nucleotides should be sufficient to identify the appropriate region of the reference sequence and thus place the fragment into the context of the reference sequence as described in the invention using bioinformatics tools. When the method is performed on an array by generating fragments of the target nucleic acid molecule, it will be possible to determine the full sequence of the target nucleic acid molecule by virtue of the massive degree of overlap between the fragments using the bioinformatics tools. Therefore, the reference sequence is utilised as a point of reference to locate the appropriate position for the fragments in the context of the reference sequence while the overlap between the sequences of the oligonucleotides enables the sequence of the target nucleic acid molecule to be determined. The target nucleic acid molecule preferably comprises the complete genome of an individual. The fragments may be prepared by techniques known in the art such as mechanical shearing or restriction digestion by an appropriate enzyme, which in the case of DNA may be an appropriate DNase, such as DNasel. The ligation cassettes are preferably from 4 to 15 nucleotides in length and carry one or more defined bases thereon. The cassette may therefore be of the structure XXNXX, XNXXXX, NXXX, XNNXX, etc for example where X is an unknown nucleotide and therefore represents any one of the four nucleotide bases and N is a defined nucleotide represented by a characteristic label present on the cassette. In the case where the cassette is of the structure XNNXX, 16 unique labels would be required to identify each of the four possible bases at the positions NN in the cassette. A library of cassettes is therefore prepared which are applied to the target nucleic acid molecule to be sequenced. Such cassettes can be prepared on DNA synthesis machines using mixtures of bases as set out in more detail in the examples provided. In the case shown as XXNXX-label, four ohgonucleotide mixtures are prepared, XXAXX-L1, XXTXX-L2, XXGXX-L3, and XXCXX-L4, each mixture containing 256 sequences, the combined mixture containing 1024 oligonucleotides labelled with four different labels. The position of the label may be included anywhere on the cassette once its presence does not interfere with the incorporation of the cassette into the growing polynucleotide by ligation. As is known in the art, a "nucleotide" consists of a nitrogenous base, a sugar, and one or more phosphate groups. They are monomeric units of a nucleic acid sequence. In RNA, the sugar is a ribose, and in DNA a deoxyribose, i.e. a sugar lacking a hydroxyl group that is present in ribose. The nitrogenous base is a derivative of purine or pyrimidine. The purines are adenine (A) and guanine (G), and the pyrimidines are cytosine (C) and thymine (T) (or in the context of RNA, uracil (U)). The C-l atom of deoxyribose is bonded to N-l of a pyrimidine or N-9 of a purine. A nucleotide is also a phosphate ester of a nucleoside, with esterification occurring on the hydroxyl group attached to C-5 of the sugar. Nucleotides are usually mono, di- or triphosphates. In the context of the present invention, the term "incorporating" means becoming part of a nucleic acid (eg DNA) molecule, polynucleotide, ohgonucleotide or primer. An ohgonucleotide or polynucleotide refers to a synthetic or natural molecule comprising a covalently linked sequence of nucleotides which are formed by a phosphodi ester or modified phosphodiester bond between the 3' position of the pentose on one nucleotide and the 5' position of the pentose on an adjacent nucleotide. In the context of the present invention, the terms "polynucleotide" and "ohgonucleotide" are used interchangeably herein. The present invention can make use of conventional detectable labels. Detection can be carried out by any suitable method, including fluorescence spectroscopy or by other optical means. The preferred label is a fluorophore, which, after absorption of energy, emits radiation at a defined wavelength. Many suitable fluorescent labels are known. Buschmann et al ; Bioconjugate Chemistry, 2003, 14, 195-204; Panchuk-Valoshina et al; Journal of Histochemistry & Cytochemistry, 1999, 47, 9, ppl 179-1188; Henegariu et al; Nature Biotechnology, 2000, 18, 345- 348. Other commercially available fluorescent labels include, but are not limited to, fluorescein, rhodamine (including TMR, texas red and Rox), alexa, bodipy, acridine, coumarin, pyrene, benzanthracene and the cyanins. Multiple labels can also be used in the invention. For example, bi-fluorophore FRET cassettes (Tet. Let. 46:8867-8871, 2000) are well known in the art and can be utilised in the present invention. Multi-fluor dendrimeric systems (J. Amer. Chem. Soc. 123:8101-8108, 2001) can also be used. Although fluorescent labels are preferred, other forms of detectable labels will be apparent as useful to those of ordinary skill. For example, microparticles, including quantum dots (Empodocles et al., Nature 399:126-130, 1999), gold nanoparticles (Reichert et al., Anal. Chem. 72:6025-6029, 2000) and microbeads (Lacoste et al, Proc. Natl. Acad. Sci USA 97(17): 9461-9466, 2000) can all be used. Multi-component labels can also be used in the invention. A multi-component label is one which is dependent on the interaction with a further compound for detection. The most common multi-component label used in biology is the biotin- streptavidin system. Biotin is used as the label attached to the nucleotide base. Streptavidin is then added separately to enable detection to occur. Other multi- component systems are available. For example, dinitrophenol has a commercially available fluorescent antibody that can be used for detection. The ligation cassettes used in the method of the invention may use a cleavable linker to attach the label to the nucleotide. The use of a cleavable linker ensures that the label can, if required, be removed after detection, avoiding any interfering signal with any ligation cassette incorporated subsequently. Generally, the use of cleavable linkers is preferable, particularly in the methods of the invention hereinbefore described. Cleavable linkers are known in the art, and conventional chemistry can be applied to attach a linker to a nucleotide base and a label. The linker can be cleaved by any suitable method, including exposure to acids, bases, nucleophiles, electrophiles, radicals, metals, reducing or oxidising agents, light, temperature, enzymes etc. The linkers as discussed herein may also be cleaved with the same catalyst used to cleave the 3'O-blocking group bond. Suitable linkers can be adapted from standard chemical blocking groups, as disclosed in Greene & Wuts, Protective Groups in Organic Synthesis, John Wiley & Sons. Further suitable cleavable linkers used in solid-phase synthesis are disclosed in Guillier et al. (Chem. Rev. 100:2092- 2157, 2000). The use of the term "cleavable linker" is not meant to imply that the whole linker is required to be removed from e.g., the nucleotide base. Where the detectable label is attached to the base, the nucleoside cleavage site can be located at a position on the linker that ensures that part of the linker remains attached to the nucleotide base after cleavage. Where the detectable label is attached to the base, the linker can be attached at any position on the nucleotide base provided that Watson-Crick base pairing can still be carried out. In the context of purine bases, it is preferred if the linker is attached via the 7-position of the purine or the preferred deazapurine analogue, via an 8- modified purine, via an N-6 modified adenosine or an N-2 modified guanine. For pyrimidines, attachment is preferably via the 5-position on cytosine, thymidine or uracil and the N-4 position on cytosine. Suitable linkers include, but are not limited to, disulfide linkers, acid labile linkers; including dialkoxybenzyl linkers, Sieber linkers, indole linkers, t-butyl Sieber linkers, electrophilically cleavable linkers, nucleophilically cleavable linkers, photocleavable linkers, cleavage under reductive conditions, oxidative conditions, cleavage via use of safety-catch linkers, and cleavage by elimination mechanisms. A. Electrophilically cleaved linkers. Electrophilically cleaved linkers are typically cleaved by protons and include cleavages sensitive to acids. Suitable linkers include the modified benzylic systems such as trityl, p-alkoxybenzyl esters and p-alkoxybenzyl amides. Other suitable linkers include tert-butyloxycarbonyl (Boc) groups and the acetal system. The use of thiophilic metals, such as nickel, silver or mercury, in the cleavage of thioacetal or other sulfur-containing protecting groups can also be considered for the preparation of suitable linker molecules.

B. Nucleophilically cleaved linkers. Nucleophilic cleavage is also a well recognised method in the preparation of linker molecules. Groups such as esters that are labile in water (i.e., can be cleaved simply at basic pH) and groups that are labile to non-aqueous nucleophiles, can be used. Fluoride ions can be used to cleave silicon-oxygen bonds in groups such as triisopropyl silane (TIPS) or t-butyldimethyl silane (TBDMS).

C. Photocleavable linkers. Photocleavable linkers have been used widely in carbohydrate chemistry. It is preferable that the light required to activate cleavage does not affect the other components of the modified nucleotides. For example, if a fluorophore is used as the label, it is preferable if this absorbs light of a different wavelength to that required to cleave the linker molecule. Suitable linkers include those based on O-nitrobenzyl compounds and nitroveratryl compounds. Linkers based on benzoin chemistry can also be used (Lee et al, J. Org. Chem. 64:3454-3460, 1999).

D. Cleavage under reductive conditions There are many linkers known that are susceptible to reductive cleavage. Catalytic hydrogenation using palladium-based catalysts has been used to cleave benzyl and benzyloxycarbonyl groups. Disulfide bond reduction is also known in the art. E. Cleavage under oxidative conditions Oxidation-based approaches are well known in the art. These include oxidation of p-alkoxybenzyl groups and the oxidation of sulfur and selenium linkers. The use of aqueous iodine to cleave disulfides and other sulfur or selenium-based linkers is also within the scope of the invention.

F. Safety-catch linkers Safety-catch linkers are those that cleave in two steps. In a preferred system the first step is the generation of a reactive nucleophilic center followed by a second step involving an intra-molecular cychzation that results in cleavage. For example, levulinic ester linkages can be treated with hydrazine or photochemistry to release an active amine, which can then be cyclised to cleave an ester elsewhere in the molecule (Burgess et al, J. Org. Chem. 62:5165-5168, 1997).

G. Cleavage by elimination mechanisms Elimination reactions can also be used. For example, the base-catalysed elimination of groups such as Fmoc and cyanoethyl, and palladium-catalysed reductive elimination of allylic systems, can be used. As well as the cleavage site, the linker can comprise a spacer unit. The spacer distances e.g., the nucleotide base from the cleavage site or label. The length of the linker is unimportant provided that the label is held a sufficient distance from the nucleotide so as not to interfere with any interaction between the nucleotide and an enzyme. In a preferred embodiment the linker may consist of the same functionality as the block. This will make the deprotection and deblocking process more efficient, as only a single treatment will be required to remove both the label and the block. The sequencing methods of the present invention are carried out with the target polynucleotide arrayed on a solid support. Multiple target polynucleotides can be immobilised on the solid support through linker molecules, or can be attached to particles, e.g., microspheres, which can also be attached to a solid support material. The polynucleotides can be attached to the solid support by a number of means, including the use of biotin-avidin interactions. Methods for immobilizing polynucleotides on a solid support are well known in the art, and include lithographic techniques and "spotting" individual polynucleotides in defined positions on a solid support. Suitable solid supports are known in the art, and include glass slides and beads, ceramic and silicon surfaces and plastic materials. The support is usually a flat surface although microscopic beads (microspheres) can also be used and can in turn be attached to another solid support by known means. The microspheres can be of any suitable size, typically in the range of from 10 ran to 100 nm in diameter. In a preferred embodiment, the polynucleotides are attached directly onto a planar surface, preferably a planar glass surface. Attachment will preferably be by means of a covalent linkage. Preferably, the arrays that are used are single molecule arrays that comprise polynucleotides in distinct optically resolvable areas, e.g., as disclosed in International Application No. WO00/06770. To carry out the ligation reaction it will usually be necessary to first anneal a primer sequence to the target polynucleotide, the primer sequence and cassette to be incorporated being recognised by the ligase enzyme or those chemicals necessary to perform the chemical ligation reaction and which primer acts as an initiation site for the subsequent extension of the complementary strand. The primer sequence may be added as a separate component with respect to the target polynucleotide. Alternatively, the primer and the target polynucleotide may each be part of one single stranded molecule, with the primer portion forming an intramolecular duplex with a part of the target, i.e., a hairpin loop structure. This structure may be immobilised to the solid support at any point on the molecule. Other conditions necessary for carrying out the ligation reaction, including temperature, pH, buffer compositions etc., will be apparent to those skilled in the art. The term "hairpin loop structure" refers to a molecular stem and loop structure formed from the hybridization of complementary polynucleotides that are covalently linked. The stem comprises the hybridized polynucleotides and the loop is the region that covalently links the two complementary polynucleotides. Anything from a 5 to 20 (or more) base pair double strand nucleic acid may be used to form the stem. In one embodiment, the structure may be formed from single-stranded polynucleotide complementary regions. The loop in this embodiment may be anything from 2 or more non-hybridised nucleotides. In a second embodiment, the structure is formed from two separate polynucleotides with complementary regions, the two polynucleotides being linked and the loop being at least partially formed from a linker moiety. The linker moiety forms a covalent attachment between the ends of the two polynucleotides. Linker moieties suitable for use in this embodiment will be apparent to the skilled practitioner. For example, the linker moiety may be polyethylene glycol (PEG). Cassettes that are not incorporated into the nascent polynucleotide chain are removed, for example, by subjecting the array to a washing step, and detection of the incorporated labels may then be carried out. The sequencing method can be carried out on both single polynucleotide molecule and multi-polynucleotide molecule arrays, i.e., arrays of distinct individual polynucleotide molecules and arrays of distinct regions comprising multiple copies of one individual polynucleotide molecule, including for example clustered arrays. Single molecule arrays allow each individual polynucleotide to be resolved separately. The use of single molecule arrays is preferred. Sequencing single molecule arrays non-destructively allows a spatially addressable array to be formed. However, the density of the arrays is not critical. Thus, the present invention can make use of a high density of immobilised molecules, and these are preferable. For example, arrays with a density of 10 to 10 molecules per cm may be used. Preferably, the density is at least 10⁷/cm² and typically up to 10⁸/cm². These high density arrays are in contrast to other arrays which may be described in the art as "high density" but which are not necessarily as high and/or which do not allow single molecule resolution. On a given array, it is the number of single oligonucleotides, rather than the number of features, that is important. The concentration of nucleic acid molecules applied to the support can be adjusted in order to achieve the highest density of addressable single ohgonucleotide molecules. At lower application concentrations, the resulting array will have a high proportion of addressable single ohgonucleotide molecules at a relatively low density per unit area. As the concentration of nucleic acid molecules is increased, the density of addressable single ohgonucleotide molecules will increase, but the proportion of single ohgonucleotide molecules capable of being addressed will actually decrease. One skilled in the art will therefore recognize that the highest density of addressable single ohgonucleotide molecules can be achieved on an array with a lower proportion or percentage of single ohgonucleotide molecules relative to an array with a high proportion of single ohgonucleotide molecules but a lower physical density of those molecules. Using the methods and apparatus of the present invention, it may be possible to image at least 10⁷ or 10⁸ molecules. Fast sequential imaging may be achieved using a scanning apparatus; shifting and transfer between images may allow higher numbers of hairpin ohgonucleotide molecules to be imaged. The extent of separation between the individual ohgonucleotide molecules on the array will be determined, in part, by the particular technique used for resolution. Apparatus used to image molecular arrays are known to those skilled in the art. For example, a confocal scanning microscope may be used to scan the surface of the array with a laser to image directly a fluorophore incorporated on the individual molecule by fluorescence. Alternatively, a sensitive 2-D detector, such as a charge-coupled detector, can be used to provide a 2-D image representing the individual ohgonucleotide molecules on the array. Resolving single molecules on the array with a 2-D detector can be done if, at 100 x magnification, adjacent ohgonucleotide molecules are separated by a distance of approximately at least 250nm, preferably at least 300nm and more preferably at least 350nm. It will be appreciated that these distances are dependent on magnification, and that other values can be determined accordingly, by one of ordinary skill in the art. Other techniques such as scanning near-field optical microscopy (SNOM) are available which are capable of greater optical resolution, thereby permitting more dense arrays to be used. For example, using SNOM, adjacent ohgonucleotide molecules may be separated by a distance of less than lOOnm, e.g. lOnm. For a description of scanning near-field optical microscopy, see Moyer et al, Laser Focus World (1993) 29(10). An additional technique that may be used is surface-specific total internal reflection fluorescence microscopy (TIRFM); see, for example, Vale et al, Nature, (1996) 380: 451-453). Using this technique, it is possible to achieve wide-field imaging (up to 100 μm x 100 μm) with single molecule sensitivity. This may allow arrays of greater than 10 resolvable molecules per cm to be used. Additionally, the techniques of scanning tunnelling microscopy (Binnig et al, Helvetica Physica Acta (1982) 55:726-735) and atomic force microscopy (Hansma et al, Ann. Rev. Biophys. Biomol. Struct. (1994) 23:115-139) are suitable for imaging the arrays of the present invention. Other devices which do not rely on microscopy may also be used, provided that they are capable of imaging within discrete areas on a solid support. Multi-polynucleotide or clustered arrays of nucleic acid molecules for use in conjunction with the sequencing method of the invention may be produced using techniques generally known in the art. By way of example, WO 98/44151 and WO 00/18957, both describe methods of nucleic acid amplification which allow amplification products to be immobilised on a solid support in order to form arrays comprised of clusters or "colonies" of immobilised nucleic acid molecules. The contents of WO 98/44151 and WO 00/18957 relating to the preparation of clustered arrays are incorporated herein by reference. The nucleic acid molecules present on the clustered arrays prepared according to these methods are suitable templates for sequencing using the method of the invention. However, the invention is not intended to be limited to sequencing on clustered arrays prepared according to these specific methods. The sequence information obtained from the method of the invention can be used to identify the full sequence of the target nucleic acid molecule. The reference sequence is any suitable sequence that represents the normal/general genome. Suitable reference genomes have been identified as part of the various genome sequencing efforts, for example the Human Genome Project. This invention may be further understood with reference to the following examples which serve to illustrate the invention and not to limit its scope.

Examples

Example 1 : Sequencing using cycles of ligation and phosphorylation

Preparation of the Ligation Cassette

Four separate 5-mer ohgonucleotide mixtures were prepared using a DNA synthesiser using a 'universal' support (Glen Research). The universal support is used to ensure the first position can be fully randomised. The first two positions were completely randomised using all four nucleotides. The third position in each synthesis was a single nucleoside carrying an amino modification for labelling. The final two bases were completely randomised. The pentamer oligonucleotides were cleaved from the column whilst still carrying the 5' -DMT protecting group. The DMT oligonucleotides were purified using a reverse phase cartridge (Biosearch Technologies) to bind all the DMT sequences. Treatment of the bound material with 3 % trichloroacetic acid to cleave the DMT group allowed elution of the deprotected pentamers with their free amino groups.

The amino phosphoramidite monomers used in the synthesis are shown in Figure 5.

Each of the purified pentamers were reacted with an appropriate fluorescent dye such that each defined base could be separately detected. Fluorophores used for this were alexa 488, Cy3, alexa 594 and alexa 647. The Alexa dyes were supplied by Molecular Probes and the Cy3 by Amersham. The fluorescent dyes were attached through a disulfide linker. Synthesis of a representative Cy3 dye-linker construct is shown below.

Cy-3disulfide linker. The starting disulfide (4.0 mg, 13.1 μmol) was dissolved in DMF (300 μL) and diisopropylethylamine (4 μL) was slowly added. The mixture was stirred at room temperature and a solution of Cy-3 dye NHS ester (5 mg, 6.53 μmol) in DMF (300 μL) was added over 10 min. After 3.5 h, on complete reaction, the volatiles were evaporated under reduced pressure and the crude residue was HPLC purified. A mixture of Cy3 disulphide linker (2.5 μmol), disuccinimidyl carbonate (0.96 mg, 3.75 μmol) and DMAP (0.46 mg, 3.75 μmol) were dissolved in dry DMF (0.5 ml) and stirred at room temperature for 10 min. The reaction was monitored by TLC (MeOH:CH₂Cl₂ 3:7) until all the dye linker was consumed. The activated dye-linker solution was used to react with the mixture of pentamers (20 uM final cone in 500 uL NaHCO3), mixed 1 : 1 to give a lOuM DNA cone in 50% aq DMF. The reaction was left for 4 h at rt. The oligonucleotides were purified by reverse phase HPLC with the fluorescent pentamers separable from both the unlabelled DNA and the unreacted dye-linker. The disulfide dye-linker constructs were used directly in cycles of sequencing.

Slide functionalization and DNA Immobilization: Bromoacetylated slides were used as support for DNA immobilization. Glass slides were transferred into racks and washed with agitation and without drying between stages as follows: overnight in detergent (Decon 90), rinse (water), overnight in 1 M NaOH, rinse (water), 15 minutes in 0.1 M HC1, rinse (water), and then stored in ethanol. A solution of 0.2% total silane, as a mixture of tetraethoxysilane and triethoxysilylpropyl (bromoacetamide) at 10000:1 in 95% aqueous ethanol (adjusted to approximately pH 4.5 with 5% H₂SO₄) was prepared. Hydrolysis of the silanes and silanol formation took place during a 5 minute preincubation step with sonication. The cleaned slides were immersed in the silane solution for 6 minutes before they were removed and washed with isopropanol. The slides were then dried under an argon stream and cured in an oven at 120 °C for 90 minutes. Oligonucleotides with thiophosphate modifications were covalently attached from solution (0.1 M potassium phosphate buffer pH 7.0) for 15 minutes at ambient temperature. The thiophosphate modification was attached during ohgonucleotide synthesis through an abasic nucleoside phosphoramidite and used as supplied (Oswel). Post-immobilization, the slides were rigorously washed by vortexing (20 seconds each step) in MilliQ grade water, 10 mM Tris pH 8.0, 10 mM EDTA solution at 95°C, MilliQ grade water before drying under argon. An alternative method for slide functionalisation is through biotinylated BSA: A fused silica slide is washed with detergent, rinsed and dried and placed in a sealed flow cell. A solution of biotinylated BSA (0.2 mg/mL) in buffer A (Tris.HCl (pH 8; 10 mM), NaCl (50 mM)) (0.5 mL) is washed into the cell and incubated for 30 mins. The slide is washed with more buffer A (10 mL) and treated with a solution of streptavidin (0.2 mg/mL, 0.5 mL) in buffer A. The slides are washed and treated with 20pM of the biotinylated hairpin DNA. The samples are left at room temperature for 30 mins to form the array then washed with buffer A.

The DNA used to verify the protocol was obtained as a hairpin with an inbuilt 3 '- overhang of known sequence and a 5'- phosphate moiety attached. Treatment of the single molecule array of hairpin DNA with a mixture of the four labelled pentamers (10 uM total DNA) plus T4 DNA ligase and ATP for 12 h at 4 °C gave attachment of a single correct fluorophore to each hairpin. Less than 1 % of the incorrect fluorophores were visible on the array. In the absence of the ligase very little surface contamination could be detected. The fluorophores were cleaved by treatment with DTT (100 mM) for 10 min, capped with iodoacetamide (431 mM) in 0.1 mM phosphate pH 6.5 for 5 minutes and washed to give a blank surface. The 5'- hydroxyl of the new hairpin was activated using T4 polynucleotide kinase at 37 °C for 30 min and the 12 h ligation was repeated.

Example 2: Sequencing using cycles of ligation and deblocking

Preparation of the Ligation Cassette

Four separate 5-mer ohgonucleotide mixtures were prepared using a DNA synthesiser. The synthesis was performed using inverted phosphoramidites with a 3'-

DMT group and a 5 '-phosphoramidite functionality. The first residue was attached to the support using a '3-phosphate CPG' (Glen Research) to give a terminal phosphate at the 5 '-position due to the inverted amidites. The first two positions were completely randomised using all four nucleotides. The third position in each synthesis was a single nucleoside without any modification. The fourth base was a mixture of all four nucleotides. The fifth nucleotide reaction was a mixture of four phosphoramidites in which each phosphoramidite contains a 3 '-azidomethyl protecting group and a N- MMT protected side chain. The pentamer oligonucleotides were cleaved from the column whilst still carrying the N-MMT protecting group. The MMT oligonucleotides were purified using a reverse phase cartridge (Biosearch Technologies) to bind all the MMT sequences. Treatment of the bound material with 3 % trichloroacetic acid to cleave the MMT group allowed elution of the deprotected pentamers with their free amino groups.

The reagents used are shown in Figure 6.

The four pentamer ohgonucleotide mixtures were fluorescently labelled and HPLC purified as described in example one.

The single molecule arrays of unlabelled hairpins are prepared as described in example one, with the hairpin DNA having a 5 '-overhang and a free 3 '-hydroxyl group. Cycles of ligation chemistry are performed as described in example one using T4 ligase and ATP at 4 °C for 12 h. The attached fluorophores are visualised and less than 1 % of the incorrect fluorophore is detected. The azidomethyl group prevents more than a single cassette being added to each surface bound hairpin. The azidomethyl group and the label were removed by treatment with phosphine (Tris-(2- carboxyethyl)-phosphine) at r.t. for 15 mins. The disulfide was capped with iodoacetamide (431 mM) for 5 mins, the surface seen to be free of dye molecules and the ligation cycle repeated.

In both examples one and two, repetition of the cycles allowed correct sequence determination of every fifth base on the DNA hairpin. This can be used to sequence a sample derived from genomic DNA and obtain the identity of every fifth base in the sequence. Repetition of this over 20-30 cycles allows unique sequence determination of the fragment on the surface. Alignment of this sequence information with a database of sequence information compiled from every fifth base in the human genome identifies any differences between the sequenced fragment and the consensus. Sequencing of hundreds on millions of fragments in parallel allows determination of all the differences between the sample under investigation and the consensus genome in a single experiment.

Claims

Claims:

1 A method of determining the sequence of a target nucleic acid molecule comprising,

(i) immobilising fragments of said target nucleic acid molecule onto the surface of a solid support to form an array of nucleic acid molecules which are capable of interrogation, each of said molecules being immobilised other than at that part of the molecule that can be interrogated; (ii) contacting said molecules with a library of ligation cassettes each comprising an ohgonucleotide having one or more defined bases and having a characteristic label thereon, under conditions that permit ligation of one of said cassettes to a primer sequence hybridised or otherwise maintained in a spatial relationship with said target nucleic acid molecules, each of said cassettes being suitably blocked to permit only a single ligation event; (iii) identifying the characteristic label(s) attached to any ligated cassette and removing the blocking group associated therewith and optionally removing said characteristic label; (iv) repeating steps (i) to (iii) for a sufficient number of times to generate a complementary ohgonucleotide sequence to each of said target nucleic acid molecules, each of said complimentary ohgonucleotide sequences having known nucleotides spaced intermittently along their length that can be placed in the context of a reference sequence and comparing the overlapping sequences of said ohgonucleotide sequences in the context of the reference sequence to determine the sequence of the target nucleic acid molecule.

2. A method according to claim 1 , wherein each of said ligation cassettes comprises from 3 to 20 nucleotides.

3. A method according to any preceding claim wherein the target nucleic acid molecules are capable of being individually resolved by optical microscopy.

4. A method according to any preceding claim wherein the array of part (i) is a clustered array.

5. A method according to any preceding claim wherein said ligation step is effected by either chemical or enzymatic ligation of said ligation cassette to said primer sequence.

6. A method according to claim 5 wherein said enzymatic ligation is effected by a ligase enzyme.

7. A method according to claim 6 wherein said ligase is T4 DNA ligase.

8. A method according to any preceding claim wherein said ligation cassette is blocked either by a blocking group at the 3' end or the absence of a phosphate moiety at either the 3' or 5' ends of said ligation cassette.

9. A method according to any preceding claim wherein said target nucleic acid molecule is immobilised at either of its 5' or 3' ends.

10. A method according to any preceding claim wherein said label is attached at any of the 5' or 3' ends of the cassette or to the one or more defined bases.

11. A method according to any preceding claim wherein the target nucleic acid molecules comprise a hairpin molecule.

12. A method according to any preceding claim wherein said steps (i) to (iii) are repeated from 15 to 30 times.