WO1998049341A2 - Polynucleotide sequencing using semi-degenerate primers - Google Patents

Polynucleotide sequencing using semi-degenerate primers Download PDF

Info

Publication number
WO1998049341A2
WO1998049341A2 PCT/GB1998/001233 GB9801233W WO9849341A2 WO 1998049341 A2 WO1998049341 A2 WO 1998049341A2 GB 9801233 W GB9801233 W GB 9801233W WO 9849341 A2 WO9849341 A2 WO 9849341A2
Authority
WO
WIPO (PCT)
Prior art keywords
primers
sequence
reaction
primer
semi
Prior art date
Application number
PCT/GB1998/001233
Other languages
French (fr)
Other versions
WO1998049341A3 (en
Inventor
Andrew Webster
Original Assignee
Andrew Webster
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Andrew Webster filed Critical Andrew Webster
Priority to EP98919324A priority Critical patent/EP0979307A2/en
Priority to AU72203/98A priority patent/AU7220398A/en
Publication of WO1998049341A2 publication Critical patent/WO1998049341A2/en
Publication of WO1998049341A3 publication Critical patent/WO1998049341A3/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing

Definitions

  • This invention relates to a method for determining the sequence of a polynucleotide using semi-degenerate oligonucleotide primers.
  • DNA deoxyribonucleic acid
  • RNA ribonucleic acid
  • the standard sequencing method involves the elongation of a DNA primer sequence along a polynucleotide template using a DNA polymerase enzyme, deoxynucleotide triphosphates and dideoxynucleotide triphosphates as proposed by F Sanger et al . , (PNAS 1977; 74:5463-).
  • the latter species terminate the elongation reaction and, if labelled specifically (e.g. with a specific fluorophore) , or separated in four different reaction tubes, allow the determination of sequence.
  • This method whilst accurate, has a number of disadvantages.
  • the range of accurate sequence that can be determined in one reaction and electrophoresis run is limited to 500-800 bases due mainly to the scarcity of long products which have escaped the earlier dideoxynucleotide termination.
  • experiments to determine unknown sequence cannot easily be performed in parallel as the result of one sequencing reaction needs to be identified before a further clone or polymerase chain reaction (PCR) product is retrieved for the determination of further adjacent sequence.
  • the concentration of template needs to be high such that pre-amplification of the template DNA is necessary before the sequencing step.
  • only one strand of the template DNA can be determined during each reaction, a separate reaction being needed to sequence the complementary strand to check the validity of the determined sequence.
  • a second method to sequence a polynucleotide has been proposed using hybridisation of unknown DNA to a large panel of known oligonucleotides (Drmanac et al . , Genomics 1989;4:114-, Bains et al . , ibid 1991; 11: 294-, Southern et al . , ibid. 1992 ; 13 : 1008-) .
  • Such a technique is potentially powerful, especially since the development of microscopic oligonucleotide arrays, using photolithographic technology (Fodor et al . , Science 1991; 251: 767-, Pease et al . , PNAS 1994;91:5022-) .
  • hybridisation 'events' occur in parallel on the same array.
  • the number of outcomes from a hybridisation experiment greatly exceeds those from an experiment involving electrophoresis alone, and so hybridisation may potentially be of value in the determination of a megabase sequence.
  • hybridisation of one sequence is not entirely specific to its complementary sequence alone and cross-hybridisations involving one or more mismatched nucleotides are difficult to avoid. This inaccuracy prevents the use of an array to determine unknown sequence.
  • oligomer probe on the array there is a limit to the size of non-contiguous repeated units within the template sequence that can be unambiguously ordered.
  • adjacent repeats of identical sequence such as di- or tri- nucleotides, which occur very commonly in mammalian DNA, can never be accurately determined by using hybridisation alone.
  • the polymerase chain reaction has proved an extremely powerful technique in nucleotide analysis. In its most straightforward form, this involves the amplification of sequence between two smaller known sequences that can be encoded by two oligonucleotides known as 'primers'.
  • a DNA polymerase enzyme extends a 5 ' to 3 ' sequence of nucleotides from each primer complementary to the template sequence. Cycles of denaturation, primer annealing and polymerase elongation are performed to manufacture many identical copies of desired double-stranded DNA between these two primer sequences.
  • the primer nucleotide has to anneal to its complementary sequence in the template DNA and the affinity and specificity of this annealing process can be controlled to some extent by the annealing temperature and the salt concentration.
  • DNA polymerase enzymes that do not possess a 3' to 5' exonuclease function e.g. Taq polymerase
  • elongation of the primer sequence will not take place if the 3 ' most nucleotides of the primer do not match exactly with the corresponding nucleotides on the template strand.
  • a number of mismatches 5' to these sites can be tolerated, particularly if the annealing temperature is lowered or salt concentration increased, mismatches being tolerated less well the closer they are to the 3' end of the primer.
  • the annealing temperature and salt concentration for each pair of primers has to be determined empirically, although calculation of the (G+C) to (A+T) ratio, primer length and salt concentration do allow a melting temperature (Tm) for each primer to be estimated.
  • Amplification using semi-degenerate primers shows reproducible bands with template DNA, suggesting that the non-random 3 ' nucleotides still confer specificity of binding.
  • the same conclusion can be drawn from the reproducible amplicons derived from short primers in the process known as random amplification of polymorphic DNA (RAPD) .
  • RAPD polymorphic DNA
  • a small single primer of arbitrary sequence is used at low annealing temperature in a PCR reaction to distinguish different strains of organism, on the basis of reproducible amplicon sizes, following amplification. If mismatches could occur all along the primer the bands would not be reproducible; the specificity of the reaction is therefore likely to be caused by the specific matching of the primer 3' ends.
  • the present invention involves a method for the analysis of polynucleotide molecules using semi-degenerate primers, and the polymerase reaction.
  • the method comprises the steps of: a) reacting a target polynucleotide with oligonucleotide primers, a polymerase enzyme, and the other reagents necessary for the polymerase reaction, wherein the oligonucleotide primers are chosen such that polymerase products of varying lengths are produced; and b) analysing the products of the said reaction or reactions; wherein the oligonucleotide primers include an array of semi-degenerate primers whose 3' ends comprise variations of one or more nucleotides A, T, G and C such that the array is complementary to all the polynucleotide sequence .
  • the semi-degenerate primers are used in a set of separate reactions to generate amplicons from a contiguous segment of polynucleotide, the sequence of which is to be determined.
  • Prior knowledge of the primers' 3 '-ends in each reaction, and subsequent sizing of fragments, allows the unambiguous determination of template sequence.
  • tagging of the 5'- ends of the primers with a specifically designed sequence allows manipulations such as separation of amplicons on the basis of complementary sequence hybridisation, the addition of primer sites for further amplification and incorporation of sites for in vitro transcription and hybridisation to oligonucleotide arrays.
  • the frequency of binding of primers on the template, the number of distinct amplicons of specific length generated and the proportion of template covered by amplicons from specific primers can be predicted using statistical analysis.
  • the technique can therefore be designed and adapted appropriately for each range of length of template that requires sequencing.
  • amplicon generation using semi-degenerate primers allows the •translation' of unknown nucleotide sequence into a series of designed oligomer sequence 'tags' for hybridisation to oligonucleotide arrays, so that cross-hybridisation is minimised.
  • the length of product generated by the semi-degenerate sequencing reactions is not limited to a maximum of ⁇ 1000 bases, instead being limited only by the elongation time during reaction cycles and the specific type of polymerase enzyme used (e.g. 3.5 kb) . This increases the number of nucleotides that can be determined in each reaction.
  • Embodiments of the invention allow parallel analysis of a large template polynucleotide in which the sequence is unknown. Experiments that can only be undertaken sequentially by the Sanger method can therefore be set up simultaneously by this method.
  • Embodiments of the invention interrogate both sense and anti-sense strands simultaneously during the sequencing reactions, acting as a check for accurate sequence determination.
  • Embodiments of the invention generate products suitable for hybridisation analysis, providing enormous power to determine the sequence of megabase polynucleotides in single experiments. This power is limited only by the size of the oligonucleotide array and allows the possibility of whole genome sequencing. Detailed Description of the Invention
  • one non- degenerate specific primer is chosen.
  • This primer could be a specific sequence at the beginning of the DNA sequence to be determined, or complementary to part of a cloning vector if the template DNA has been cloned. Ideally, it should have a low annealing temperature.
  • the specific primer is labelled so that amplicons with incorporated primer can subsequently be identified. This can be done for example, using end-labelling with ⁇ P-dATP, biotinylation or attachment of a fluorophore.
  • the latter labelling scheme by using more than one distinct fluorophore with different emission frequencies, may allow later electrophoresis of more than one reaction product on one lane of an electrophoretic gel.
  • a set of semi-degenerate primers is designed so that together their non-degenerate 3 ' ends cover all possible sequences in the template DNA. Random nucleotides, a universal base, e.g. inosine, or a combination of both entities are then used to make up the middle 5-10 nucleotides of the primer. Inosine has the disadvantage of allowing intercomplementarity of the semi- degenerate and specific primers, whilst random nucleotides have the disadvantage of decreasing by fourfold the effective concentration of each primer for each nucleotide position.
  • a specific tag sequence can optionally be added to the 5' end.
  • Such a tag sequence may encourage the binding of the semi-degenerate primer at the end of an amplicon following the first or subsequent reaction cycles, rather than further internally, thus increasing the number of large amplicons at the expense of small ones. Furthermore, if a similar tag is added to the 5' end of the specific primer, further rounds of amplification using primers to the tag sequences can be used to augment the concentration of final product.
  • the label e.g. ⁇ P-dATP etc
  • the type of DNA polymerase used in the sequencing reaction is important.
  • the enzyme must not have any 3' to 5' exonuclease activity, as this would allow a semi- degenerate primer to successfully amplify a product at a site that does not exactly correspond to its 3' end.
  • the absence of such activity occurs in a number of enzymes available commercially for use in PCR (e.g. Thermus aquaticus (Taq) polymerase, Thermus thermophilus (Tth) polymerase) ; although this may reduce the fidelity of polymerisation, it should not effect the accuracy of the technique.
  • a 5' to 3' exonuclease activity is useful, however, in reducing the tendency for semi-degenerate primer binding internally and thus reducing the frequency of short amplicons.
  • Reactions are set up using template DNA, deoxynucleotide triphosphates, a suitable buffer system, a suitable DNA polymerase and a single or set of semi- degenerate primer (s). Because the semi-degenerate primers can manufacture amplicons without the incorporation of the specific primer, and this may reduce the efficiency of specific primer generated reactions, a number of cycles at high stringency (with only the specific primer annealing ) can be performed first. Also the specific primer can be used in excess concentration. This may generate a number of single-stranded products with the specific primer at the ends (i.e. as in 'asymmetric PCR').
  • the reactions can be performed on whole genomic DNA, again the efficiency of specific primer-generated reactions is increased if the template is separated and/or purified before the reactions.
  • the semi-degenerate primers can be added and the annealing temperature lowered. The exact annealing temperature for any set of semi- degenerate and specific primers may need to be determined empirically. Thereafter, cycles of PCR are repeated as usual with the elongation temperature appropriate to the DNA polymerase (e.g. 72°C for Taq polymerase) and denaturation steps at 96°C.
  • the technique is also of use in determining the sequence of a specific mRNA, from extracted cell total RNA, after reverse transcriptase amplification with a single specific primer. Subsequent use of a further specific primer and a set of semi-degenerate primers may allow determination of the sequence without amplification of other mRNAs.
  • the choice of semi-degenerate primer sets influences the number of reactions and electrophoresis runs as well as the decoding for the final sequence solution.
  • the simplest primer set, which uses four semi-degenerate primers in four separate PCR reactions each with a labelled specific primer, is shown below:
  • R represents equal proportions of each nucleotide A, T, G, C or inosine
  • tag represents a common specified DNA sequence of 10-20 nucleotides that does not share complementarity with itself or the chosen specific primer.
  • the four reactions are run out on a non-denaturing gel (e.g. agarose or polyacrylamide) in separate lanes or in the same lane if four different fluors have been used to label the specific primer in each reaction. The sequence can be deduced easily by the relative lengths of the bands occurring in each lane.
  • Amplicons with semi-degenerate primers at each end do not appear on the gel as they are not labelled. Subsequent autoradiography or fluorescence detection is performed to reveal the specific primer- containing amplicons. A more sophisticated set of four semi-degenerate primers that specify the first two 3 • positions is as follows:
  • Primers in which the two 3' positions can each be one of two nucleotides, make up a set of four sequences that match their complementary sequences during the annealing step of the PCR reaction.
  • There are 36 ( 6 2 ) such primers and 58,905 ways of choosing a set of four different ones ( 36 C 4 ) .
  • 36 C 4 36 C 4 .
  • This particular combination allows unambiguous sequence determination, as the pair of nucleotides at any one position in each primer is not reproduced exactly in the other position in any of the four primers.
  • each position of the template requires the annealing of two complementary primers and the successful amplification of product.
  • a larger set of semi-degenerate primers may allow each nucleotide position to be 'checked' as each would be complementary to two or three primers in the sequencing reactions:
  • This method only sequences one strand of the template DNA (unlike the subsequent embodiments) and so an anti- sense 'check' does not inherently occur. Also the length of sequence determined in each experiment is limited by both the length of double-stranded amplicon that can be distinguished to the accuracy of a single nucleotide pair during electrophoresis (e.g. approx. 3000 bases) and the length that can be generated by the polymerase enzyme.
  • one or more semi-degenerate primer(s) with specified 3 ' -ends are used in a set of individual sequencing reaction experiments such that each reaction contains different 3' sequences. A specific primer is not used. A number of such experiments are performed on purified template DNA.
  • the size of the amplicons from each experiment is then determined.
  • the final sequence can be solved knowing the length of nucleotide between the flanking primer (s) that were used in each experiment.
  • the number of experiments that are necessary, together with the specificity of binding of the semi-degenerate primers that are needed, can be determined using statistical considerations.
  • the technique can be modified by the use of endonuclease enzymes to fragment the template in a predictable fashion prior to carrying out semi-degenerate primed PCR.
  • the template polynucleotide needs to have been separated from other contaminating polynucleotides because a specific labelled primer is not used. Some contaminating sequence can however be tolerated during the reconstruction of sequence. However, significant contaminating polynucleotide would prevent the clean separation of legitimate amplicons during size analysis.
  • One method to obtain such purification of a specific segment of DNA for example, without cloning, is to use a biotin/streptavidin or similar binding system.
  • a biotinylated oligonucleotide with a sequence complementary to a part of the sequence to be determined is allowed to hybridise to total genomic DNA following a fractionation or endonuclease step.
  • the bound fragment is then isolated using avidin or streptavidin capture.
  • Other methods include the flow-sorting of nucleotide fragments and electrophoretic separation.
  • the desired template can be pre-amplified by prior PCR and subsequently purified by standard methods.
  • nucleotide For any single nucleotide within the target polynucleotide to be included in the final reconstructed sequence, two events need to occur: firstly, the nucleotide must be bound to at least one primer within its 3 ' specified sequence; secondly, this specific nucleotide duplex must go on to be incorporated in an amplicon.
  • the number of different specified 3 ' primer sequences that are necessary to sequence unambiguously does not need to include all the possible 3' nucleotide combinations. For example, considering the use of 5 '-tag- (Random) x (N) 3 -3 ' primers, where (N) 3 represents a specified 3mer sequence, not all 64 possible primers need be included. If they were, each nucleotide would be bound at 3 sites by 1 to 3 different primers.
  • Various algorithms can be designed to determine a smaller set of 3mers that bind all possible nucleotides (with regard to their surrounding sequence) at least once.
  • AAA AAG, AAC, AGA, AGC, ACA, ACG, ACC, ATA, ATG, ATC, ATT, GGA, GGG, GGC, GGT, GCA, GTA, GTG, GTC, GTT, CGA, CGC, CCA, CCG, CCC, CTA, CTG, CTC, CTT, TGA, TGC, TCA, TTA, TTC, TTT
  • AAAA AAAA, AAGA, AACA, AATA, AATC, AGGA, AGGG, AGGC, AGGT, AGCA, AGCG,
  • ACCG ACCG, ACCC, ACCT, ACTA, ACTG, ACTC, ACTT, ATGA, ATGG, ATGC, ATGT,
  • ATCA ATTA, ATTG, ATTC, ATTT, GAAA, GAGA, GACA, GATA, GATC, GGGA, GGGG, GGGC, GGGT, GGCA, GCGA, GCGG, GCGC, GCGT, GCCA, GCCG, GCCC,
  • GCCT GCCT
  • GCTA GCTG
  • GCTC GCTT
  • GTAA GTGA
  • GTGG GTGG
  • GTGT GTCA
  • GTTA GTTA
  • CCGT CCCA
  • CCCG CCCC
  • CCCT CCTG
  • CTAA CTCA
  • CTTA CTTG
  • TAAA TAAA
  • TAGA TACA, TATA, TATC, TGAA, TGGA, TGGT, TGCA, TGTA, TGTG, TCGA, TCGG, TCGC, TCGT, TCCA, TCCG, TCCC, TCCT, TCTA, TCTG, TCTC, TCTT,
  • TTCA TTCA
  • TTTA TTTG
  • TTTC TTTT
  • 64 simultaneous PCR reactions labelled conventionally (eg with ⁇ 32 P-dCTP during the reactions) , each containing a single 3mer semi-degenerate primer, after complete digestion with endonuclease enzymes that have a total probability of cutting of 2/4 5 (e.g. 2 5- cutters or two of the form -g, A/T, g, C, A/T, C- ) run out on a non-denaturing gel, would give useful information from 640 base-pairs (1% probability of no amplicons) to 3760 base pairs (99% probability of no amplicons) .
  • the reconstruction of the final sequence using the second method is more complicated than the decoding necessary for the first method.
  • the length x of nucleotide is determined. This can be taken as either exact or to lie within specified limits, depending on the accuracy of nucleotide sizing.
  • a nucleotide sequence or set of sequences that defines the ends of each length x is determined. If n different semi- degenerate 3 ' end sequences are used in the reaction being considered, then this set will comprise all of these n sequences.
  • Different algorithms can be designed to use these data to piece together the final sequence solution. One algorithm involves searching for lengths of amplicon in a determined fashion until a match, as follows, is made.
  • (A) represents a 3 • primer sequence, or set of sequences, used in one semi-degenerate reaction (reaction 1) .
  • the complementary sequence or set of sequences A' has been used in another separate reaction (reaction 2) .
  • a, b, c and d represent lengths of amplicon occurring in these two reactions - a, c and d in reaction 1 and b in reaction 2.
  • This embodiment of the invention is an excellent way of generating polynucleotide fragments from a template nucleotide for the use of hybridisation onto an oligonucleotide array.
  • the product from each semi- degenerate reaction can be fragmented using techniques already described (e.g. Chee et al . , Science 1996; 274:610- ) and hybridised to an array.
  • any non-contiguous repeat sequences could be unambiguously ordered as each repeat would turn up in differently labelled amplicons.
  • concomitant sizing of the reaction products e.g. using electrophoresis may allow the determination of the length of contiguous repeat sequences as well.
  • this third embodiment of the invention proceeds through repeated polymerase reactions with specific pairs of primers, one of which is a semi- degenerate primer with a defined 3 ' end.
  • the other primer is designed so that it has a nucleotide sequence that is specific to the individual semi-degenerate primer and the individual step in the overall series of reactions.
  • This primer also contains sequences that are recognised by corresponding primers in subsequent reactions so that, after several reactions, the polymerase product contains one end labelled with multiple primers, which define the order of binding and identity of the semi-degenerate primers to the template polynucleotide. Therefore, as the series of reactions proceeds, one end of the template polynucleotide will be shortened as subsequent semi- degenerate primers hybridise 3' to the last one, and one end will be lengthened, as the second primer is incorporated into the polymerase product. Ultimately, no template will remain, and the polymerase product will consist of a defined series of nucleotides. The products can then be hybridised exclusively to specific addresses on an oligonucleotide array.
  • the basic strategy involves an initial single semi- degenerate sequencing reaction to generate amplicons from the template sequence using a set of semi-degenerate primers.
  • the reactions are carried out so that each semi- degenerate primer of the set is contained in a separate compartment.
  • the resulting amplicons are used in the subsequent cycle of reactions.
  • a number of sequential semi-degenerate primed reactions are interspersed with complete mixing of the reaction products and further separation for another round of reactions.
  • Initial amplicons from the first reaction on the template sequence are labelled differently at each end, and only asymmetrically labelled amplicons 'survive' the subsequent processing. This enables one end of the amplicon to be used for 'tagging' and the other for sequencing.
  • Subsequent division of the reactants into separate compartments followed by semi-degenerate sequencing reactions with primers specific to each compartment allows the progressive translation of random sequence at one end into designed tags at the other as the cycles of mixing, separation and polymerisation reactions continue. Whilst the number of determined sequences increase exponentially, the number of primers needed for the sequencing strategy increases only arithmetically.
  • a final PCR reaction on all the amplicons amplifies only the complete tagged sequence which is then used to hybridise to a specially designed array.
  • Each tag contains sequences that represent the primer ends of the initial amplicon as well as a unique contiguous sequence within that amplicon.
  • Subsequent decoding allows the determination of a proportion of the original template.
  • a number of similar experiments are performed with distinct initial primers to ensure a complete coverage of the template. Decoding can then identify all unique sequences and accurately order sequences between non-contiguous repeats. The only ambiguity remaining will be the determination of the number of contiguous repeat sequences (e.g. di-, tri- and tetra- nucleotide repeats) . Sizing (e.g.
  • electrophoresis of a set of reaction products will then allow the determination of the lengths of concomitant repeat sequences.
  • the technique depends upon the generation of amplicons in which each end is bound to a different primer, one end being the 'sequencing end', the other being the 'labelling end'.
  • the 'sequencing end' is bound internally by subsequent primers to its 5' end and this ensures that symmetric amplicons with two sequencing ends do not obfuscate the final hybridisation analysis (as the final reaction can only take place if the original 5' tag is available for primer binding).
  • the 'labelling end' is bound at the 5' end. Symmetric amplicons with two labelling ends, if allowed to occur, would undergo subsequent rounds of amplification and be available for the final hybridisation step.
  • Such amplicons can be eliminated by incorporating a separation step in which only amplicons with one or both 'sequencing ends' are retained. This step is undertaken after step A in the following example. It could also be undertaken after step B.
  • step B this latter method would have the disadvantage that one cycle of reaction mixing will have already occurred and so the reactants would need to be transferred from apparatus designed for cycles of mixing and separation. If this step is undertaken after step B, then this would involve the incorporation of a biotin or similar label in the sequencing primer (primer Bl) . After step B, the reactants are pooled and captured using an avidin/streptavidin (or similar) system so eliminating amplicons with two 'labelling ends'. Many variations exist within the basic strategy described above.
  • 'Tag' sequences are domains of primer 5' ends that are specific to both the step of the process and the compartment.
  • Tag B is specific to step B
  • agj. is specific to compartment 1 and so Tag B1 was used during step B in compartment 1.
  • 'C sequences are domains of primer 5 ' ends that are specific to each step but common to all compartments.
  • C c has been used in all compartments during step C.
  • 'N' sequences represent the specified 3' sequence of the semi-degenerate primer.
  • Subscripts x,y etc represent these different 3' specified sequences.
  • (R) n represents sequences within primers that have randomly applied nucleotides at each position, a universal base e.g. inosine, or a combination of the two.
  • the steps of the process are designated A, B, C etc and the compartments 1,2,3, etc.
  • the first reaction mixture includes the template to be sequenced which can be any purified source of contiguous polynucleotide (e.g. flow-sorted whole mammalian chromosomes) .
  • the choice of semi-degenerate primers is determined by the template length and size of array available; the primer binding probability may be adjusted to include a desired proportion of nucleotides in the subsequent amplicons.
  • the annealing temperature is determined empirically for each reaction. In this first single reaction, if a number of such primers (n) is used, this gives n(n+l)/2 types of amplicon with respect to amplicon ends.
  • the semi-degenerate primers for step A are designed to a blueprint similar to the following:
  • 'N' sequences represent the specific 3' ends of the semi-degenerate primers.
  • a a ⁇ q and A lab in a 1:1 proportion.
  • the 'N' sequences are designed to have similar annealing temperatures (the N sequences do not necessarily need to bind all possible nucleotides in the template but instead need to generate a proportion of all template within the resulting amplicons).
  • Primer A s ⁇ q will become incorporated as the sequencing end of amplicons and is biotinylated to allow its retrieval (above) .
  • Primer A lab will become incorporated as the labelling strand and is not biotinylated.
  • C regions represent 10-20 mer tagging sequences common to all the primers in the reaction, C ⁇ , Qai, etc being different from each other, but which are specific to step A and occur in each primer during step A.
  • Tag A j- s ⁇ q and ag ⁇ represent 10-20mer tagging sequences which are different to each other and are specific to the primer with the 3' end of (N x ) and to step A.
  • (R) n represents a number of random nucleotides, inosine or a combination thereof.
  • an avidin/streptavidin capture step (or similar technique) is incorporated to eliminate A lab -A lab amplicons. Alternatively, this can be done after step B (see above) .
  • reagents i.e. primers, polymerase, dNTPs etc
  • reaction product is divided into separate compartments suitable for polymerase reactions (e.g. Eppendorf tubes, PCR silicon chips, a specifically designed chamber with a facility to erect small water-tight separating walls to create temporarily a number of non-communicating individual compartments) .
  • polymerase reactions e.g. Eppendorf tubes, PCR silicon chips, a specifically designed chamber with a facility to erect small water-tight separating walls to create temporarily a number of non-communicating individual compartments.
  • the second reaction step (step B) is designed to separately label each amplicon end such that the 'labelling end' is labelled with a tag sequence specific to the 'sequencing end' of the same amplicon. Subsequent hybridisation at the end of the sequencing experiment will then reveal that the random sequence in question occurred within an amplicon derived in step A from the two specific primer sequences encoded by the first two tag sequences. This is achieved by using each separate compartment to amplify only those amplicons with a specific (or a set of specific) 'sequencing ends'. This principle of the selective amplification of amplicons in each compartment is used in subsequent steps.
  • a typical pair of primers used in this step would be as follows:
  • Step B Primers Bl) 5'- C ⁇ -Tag AlK - C Ac -(R) n -N,-3' B2) S'-C.-Tag B .-C ⁇ '
  • Primer Bl is designed to bind only to the 'sequencing end' of amplicons that contain the two sequences N x and Tag AlaB ⁇ I . Amplicons with other 'sequencing ends' are not amplified in compartment 1.
  • Primer B2 binds to the C ft , sequence of the labelling end of amplicons applying a tag sequence Tag B1 - that identifies this amplicon as having a N 2 sequence at the other end. It also applies a new step-specific C B sequence which will allow subsequent reactions to take place. After this step, two types of amplicon will be formed in each compartment with regard to amplicon ends:
  • B2 - B2 amplicons will not exist because amplicons without at least one 'sequencing end' will have been eliminated (as above) .
  • Bl - B2 amplicons contain the sequence C ⁇ which will be required for the final step that generates tags for hybridisation to an array.
  • Bl-Bl amplicons do not possess this sequence and so will not be included in the final sequence analysis.
  • Bl- Bl amplicons will only occur when an amplicon has the same compartment-specific sequence at each end a rarer event than B1-B2 amplicons which require only that there is a compartment-specific sequence at one end.
  • B1-B2 amplicons will have at their ends the following sequences:
  • B2 ends 5 ' -C B -Tag Bx -C Ab -Tag Aylab - C ⁇ - (R) n -N y -3 '
  • the B2 end codes for the ends of the relevant amplicon from step A with the tags Tag Bx and Tag Aylab which in this case code for the step A primers with the specific end sequences of N x and N y respectively .
  • the reactants from each compartment are evenly mixed and again divided into a number of compartments for reactions with a pair of compartment- specific primers.
  • Step C Subsequent steps are concerned with identifying sequences within the B1-B2 amplicon.
  • the primers used are designed so that a hybridisation event occurring within this region generates a specific tag sequence at the labelling end.
  • primer C2 An example, termed primer C2 , is shown below:
  • domain C B binds to all 'labelling ends' or (B2 ends) of B1-B2 amplicons.
  • successful amplification only occurs if the other end has in it a sequence specific to the particular compartment.
  • primer C2 carries a tag sequence ag ⁇ - that codes for this compartment-specific sequence N x and hence adds it to the two other tag sequences at this 'labelling end' that identify the original step A primers for that amplicon.
  • Domain C c is common to all compartments in step C and will allow subsequent primer binding at the labelling end in the next step, step D.
  • a primer can be designed to interrogate the nucleotides immediately adjacent and upstream of the original step A primer sequence.
  • the primer could interrogate the three nucleotides -XXX- in the following Bl end:
  • domain C ⁇ because it is unique to all Bl ends, only allows binding of primer Cl to the 'sequencing ends' of amplicons.
  • ⁇ ⁇ is a sequence of, for example, 3 nucleotides that are introduced as mismatches and allow for efficient primer hybridisation in the subsequent step.
  • (R) n spans all (R) n _ N x - sequences of Bl ends so that all Bl ends are interrogated.
  • XXX is the compartment-specific sequence that interrogates the upstream three nucleotides and only allows subsequent amplification if this specific sequence is present (in this case only 1 in 64 such amplicons, on average, would be amplified) .
  • sequences XXX were completely non-degenerate then, for complete sequencing, 64 compartments would be required. If XX non-degenerate sequences were used, 16 compartments would be required. Also, if a A/T-XX sequence was used, in which the first 5' nucleotide can be one of two bases and the other two are specific, 32 compartments would be required, and so on.
  • the advantage of this particular kind of primer Cl because of the presence of the domain C ⁇ , is that it allows the annealing temperature of primer Cl to easily match that of primer C2.
  • the main disadvantage is that only nucleotides directly adjacent to step A 3' primer sequences are interrogated and sequenced.
  • a Cl primer can be designed so that it binds anywhere within the amplicon sequence.
  • An example of such a primer is as follows:
  • This primer will bind wherever the specific sequence XXX occurs within an amplicon. However, the annealing temperature for efficient amplification with such a primer is likely to be low and not correspond with the annealing temperature of primer C2.
  • One way round this is to incorporate one or more ' anchor ' non-random nucleotides immediately downstream from the sequence XXX, here shown as sequence (N) n , which are common to all compartments. This limits the sequences that will be interrogated but if n ⁇ 3 such a limitation is easily overcome by the fact that both strands of template are being sequenced and that further experiments with differing anchor sequences can also be performed on the same template.
  • a further way to overcome the likely different annealing temperatures of primers Cl and C2 is to use a number of reaction cycles with only the C2 primers at the appropriate annealing temperature, such that a number of single strands are generated before the addition of Cl primers and lowering of annealing temperature.
  • a primer of the Cl(b) type will be used containing two anchor nucleotides -AG- as follows:
  • amplicons C1-C2 and Cl-Cl there will be two types of amplicons with respect to amplicon ends, amplicons C1-C2 and Cl-Cl.
  • the amplicon ends will be as follows:
  • step D the amplicons are further interrogated with respect to the nucleotides 3' to the sequencing ends.
  • step D will be described here.
  • the number of subsequent steps will be determined by the size of the oligonucleotide array available for hybridisation, the detrimental effects of the accumulation of primers from previous steps (see below) , and the mutation rate of the DNA polymerase. A larger number of steps increases the length of contiguous nucleotide that is encoded in each sequence and also increases the outcome set for each experiment.
  • Step D occurs after the even mixing of all step C reactants and the subsequent division into compartments.
  • a pair of primers is used in each compartment, one each to bind to the sequencing and labelling ends.
  • An example of such a primer pair, used in compartment 1, is shown below:
  • D2 labels the 'labelling end' of each amplicon by virtue of the binding of domain -C c - and labels the amplicon with a tag specific to step D and compartment 1 encoding for the unique sequence XXX.
  • Primer Dl interrogates the 3 nucleotides 3' to those previously interrogated in step C.
  • the presence and binding of sequences Tag ⁇ e ,- and AG allows the annealing temperature to be increased to match primer D2. It also prevents internal binding within the amplicon and so only contiguous sequence is interrogated.
  • Sequence (YYY) Dn ⁇ w is a sequence of 3 nucleotides common to all compartments that is introduced as mismatches and allow for efficient primer hybridisation in the subsequent step.
  • step E might comprise the following two primers:
  • XXX being the compartment specific sequence used to interrogate amplicon sequence and corresponding to the tag sequence Tag E1 - on the labelling end.
  • a final PCR step is performed to amplify the tag sequences on the labelling ends of the amplicons. Only amplicons that have completed each step will be so amplified and used in the hybridisation step. If steps through from A to F have occurred the 'labelling end' of surviving amplicons will look as follows:
  • This sequence contains 6 different tag sequences that together encode for the original amplicon ends as well as a contiguous sequence of 12 nucleotides contained somewhere within that amplicon on the strand shared by sequence B2.
  • the outcome set for this reaction will be 64 6 that is 6.8 x 10 10 .
  • C AJ ' is complementary to the sequence C ⁇ .
  • This final step can be used to generate RNA using a T7 RNA polymerase promoter or a fluor-labelled single stranded DNA in asymmetric PCR for hybridisation to an array as described by Hacia et al and Shoemaker et al respectively (Nature Genet 1996; 14:441-449) .
  • the form of the final sequences generated for hybridisation will be as follows:
  • Each address on the array would have a nucleotide sequence of this form.
  • the set of (64 x 7) 448 Tag sequences will have been specifically designed so that they do not cross-hybridise to a complementary sequence of any other sequence within the set. This could easily be achieved using a subset of the 16384 possible 7mers for example.
  • the C domains of the array can be designed to have a number of mismatches with respect to the C domains in the hybridisation sequences.
  • the problem of contaminating primers One problem with this technique is the accumulation of primers in the reaction mixture from previous steps following step B (above) .
  • One way to eliminate such primers is to purify the amplicons, using standard methods for purifying PCR products, after each step. This would, however, add a further manipulation to each step, and make automation of the whole experiment more complex.
  • One way to reduce the chances of this happening, other than purifying products is to incorporate a dilution step after step B when the reactants had been mixed. The only component of the reactants to have increased in concentration will be the successfully amplified products from the previous step. Hence by diluting the reactants, such amplicons would be the only species to survive in appreciable concentration. Further polymerase enzyme, dNTPs and buffer solution etc will need to be added before the subsequent steps. Also, the number of cycles of reaction for each step should be high enough to ensure adequate concentration of amplicons after the dilution step. This process could be easily automated within a system specially designed for the repetitive mixing and dividing of reactants.
  • a further method to eliminate unwanted primers from previous reaction steps is to add oligonucleotides whose 3 • ends are complementary to such primers (and not to the primers of the present reaction step) so that dimers can be formed and the 3' ends of the primers be 'neutralised* after polymerase elongation.
  • oligonucleotides whose 3 • ends are complementary to such primers (and not to the primers of the present reaction step) so that dimers can be formed and the 3' ends of the primers be 'neutralised* after polymerase elongation.
  • Non-contiguous sequences that are repeated in the template of, regarding the above example, 11 or more nucleotides can be unambiguously ordered as each will occur in one or more different step A amplicons. If the elongation step of the step A reaction is restricted to the generation of amplicons 3000 bases or less then the probability of having 3 or more sequences of 11 or more nucleotides repeating in random DNA in the same amplicon is very small. Even if this did occur, the inclusion of one such repeat in another amplicon would allow unambiguous ordering.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Organic Chemistry (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Biotechnology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Physics & Mathematics (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

A method for the analysis of polynucleotide molecules comprises the steps of: a) reacting a target polynucleotide with oligonucleotide primers, a polymerase enzyme, and the other reagents necessary for the polymerase reaction, wherein the oligonucleotide primers are chosen such that polymerase products of varying lengths are produced; and b) analysing the products of the said reaction or reactions; wherein the oligonucleotide primers include an array of semi-degenerate primers whose (3') ends comprise variations of one or more nucleotides A, T, G, and C such that the array is complementary to all the polynucleotide sequence.

Description

POLYNUCLEOTIDE SEQUENCING USING SEMI-DEGENERATE PRIMERS Field of the Invention
This invention relates to a method for determining the sequence of a polynucleotide using semi-degenerate oligonucleotide primers. Background to the Invention
The endpoint of most experiments in molecular biology involves the accurate determination of a deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) sequence. Indeed, the determination of the DNA sequence of whole genomes has been the focus of much research, effort and funds. It is anticipated that knowledge of such sequences will be of unparalleled benefit to the understanding of biology and disease.
The standard sequencing method involves the elongation of a DNA primer sequence along a polynucleotide template using a DNA polymerase enzyme, deoxynucleotide triphosphates and dideoxynucleotide triphosphates as proposed by F Sanger et al . , (PNAS 1977; 74:5463-). The latter species terminate the elongation reaction and, if labelled specifically (e.g. with a specific fluorophore) , or separated in four different reaction tubes, allow the determination of sequence. This method, whilst accurate, has a number of disadvantages. Firstly, the range of accurate sequence that can be determined in one reaction and electrophoresis run is limited to 500-800 bases due mainly to the scarcity of long products which have escaped the earlier dideoxynucleotide termination. Secondly, experiments to determine unknown sequence cannot easily be performed in parallel as the result of one sequencing reaction needs to be identified before a further clone or polymerase chain reaction (PCR) product is retrieved for the determination of further adjacent sequence. Thirdly, the concentration of template needs to be high such that pre-amplification of the template DNA is necessary before the sequencing step. Fourthly, only one strand of the template DNA can be determined during each reaction, a separate reaction being needed to sequence the complementary strand to check the validity of the determined sequence. These factors impose serious limits on the speed and efficiency of sequence determination.
A second method to sequence a polynucleotide has been proposed using hybridisation of unknown DNA to a large panel of known oligonucleotides (Drmanac et al . , Genomics 1989;4:114-, Bains et al . , ibid 1991; 11: 294-, Southern et al . , ibid. 1992 ; 13 : 1008-) . Such a technique is potentially powerful, especially since the development of microscopic oligonucleotide arrays, using photolithographic technology (Fodor et al . , Science 1991; 251: 767-, Pease et al . , PNAS 1994;91:5022-) . Unlike Sanger sequencing, many hybridisation 'events' occur in parallel on the same array. The number of outcomes from a hybridisation experiment greatly exceeds those from an experiment involving electrophoresis alone, and so hybridisation may potentially be of value in the determination of a megabase sequence. There are however at least three important drawbacks to this method. Firstly, the hybridisation of one sequence is not entirely specific to its complementary sequence alone and cross-hybridisations involving one or more mismatched nucleotides are difficult to avoid. This inaccuracy prevents the use of an array to determine unknown sequence. Secondly, for any given size of oligomer probe on the array there is a limit to the size of non-contiguous repeated units within the template sequence that can be unambiguously ordered. Thirdly, adjacent repeats of identical sequence such as di- or tri- nucleotides, which occur very commonly in mammalian DNA, can never be accurately determined by using hybridisation alone.
The polymerase chain reaction has proved an extremely powerful technique in nucleotide analysis. In its most straightforward form, this involves the amplification of sequence between two smaller known sequences that can be encoded by two oligonucleotides known as 'primers'. A DNA polymerase enzyme extends a 5 ' to 3 ' sequence of nucleotides from each primer complementary to the template sequence. Cycles of denaturation, primer annealing and polymerase elongation are performed to manufacture many identical copies of desired double-stranded DNA between these two primer sequences. For the reaction to work, the primer nucleotide has to anneal to its complementary sequence in the template DNA and the affinity and specificity of this annealing process can be controlled to some extent by the annealing temperature and the salt concentration. When DNA polymerase enzymes that do not possess a 3' to 5' exonuclease function (e.g. Taq polymerase) are used, elongation of the primer sequence will not take place if the 3 ' most nucleotides of the primer do not match exactly with the corresponding nucleotides on the template strand. A number of mismatches 5' to these sites can be tolerated, particularly if the annealing temperature is lowered or salt concentration increased, mismatches being tolerated less well the closer they are to the 3' end of the primer. The annealing temperature and salt concentration for each pair of primers has to be determined empirically, although calculation of the (G+C) to (A+T) ratio, primer length and salt concentration do allow a melting temperature (Tm) for each primer to be estimated.
As well as using primers of specified sequence for PCR, semi-degenerate primers in which only the 3 ' most nucleotides are determined and the others random, or degenerate primers, in which all positions are random, have been used successfully in experiments. PCR reactions using these primers have been used to amplify specific chromosomes for the generation of chromosome specific oligonucleotide probes (Telenius et al . , Genomics 1992; 13:718-) and to amplify single copy DNA (e.g. single cells) for further analysis (Zhang et al . , PNAS 1992;89:5847-) . Amplification using semi-degenerate primers shows reproducible bands with template DNA, suggesting that the non-random 3 ' nucleotides still confer specificity of binding. The same conclusion can be drawn from the reproducible amplicons derived from short primers in the process known as random amplification of polymorphic DNA (RAPD) . In this technique, a small single primer of arbitrary sequence is used at low annealing temperature in a PCR reaction to distinguish different strains of organism, on the basis of reproducible amplicon sizes, following amplification. If mismatches could occur all along the primer the bands would not be reproducible; the specificity of the reaction is therefore likely to be caused by the specific matching of the primer 3' ends. Summary of the Invention
The present invention involves a method for the analysis of polynucleotide molecules using semi-degenerate primers, and the polymerase reaction.
The method comprises the steps of: a) reacting a target polynucleotide with oligonucleotide primers, a polymerase enzyme, and the other reagents necessary for the polymerase reaction, wherein the oligonucleotide primers are chosen such that polymerase products of varying lengths are produced; and b) analysing the products of the said reaction or reactions; wherein the oligonucleotide primers include an array of semi-degenerate primers whose 3' ends comprise variations of one or more nucleotides A, T, G and C such that the array is complementary to all the polynucleotide sequence .
The semi-degenerate primers are used in a set of separate reactions to generate amplicons from a contiguous segment of polynucleotide, the sequence of which is to be determined. Prior knowledge of the primers' 3 '-ends in each reaction, and subsequent sizing of fragments, allows the unambiguous determination of template sequence. In an embodiment of the invention, tagging of the 5'- ends of the primers with a specifically designed sequence, allows manipulations such as separation of amplicons on the basis of complementary sequence hybridisation, the addition of primer sites for further amplification and incorporation of sites for in vitro transcription and hybridisation to oligonucleotide arrays. Furthermore, the frequency of binding of primers on the template, the number of distinct amplicons of specific length generated and the proportion of template covered by amplicons from specific primers can be predicted using statistical analysis. The technique can therefore be designed and adapted appropriately for each range of length of template that requires sequencing.
In a further embodiment of the invention, amplicon generation using semi-degenerate primers allows the •translation' of unknown nucleotide sequence into a series of designed oligomer sequence 'tags' for hybridisation to oligonucleotide arrays, so that cross-hybridisation is minimised.
The invention has the following advantages over conventional methods:
(i) As amplification of nucleotide occurs during sequence determination, direct sequence analysis can be performed on low concentrations of template polynucleotide.
(ii) Because the sequencing reaction products are amplified during the reactions, their final high concentration allows easier detection than the single- stranded dideoxynucleotide terminated products of Sanger sequencing.
(iii) Unlike Sanger sequencing, the length of product generated by the semi-degenerate sequencing reactions is not limited to a maximum of <1000 bases, instead being limited only by the elongation time during reaction cycles and the specific type of polymerase enzyme used (e.g. 3.5 kb) . This increases the number of nucleotides that can be determined in each reaction.
(iv) The concurrent use of a non-degenerate primer, if its annealing temperature is appropriate, may allow sequence determination directly from whole genomic DNA or total RNA. (v) The reduced amount of manipulation of template polynucleotide and reaction products can improve the efficiency of large scale sequencing projects.
(vi) As the sequencing reaction products are double- stranded, denaturing agents are not necessary during product size analysis (e.g. electrophoresis) .
(vii) Embodiments of the invention allow parallel analysis of a large template polynucleotide in which the sequence is unknown. Experiments that can only be undertaken sequentially by the Sanger method can therefore be set up simultaneously by this method.
(viii) Embodiments of the invention interrogate both sense and anti-sense strands simultaneously during the sequencing reactions, acting as a check for accurate sequence determination.
(ix) Embodiments of the invention generate products suitable for hybridisation analysis, providing enormous power to determine the sequence of megabase polynucleotides in single experiments. This power is limited only by the size of the oligonucleotide array and allows the possibility of whole genome sequencing. Detailed Description of the Invention
I. In the simplest form of the invention, one non- degenerate specific primer is chosen. This primer could be a specific sequence at the beginning of the DNA sequence to be determined, or complementary to part of a cloning vector if the template DNA has been cloned. Ideally, it should have a low annealing temperature. The specific primer is labelled so that amplicons with incorporated primer can subsequently be identified. This can be done for example, using end-labelling with γ^P-dATP, biotinylation or attachment of a fluorophore. The latter labelling scheme, by using more than one distinct fluorophore with different emission frequencies, may allow later electrophoresis of more than one reaction product on one lane of an electrophoretic gel. A set of semi-degenerate primers is designed so that together their non-degenerate 3 ' ends cover all possible sequences in the template DNA. Random nucleotides, a universal base, e.g. inosine, or a combination of both entities are then used to make up the middle 5-10 nucleotides of the primer. Inosine has the disadvantage of allowing intercomplementarity of the semi- degenerate and specific primers, whilst random nucleotides have the disadvantage of decreasing by fourfold the effective concentration of each primer for each nucleotide position. A specific tag sequence can optionally be added to the 5' end. Such a tag sequence may encourage the binding of the semi-degenerate primer at the end of an amplicon following the first or subsequent reaction cycles, rather than further internally, thus increasing the number of large amplicons at the expense of small ones. Furthermore, if a similar tag is added to the 5' end of the specific primer, further rounds of amplification using primers to the tag sequences can be used to augment the concentration of final product. The label (e.g. γ^P-dATP etc) may then be incorporated in the tag primer that anneals to the specific primer rather than in the specific primer itself.
The type of DNA polymerase used in the sequencing reaction is important. The enzyme must not have any 3' to 5' exonuclease activity, as this would allow a semi- degenerate primer to successfully amplify a product at a site that does not exactly correspond to its 3' end. The absence of such activity occurs in a number of enzymes available commercially for use in PCR (e.g. Thermus aquaticus (Taq) polymerase, Thermus thermophilus (Tth) polymerase) ; although this may reduce the fidelity of polymerisation, it should not effect the accuracy of the technique. A 5' to 3' exonuclease activity is useful, however, in reducing the tendency for semi-degenerate primer binding internally and thus reducing the frequency of short amplicons.
Reactions are set up using template DNA, deoxynucleotide triphosphates, a suitable buffer system, a suitable DNA polymerase and a single or set of semi- degenerate primer (s). Because the semi-degenerate primers can manufacture amplicons without the incorporation of the specific primer, and this may reduce the efficiency of specific primer generated reactions, a number of cycles at high stringency (with only the specific primer annealing ) can be performed first. Also the specific primer can be used in excess concentration. This may generate a number of single-stranded products with the specific primer at the ends (i.e. as in 'asymmetric PCR'). Although the reactions can be performed on whole genomic DNA, again the efficiency of specific primer-generated reactions is increased if the template is separated and/or purified before the reactions. After cycles of high stringency, the semi-degenerate primers can be added and the annealing temperature lowered. The exact annealing temperature for any set of semi- degenerate and specific primers may need to be determined empirically. Thereafter, cycles of PCR are repeated as usual with the elongation temperature appropriate to the DNA polymerase (e.g. 72°C for Taq polymerase) and denaturation steps at 96°C. The technique is also of use in determining the sequence of a specific mRNA, from extracted cell total RNA, after reverse transcriptase amplification with a single specific primer. Subsequent use of a further specific primer and a set of semi-degenerate primers may allow determination of the sequence without amplification of other mRNAs.
The choice of semi-degenerate primer sets influences the number of reactions and electrophoresis runs as well as the decoding for the final sequence solution. The simplest primer set, which uses four semi-degenerate primers in four separate PCR reactions each with a labelled specific primer, is shown below:
5'-tag-(R)x-A-3' 5 -tag- (R) x-T-3 '
5 ' -tag- (R) x-G-3 ' 5 ' -tag- (R) x-C-3 ' where R represents equal proportions of each nucleotide A, T, G, C or inosine, and tag represents a common specified DNA sequence of 10-20 nucleotides that does not share complementarity with itself or the chosen specific primer. The four reactions are run out on a non-denaturing gel (e.g. agarose or polyacrylamide) in separate lanes or in the same lane if four different fluors have been used to label the specific primer in each reaction. The sequence can be deduced easily by the relative lengths of the bands occurring in each lane. Amplicons with semi-degenerate primers at each end do not appear on the gel as they are not labelled. Subsequent autoradiography or fluorescence detection is performed to reveal the specific primer- containing amplicons. A more sophisticated set of four semi-degenerate primers that specify the first two 3 • positions is as follows:
5 ' -tag- (R) x-A/C-A/T-3 ' 5 ' -tag- (R) x-T/G-T/C-3 •
5 ' -tag- (R) x-G/T-G/A-3 5 -tag- (R) x-C/A-C/G-3 •
Primers, in which the two 3' positions can each be one of two nucleotides, make up a set of four sequences that match their complementary sequences during the annealing step of the PCR reaction. There are 36 (=62) such primers and 58,905 ways of choosing a set of four different ones (36C4) . One of these combinations is represented above. This particular combination allows unambiguous sequence determination, as the pair of nucleotides at any one position in each primer is not reproduced exactly in the other position in any of the four primers. However, for the above two primer sets to work, each position of the template requires the annealing of two complementary primers and the successful amplification of product.
A larger set of semi-degenerate primers, as shown below, may allow each nucleotide position to be 'checked' as each would be complementary to two or three primers in the sequencing reactions:
5'-tag-(R)x-(all dimers)-3' 5 • -tag- (R)x- (all tri ers) -3 •
This increase in accuracy is at the expense of a larger number of sequencing reactions for each DNA sequencing experiment (16 or 64 in this case) . Further sets of semi-degenerate primers with specified (or semi- specified) nucleotides at each of the three 3 • positions can be designed using similar principles.
This method only sequences one strand of the template DNA (unlike the subsequent embodiments) and so an anti- sense 'check' does not inherently occur. Also the length of sequence determined in each experiment is limited by both the length of double-stranded amplicon that can be distinguished to the accuracy of a single nucleotide pair during electrophoresis (e.g. approx. 3000 bases) and the length that can be generated by the polymerase enzyme. II. In a second embodiment of the invention, one or more semi-degenerate primer(s) with specified 3 ' -ends are used in a set of individual sequencing reaction experiments such that each reaction contains different 3' sequences. A specific primer is not used. A number of such experiments are performed on purified template DNA. The size of the amplicons from each experiment is then determined. The final sequence can be solved knowing the length of nucleotide between the flanking primer (s) that were used in each experiment. The number of experiments that are necessary, together with the specificity of binding of the semi-degenerate primers that are needed, can be determined using statistical considerations.
The technique can be modified by the use of endonuclease enzymes to fragment the template in a predictable fashion prior to carrying out semi-degenerate primed PCR. The template polynucleotide needs to have been separated from other contaminating polynucleotides because a specific labelled primer is not used. Some contaminating sequence can however be tolerated during the reconstruction of sequence. However, significant contaminating polynucleotide would prevent the clean separation of legitimate amplicons during size analysis. One method to obtain such purification of a specific segment of DNA for example, without cloning, is to use a biotin/streptavidin or similar binding system. A biotinylated oligonucleotide with a sequence complementary to a part of the sequence to be determined is allowed to hybridise to total genomic DNA following a fractionation or endonuclease step. The bound fragment is then isolated using avidin or streptavidin capture. Other methods include the flow-sorting of nucleotide fragments and electrophoretic separation.
Furthermore, the desired template can be pre-amplified by prior PCR and subsequently purified by standard methods.
For any single nucleotide within the target polynucleotide to be included in the final reconstructed sequence, two events need to occur: firstly, the nucleotide must be bound to at least one primer within its 3 ' specified sequence; secondly, this specific nucleotide duplex must go on to be incorporated in an amplicon. The number of different specified 3 ' primer sequences that are necessary to sequence unambiguously does not need to include all the possible 3' nucleotide combinations. For example, considering the use of 5 '-tag- (Random) x(N)3-3 ' primers, where (N)3 represents a specified 3mer sequence, not all 64 possible primers need be included. If they were, each nucleotide would be bound at 3 sites by 1 to 3 different primers. Various algorithms can be designed to determine a smaller set of 3mers that bind all possible nucleotides (with regard to their surrounding sequence) at least once.
For instance, an algorithm can be written that generates all possible combinations of say 30 4mers and tests each combination against all possible 7mer sequences. Any set that matches at least one of its 4mer sequences with every 7mer will therefore bind every possible nucleotide with respect to its surrounding sequence. Because of the large number of possible sets (64C30 = 1.6x 1018 in this case) , such an algorithm may stretch microprocessor technology (109 instructions/s) . Hence, a less accurate, more empirical algorithm is needed. By using such an algorithm the set of all possible 64 4mers can be reduced to sets of at least 36 3mers. One such set is shown below:
AAA, AAG, AAC, AGA, AGC, ACA, ACG, ACC, ATA, ATG, ATC, ATT, GGA, GGG, GGC, GGT, GCA, GTA, GTG, GTC, GTT, CGA, CGC, CCA, CCG, CCC, CTA, CTG, CTC, CTT, TGA, TGC, TCA, TTA, TTC, TTT
Similarly, the set of 4mers that bind to all nucleotides at least once can be reduced from 256 to at least 115 - one such set is shown below:
AAAA, AAGA, AACA, AATA, AATC, AGGA, AGGG, AGGC, AGGT, AGCA, AGCG,
AGCC, AGCT, AGTA, AGTG, AGTC, AGTT, ACGA, ACGG, ACGC, ACGT, ACCA,
ACCG, ACCC, ACCT, ACTA, ACTG, ACTC, ACTT, ATGA, ATGG, ATGC, ATGT,
ATCA, ATTA, ATTG, ATTC, ATTT, GAAA, GAGA, GACA, GATA, GATC, GGGA, GGGG, GGGC, GGGT, GGCA, GCGA, GCGG, GCGC, GCGT, GCCA, GCCG, GCCC,
GCCT, GCTA, GCTG, GCTC, GCTT, GTAA, GTGA, GTGG, GTGT, GTCA, GTTA,
GTTG, GTTC, CAAA, CAGA, CACA, CATA, CATC, CGAA, CGGA, CGGT, CGCA,
CCGT, CCCA, CCCG, CCCC, CCCT, CCTG, CTAA, CTCA, CTTA, CTTG, TAAA,
TAGA, TACA, TATA, TATC, TGAA, TGGA, TGGT, TGCA, TGTA, TGTG, TCGA, TCGG, TCGC, TCGT, TCCA, TCCG, TCCC, TCCT, TCTA, TCTG, TCTC, TCTT,
TTCA, TTTA, TTTG, TTTC, TTTT.
By reducing the number of semi-degenerate primers in this way, fewer sequencing reactions need to be performed. However, such a strategy does reduce the mean number of •hits' for each nucleotide, so reducing the frequency of overlapping sequences during the final decoding and so increasing the consequent computer time. Another way to reduce the number of reactions is to group together 2 or more primers in each reaction. This influences the primer binding probability and so, also, the probability density function of amplicons. The number of amplicons produced using any one or more semi-degenerate primer (s) on a template can be modified by prior digestion with one or more endonuclease enzymes. This has the effect of reducing the number of long amplicons produced in each reaction. The use of endonuclease enzymes provides another variable together with the specificity and number of semi-degenerate primers that can influence the density of amplicons for a given template length.
As an example of the potential of this aspect of this invention, 64 simultaneous PCR reactions, labelled conventionally (eg with α32P-dCTP during the reactions) , each containing a single 3mer semi-degenerate primer, after complete digestion with endonuclease enzymes that have a total probability of cutting of 2/45 (e.g. 2 5- cutters or two of the form -g, A/T, g, C, A/T, C- ) run out on a non-denaturing gel, would give useful information from 640 base-pairs (1% probability of no amplicons) to 3760 base pairs (99% probability of no amplicons) . This would have the power to determine the majority of sequence of a 90kb template (checking the antisense strand simultaneously) and could easily be accomplished in a day. (NB the same sets of enzymes could not be used in each reaction as the cutting sites would not then be sequenced) . This compares with the determination of one strand only, using 180 sequential Sanger sequencing runs and state-of- the-art fluorescent dideoxynucleotide chemistry over many days.
The reconstruction of the final sequence using the second method (no specific primer) is more complicated than the decoding necessary for the first method. Firstly, the length x of nucleotide is determined. This can be taken as either exact or to lie within specified limits, depending on the accuracy of nucleotide sizing. Secondly, a nucleotide sequence or set of sequences that defines the ends of each length x is determined. If n different semi- degenerate 3 ' end sequences are used in the reaction being considered, then this set will comprise all of these n sequences. Different algorithms can be designed to use these data to piece together the final sequence solution. One algorithm involves searching for lengths of amplicon in a determined fashion until a match, as follows, is made.
In the following example (A) represents a 3 • primer sequence, or set of sequences, used in one semi-degenerate reaction (reaction 1) . The complementary sequence or set of sequences A' has been used in another separate reaction (reaction 2) . a, b, c and d represent lengths of amplicon occurring in these two reactions - a, c and d in reaction 1 and b in reaction 2.
In this systematic approach, a + b + c = d. After determining the most appropriate sequence, further segment lengths are sought in the amplicon sizes for these two reactions until the correct sequence relationship is established. This process is repeated for each of the reactions until the complete sequence is determined.
5'-A a A'-3'
5Ά' b A-3'
5'A c A'-3'
5'_A d A'-3'
This embodiment of the invention is an excellent way of generating polynucleotide fragments from a template nucleotide for the use of hybridisation onto an oligonucleotide array. The product from each semi- degenerate reaction can be fragmented using techniques already described (e.g. Chee et al . , Science 1996; 274:610- ) and hybridised to an array. When analysing the sequence generated from different reactions (using different semi- degenerate primers) any non-contiguous repeat sequences could be unambiguously ordered as each repeat would turn up in differently labelled amplicons. Further, concomitant sizing of the reaction products (e.g. using electrophoresis) may allow the determination of the length of contiguous repeat sequences as well. Overcoming these two limitations of hybridisation sequencing, remembering the enormous power of the technique to sequence megabase nucleotide, is an attractive proposition. However, this particular embodiment of the invention does not address the important limitation of cross-hybridisation to an array. To address this, a third and potentially much more powerful embodiment has been devised. III. In summary, this third embodiment of the invention proceeds through repeated polymerase reactions with specific pairs of primers, one of which is a semi- degenerate primer with a defined 3 ' end. The other primer is designed so that it has a nucleotide sequence that is specific to the individual semi-degenerate primer and the individual step in the overall series of reactions. This primer also contains sequences that are recognised by corresponding primers in subsequent reactions so that, after several reactions, the polymerase product contains one end labelled with multiple primers, which define the order of binding and identity of the semi-degenerate primers to the template polynucleotide. Therefore, as the series of reactions proceeds, one end of the template polynucleotide will be shortened as subsequent semi- degenerate primers hybridise 3' to the last one, and one end will be lengthened, as the second primer is incorporated into the polymerase product. Ultimately, no template will remain, and the polymerase product will consist of a defined series of nucleotides. The products can then be hybridised exclusively to specific addresses on an oligonucleotide array. The need for accurate hybridisation of random unknown target sequence is confined to a small number of bases at the 3* ends of semi-degenerate primers. Cross-hybridisation at these bases is unlikely, due to the failure of DNA polymerase action when mismatches occur. This is in contrast to the need for accurate hybridisation of longer random sequences to an array. Furthermore, a series of reactions can have differing annealing temperatures to ensure the fidelity of 3' primer binding for different sequences. Because of the very parallel nature of this method, many thousands of such random sequences can be translated and hybridised in one experiment and the method would be powerful enough to sequence, in parallel experiments, individual mammalian chromosomes and possibly complete genomes.
The basic strategy involves an initial single semi- degenerate sequencing reaction to generate amplicons from the template sequence using a set of semi-degenerate primers. The reactions are carried out so that each semi- degenerate primer of the set is contained in a separate compartment. The resulting amplicons are used in the subsequent cycle of reactions.
A number of sequential semi-degenerate primed reactions are interspersed with complete mixing of the reaction products and further separation for another round of reactions. Initial amplicons from the first reaction on the template sequence are labelled differently at each end, and only asymmetrically labelled amplicons 'survive' the subsequent processing. This enables one end of the amplicon to be used for 'tagging' and the other for sequencing. Subsequent division of the reactants into separate compartments followed by semi-degenerate sequencing reactions with primers specific to each compartment allows the progressive translation of random sequence at one end into designed tags at the other as the cycles of mixing, separation and polymerisation reactions continue. Whilst the number of determined sequences increase exponentially, the number of primers needed for the sequencing strategy increases only arithmetically.
A final PCR reaction on all the amplicons amplifies only the complete tagged sequence which is then used to hybridise to a specially designed array. Each tag contains sequences that represent the primer ends of the initial amplicon as well as a unique contiguous sequence within that amplicon. Subsequent decoding allows the determination of a proportion of the original template. A number of similar experiments are performed with distinct initial primers to ensure a complete coverage of the template. Decoding can then identify all unique sequences and accurately order sequences between non-contiguous repeats. The only ambiguity remaining will be the determination of the number of contiguous repeat sequences (e.g. di-, tri- and tetra- nucleotide repeats) . Sizing (e.g. electrophoresis) of a set of reaction products will then allow the determination of the lengths of concomitant repeat sequences. The technique depends upon the generation of amplicons in which each end is bound to a different primer, one end being the 'sequencing end', the other being the 'labelling end'. Hence, before the occurrence of reactions that translate internal amplicon sequence into tag sequence, asymmetric amplicons with regard to their 5' tag sequences need to be available for further amplification, whilst it is important that symmetric amplicons do not 'survive' to the final analysis step.
The 'sequencing end' is bound internally by subsequent primers to its 5' end and this ensures that symmetric amplicons with two sequencing ends do not obfuscate the final hybridisation analysis (as the final reaction can only take place if the original 5' tag is available for primer binding). However, the 'labelling end' is bound at the 5' end. Symmetric amplicons with two labelling ends, if allowed to occur, would undergo subsequent rounds of amplification and be available for the final hybridisation step. Such amplicons can be eliminated by incorporating a separation step in which only amplicons with one or both 'sequencing ends' are retained. This step is undertaken after step A in the following example. It could also be undertaken after step B. This latter method would have the disadvantage that one cycle of reaction mixing will have already occurred and so the reactants would need to be transferred from apparatus designed for cycles of mixing and separation. If this step is undertaken after step B, then this would involve the incorporation of a biotin or similar label in the sequencing primer (primer Bl) . After step B, the reactants are pooled and captured using an avidin/streptavidin (or similar) system so eliminating amplicons with two 'labelling ends'. Many variations exist within the basic strategy described above.
In the following description, 'Tag' sequences are domains of primer 5' ends that are specific to both the step of the process and the compartment. For example TagB is specific to step B, agj. is specific to compartment 1 and so TagB1 was used during step B in compartment 1. 'C sequences are domains of primer 5 ' ends that are specific to each step but common to all compartments. Cc has been used in all compartments during step C. 'N' sequences represent the specified 3' sequence of the semi-degenerate primer. Subscripts x,y etc represent these different 3' specified sequences. (R)n represents sequences within primers that have randomly applied nucleotides at each position, a universal base e.g. inosine, or a combination of the two. The steps of the process are designated A, B, C etc and the compartments 1,2,3, etc. Step A
The first reaction mixture includes the template to be sequenced which can be any purified source of contiguous polynucleotide (e.g. flow-sorted whole mammalian chromosomes) . The choice of semi-degenerate primers is determined by the template length and size of array available; the primer binding probability may be adjusted to include a desired proportion of nucleotides in the subsequent amplicons. The annealing temperature is determined empirically for each reaction. In this first single reaction, if a number of such primers (n) is used, this gives n(n+l)/2 types of amplicon with respect to amplicon ends. The semi-degenerate primers for step A are designed to a blueprint similar to the following:
StepA
Figure imgf000021_0002
5'-biotin-CAd-TagAxseq- CAc
Figure imgf000021_0001
Primer Ab 5'-CAb-TagAxlab-CAa-(R)n-(Nx)n-3'
Here, 'N' sequences represent the specific 3' ends of the semi-degenerate primers. For each "N1 sequence two primers are used, Aaβq and Alab, in a 1:1 proportion. The 'N' sequences are designed to have similar annealing temperatures (the N sequences do not necessarily need to bind all possible nucleotides in the template but instead need to generate a proportion of all template within the resulting amplicons). Primer Asβq will become incorporated as the sequencing end of amplicons and is biotinylated to allow its retrieval (above) . Primer Alab will become incorporated as the labelling strand and is not biotinylated. C regions represent 10-20 mer tagging sequences common to all the primers in the reaction, C^ , Qai, etc being different from each other, but which are specific to step A and occur in each primer during step A. TagAj-sβq and ag^^ represent 10-20mer tagging sequences which are different to each other and are specific to the primer with the 3' end of (Nx) and to step A. (R)n represents a number of random nucleotides, inosine or a combination thereof.
Following step A, an avidin/streptavidin capture step (or similar technique) is incorporated to eliminate Alab-Alab amplicons. Alternatively, this can be done after step B (see above) .
For the following reactions, the following set of cycles occurs:
• equal division into compartments and addition of reagents (i.e. primers, polymerase, dNTPs etc)
• compartment-specific semi-degenerate reactions
• even mixing of all reagents • a dilution step (optional)
• equal division into compartments and addition of reagents etc
Following step A (and streptavidin capture) , the reaction product is divided into separate compartments suitable for polymerase reactions (e.g. Eppendorf tubes, PCR silicon chips, a specifically designed chamber with a facility to erect small water-tight separating walls to create temporarily a number of non-communicating individual compartments) . For a large number of compartments (eg 64) manual manipulation (pipetting) is laborious and the use of such specially designed apparatus of benefit. Step B
The second reaction step (step B) is designed to separately label each amplicon end such that the 'labelling end' is labelled with a tag sequence specific to the 'sequencing end' of the same amplicon. Subsequent hybridisation at the end of the sequencing experiment will then reveal that the random sequence in question occurred within an amplicon derived in step A from the two specific primer sequences encoded by the first two tag sequences. This is achieved by using each separate compartment to amplify only those amplicons with a specific (or a set of specific) 'sequencing ends'. This principle of the selective amplification of amplicons in each compartment is used in subsequent steps. A typical pair of primers used in this step would be as follows:
Step B Primers Bl) 5'- C^ -TagAlK - CAc -(R)n-N,-3' B2) S'-C.-TagB.-C^'
These two primers relate to compartment 1 in this example. Primer Bl is designed to bind only to the 'sequencing end' of amplicons that contain the two sequences Nx and TagAlaB<I. Amplicons with other 'sequencing ends' are not amplified in compartment 1. Primer B2 binds to the Cft, sequence of the labelling end of amplicons applying a tag sequence TagB1- that identifies this amplicon as having a N2 sequence at the other end. It also applies a new step-specific CB sequence which will allow subsequent reactions to take place. After this step, two types of amplicon will be formed in each compartment with regard to amplicon ends:
Bl - B2 and Bl Bl
B2 - B2 amplicons will not exist because amplicons without at least one 'sequencing end' will have been eliminated (as above) . Bl - B2 amplicons contain the sequence C^ which will be required for the final step that generates tags for hybridisation to an array. Bl-Bl amplicons do not possess this sequence and so will not be included in the final sequence analysis. Furthermore, Bl- Bl amplicons will only occur when an amplicon has the same compartment-specific sequence at each end a rarer event than B1-B2 amplicons which require only that there is a compartment-specific sequence at one end.
At the end of step B, B1-B2 amplicons will have at their ends the following sequences:
Bl ends 5'- CM -TagAxιlβq- C^ -(R)n-Nx-3«
B2 ends 5 ' -CB-TagBx-CAb-TagAylab- C^- (R)n-Ny-3 ' The B2 end codes for the ends of the relevant amplicon from step A with the tags TagBx and TagAylab which in this case code for the step A primers with the specific end sequences of Nx and Ny respectively . Following step B, the reactants from each compartment are evenly mixed and again divided into a number of compartments for reactions with a pair of compartment- specific primers. Step C Subsequent steps are concerned with identifying sequences within the B1-B2 amplicon. The primers used are designed so that a hybridisation event occurring within this region generates a specific tag sequence at the labelling end. Each compartment, by design of its specific primers, will identify a different nucleotide sequence. The design of one of this pair of primers for binding to the B2 or labelling end of the amplicons is straightforward. An example, termed primer C2 , is shown below:
Primer C2 5 * -Cc-Tagcj-Cn-3
Here domain CB binds to all 'labelling ends' or (B2 ends) of B1-B2 amplicons. However, successful amplification only occurs if the other end has in it a sequence specific to the particular compartment. Hence, primer C2 carries a tag sequence ag^- that codes for this compartment-specific sequence Nx and hence adds it to the two other tag sequences at this 'labelling end' that identify the original step A primers for that amplicon. Domain Cc is common to all compartments in step C and will allow subsequent primer binding at the labelling end in the next step, step D. The design of the sequencing end primers
The design of the sequencing end primers can be accomplished in a number of ways. Firstly, a primer can be designed to interrogate the nucleotides immediately adjacent and upstream of the original step A primer sequence. For example, the primer could interrogate the three nucleotides -XXX- in the following Bl end:
5'- CM -TagAxsβq- C^ -(R)n-Nx-XXX-3«
and a suitable primer (Cl) to accomplish this might be:
Primer Cl(a) 5'-^ -C^ -(R)n -XXX-3 •
Here, domain C^ , because it is unique to all Bl ends, only allows binding of primer Cl to the 'sequencing ends' of amplicons. ^^ is a sequence of, for example, 3 nucleotides that are introduced as mismatches and allow for efficient primer hybridisation in the subsequent step. (R)n spans all (R)n _Nx- sequences of Bl ends so that all Bl ends are interrogated. XXX is the compartment-specific sequence that interrogates the upstream three nucleotides and only allows subsequent amplification if this specific sequence is present (in this case only 1 in 64 such amplicons, on average, would be amplified) . If the sequences XXX were completely non-degenerate then, for complete sequencing, 64 compartments would be required. If XX non-degenerate sequences were used, 16 compartments would be required. Also, if a A/T-XX sequence was used, in which the first 5' nucleotide can be one of two bases and the other two are specific, 32 compartments would be required, and so on. The advantage of this particular kind of primer Cl, because of the presence of the domain C^, is that it allows the annealing temperature of primer Cl to easily match that of primer C2. However, the main disadvantage is that only nucleotides directly adjacent to step A 3' primer sequences are interrogated and sequenced. This limits the number of outcomes and would require a larger number of step A primers to be used in further experiments to ensure that the majority of template is sequenced. To overcome this limitation, a Cl primer can be designed so that it binds anywhere within the amplicon sequence. An example of such a primer is as follows:
Primer Cl(b) 5 »-TagCaβq-(R)n-(N)n-XXX-3 •
This primer will bind wherever the specific sequence XXX occurs within an amplicon. However, the annealing temperature for efficient amplification with such a primer is likely to be low and not correspond with the annealing temperature of primer C2. One way round this is to incorporate one or more ' anchor ' non-random nucleotides immediately downstream from the sequence XXX, here shown as sequence (N)n, which are common to all compartments. This limits the sequences that will be interrogated but if n < 3 such a limitation is easily overcome by the fact that both strands of template are being sequenced and that further experiments with differing anchor sequences can also be performed on the same template. A further way to overcome the likely different annealing temperatures of primers Cl and C2 is to use a number of reaction cycles with only the C2 primers at the appropriate annealing temperature, such that a number of single strands are generated before the addition of Cl primers and lowering of annealing temperature.
For the sake of this specific example of the process, a primer of the Cl(b) type will be used containing two anchor nucleotides -AG- as follows:
Primer Cl 5 ' -TagCseq- (R) 6-AG-XXX-3 '
Hence, following step C, there will be two types of amplicons with respect to amplicon ends, amplicons C1-C2 and Cl-Cl. The amplicon ends will be as follows:
Cl ends 5 '-TagCsβ<I-(R) 6-AG-XXX-3 '
C2 ends 5 -Cc-Tagcz-CB- ag^-C^-Tag*^- cto- (R) n-Ny-3 ' TagCz corresponds to the compartment-specific XXX sequence at the other end of the amplicon. Again Cl-Cl amplicons will not be amplified in the final reaction due to the absence of a C^- sequence. They will also become less numerous during subsequent steps as only Cl-Cl amplicons with the same XXX sequences at each end will be further amplified.
After step C, the amplicons are further interrogated with respect to the nucleotides 3' to the sequencing ends. These subsequent steps are similar in principle. One such step, step D, will be described here. The number of subsequent steps will be determined by the size of the oligonucleotide array available for hybridisation, the detrimental effects of the accumulation of primers from previous steps (see below) , and the mutation rate of the DNA polymerase. A larger number of steps increases the length of contiguous nucleotide that is encoded in each sequence and also increases the outcome set for each experiment. Step D
Step D occurs after the even mixing of all step C reactants and the subsequent division into compartments. A pair of primers is used in each compartment, one each to bind to the sequencing and labelling ends. An example of such a primer pair, used in compartment 1, is shown below:
Primer Dl (sequencing end) 5'-TagCseq-(YYY)Dnew-(R)3-AG-(R)3-XXX-3'
Primer D2 (labelling) 5'-CD-TagD1-Cc-3'
Here D2 labels the 'labelling end' of each amplicon by virtue of the binding of domain -Cc- and labels the amplicon with a tag specific to step D and compartment 1 encoding for the unique sequence XXX. Primer Dl interrogates the 3 nucleotides 3' to those previously interrogated in step C. The presence and binding of sequences Tag^e,- and AG allows the annealing temperature to be increased to match primer D2. It also prevents internal binding within the amplicon and so only contiguous sequence is interrogated. Sequence (YYY)Dnβw is a sequence of 3 nucleotides common to all compartments that is introduced as mismatches and allow for efficient primer hybridisation in the subsequent step.
Further steps are performed after mixing reactants and subsequent division into compartments, to interrogate further sequences and appropriately label labelling ends with tags. For instance step E might comprise the following two primers:
Primer El (sequencing end) 5'-Tagc^-(YYY)Dre!w-(ZZZ)Enew-AG-(R)6-XXX-3'
Primer E2 (labelling) 5'-CE-TagE1-CD-3'
XXX being the compartment specific sequence used to interrogate amplicon sequence and corresponding to the tag sequence TagE1- on the labelling end.
The final PCR reaction and hybridisation to an array
After the sequencing steps the reactants are again mixed together. A final PCR step is performed to amplify the tag sequences on the labelling ends of the amplicons. Only amplicons that have completed each step will be so amplified and used in the hybridisation step. If steps through from A to F have occurred the 'labelling end' of surviving amplicons will look as follows:
F2 ends 5'-CF-TagF6-CE-TagE5-CD-TagD4-Cc-TagC3-CB-TagB2-CAb-TagAI-CA„-(R)„-(N1)I,-3'
This sequence contains 6 different tag sequences that together encode for the original amplicon ends as well as a contiguous sequence of 12 nucleotides contained somewhere within that amplicon on the strand shared by sequence B2.
If 64 primers are used in step A, and 64 pairs of primers are used in 64 compartments in the subsequent steps, then the outcome set for this reaction will be 646 that is 6.8 x 1010. To extract the full amount of information from this sequence a corresponding array with 6.8 x 1010 discrete addresses will be required. The outcome set for such a hybridisation experiment will be 2 power 6.8 x 1010 which is comparable to the diversity of the diploid human genome. This is why the number of steps performed (and the number of compartments in each step) depends upon the size of the oligonucleotide array that is available for hybridisation. The total number of different primers required, however, to generate such massive diversity is only (64 x 2 x 6) = 768. This same primer set can be used for any template of similar size. In this final PCR step, the pair of primers used would be as follows:
Final primers 5'-CF-3' and 5' - (C^) ' ^ *
(CAJ ' is complementary to the sequence C^. This final step can be used to generate RNA using a T7 RNA polymerase promoter or a fluor-labelled single stranded DNA in asymmetric PCR for hybridisation to an array as described by Hacia et al and Shoemaker et al respectively (Nature Genet 1996; 14:441-449) . The form of the final sequences generated for hybridisation will be as follows:
Final hybridisation sequence: 5'-Cp-TagF6-CE-TagE5-CD-TagD4-Cc-TagC3-CB-
TagB2-CAb-TagA1-CAa-3'
Each address on the array would have a nucleotide sequence of this form. The set of (64 x 7) = 448 Tag sequences will have been specifically designed so that they do not cross-hybridise to a complementary sequence of any other sequence within the set. This could easily be achieved using a subset of the 16384 possible 7mers for example. To increase the specificity of binding, the C domains of the array can be designed to have a number of mismatches with respect to the C domains in the hybridisation sequences. The problem of contaminating primers One problem with this technique is the accumulation of primers in the reaction mixture from previous steps following step B (above) . One way to eliminate such primers is to purify the amplicons, using standard methods for purifying PCR products, after each step. This would, however, add a further manipulation to each step, and make automation of the whole experiment more complex. One way to reduce the chances of this happening, other than purifying products, is to incorporate a dilution step after step B when the reactants had been mixed. The only component of the reactants to have increased in concentration will be the successfully amplified products from the previous step. Hence by diluting the reactants, such amplicons would be the only species to survive in appreciable concentration. Further polymerase enzyme, dNTPs and buffer solution etc will need to be added before the subsequent steps. Also, the number of cycles of reaction for each step should be high enough to ensure adequate concentration of amplicons after the dilution step. This process could be easily automated within a system specially designed for the repetitive mixing and dividing of reactants.
A further method to eliminate unwanted primers from previous reaction steps is to add oligonucleotides whose 3 • ends are complementary to such primers (and not to the primers of the present reaction step) so that dimers can be formed and the 3' ends of the primers be 'neutralised* after polymerase elongation. Such a nucleotide that neutralises Bl primers (but not Cl ones) is shown below.
5'- CAd -TagAIsεq- CAc -(R)n-N,-3' Unwanted Bl primer
3'- CAd'-TagAlse<1' CAc'-(R)n+3 — AAAAAAAAAAA- 5' neutralising primer Hence the Bl primer 3' end is neutralised by the addition of a number of T bases. Similar neutralising oligonucleotides can be added at subsequent reaction steps. Final sequence reconstruction The addresses on the oligonucleotide array will be easily decoded into a contiguous nucleotide sequence on a specific strand of a specific amplicon from step A (with respect to primer ends) . Such sequences can be easily grouped together regrading the strand and amplicon. Overlapping sequences are sought and long contiguous sequences are built up. Non-contiguous sequences that are repeated in the template of, regarding the above example, 11 or more nucleotides can be unambiguously ordered as each will occur in one or more different step A amplicons. If the elongation step of the step A reaction is restricted to the generation of amplicons 3000 bases or less then the probability of having 3 or more sequences of 11 or more nucleotides repeating in random DNA in the same amplicon is very small. Even if this did occur, the inclusion of one such repeat in another amplicon would allow unambiguous ordering.
The only ambiguity that remains to be solved is the length of contiguous repeats such as mini- and micro- satellite sequences. Separation and sizing of amplicons following a semi-degenerate reaction of a set or subset of step A primers or following the further amplification of amplicons from specific compartments would allow the unambiguous sizing of many such contiguous repeat sequences simultaneously. This strategy, by nature of its very parallel processing of short nucleotide sequences, therefore allows the unambiguous determination of many millions of bases in a polynucleotide sequence. It has the power, ultimately, to determine the majority of sequence of whole chromosomes and whole genomes in a manageable number of experiments.

Claims

1. A method for analysing the sequence of a target polynucleotide, comprising the steps of: a) reacting the target polynucleotide with oligonucleotide primers, a polymerase enzyme, and the other reagents necessary for the polymerase reaction, wherein the oligonucleotide primers are chosen such that polymerase products of varying lengths are produced; and b) analysing the products of step (a) ; wherein the oligonucleotide primers include an array of semi-degenerate primers whose 3 ' ends comprise variations of one or more nucleotides A, T, C and G such that the array is complementary to all the polynucleotide sequence.
2. A method according to claim 1, wherein the semi- degenerate primers are of the formula:
5'-tagseq-(R)n-(N)m-3'
wherein (R)n represents a nucleotide sequence in which each base position is a universal base and/or has an equal probability of containing each of the nucleotides A, T, G and C; (N)m represents the 3* ends as defined in claim 1; and 'tagseq' represents an optional nucleotide sequence capable of characterising individual primers of the array.
3. A method according to claim 1 or claim 2 , wherein the oligonucleotide primers additionally include one or more non-degenerate primers.
4. A method according to any of claims 1 to 3 , wherein step (a) additionally comprises one or more reaction cycles, each cycle comprising the steps of: i) separating the polymerase products into a set of non-communicating reaction compartments; ii) reacting the polymerase products with an array of oligonucleotide primer pairs, a polymerase enzyme, and the other reagents necessary for the polymerase chain reaction; and iii) mixing together the reaction products of step
(ϋ); wherein each pair of oligonucleotide primers of the array includes at least one semi-degenerate primer as defined in claim 1 and each reaction compartment comprises a different oligonucleotide primer pair of the array.
5. A method according to claim 4, wherein the array of oligonucleotide primer pairs comprises an array of oligonucleotide primers defined, at the 5' end, by the formula:
5 ' -CB-tagx-CA-3 ' ;
wherein 'tagx', CA and CB represent different nucleotide sequences that are mutually non-complementary, CAand CB are common to all or a sub-set of oligonucleotide primers used in each reaction cycle, and 'tagx' is capable of characterising individual primers of the array.
6. A method according to claim 5, wherein CA and CB are identical to the oligonucleotide primers in the preceding or succeeding reaction cycles respectively.
7. A method according to any of claims 4 to 6 , wherein all or a subset of the pairs of oligonucleotide primers comprise one semi-degenerate primer as defined in claim 2 and one oligonucleotide primer as defined in claim 5.
8. A method according to any of claims 4 to 7 , wherein each reaction product comprises at one end an oligonucleotide primer sequence that overlaps with the corresponding sequence on the next reaction product and represents a specific sequence on the target polynucleotide.
9. A method according to claim 8, wherein the oligonucleotide primer sequences are amplified separately using the polymerase chain reaction.
10. A method according to any of claims 4 to 9, wherein the reaction products are analysed by their hybridisation to a panel of compartmentalised, immobilised oligonucleotides and by observing the location of positive hybridisation.
11. A method according to claim 10, wherein the primers are as defined in claim 5 and are all mutually non- complementary and hybridise only to one oligonucleotide in the panel.
12. A method according to any of claims 4 to 11, wherein the reaction products are separated before further reaction cycles, and/or before the final amplification step of is/are allowed to occur.
13. A method according to claim 12, wherein the separation step comprises the specific capture of a subset of reaction products containing oligonucleotide primer molecules labelled with a capture moiety.
14. A method according to any of claims 4 to 13, wherein the oligonucleotide molecules that are added during step ii) bind specifically to, and allow 5' to 3' polymerase elongation of, a sub-set of the oligonucleotide primers used in previous reaction cycles thus neutralising the oligonucleotide primer 3' ends.
15. A method according to any of claims 1 to 3 , wherein any or all of the oligonucleotide primers are detectably labelled.
16. A method according to any of claims 1 to 3 , wherein the analysis comprises gel electrophoresis.
17. A method according to any of claims 1 to 3 , wherein the analysis comprises hybridisation of said products to a panel comprising an array of immobilised oligonucleotide molecules.
18. A method according to any preceding claim, wherein said 3 ' ends of said semi-degenerate primers are 1 to 10 nucleotides in length.
19. A method according to claim 18, wherein said 3' ends of said semi-degenerate primers are 3 or 4 nucleotides in length.
20. A method according to any preceding claim, wherein any or all of the oligonucleotide primers incorporate an RNA polymerase-binding site.
21. A method according to any preceding claim, wherein the polymerase enzyme lacks a 3' to 5' exonuclease activity.
22. A method according to any preceding claim, wherein the target polynucleotide is genomic DNA or cDNA.
23. A method according to any of claims 1 to 21, wherein the target polynucleotide is messenger RNA (mRNA) , and comprises a preceding reaction using a reverse transcriptase enzyme.
24. A multi-container unit, each container containing one of an array of semi-degenerate primers as defined in any preceding claim.
PCT/GB1998/001233 1997-04-28 1998-04-28 Polynucleotide sequencing using semi-degenerate primers WO1998049341A2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP98919324A EP0979307A2 (en) 1997-04-28 1998-04-28 Polynucleotide sequencing using semi-degenerate primers
AU72203/98A AU7220398A (en) 1997-04-28 1998-04-28 Polynucleotide sequencing using semi-degenerate primers

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GBGB9708606.0A GB9708606D0 (en) 1997-04-28 1997-04-28 Sequencing
GB9708606.0 1997-04-28

Publications (2)

Publication Number Publication Date
WO1998049341A2 true WO1998049341A2 (en) 1998-11-05
WO1998049341A3 WO1998049341A3 (en) 1999-01-28

Family

ID=10811491

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB1998/001233 WO1998049341A2 (en) 1997-04-28 1998-04-28 Polynucleotide sequencing using semi-degenerate primers

Country Status (4)

Country Link
EP (1) EP0979307A2 (en)
AU (1) AU7220398A (en)
GB (1) GB9708606D0 (en)
WO (1) WO1998049341A2 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1088106A1 (en) * 1998-06-16 2001-04-04 Orchid BioSciences, Inc. Polymerase signaling assay
EP1181389A1 (en) * 1999-04-30 2002-02-27 Takara Shuzo Co, Ltd. Method of amplification of nucleic acids
WO2003074698A1 (en) * 2002-03-06 2003-09-12 Takara Bio Inc. Method of determining base sequence of nucleic acid
US6946249B2 (en) 1997-11-21 2005-09-20 Beckman Coulter, Inc. De novo or “universal” sequencing array

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1996012014A1 (en) * 1994-10-13 1996-04-25 Lynx Therapeutics, Inc. Molecular tagging system
WO1996041893A1 (en) * 1995-06-09 1996-12-27 The University Of Tennessee Research Corporation Methods for the generation of sequence signatures from nucleic acids

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1996012014A1 (en) * 1994-10-13 1996-04-25 Lynx Therapeutics, Inc. Molecular tagging system
WO1996041893A1 (en) * 1995-06-09 1996-12-27 The University Of Tennessee Research Corporation Methods for the generation of sequence signatures from nucleic acids

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ANDERSSON S ET AL: "THE MOLECULAR BIOLOGY OF ANDROGENIC 17BETA-HYDROXYSTEROID DEHYDROGENASES" JOURNAL OF STEROID BIOCHEMISTRY AND MOLECULAR BIOLOGY, vol. 53, no. 1/06, June 1995, pages 37-39, XP000196679 *
TELENIUS H ET AL: "DEGENERATE OLIGONUCLEOTIDE-PRIMED PCR: GENERAL AMPLIFICATION OF TARGET DNA BY A SINGLE DEGENERATE PRIMER" GENOMICS, vol. 13, 1992, pages 718-725, XP000199116 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6946249B2 (en) 1997-11-21 2005-09-20 Beckman Coulter, Inc. De novo or “universal” sequencing array
EP1088106A1 (en) * 1998-06-16 2001-04-04 Orchid BioSciences, Inc. Polymerase signaling assay
EP1088106A4 (en) * 1998-06-16 2002-02-06 Orchid Biosciences Inc Polymerase signaling assay
JP2002518024A (en) * 1998-06-16 2002-06-25 オーキッド・バイオサイエンシーズ・インコーポレイテッド Polymerase signal formation assay
US6872521B1 (en) 1998-06-16 2005-03-29 Beckman Coulter, Inc. Polymerase signaling assay
EP1181389A1 (en) * 1999-04-30 2002-02-27 Takara Shuzo Co, Ltd. Method of amplification of nucleic acids
EP1181389A4 (en) * 1999-04-30 2004-11-03 Takara Shuzo Co Method of amplification of nucleic acids
WO2003074698A1 (en) * 2002-03-06 2003-09-12 Takara Bio Inc. Method of determining base sequence of nucleic acid

Also Published As

Publication number Publication date
AU7220398A (en) 1998-11-24
GB9708606D0 (en) 1997-06-18
EP0979307A2 (en) 2000-02-16
WO1998049341A3 (en) 1999-01-28

Similar Documents

Publication Publication Date Title
US6083726A (en) Methods for polynucleotide synthesis and articles for polynucleotide hybridization
AU718610B2 (en) Optimally fluorescent oligonucleotides
US5114839A (en) Process for dna sequencing using oligonucleotide bank
US5547843A (en) Method for promoting specific alignment of short oligonucleotides on nucleic acids
US5354656A (en) Method of DNA sequencing
EA012525B1 (en) Method for preparing polynucleotides for analysis
US5858731A (en) Oligonucleotide libraries useful for producing primers
AU8417391A (en) Circular extension for generating multiple nucleic acid complements
JPH048293A (en) Hair pin probe capable of ligation and method for amplification of nucleic acid using its transcription
JPH02503054A (en) Nucleic acid sequence amplification and detection
US5599921A (en) Oligonucleotide families useful for producing primers
JP2002531053A (en) Methods and reagents for analyzing nucleotide sequences of nucleic acids
EP1047794A2 (en) Method for the detection or nucleic acid of nucleic acid sequences
EP0358737A1 (en) Genomic amplification with direct sequencing
WO1998002575A1 (en) Method for contiguous genome sequencing
JPH06153952A (en) Method for pretreatment for carrying out amplifying and labeling of unknown double-stranded dna molecule in trace amount
US6335184B1 (en) Linked linear amplification of nucleic acids
WO1998049341A2 (en) Polynucleotide sequencing using semi-degenerate primers
US6015675A (en) Mutation detection by competitive oligonucleotide priming
US20020018999A1 (en) Methods for characterizing polymorphisms
WO2003002752A2 (en) Methods of using nick translate libraries for snp analysis
US6670120B1 (en) Categorising nucleic acid
WO2000020630A1 (en) Oligonucleotide primers that destabilize non-specific duplex formation and uses thereof
Mauger et al. High‐specificity single‐tube multiplex genotyping using Ribo‐PAP PCR, tag primers, alkali cleavage of RNA/DNA chimeras and MALDI‐TOF MS
WO2000065098A9 (en) Nucleotide extension on a microarray of gel-immobilized primers

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE GH GM GW HU ID IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW SD SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
AK Designated states

Kind code of ref document: A3

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE GH GM GW HU ID IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZW

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): GH GM KE LS MW SD SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 1998919324

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 09403879

Country of ref document: US

WWP Wipo information: published in national office

Ref document number: 1998919324

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

NENP Non-entry into the national phase in:

Ref country code: JP

Ref document number: 1998546741

Format of ref document f/p: F

NENP Non-entry into the national phase in:

Ref country code: CA

WWW Wipo information: withdrawn in national office

Ref document number: 1998919324

Country of ref document: EP