EP1294869A2 - Manipulation von ganzen zellen durch mutation eines grossteils des stargenoms, kombination der mutationen und wahlweises wiederholen - Google Patents

Manipulation von ganzen zellen durch mutation eines grossteils des stargenoms, kombination der mutationen und wahlweises wiederholen

Info

Publication number
EP1294869A2
EP1294869A2 EP01944583A EP01944583A EP1294869A2 EP 1294869 A2 EP1294869 A2 EP 1294869A2 EP 01944583 A EP01944583 A EP 01944583A EP 01944583 A EP01944583 A EP 01944583A EP 1294869 A2 EP1294869 A2 EP 1294869A2
Authority
EP
European Patent Office
Prior art keywords
dna
sequence
sequencing
organism
organisms
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP01944583A
Other languages
English (en)
French (fr)
Inventor
Jay M. Short
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BASF Enzymes LLC
Original Assignee
Diversa Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US09/594,459 external-priority patent/US6605449B1/en
Priority claimed from US09/677,584 external-priority patent/US7033781B1/en
Application filed by Diversa Corp filed Critical Diversa Corp
Publication of EP1294869A2 publication Critical patent/EP1294869A2/de
Withdrawn legal-status Critical Current

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/1034Isolating an individual clone by screening libraries
    • C12N15/1058Directional evolution of libraries, e.g. evolution of libraries is achieved by mutagenesis and screening or selection of mixed population of organisms
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/102Mutagenizing nucleic acids
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/102Mutagenizing nucleic acids
    • C12N15/1027Mutagenizing nucleic acids by DNA shuffling, e.g. RSR, STEP, RPR
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/63Introduction of foreign genetic material using vectors; Vectors; Use of hosts therefor; Regulation of expression
    • C12N15/79Vectors or expression systems specially adapted for eukaryotic hosts
    • C12N15/82Vectors or expression systems specially adapted for eukaryotic hosts for plant cells, e.g. plant artificial chromosomes (PACs)
    • C12N15/8241Phenotypically and genetically modified plants via recombinant DNA technology
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/53Immunoassay; Biospecific binding assay; Materials therefor
    • G01N33/531Production of immunochemical test materials
    • G01N33/532Production of labelled immunochemicals
    • G01N33/534Production of labelled immunochemicals with radioactive label
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/68Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids
    • G01N33/6803General methods of protein analysis not limited to specific proteins or families of proteins
    • G01N33/6818Sequencing of polypeptides

Definitions

  • This invention relates to the field of cellular and whole organism engineering. Specifically, this invention relates to a cellular transformation, directed evolution, and screening method for creating novel transgenic organisms having desirable properties. Thus in one aspect, this invention relates to a method of generating a transgenic organism, such as a microbe or a plant, having a plurality of traits that are differentially activatable.
  • This invention also relates to the field of protein engineering. Specifically, this invention relates to a directed evolution method for preparing a polynucleotide encoding a polypeptide. More specifically, this invention relates to a method of using mutagenesis to generate a novel polynucleotide encoding a novel polypeptide, which novel polypeptide is itself an improved biological molecule &/or contributes to the generation of another improved biological molecule. More specifically still, this invention relates to a method of performing both non-stochastic polynucleotide chimeri ⁇ sation and non-stochastic site- directed point mutagenesis.
  • this invention relates to a method of generating a progeny set of chimeric polynucleotide(s) by means that are synthetic and non-stochastic, and where the design of the progeny polynucleotide(s) is derived by analysis of a parental set of polynucleotides &/or of the polypeptides correspondingly encoded by the parental polynucleotides.
  • this invention relates to a method of performing site- directed mutagenesis using means that are exhaustive, systematic, and non-stochastic.
  • this invention relates to a step of selecting from among a generated set of progeny molecules a subset comprised of particularly desirable species, including by a process termed end-selection, which subset may then be screened further.
  • This invention also relates to the step of screening a set of polynucleotides for the production of a polypeptide &/or of another expressed biological molecule having a useful property.
  • Novel biological molecules whose manufacture is taught by this invention include genes, gene pathways, and any molecules whose expression is affected thereby, including directly encoded polypetides &/or any molecules affected by such polypeptides.
  • Said novel biological molecules include those that contain a carbohydrate, a lipid, a nucleic acid, &/or a protein component, and specific but non-limiting examples of these include antibiotics, antibodies, enzymes, and steroidal and non-steroidal hormones.
  • the present invention relates to enzymes, particularly to thermostable enzymes, and to their generation by directed evolution. More particularly, the present invention relates to thermostable enzymes which are stable at high temperatures and which have improved activity at lower temperatures.
  • the generation of organism having but a single genetically introduced trait can also lead to the incmrence of undesirable costs, although for other reasons. It is thus appreciated that the separate production, marketing, & storage of genetically altered organisms each having a single transgenic traits can incur costs, including inventory costs, that are undesirable. For example, the storage of such organisms may require a separate bin to be used for each trait. Furthermore, the value of an organisms having a single particular trait is often intimately tied to the marketability of that particular trait, and when that marketability diminishes, inventories of such organisms cannot be sold in other markets.
  • the instant invention solves these and other problems by providing a method of producing genetically altered organisms having a large number of stacked traits that are differentially activatable. Upon purchasing such a genetically altered organism (having a large number of differentially activatable stacked traits), the purchasing customer has the option of selecting and paying for particular traits among the total that can then be activated differentially.
  • One economic advantage provided by this invention is that the storage of such genetically altered organisms is simplified since, for example, one bin could be used to store a large number of traits.
  • a single organism of this type can satisfy the demands for a variety of traits; consequently, such an organism can be sold in a variety of markets.
  • this invention provides - in one specific aspect - a process comprising the step of monitoring a cell or organism at holistic level. This serves as a way of collecting holistic - rather than isolated - information about a working cell or organism that is being subjected to a substantial amount of genetic manipulation. This invention further provides that this type of holistic monitoring can include the detection of all morphological, behavioral, and physical parameters.
  • the holistic monitoring can include the identification &/or quantification of all the genetic material contained in a working cell or organism (e.g. all nucleic acids including the entire genome, messenger RNA's, tRNA's, rRNA's, and mitochondrial nucleic acids, plasmids, phages, phagemids, viruses, as well as all episomal nucleic acids and endosymbiont nucleic acids).
  • this type of holistic monitoring can include all gene products produced by the working cell or organisms.
  • the holistic monitoring provided by this invention can include the identification &/or quantification of all molecules that are chemically at least in part protein in a working cell or organism.
  • the holistic monitoring provided by this invention can also include the identification &/or quantification of all molecules that are chemically at least in part carbohydrate in a working cell or organism.
  • the holistic monitoring provided by this invention can also include the identification &/or quantification of all molecules that are chemically at least in part proteoglycan in a working cell or organism.
  • the holistic monitoring provided by this invention can also include the identification &/or quantification of all molecules that are chemically at least in part glycoprotein in a working cell or organism.
  • the holistic monitoring provided by this invention can also include the identification &/or quantification of all molecules that are chemically at least in part nucleic acids in a working cell or organism.
  • the holistic monitoring provided by this invention can also include the identification &/or quantification of all molecules that are chemically at least in part lipids in a working cell or organism.
  • this invention provides that the ability to differentially activate a trait from among many, such as a enzyme from among many enzymes, depends the enzyme(s) to be activated having a unique activity profile (or activity fingerprint).
  • An enzyme's activity profile includes the reaction(s) it catalyzes and its specificity.
  • an enzymes activity profile includes its:
  • enzymes are differentially affected by exposure to varying degrees of processing (e.g. upon extraction &/or purification) and exposure (e.g. to suboptimal storage conditions). Accordingly, enzyme differences may surface after exposure to:
  • harvesting the full potential of nature's diversity can include both the step of discovery and the step of optimizing what is discovered.
  • the step of discovery allows one to mine biological molecules that have commercial utility. It is instantly appreciated that the ability to harvest the full richness of biodiversity, i.e. to mine biological molecules from a wide range of environmental conditions, is critical to the ability to discover novel molecules adapted to fruition under a wide variety of conditions, including extremes of conditions, such as may be found in a commercial application.
  • directed evolution of experimentally modifying a biological molecule towards a desirable property, can be achieved by mutagenizing one or more parental molecular templates and by idendifying any desirable molecules among the progeny molecules.
  • Curcently available technologies in directed evolution include methods for achieving stochastic (i.e. random) mutagenesis and methods for achieving non-stochastic (non-random) mutagenesis.
  • stochastic i.e. random
  • non-stochastic non-random
  • stochastic or random mutagenesis is exemplified by a situation in which a progenitor molecular template is mutated (modified or changed) to yield a set of progeny molecules having mutation(s) that are not predetermined.
  • a progenitor molecular template is mutated (modified or changed) to yield a set of progeny molecules having mutation(s) that are not predetermined.
  • stochastic mutagenesis reaction for example, there is not a particular predetermined product whose production is intended; rather there is an uncertainty - hence randomness - regarding the exact nature of the mutations achieved, and thus also regarding the products generated.
  • non-stochastic or non-random mutagenesis is exemplified by a situation in which a progenitor molecular template is mutated (modified or changed) to yield a progeny molecule having one or more predetermined mutations. It is appreciated that the presence of background products in some quantity is a reality in many reactions where molecular processing occurs, and the presence of these background products does not detract from the non-stochastic nature of a mutagenesis process having a predetermined product.
  • stochastic mutagenesis is manifested in processes such as ereor- prone PCR and stochastic shuffling, where the mutation(s) achieved are random or not predetermined.
  • non-stochastic mutagenesis is manifested in instantly disclosed processes such as gene site-saturation mutagenesis and synthetic ligation reassembly, where the exact chemical stracture(s) of the intended product(s) are predetermined.
  • Natural evolution has been a springboard for directed or experimental evolution, serving both as a reservoir of methods to be mimicked and of molecular templates to be mutagenized. It is appreciated that, despite its intrinsic process-related limitations (in the types of favored &/or allowed mutagenesis processes) and in its speed, natural evolution has had the advantage of having been in process for millions of years & and throughout a wide diversity of environments. Accordingly, namral evolution (molecular mutagenesis and selection in nature) has resulted in the generation of a wealth of biological compounds that have shown usefulness in certain commercial applications.
  • nucleic acids do not reach close enough proximity to each other in a operative environment to undergo chimerization or incorporation or other types of transfers from one species to another.
  • the chimerization of nucleic acids from these 2 species is likewise unlikely, with parasites common to the two species serving as an example of a very slow passageway for inter-molecular encounters and exchanges of DNA.
  • the generation of a molecule causing self-toxicity or self-lethality or sexual sterility is avoided in nature.
  • the propagation of a molecule having no particular immediate benefit to an organism is prone to vanish in subsequent generations of the organism. Furthermore, e.g., there is no selection pressure for improving the performance of molecule under conditions other than those to which it is exposed in its endogenous environment; e.g. a cytoplasmic molecule is not likely to acquire functional features extending beyond what is required of it in the cytoplasm. Furthermore still, the propagation of a biological molecule is susceptible to any global detrimental effects - whether caused by itself or not - on its ecosystem. These and other characteristics greatly limit the types of mutations that can be propagated in nature.
  • directed (or experimental) evolution - particularly as provided herein - can be performed much more rapidly and can be directed in a more streamlined manner at evolving a predetermined molecular property that is commercially desirable where nature does not provide one &/or is not likely to provide.
  • the directed evolution invention provided herein can provide more wide-ranging possibilities in the types of steps that can be used in mutagenesis and selection processes. Accordingly, using templates harvested from nature, the instant directed evolution invention provides more wide-ranging possibilities in the types of progeny molecules that can be generated and in the speed at which they can be generated than often nature itself might be expected to in the same length of time.
  • the instantly disclosed directed evolution methods can be applied iteratively to produce a lineage of progeny molecules (e.g. comprising successive sets of progeny molecules) that would not likely be propagated (i.e., generated &/or selected for) in nature, but that could lead to the generation of a desirable downstream mutagenesis product that is not achievable by natural evolution.
  • progeny molecules e.g. comprising successive sets of progeny molecules
  • Mutagenesis has been attempted in the past on many occasions, but by methods that are inadequate for the purpose of this invention.
  • previously described non- stochastic methods have been serviceable in the generation of only very small sets of progeny molecules (comprised often of merely a solitary progeny molecule).
  • a chimeric gene has been made by joining 2 polynucleotide fragments using compatible sticky ends generated by restriction enzyme(s), where each fragment is derived from a separate progenitor (or parental) molecule.
  • Another example might be the mutagenesis of a single codon position (i.e. to achieve a codon substitution, addition, or deletion) in a parental polynucleotide to generate a single progeny polynucleotide encoding for a single site- mutagenized polypeptide.
  • stochastic methods have been used to achieve larger numbers of point mutations and/or chimerizations than non-stochastic methods; for this reason, stochastic methods have comprised the predominant approach for generating a set of progeny molecules that can be subjected to screening, and amongst which a desirable molecular species might hopefully be found.
  • a major drawback of these approaches is that — because of their stochastic nature - there is a randomness to the exact components in each set of progeny molecules that is produced. Accordingly, the experimentalist typically has little or no idea what exact progeny molecular species are represented in a particular reaction vessel prior to their generation. Thus, when a stochastic procedure is repeated (e.g.
  • the instant invention addresses these problems by providing non-stochastic means for comprehensively and exhaustively generating all possible point mutations in a parental template.
  • the instant invention further provides means for exhaustively generating all possible chimerizations within a group of chimerizations.
  • Site-directed mutagenesis technologies such as sloppy or low-fidelity PCR, are ineffective for systematically achieving at each position (site) along a polypeptide sequence the full (saturated) range of possible mutations (i.e. all possible amino acid substitutions).
  • IC information content
  • Information density is the IC per unit length of a sequence. Active sites of enzymes tend to have a high information density. By contrast, flexible linkers of information in enzymes have a low information density.
  • Error-prone PCR uses low-fidelity polymerization conditions to introduce a low level of point mutations randomly over a long sequence. In a mixture of fragments of unknown sequence, error-prone PCR can be used to mutagenize the mixture.
  • the published enor-prone PCR protocols suffer from a low processivity of the polymerase. Therefore, the protocol is unable to result in the random mutagenesis of an average-sized gene. This inability limits the practical application of enor-prone PCR.
  • Some computer simulations have suggested that point mutagenesis alone may often be too gradual to allow the large-scale block changes that are required for continued and dramatic sequence evolution.
  • oligonucleotide-directed mutagenesis a short sequence is replaced with a synthetically mutagenized oligonucleotide. This approach does not generate combinations of distant mutations and is thus not combinatorial.
  • the limited library size relative to the vast sequence length means that many rounds of selection are unavoidable for protein optimization.
  • Mutagenesis with synthetic oligonucleotides requires sequencing of individual clones after each selection round followed by grouping them into families, arbitrarily choosing a single family, and reducing it to a consensus motif. Such motif is re- synthesized and reinserted into a single gene followed by additional selection. This step process constitutes a statistical bottleneck, is labor intensive, and is not practical for many rounds of mutagenesis.
  • Enor-prone PCR and oligonucleotide-directed mutagenesis are thus useful for single cycles of sequence fine-tuning, but rapidly become too limiting when they are applied for multiple cycles.
  • cassette mutagenesis a sequence block of a single template is typically replaced by a (partially) randomized sequence. Therefore, the maximum information content that can be obtained is statistically limited by the number of random sequences (i.e., library size). This eliminates other sequence families which are not cunently best, but which may have greater long term potential.
  • mutagenesis with synthetic oligonucleotides requires sequencing of individual clones after each selection round. Thus, such an approach is tedious and impractical for many rounds of mutagenesis.
  • enor-prone PCR and cassette mutagenesis are best suited, and have been widely used, for fine-tuning areas of comparatively low information content.
  • One apparent exception is the selection of an RNA ligase ribozyme from a random library using many rounds of amplification by enor-prone PCR and selection.
  • This invention relates generally to the field of cellular and whole organism engineering. Specifically, this invention relates to a cellular transformation, directed evolution, and screening method for creating novel transgenic organisms having desirable properties. Thus in one aspect, this invention relates to a method of generating a transgenic organism, such as a microbe or a plant, having a plurality of traits that are differentially activatable.
  • this invention is directed to a method of producing an improved organism having a desirable trait to by: a) obtaining an initial population of organisms, b) generating a set of mutagenized organisms, such that when all the genetic mutations in the set of mutagenized organisms are taken as a whole, there is represented a set of substantial genetic mutations, and c) detecting the presence of said improved organism.
  • This invention provides that any of steps a), b), and c) can be further repeated in any particular order and any number of times; accordingly, this invention specifically provides methods comprised of any iterative combination of steps a), b), and c), with a number of iterations.
  • this invention is directed to a method of producing an improved organism having a desirable trait to by: a) obtaining an initial population of organisms, which can be a clonal population or otherwise, b) generating a set of mutagenized organisms each having at least one genetic mutation, such that when all the genetic mutations in the set of mutagenized organisms are taken as a whole, there is represented a set of substantial genetic mutations c) detecting the manifestation of at least two genetic mutations, and d) introducing at least two detected genetic mutations into one organism.
  • this invention provides that any of steps a), b), c), and d) can be further repeated in any particular order and any number of times; accordingly, this invention specifically provides methods comprised of any iterative combination of steps a), b), c), and d), with a total number of iterations can be from one up to one million, including specifically every integer value in between.
  • the step of b) generating a second set of mutagenized organisms is comprised of generating a plurality of organisms, each of which organisms has a particular transgenic mutation.
  • generating a set of mutagenized organisms having genetic mutations can be achieved by any means known in the art to mutagenized including any radiation known to mutagenized, such as ionizing and ultra violet.
  • Further examples of serviceable mutagenizing methods include site-saturation mutagenesis, transposon-based methods, and homologous recombination.
  • “Combining” means inco ⁇ orating a plurality of different genetic mutations in the genetic makeup (e.g. the genome) of the same organism; and methods to achieve this "combining" step including sexual recombination, homologous recombination, and transposon-based methods.
  • an "initial population of organisms” means a “working population of organisms”, which refers simply to a population of organisms with which one is working, and which is comprised of at least one organism.
  • An "initial population of organisms” which can be a clonal population or otherwise.
  • an "initial population of organisms” may be a population of multicellular organisms or of unicellular organisms or of both.
  • An “initial population of organisms” may be comprised of unicellular organisms or multicellular organisms or both.
  • An “initial population of organisms” maybe comprised of prokaryotic organisms or eukaryotic organisms or both.
  • This invention provides that an "initial population of organisms” is comprised of at least one organism, and prefened embodiments include at least that .
  • organism any biological form or thing that is capable of self replication or replication in a host.
  • organs include the following kinds of organisms (which kinds are not necessarily mutually-exclusive): animals, plants, insects, cyanobacteria, microorganisms, fungi, bacteria, eukaryotes, prokaryotes, mycoplasma, viral organisms (including DNA viruses, RNA viruses), and prions.
  • Non-limiting particularly prefened examples of kinds of "organisms” also include Archaea (archaebacteria) and Bacteria (eubacteria).
  • Archaea Archaebacteria
  • Bacteria eubacteria
  • Non-limiting examples of Archaea (archaebacteria) include Crenarchaeota, Euryarchaeota, and Korarchaeota.
  • Bacteria include Aquificales, CFB/Green sulfur bacteria group, Chlamydiales/Verrucomicrobia group, Chrysiogenes group, Coprothermobacter group, Cyanobacteria & chloroplasts, Cytophaga/Flexibacter /Bacteriods group, Dictyoglomus group, Fibrobacter/Acidobacteria group, Firmicutes, Flexistipes group, Fusobacteria, Green non-sulfur bacteria, Nitrospira group, Planctomycetales, Proteobacteria, Spirochaetales, Synergistes group, Thermodesulfobacterium group, Thermotogales, Thermus/Deinococcus group.
  • particularly prefened kinds of organisms include Aquifex, Aspergillus, Bacillus, Clostridium, E. coli, Lactobacillus, Mycobacterium, Pseudomonas, Streptomyces, and Thermotoga.
  • particularly prefened organisms include cultivated organisms such as CHO, VERO, BHK, HeLa, COS, MDCK, Jurkat, HEK-293, and WI38.
  • Particularly prefened non-limiting examples of organisms further include host organisms that are serviceable for the expression of recombinant molecules.
  • Organisms further include primary cultures (e.g. cells from harvested mammalian tissues), immortalized cells, all cultivated and culturable cells and multicellular organisms, and all uncultivated and uculturable cells and multicellular organisms.
  • genomic information is useful for performing the claimed methods; thus, this invention provides the following as prefened but non-limiting examples of organisms that are particularly serviceable for this invention, because there is a significant amount of- if not complete - genomic sequence information (in terms of primary sequence &/or annotation) for these organisms: Human, Insect (e.g. Drosophila melanogaster), Higher plants (e.g. Arabidopsis thaliana), Protozoan (e.g. Plasmodium falciparum), Nematode (e.g. Caenorhabditis elegans), Fungi(e.g. Saccharomyces cerevisiae), Proteobacteria gamma subdivision (e.g.
  • Escherichia coli K-12 Haemophilus influenzae Rd, Xylella fastidiosa 9a5c, Vibrio cholerae El Tor N16961, Pseudomonas aeruginosa PA01, Buchnera sp. APS), Proteobacteria beta subdivision (e.g. Neisseria meningitidis MC58 (serogroup B), Neisseria meningitidis Z2491 (serogroup A)), Proteobacteria other subdivisions (e.g.
  • Chlamydia trachomatisserovar D Chlamydia muridarum (Chlamydia trachomatis MoPn), Chlamydia pneumoniae CWL029, Chlamydia pneumoniae AR39, Chlamydia pneumoniae J138), Spirochete (e.g. Borrelia burgdorferi B31, Treponema pallidum), Cyanobacteria (e.g. Synechocystis sp. PCC6803), Radioresistant bacteria (e.g. Deinococcus radiodurans Rl), Hyperthermophilic bacteria (e.g.
  • Aquifex aeolicus VF5, Thermotoga maritima MSB8), and Archaea e.g. Methanococcus jannaschii, Methanobacterium thermoautotrophicum deltaH, Archaeoglobus fulgidus, Pyrococcus horikoshii OT3, Pyrococcus abyssi, Aeropyrum pernix Kl.
  • Non-limiting particularly prefened examples of kinds of plant "organisms” include those listed in Table 1.
  • Non-limiting examples of plant organisms and sources of transgenic molecules e.g. nucleic acids & nucleic acid products
  • the meaning of "generating a set of mutagenized organisms having genetic mutations” includes the steps of substituting, deleting, as well as introducing a nucleotide sequence into organism; and this invention provides a nucleotide sequence that serviceable for this purpose may be a single-stranded or double-stranded and the fact that its length may be from one nucleotide up to 10,000,000,000 nucleotides in length including specifically every integer value in between.
  • a mutation in an organism includes any alteration in the stracture of one or more molecules that encode the organism.
  • These molecules include nucleic acid, DNA, RNA, prionic molecules, and may be exemplified by a variety of molecules in an organism such as a DNA that is genomic, episomal, or nucleic, or by a nucleic acid that is vectoral (e.g. viral, cosmid, phage, phagemid).
  • a "set of substantial genetic mutations” is preferably a disruption (e.g. a functional knock-out) of at least about 15 to about 150,000 genomic locations or nucleotide sequences (e.g. genes, promoters, regulatory sequences, codons etc.), including specifically every integer value in between.
  • a "set of substantial genetic mutations” is preferably an alteration in an expression level (e.g. decreased or increased expression level) or an alteration in the expression pattern (e.g. throughout a period of time) of at least about 15 to about 150,000 genes, including specifically every integer value in between.
  • a "set of substantial genetic mutations” is preferably an alteration in an expression level (e.g. decreased or increased expression level) or an alteration in the expression pattern (e.g. throughout a period of time) of at least about 15 to about 150,000 gene products &/or phenotypes &/or traits, including specifically every integer value in between.
  • a "set of substantial genetic mutations" with respect to an organism (or type of organism) is preferably a disruption (e.g. a functional knock-out) of at least about 1% to about 100% of genomic locations or nucleotide sequences (e.g. genes, promoters, regulatory sequences, codons etc.) in the organism (or type of organism), including specifically percentages of every integer value in between.
  • a "set of substantial genetic mutations” is preferably an alteration in an expression level (e.g. decreased or increased expression level) or an alteration in the expression pattern (e.g.
  • a "set of substantial genetic mutations" is preferably an alteration in an expression level (e.g. decreased or increased expression level) or an alteration in the expression pattern (e.g. throughout a period of time) of at least about 1% to about 100% of the gene products &/or phenotypes &/or traits of an organism (or type of organism), including specifically every integer value in between.
  • a "set of substantial genetic mutations” is preferably an introduction or deletion of at least about 15 to 150,000 genes promoters or other nucleotide sequences (where each sequence is from 1 base to 10,000,000 bases), including specifically every integer value in between.
  • gene pathways e.g. that ultimately lead to the production of small molecules
  • knocking-out, altering expression level, and altering expression pattern can be achieved, by non-limiting exemplification, by mutagenizing a nucleotide sequence corresponding gene as well as a conesponding promoter that affects the expression of the gene.
  • a "mutagenized organism” includes any organism that has been altered by a genetic mutation.
  • a “genetic mutation” can be, by way of non-limiting and non-mutually exclusive exemplification, and change in the nucleotide sequence (DNA or RNA) with respect to genomic, extra-genomic, episomal, mitochondrial, and any nucleotide sequence associated with (e.g. contained within or considered part of) an organism-
  • detecting the manifestation of a "genetic mutation” means "detecting the manifestation of a detectable parameter", including but not limited to a change in the genomic sequence. Accordingly, this invention provides that a step of sequencing (&/or annotating) of and organism's genomic DNA is necessary for some methods of this invention, and exemplary but non-limiting aspects of this sequencing (&/or annotating) step are provided herein.
  • a detectable “trait”, as used herein, is any detectable parameter associated with the organism. Accordingly, such a detectable “parameter” includes, by way of non- limiting exemplification, any detectable “nucleotide knock-in", any detectable “nucleotide knock-outs", any detectable “phenotype”, and any detectable “genotype”.
  • a “trait” includes any substance produced or not produced by the organism. Accordingly, a “trait” includes viability or non- viability, behavior, growth rate, size, morphology.
  • Trait includes increased (or alternatively decreased) expression of a gene product or gene pathway product.
  • Trait also includes small molecule production (including vitamins, antibiotics), herbicide resistance, drought resistance, pest resistance, production of any recombinant biomolecule (ie.g. vaccines, enzymes, protein therapeutics, chiral enzymes). Additional examples of serviceable traits for this invention are shown in Table 2.
  • Non-limiting examples of serviceable genes, gene products, phenotypes, or traits according to the methods of this invention e.g. knockouts, knockins, increased or decreased expression level, increased or decreased expression pattern
  • Acetohydroxyacid synthase variant 62 Cinnamate 4-hydroxylase
  • Acetolactate synthase 63 Cinnamate 4-hydroxylase knockout
  • ACP acyl-ACP thioesterase 65 Coat protein knockout
  • Amylase 80 Delta-12 saturase
  • Antifungal protein 83 Delta- 15 desaturase knockout
  • Antiviral protein 86 Deoxyhypusine synthase (DHS)
  • Attacin E 88 Diacylglycerol acetyl tansferase
  • producing an organism having a desirable trait includes an organism that is with respect to an organ or a part of an organ but not necessarily altered anywhere else.
  • detectable parameter is meant any detectable parameter associated with an organism under a set of conditions.
  • detectable parameters include the ability to produce a substance, the ability to not produce a substance, an altered pattern of (such as an increased or a decreased) ability to produce a substance, viability, non- viability, behaviour, growth rate, size, morphology or morphological characteristic,
  • this invention is directed to a method of producing an organism having a desirable trait or a desirable improvement in a trait by: a) obtaining an initial population of organisms comprised of at least one starting organism, b) mutagenizing the population such that mutations occur throughout a substantial part of the genome of at least one initial organism, c) selecting at least one mutagenized organism having a desirable trait or a desirable improvement in a trait, and d) optionally repeating the method by subjecting one or more mutagenized organisms to a repetition of the method.
  • a mutagenized organism having a desirable trait or a desirable improvement in a trait can be referred to as an "up-mutant", and the associated mutation(s) contained in an up-mutant organism can be referred to as up-mutation(s).
  • step c) is comprised of selecting at least two different mutagenized organisms, each having a different mutagenized genome, and the method of producing an organism having a desirable trait or a desirable improvement in a trait is comprised of a) obtaining a starting population of organisms comprised of at least one starting organism, b) mutagenizing the population such that mutations occur throughout a substantial part of the genome of at least one starting organism, c) selecting at least two mutagenized organism having a desirable trait or a desirable improvement in a trait, d) creating combinations of the mutations of the two or more mutagenized organisms, e) selecting at least one mutagenized organism having a desirable trait or a desirable improvement in a trait, and f) optionally repeating the method by subjecting one or more mutagenized organisms to a repetition of the method.
  • the method is repeated.
  • an up-mutant organism can serve as a starting organism for the above method.
  • an up mutant organism having a combination of two or more up-mutations in its genome can serve as a starting organism for the above method.
  • this invention is directed to a method of producing an organism having a desirable trait or a desirable improvement in a trait by: a) obtaining a starting population of organisms comprised of at least one starting organism, b) mutagenizing the population such that mutations occur throughout a substantial part of the genome of at least one starting organism, c) selecting at least one mutagenized organism having a desirable trait or a desirable improvement in a trait, and d) optionally repeating the method by subjecting one or more mutagenized organisms to a repetition of the method.
  • a mutagenized organism having a desirable trait or a desirable improvement in a trait can be referred to as an "up-mutant", and the associated mutation(s) contained in an up-mutant organism can be referred to as up-mutation(s).
  • Mutagenizing a starting population such that mutations occur throughout a substantial part of the genome of at least one starting organism refers to mutagenizing at least approximately 1% of the genes of a genome, or at least approximately 10% of the genes of a genome, or at least approximately 20% of the genes of a genome, or at least approximately 30% of the genes of a genome, or at least approximately 40% of the genes of a genome, or at least approximately 50% of the genes of a genome, or at least approximately 60% of the genes of a genome, or at least approximately 70% of the genes of a genome, or at least approximately 80% of the genes of a genome, or at least approximately 90% of the genes of a genome, or at least approximately 95% of the genes of a genome, or at least approximately 98% of the genes of a genome.
  • this invention provides a method of producing an organism having a desirable trait or a desirable improvement in a trait by: a) obtaining sequence information of a genome; b) annotating the genomic sequence obtained; c) mutagenizing a substantial part of the genome the genome; d) selecting at least one mutagenized genome having a desirable trait or a desirable improvement in a trait; and e) optionally repeating the method by subjecting one or more mutagenized genomes to a repetition of the method.
  • this invention provides a process comprised of: 1.) Subjecting a working cell or organism to holistic monitoring (which can include the detection and/or measurement of all detectable functions and physical parameters). Examples of such parameters include morphology, behavior, growth, responsiveness to stimuli (e.g., antibiotics, different environment, etc.). Additional examples include all measurable molecules, including molecules that are chemically at least in part a nucleic acids, proteins, carbohydrates, proteoglycans, glycoproteins, or lipids.
  • performing holistic monitoring is comprised of using a microanay-based method.
  • performing holistic monitoring is comprised of sequencing a substantial portion of the genome, i.e.
  • the genome for example at least approximately 10% of the genome, or for example at least approximately 20% of the genome, or for example at least approximately 30% of the genome, or for example at least approximately 40% of the genome, or for example at least approximately 50% of the genome, or for example at least approximately 60% of the genome, or for example at least approximately 70% of the genome, or for example at least approximately 80% of the genome, or for example at least approximately 90% of the genome, or for example at least approximately 95% of the genome, or for example at least approximately 98% of the genome.
  • This invention provides that molecules serviceable for introducing transgenic traits into a plant include all known genes and nucleic acids.
  • this invention specifically names any number &/or combination of genes listed herein or listed in any reference incorporated herein by reference .
  • this invention specifically names any number &/or combination of genes & gene pathways listed herein as well as in any reference incorporated by reference herein.
  • molecules serviceable as detectable parameters include molecule, any enzyme, substrate thereof, product thereof, and any gene or gene pathway listed herein including in any figure or table herein as well as in any reference inco ⁇ orated by reference herein.
  • This invention also relates generally to the field of nucleic acid engineering and correspondingly encoded recombinant protein engineering. More particularly, the invention relates to the directed evolution of nucleic acids and screening of clones containing the evolved nucleic acids for resultant activity(ies) of interest, such nucleic acid activityries) &/or specified protein, particularly enzyme, activity(ies) of interest.
  • Mutagenized molecules provided by this invention may have chimeric molecules and molecules with point mutations, including biological molecules that contain a carbohydrate, a lipid, a nucleic acid, &/or a protein component, and specific but non-limiting examples of these include antibiotics, antibodies, enzymes, and steroidal and non-steroidal hormones.
  • This invention relates generally to a method of: 1) preparing a progeny generation of molecule(s) (including a molecule that is comprised of a polynucleotide sequence, a molecule that is comprised of a polypeptide sequence, and a molecules that is comprised in part of a polynucleotide sequence and in part of a polypeptide sequence), that is mutagenized to achieve at least one point mutation, addition, deletion, &/or chimerization, from one or more ancestral or parental generation template(s); 2) screening the progeny generation molecule(s) - preferably using a high throughput method - for at least one property of interest (such as an improvement in an enzyme activity or an increase in stability or a novel chemotherapeutic effect); 3) optionally obtaining &/or cataloguing structural &/or and functional information regarding the parental &/or progeny generation molecules; and 4) optionally repeating any of steps 1) to 3).
  • a progeny generation of molecule(s) including
  • amino acid site-saturation mutagenesis one such mutant polypeptide for each of the 19 naturally encoded polypeptide-forming alpha-amino acid substitutions at each and every amino acid position along the polypeptide.
  • amino acid site-saturation mutagenesis one such mutant polypeptide for each of the 19 naturally encoded polypeptide-forming alpha-amino acid substitutions at each and every amino acid position along the polypeptide.
  • this approach is also serviceable for generating mutants containing - in addition to &/or in combination with the 20 naturally encoded polypeptide- forming alpha-amino acids - other rare &/or not naturally-encoded amino acids and amino acid derivatives.
  • this approach is also serviceable for generating mutants by the use of- in addition to &/or in combination with natural or unaltered codon recognition systems of suitable hosts - altered, mutagenized, &/or designer codon recognition systems (such as in a host cell with one or more altered tRNA molecules).
  • this invention relates to recombination and more specifically to a method for preparing polynucleotides encoding a polypeptide by a method of in vivo re- assortment of polynucleotide sequences containing regions of partial homology, assembling the polynucleotides to form at least one polynucleotide and screening the polynucleotides for the production of polypeptide(s) having a useful property.
  • this invention is serviceable for analyzing and cataloguing - with respect to any molecular property (e.g. an enzymatic activity) or combination of properties allowed by current technology - the effects of any mutational change achieved (including particularly saturation mutagenesis).
  • a comprehensive method for determining the effect of changing each amino acid in a parental polypeptide into each of at least 19 possible substitutions. This allows each amino acid in a parental polypeptide to be characterized and catalogued according to its spectrum of potential effects on a measurable property of the polypeptide.
  • the method of the present invention utilizes the natural property of cells to recombine molecules and/or to mediate reductive processes that reduce the complexity of sequences and extent of repeated or consecutive sequences possessing regions of homology.
  • a method for introducing polynucleotides into a suitable host cell and growing the host cell under conditions that produce a hybrid polynucleotide is provided, in accordance with one aspect of the invention.
  • the invention provides a method for screening for biologically active hybrid polypeptides encoded by hybrid polynucleotides.
  • the present method allows for the identification of biologically active hybrid polypeptides with enhanced biological activities.
  • this invention relates to a method of discovering which phenotype corresponds to a gene by disrupting every gene in the organism.
  • this invention provides a method for determining a gene that alters a characteristic of an organism, comprising: a) obtaining an initial population of organisms, b) generating a set of mutagenized organisms, such that when all the genetic mutations in the set of mutagenized organisms are taken as a whole, there is represented a set of substantial genetic mutations, and c) detecting the presence an organism having an altered trait, and d) determining the nucleotide sequence of a gene that has been mutagenized in the organism having the altered trait.
  • this invention relates to a method of improving a trait in an organism by functionally knocking out a particular gene in the organism, and then transferring a library of genes, which only vary from the wild-type at one codon position, into the organism.
  • this invention provides a method method for producing an organism with an improved trait, comprising: a) functionally knocking out an enogenous gene in a substantially clonal population of organisms; b) transferring the set of altered genes into the clonal population of organisms, wherein each altered gene differs from the endogenous gene at only one codon; and c) detecting a mutagenized organism having an improved trait; and d) determining the nucleotide sequence of a gene that has been transferred into the detected organism.
  • Figure 1 shows the activity of the enzyme exonuclease
  • Figure 2 illustrates a method of generating a double-stranded nucleic acid building block with two overhangs using a polymerase-based amplification reaction (e.g., PCR).
  • a polymerase-based amplification reaction e.g., PCR
  • a first polymerase-based amplification reaction using a first set of primers, F 2 and Rj is used to generate a blunt-ended product (labeled Reaction 1, Product 1), which is essentially identical to Product A.
  • a second polymerase-based amplification reaction using a second set of primers, Fi and R 2 is used to generate a blunt-ended product (labeled Reaction 2, Product 2), which is essentially identical to Product B.
  • the product with the 3' overhangs is selected for by nuclease-based degradation of the other 3 products using a 3' acting exonuclease, such as exonuclease III.
  • a 3' acting exonuclease such as exonuclease III.
  • Alternate primers are shown in parenthesis to illustrate serviceable primers may overlap, and additionally that serviceable primers may be of different lengths, as shown.
  • FIGURE 3 Unique Overhangs And Unique Couplings.
  • Figure 3 illustrates the point that the number of unique overhangs of each size (e.g. the total number of unique overhangs composed of 1 or 2 or 3, etc. nucleotides) exceeds the number of unique couplings that can result from the use of all the unique overhangs of that size. For example, there are 4 unique 3' overhangs composed of a single nucleotide, and 4 unique 5' overhangs composed of a single nucleotide. Yet the total number of unique couplings that can be made using all the 8 unique single-nucleotide 3' overhangs and single-nucleotide 5' overhangs is 4.
  • FIGURE 4 Unique Overall Assembly Order Achieved by Sequentially Coupling the Building Blocks
  • Figure 4 illustrates the fact that in order to assemble a total of "n" nucleic acid building blocks, "n-1" couplings are needed. Yet it is sometimes the case that the number of unique couplings available for use is fewer that the "n-1" value. Under these, and other, circumstances a stringent non-stochastic overall assembly order can still be achieved by performing the assembly process in sequential steps. In this example, 2 sequential steps are used to achieve a designed overall assembly order for five nucleic acid building blocks. In this illustration the designed overall assembly order for the five nucleic acid building blocks is: 5'-(#l-#2-#3-#4-#5)-3', where #1 represents building block number 1, etc.
  • FIGURE 5 Unique Couplings Available Using a Two-Nucleotide 3' Overhang.
  • Figure 5 further illustrates the point that the number of unique overhangs of each size (here, e.g. the total number of unique overhangs composed of 2 nucleotides) exceeds the number of unique couplings that can result from the use of all the unique overhangs of that size. For example, there are 16 unique 3' overhangs composed of two nucleotides, and another 16 unique 5' overhangs composed of two nucleotides, for a total of 32 as shown. Yet the total number of couplings that are unique and not self-binding that can be made using all the 32 unique double-nucleotide 3' overhangs and double-nucleotide 5' overhangs is 12.
  • Figure 6 Generation of an Exhaustive Set of Chimeric Combinations by Synthetic Ligation Reassembly.
  • Figure 6 showcases the power of this invention in its ability to generate exhaustively and systematically all possible combinations of the nucleic acid building blocks designed in this example. Particularly large sets (or libraries) of progeny chimeric molecules can be generated. Because this method can be performed exhaustively and systematically, the method application can be repeated by choosing new demarcation points and with correspondingly newly designed nucleic acid building blocks, bypassing the burden of re-generating and re-screening previously examined and rejected molecular species. It is appreciated that, codon wobble can be used to advantage to increase the frequency of a demarcation point.
  • a particular base can often be substituted into a nucleic acid building block without altering the amino acid encoded by progenitor codon (that is now altered codon) because of codon degeneracy.
  • demarcation points are chosen upon alignment of 8 progenitor templates.
  • Nucleic acid building blocks including their overhangs are then designed and synthesized.
  • 18 nucleic acid building blocks are generated based on the sequence of each of the 8 progenitor templates, for a total of 144 nucleic acid building blocks (or double-stranded oligos). Performing the ligation synthesis procedure will then produce a library of progeny molecules comprised of yield of 8 18 (or over 1.8 x 10 16 ) chimeras.
  • double-stranded nucleic acid building blocks are designed by aligning a plurality of progenitor nucleic acid templates. Preferably these templates contain some homology and some heterology.
  • the nucleic acids may encode related proteins, such as related enzymes, which relationship may be based on function or stracture or both.
  • Figure 7 shows the alignment of three polynucleotide progenitor templates and the selection of demarcation points (boxed) shared by all the progenitor molecules.
  • the nucleic acid building blocks derived from each of the progenitor templates were chosen to be approximately 30 to 50 nucleotides in length.
  • Figure 8 Nucleic acid building blocks for synthetic ligation gene reassembly.
  • Figure 8 shows the nucleic acid building blocks from the example in Figure 7.
  • the nucleic acid building blocks are shown here in generic cartoon form, with their compatible overhangs, including both 5' and 3' overhangs.
  • Figure 9 Addition of Introns by Synthetic Ligation Reassembly.
  • Figure 9 shows in generic cartoon form that an intron may be introduced into a chimeric progeny molecule by way of a nucleic acid building block. It is appreciated that introns often have consensus sequences at both termini in order to render them operational. It is also appreciated that, in addition to enabling gene splicing, introns may serve an additional pu ⁇ ose by providing sites of homology to other nucleic acids to enable homologous recombination. For this ptupose, and potentially others, it may be sometimes desirable to generate a large nucleic acid building block for introducing an intron.
  • such a specialized nucleic acid building block may also be generated by direct chemical synthesis of more than two single stranded oligos or by using a polymerase-based amplification reaction as shown in Figure 2.
  • Figure 10 Ligation Reassembly Using Fewer Than All The Nucleotides Of An Overhang.
  • Figure 10 shows that coupling can occur in a manner that does not make use of every nucleotide in a participating overhang. The coupling is particularly lively to survive (e.g. in a transformed host) if the coupling reinforced by treatment with a ligase enzyme to form what may be referred to as a "gap ligation" or a "gapped ligation". It is appreciated that, as shown, this type of coupling can contribute to generation of unwanted background product(s), but it can also be used advantageously increase the diversity of the progeny library generated by the designed ligation reassembly.
  • nucleic acid building blocks can be chemically made (or ordered) that lack a 5' phosphate group (or alternatively they can be remove - e.g. by treatment with a phosphatase enzyme such as a calf intestinal alkaline phosphatase (CLAP) - in order to prevent palindromic self-ligations in ligation reassembly processes.
  • a phosphatase enzyme such as a calf intestinal alkaline phosphatase (CLAP)
  • Figure 12 Pathway Engineering. It is a goal of this invention to provide ways of making new gene pathways using ligation reassembly, optionally with other directed evolution methods such as saturation mutagenesis.
  • Figure 12 illustrates a prefened approach that may be taken to achieve this goal.
  • naturally-occurring microbial gene pathways are linked more often than naturally-occurring eukaryotic (e.g. plant) gene pathways, which are sometime only partially linked.
  • this invention provides that regulatory gene sequences (including promoters) can be introduced in the form of nucleic acid building blocks into progeny gene pathways generated by ligation reassembly processes.
  • originally linked microbial gene pathways, as well as originally unlinked genes and gene pathways can be thus converted to acquire operability in plants and other eukaryotes.
  • Figure 13 Avoidance of unwanted self-ligation in palindromic couplings.
  • Figure 13 illustrates that another goal of this invention, in addition to the generation of novel gene pathways, is the subjection of gene pathways - both naturally occurring and man-made - to mutagenesis and selection in order to achieve improved progeny molecules using the instantly disclosed methods of directed evolution (including saturation mutagenesis and synthetic ligation reassembly).
  • both microbial and plant pathways can be improved by directed evolution, and as shown, the directed evolution process can be performed both on genes prior to linking them into pathways, and on gene pathways themselves.
  • Figure 14 Conversion of Microbial Pathways to Eukaryotic Pathways.
  • this invention provides that microbial pathways can be converted to pathways operable in plants and other eukaryotic species by the introduction of regulatory sequences that function in those species.
  • Prefened regulatory sequences include promoters, operators, and activator binding sites.
  • a preferred method of achieving the introduction of such serviceable regulatory sequences is in the form of nucleic acid building blocks, particularly through the use of couplings in ligation reassembly processes. These couplings in Fig. 14 are marked with the letters A, B, C, D and F.
  • Fig. 15 Holistic engineering of differentially activatable stacked traits in noveltransgenic plants using directed evolution and whole cell monitoring.
  • Figure 21 Starting population comprised of an organism strain to be subjected to improvement or evolution in order to produce a resultant population comprised of an improved organism strain that has a desired trait.
  • Figure 22 Starting population comprised of a genomic sequence to be subjected to improvement or evolution in order to produce a resultant population comprised of an improved genomic sequence that has a desired trait.
  • agent is used herein to denote a chemical compound, a mixture of chemical compounds, an array of spatially localized compounds (e.g., a VLSIPS peptide array, polynucleotide array, and/or combinatorial small molecule array), biological macromolecule, a bacteriophage peptide display library, a bacteriophage antibody (e.g., scFv) display library, a polysome peptide display library, or an extract made form biological materials such as bacteria, plants, fungi, or animal (particular mammalian) cells or tissues.
  • a chemical compound e.g., a VLSIPS peptide array, polynucleotide array, and/or combinatorial small molecule array
  • biological macromolecule e.g., a VLSIPS peptide array, polynucleotide array, and/or combinatorial small molecule array
  • bacteriophage peptide display library e.g., a bacteriophage antibody (
  • Agents are evaluated for potential activity as anti-neoplasties, anti- inflammatories or apoptosis modulators by inclusion in screening assays described hereinbelow.
  • Agents are evaluated for potential activity as specific protein interaction inhibitors (i.e., an agent which selectively inhibits a binding interaction between two predetermined polypeptides but which doe snot substantially interfere with cell viability) by inclusion in screening assays described hereinbelow.
  • An "ambiguous base requirement" in a restriction site refers to a nucleotide base requirement that is not specified to the fullest extent, i.e.
  • R G or A
  • Y C or T
  • M A or C
  • K G or T
  • S G or C
  • W A orT
  • H A or C or T
  • N G or C or A
  • D G or A or T
  • A or C or G or T.
  • amino acid refers to any organic compound that contains an amino group (- ⁇ H 2 ) and a carboxyl group (-COOH); preferably either as free groups or alternatively after condensation as part of peptide bonds.
  • the "twenty naturally encoded polypeptide-forming alpha-amino acids” are understood in the art and refer to: alanine (ala or A), arginine (arg or R), asparagine (asn or N), aspartic acid (asp or D), cysteine (cys or C), gluatamic acid (glu or E), glutamine (gin or Q), glycine (gly or G), histidine (his or H), isoleucine (ile or I), leucine (leu or L), lysine (lys or K), methionine (met or M), phenylalanine (phe or F), proline (pro or P), serine (ser or S), threonine (thr or T),
  • amplification means that the number of copies of a polynucleotide is increased.
  • antibody refers to intact immunoglobulin molecules, as well as fragments of immunoglobulin molecules, such as Fab, Fab', (Fab') 2 , Fv, and SCA fragments, that are capable of binding to an epitope of an antigen.
  • Fab fragments of immunoglobulin molecules
  • Fab' fragments of immunoglobulin molecules
  • Fv fragments of immunoglobulin molecules
  • SCA fragments that are capable of binding to an epitope of an antigen.
  • These antibody fragments which retain some ability to selectively bind to an antigen (e.g., a polypeptide antigen) of the antibody from which they are derived, can be made using well known methods in the art (see, e.g., Harlow and Lane, supra), and are described further, as follows.
  • An Fab fragment consists of a monovalent antigen-binding fragment of an antibody molecule, and can be produced by digestion of a whole antibody molecule with the enzyme papain, to yield a fragment consisting of an intact light chain and a portion of a heavy chain.
  • An Fab' fragment of an antibody molecule can be obtained by treating a whole antibody molecule with pepsin, followed by reduction, to yield a molecule consisting of an intact light chain and a portion of a heavy chain. Two Fab' fragments are obtained per antibody molecule treated in this manner.
  • An (Fab') 2 fragment of an antibody can be obtained by treating a whole antibody molecule with the enzyme pepsin, without subsequent reduction.
  • a (Fab') 2 fragment is a dimer of two Fab' fragments, held together by two disulfide bonds.
  • An Fv fragment is defined as a genetically engineered fragment containing the variable region of a light chain and the variable region of a heavy chain expressed as two chains.
  • SCA single chain antibody
  • AME Applied Molecular Evolution
  • a molecule that has a "chimeric property" is a molecule that is: 1) in part homologous and in part heterologous to a first reference molecule; while 2) at the same time being in part homologous and in part heterologous to a second reference molecule; without 3) precluding the possibility of being at the same time in part homologous and in part heterologous to still one or more additional reference molecules.
  • a chimeric molecule may be prepared by assemblying a reassortment of partial molecular sequences.
  • a chimeric polynucleotide molecule may be prepared by synthesizing the chimeric polynucleotide using plurality of molecular templates, such that the resultant chimeric polynucleotide has properties of a plurality of templates.
  • cognate refers to a gene sequence that is evolutionarily and functionally related between species.
  • human CD4 gene is the cognate gene to the mouse 3d4 gene, since the sequences and stractures of these two genes indicate that they are highly homologous and both genes encode a protein which functions in signaling T cell activation through MHC class Il-restricted antigen recognition.
  • a “comparison window,” as used herein, refers to a conceptual segment of at least 20 contiguous nucleotide positions wherein a polynucleotide sequence may be compared to a reference sequence of at least 20 contiguous nucleotides and wherein the portion of the polynucleotide sequence in the comparison window may comprise additions or deletions (i.e., gaps) of 20 percent or less as compared to the reference sequence (which does not comprise additions or deletions) for optimal alignment of the two sequences.
  • Optimal alignment of sequences for aligning a comparison window may be conducted by the local homology algorithm of Smith (Smith and Waterman, Adv Appl Math, 1981; Smith and Waterman, J Teor Biol, 1981 ; Smith and Waterman, JMol Biol, 1981; Smith et al, JMolEvol, 1981), by the homology alignment algorithm of Needleman (Needleman and Wuncsch, 1970), by the search of similarity method of Pearson (Pearson and Lipman, 1988), by computerized implementations of these algorithms (GAP, BESTFIT, FASTA, and TFASTA in the Wisconsin Genetics Software Package Release 7.0, Genetics Computer Group, 575 Science Dr., Madison, WI), or by inspection, and the best alignment (i.e., resulting in the highest percentage of homology over the comparison window) generated by the various methods is selected.
  • Smith Smith and Waterman, Adv Appl Math, 1981; Smith and Waterman, J Teor Biol, 1981 ; Smith and Waterman, JMol Biol, 1981;
  • complementarity-determining region and "CDR” refer to the art-recognized term as exemplified by the Kabat and Chothia CDR definitions also generally known as supervariable regions or hypervariable loops (Chothia and Lesk, 1987; Clothia et al, 1989; Kabat et al, 1987; and Tramontano et al, 1990).
  • Variable region domains typically comprise the amino-terminal approximately 105-115 amino acids of a naturally-occurring immunoglobulin chain (e.g., amino acids 1-110), although variable domains somewhat shorter or longer are also suitable for forming single-chain antibodies.
  • Constant amino acid substitutions refer to the interchangeability of residues having similar side chains.
  • a group of amino acids having aliphatic side chains is glycine, alanine, valine, leucine, and isoleucine
  • a group of amino acids having aliphatic-hydroxyl side chains is serine and threonine
  • a group of amino acids having amide-containing side chains is asparagine and glutamine
  • a group of amino acids having aromatic side chains is phenylalanine, tyrosine, and tryptophan
  • a group of amino acids having basic side chains is lysine, arginine, and histidine
  • a group of amino acids having sulfur-containing side chains is cysteine and methionine.
  • Preferred conservative amino acids substitution groups are : valine-leucine-isoleucine, phenylalanine-tyrosine, lysine-arginine, alanine-valine, and asparagine-glutamine.
  • a polynucleotide sequence is homologous (i.e., is identical, not strictly evolutionarily related) to all or a portion of a reference polynucleotide sequence, or that a polypeptide sequence is identical to a reference polypeptide sequence.
  • the term “complementary to” is used herein to mean that the complementary sequence is homologous to all or a portion of a reference polynucleotide sequence.
  • TATAC conesponds to a reference "TATAC” and is complementary to a reference sequence "GTATA.”
  • degradation effective amount refers to the amount of enzyme which is required to process at least 50% of the substrate, as compared to substrate not contacted with the enzyme. Preferably, at least 80% of the substrate is degraded.
  • defined sequence framework refers to a set of defined sequences that are selected on a non-random basis, generally on the basis of experimental data or structural data; for example, a defined sequence framework may comprise a set of amino acid sequences that are predicted to form a ⁇ -sheet structure or may comprise a leucine zipper heptad repeat motif, a zinc-finger domain, among other variations.
  • a “defined sequence kernal” is a set of sequences which encompass a limited scope of variability.
  • a completely random 10-mer sequence of the 20 conventional amino acids can be any of (20) 10 sequences
  • a pseudorandom 10-mer sequence of the 20 conventional amino acids can be any of (20) 10 sequences but will exhibit a bias for certain residues at certain positions and/or overall
  • a defined sequence kernal is a subset of sequences if each residue position was allowed to be any of the allowable 20 conventional amino acids (and/or allowable unconventional amino/imino acids).
  • a defined sequence kernal generally comprises variant and invariant residue positions and/or comprises variant residue positions which can comprise a residue selected from a defined subset of amino acid residues), and the like, either segmentally or over the entire length of the individual selected library member sequence.
  • sequence kernels can refer to either amino acid sequences or polynucleotide sequences.
  • sequences (NNK) ⁇ 0 and (T KM) ⁇ wherein N represents A, T, G, or C; K represents G or T; and M represents A or C, are defined sequence kernels.
  • “Digestion” of DNA refers to catalytic cleavage of the DNA with a restriction enzyme that acts only at certain sequences in the DNA.
  • the various restriction enzymes used herein are commercially available and their reaction conditions, cofactors and other requirements were used as would be known to the ordinarily skilled artisan.
  • For analytical pu ⁇ oses typically 1 ⁇ g of plasmid or DNA fragment is used with about 2 units of enzyme in about 20 ⁇ l of buffer solution.
  • Directional ligation refers to a ligation in which a 5' end and a 3' end of a polynuclotide are different enough to specify a prefened ligation orientation.
  • an otherwise untreated and undigested PCR product that has two blunt ends will typically not have a preferred ligation orientation when ligated into a cloning vector digested to produce blunt ends in its multiple cloning site; thus, directional ligation will typically not be displayed under these circumstances.
  • directional ligation will typically displayed when a digested PCR product having a 5' EcoR I-treated end and a 3' BamH I-is ligated into a cloning vector that has a multiple cloning site digested with EcoR I and BamH I.
  • DNA shuffling is used herein to indicate recombination between substantially homologous but non-identical sequences, in some embodiments DNA shuffling may involve crossover via non-homologous recombination, such as via cer/lox and/or flp/frt systems and the like.
  • epitope refers to an antigenic determinant on an antigen, such as a phytase polypeptide, to which the paratope of an antibody, such as an phytase-specific antibody, binds.
  • Antigenic determinants usually consist of chemically active surface groupings of molecules, such as amino acids or sugar side chains, and can have specific three-dimensional structural characteristics, as well as specific charge characteristics.
  • epitopope refers to that portion of an antigen or other macromolecule capable of forming a binding interaction that interacts with the variable region binding body of an antibody. Typically, such binding interaction is manifested as an intermolecular contact with one or more amino acid residues of a CDR.
  • fragment when referring to a reference polypeptide comprise a polypeptide which retains at least one biological function or activity that is at least essentially same as that of the reference polypeptide. Furthermore, the terms “fragment”, “derivative” or “analog” are exemplified by a "pro-form” molecule, such as a low activity proprotein that can be modified by cleavage to produce a mature enzyme with significantly higher activity.
  • a method for producing from a template polypeptide a set of progeny polypeptides in which a "full range of single amino acid substitutions" is represented at each amino acid position.
  • “full range of single amino acid substitutions” is in reference to the naturally encoded 20 naturally encoded polypeptide- forming alpha-amino acids, as described herein.
  • the term "gene” means the segment of DNA involved in producing a polypeptide chain; it includes regions preceding and following the coding region (leader and trailer) as well as intervening sequences (introns) between individual coding segments (exons).
  • Genetic instability refers to the natural tendency of highly repetitive sequences to be lost through a process of reductive events generally involving sequence simplification through the loss of repeated sequences. Deletions tend to involve the loss of one copy of a repeat and everything between the repeats.
  • heterologous means that one single-stranded nucleic acid sequence is unable to hybridize to another single-stranded nucleic acid sequence or its complement.
  • areas of heterology means that areas of polynucleotides or polynucleotides have areas or regions within their sequence which are unable to hybridize to another nucleic acid or polynucleotide. Such regions or areas are for example areas of mutations.
  • homologous or “homeologous” means that one single-stranded nucleic acid nucleic acid sequence may hybridize to a complementary single-stranded nucleic acid sequence.
  • the degree of hybridization may depend on a number of factors including the amount of identity between the sequences and the hybridization conditions such as temperature and salt concentrations as discussed later.
  • the region of identity is greater than about 5 bp, more preferably the region of identity is greater than 10 bp.
  • An immunoglobulin light or heavy chain variable region consists of a "framework" region interrupted by three hypervariable regions, also called CDR's.
  • the extent of the framework region and CDR's have been precisely defined; see “Sequences of Proteins of Immunological Interest” (Kabat et al, 1987).
  • the sequences of the framework regions of different light or heavy chains are relatively conserved within a specie.
  • a "human framework region” is a framework region that is substantially identical (about 85 or more, usually 90-95 or more) to the framework region of a naturally occurring human immunoglobulin.
  • the framework region of an antibody that is the combined framework regions of the constituent light and heavy chains, serves to position and align the CDR's.
  • the CDR's are primarily responsible for binding to an epitope of an antigen.
  • the benefits of this invention extend to "commercial applications" (or commercial processes), which term is used to include applications in commercial industry proper (or simply industry) as well as non-commercial commercial applications (e.g. biomedical research at a non-profit institution). Relevant applications include those in areas of diagnosis, medicine, agriculture, manufacturing, and academia.
  • identity means that two nucleic acid sequences have the same sequence or a complementary sequence.
  • areas of identity means that regions or areas of a polynucleotide or the overall polynucleotide are identical or complementary to areas of another polynucleotide or the polynucleotide.
  • isolated means that the material is removed from its original environment (e.g., the natural environment if it is naturally occurring).
  • a naturally-occurring polynucleotide or enzyme present in a living animal is not isolated, but the same polynucleotide or enzyme, separated from some or all of the coexisting materials in the natural system, is isolated.
  • Such polynucleotides could be part of a vector and/or such polynucleotides or enzymes could be part of a composition, and still be isolated in that such vector or composition is not part of its natural environment.
  • isolated nucleic acid is meant a nucleic acid, e.g., a DNA or RNA molecule, that is not immediately contiguous with the 5' and 3' flanking sequences with which it normally is immediately contiguous when present in the naturally occurring genome of the organism from which it is derived.
  • the term thus describes, for example, a nucleic acid that is inco ⁇ orated into a vector, such as a plasmid or viral vector; a nucleic acid that is inco ⁇ orated into the genome of a heterologous cell (or the genome of a homologous cell, but at a site different from that at which it naturally occurs); and a nucleic acid that exists as a separate molecule, e.g., a DNA fragment produced by PCR amplification or restriction enzyme digestion, or an RNA molecule produced by in vitro transcription.
  • the term also describes a recombinant nucleic acid that forms part of a hybrid gene encoding additional polypeptide sequences that can be used, for example, in the production of a fusion protein.
  • ligand refers to a molecule, such as a random peptide or variable segment sequence, that is recognized by a particular receptor.
  • a molecule or macromolecular complex
  • the binding partner having a smaller molecular weight is refened to as the ligand and the binding partner having a greater molecular weight is refened to as a receptor.
  • Ligase refers to the process of forming phosphodiester bonds between two double stranded nucleic acid fragments (Sambrook et al, 1982, p. 146; Sambrook, 1989). Unless otherwise provided, ligation may be accomplished using known buffers and conditions with 10 units of T4 DNA ligase ("ligase”) per 0.5 ⁇ g of approximately equimolar amounts of the DNA fragments to be ligated.
  • ligase T4 DNA ligase
  • linker refers to a molecule or group of molecules that connects two molecules, such as a DNA binding protein and a random peptide, and serves to place the two molecules in a prefened configuration, e.g., so that the random peptide can bind to a receptor with minimal steric hindrance from the DNA binding protein.
  • a "molecular property to be evolved” includes reference to molecules comprised of a polynucleotide sequence, molecules comprised of a polypeptide sequence, and molecules comprised in part of a polynucleotide sequence and in part of a polypeptide sequence.
  • Particularly relevant - but by no means limiting - examples of molecular properties to be evolved include enzymatic activities at specified conditions, such as related to temperature; salinity; pressure; pH; and concentration of glycerol, DMSO, detergent, &/or any other molecular species with which contact is made in a reaction environment.
  • Additional particularly relevant - but by no means limiting - examples of molecular properties to be evolved include stabilities - e.g. the amount of a residual molecular property that is present after a specified exposure time to a specified environment, such as may be encountered during storage.
  • mutants includes changes in the sequence of a wild-type or parental nucleic acid sequence or changes in the sequence of a peptide. Such mutations may be point mutations such as transitions or transversions. The mutations may be deletions, insertions or duplications. A mutation can also be a "chimerization", which is exemplified in a progeny molecule that is generated to contain part or all of a sequence of one parental molecule as well as part or all of a sequence of at least one other parental molecule. This invention provides for both chimeric polynucleotides and chimeric polypeptides.
  • N,N,G/T nucleotide sequence represents 32 possible triplets, where "N” can be A, C, G or T.
  • naturally-occurring refers to the fact that an object can be found in nature.
  • a polypeptide or polynucleotide sequence that is present in an organism (including viruses) that can be isolated from a source in nature and which has not been intentionally modified by man in the laboratory is naturally occurring.
  • naturally occurring refers to an object as present in a non-pathological (un-diseased) individual, such as would be typical for the species.
  • nucleic acid molecule is comprised of at least one base or one base pair, depending on whether it is single-stranded or double-stranded, respectively.
  • a nucleic acid molecule may belong exclusively or chimerically to any group of nucleotide-containing molecules, as exemplified by, but not limited to, the following groups of nucleic acid molecules: RNA, DNA, genomic nucleic acids, non-genomic nucleic acids, naturally occurring and not naturally occurring nucleic acids, and synthetic nucleic acids. This includes, by way of non-limiting example, nucleic acids associated with any organelle, such as the mitochondria, ribosomal RNA, and nucleic acid molecules comprised chimerically of one or more components that are not naturally occurring along with naturally occurring components.
  • nucleic acid molecule may contain in part one or more non- nucleotide-based components as exemplified by, but not limited to, amino acids and sugars.
  • a ribozyme that is in part nucleotide- based and in part protein-based is considered a "nucleic acid molecule”.
  • a nucleic acid molecule that is labeled with a detectable moiety such as a radioactive or alternatively a non-radioactive label, is likewise considered a "nucleic acid molecule”.
  • nucleic acid sequence coding for or a "DNA coding sequence of or a “nucleotide sequence encoding” a particular enzyme — as well as other synonymous terms - refer to a DNA sequence which is transcribed and translated into an enzyme when placed under the control of appropriate regulatory sequences.
  • a "promotor sequence” is a DNA regulatory region capable of binding RNA polymerase in a cell and initiating transcription of a downstream (3' direction) coding sequence. The promoter is part of the DNA sequence. This sequence region has a start codon at its 3' terminus. The promoter sequence does include the minimum number of bases where elements necessary to initiate transcription at levels detectable above background.
  • RNA polymerase binds the sequence and transcription is initiated at the start codon (3' terminus with a promoter)
  • transcription proceeds downstream in the 3' direction.
  • a transcription initiation site (conveniently defined by mapping with nuclease SI) as well as protein binding domains (consensus sequences) responsible for the binding of RNA polymerase.
  • nucleic acid encoding an enzyme (protein) or “DNA encoding an enzyme (protein)” or “polynucleotide encoding an enzyme (protein)” and other synonymous terms encompasses a polynucleotide which includes only coding sequence for the enzyme as well as a polynucleotide which includes additional coding and/or non- coding sequence.
  • a "specific nucleic acid molecule species” is defined by its chemical structure, as exemplified by, but not limited to, its primary sequence.
  • a specific "nucleic acid molecule species” is defined by a function of the nucleic acid species or by a function of a product derived from the nucleic acid species.
  • a “specific nucleic acid molecule species” may be defined by one or more activities or properties attributable to it, including activities or properties attributable its expressed product.
  • the instant definition of "assembling a working nucleic acid sample into a nucleic acid library” includes the process of inco ⁇ orating a nucleic acid sample into a vector-based collection, such as by ligation into a vector and transformation of a host. A description of relevant vectors, hosts, and other reagents as well as specific non-limiting examples thereof are provided hereinafter.
  • the instant definition of "assembling a working nucleic acid sample into a nucleic acid library” also includes the process of inco ⁇ orating a nucleic acid sample into a non- vector-based collection, such as by ligation to adaptors.
  • the adaptors can anneal to PCR primers to facilitate amplification by PCR.
  • a "nucleic acid library” is comprised of a vector-based collection of one or more nucleic acid molecules.
  • a "nucleic acid library” is comprised of a non- vector-based collection of nucleic acid molecules.
  • a "nucleic acid library” is comprised of a combined collection of nucleic acid molecules that is in part vector-based and in part non- vector-based.
  • the collection of molecules comprising a library is searchable and separable according to individual nucleic acid molecule species.
  • the present invention provides a "nucleic acid construct” or alternatively a “nucleotide construct” or alternatively a "DNA construct”.
  • construct is used herein to describe a molecule, such as a polynucleotide (e.g., a phytase polynucleotide) may optionally be chemically bonded to one or more additional molecular moieties, such as a vector, or parts of a vector.
  • a nucleotide constract is exemplified by a DNA expression DNA expression constructs suitable for the transformation of a host cell.
  • oligonucleotide refers to either a single stranded polydeoxynucleotide or two complementary polydeoxynucleotide strands which may be chemically synthesized. Such synthetic oligonucleotides may or may not have a 5' phosphate. Those that do not will not ligate to another oligonucleotide without adding a phosphate with an ATP in the presence of a kinase. A synthetic oligonucleotide will ligate to a fragment that has not been dephosphorylated.
  • a "32-fold degenerate oligonucleotide that is comprised of, in series, at least a first homologous sequence, a degenerate N,N,G/T sequence, and a second homologous sequence" is mentioned.
  • homologous is in reference to homology between the oligo and the parental polynucleotide that is subjected to the polymerase-based amplification.
  • operably linked refers to a linkage of polynucleotide elements in a functional relationship.
  • a nucleic acid is “operably linked” when it is placed into a functional relationship with another nucleic acid sequence.
  • a promoter or enhancer is operably linked to a coding sequence if it affects the transcription of the coding sequence.
  • Operably linked means that the DNA sequences being linked are typically contiguous and, where necessary to join two protein coding regions, contiguous and in reading frame.
  • a coding sequence is "operably linked to" another coding sequence when RNA polymerase will transcribe the two coding sequences into a single mRNA, which is then translated into a single polypeptide having amino acids derived from both coding sequences.
  • the coding sequences need not be contiguous to one another so long as the expressed sequences are ultimately processed to produce the desired protein.
  • parental polynucleotide set is a set comprised of one or more distinct polynucleotide species. Usually this term fis used in reference to a progeny polynucleotide set which is preferably obtained by mutagenization of the parental set, in which case the terms “parental”, “starting” and “template” are used interchangeably.
  • physiological conditions refers to temperature, pH, ionic strength, viscosity, and like biochemical parameters which are compatible with a viable organism, and/or which typically exist intracellularly in a viable cultured yeast cell or mammalian cell.
  • intracellular conditions in a yeast cell grown under typical laboratory culture conditions are physiological conditions.
  • Suitable in vitro reaction conditions for in vitro transcription cocktails are generally physiological conditions.
  • in vitro physiological conditions comprise 50-200 mM NaCl or KCl, pH 6.5-8.5, 20-45 C and 0.001-10 mM divalent cation (e.g., Mg ++ , Ca* 4 ); preferably about 150 mM NaCl or KCl, pH 7.2-7.6, 5 mM divalent cation, and often include 0.01-1.0 percent nonspecific protein (e.g., BSA).
  • a non-ionic detergent can often be present, usually at about 0.001 to 2%, typically 0.05-0.2% (v/v).
  • Particular aqueous conditions may be selected by the practitioner according to conventional methods.
  • buffered aqueous conditions may be applicable: 10-250 mM NaCl, 5-50 mM Tris HC1, pH 5-8, with optional addition of divalent cation(s) and/or metal chelators and/or non-ionic detergents and/or membrane fractions and/or anti-foam agents and/or scintillants.
  • population means a collection of components such as polynucleotides, portions or polynucleotides or proteins.
  • a molecule having a "pro-form” refers to a molecule that undergoes any combination of one or more covalent and noncovalent chemical modifications (e.g. glycosylation, proteolytic cleavage, dimerization or oligomerization, temperature-induced or pH-induced conformational change, association with a co-factor, etc.) en route to attain a more mature molecular form having a property difference (e.g. an increase in activity) in comparison with the reference pro-form molecule.
  • covalent and noncovalent chemical modifications e.g. glycosylation, proteolytic cleavage, dimerization or oligomerization, temperature-induced or pH-induced conformational change, association with a co-factor, etc.
  • a property difference e.g. an increase in activity
  • the referemce precursor molecule may be termed a "pre-pro-form" molecule.
  • the term "pseudorandom” refers to a set of sequences that have limited variability, such that, for example, the degree of residue variability at another position, but any pseudorandom position is allowed some degree of residue variation, however circumscribed.
  • “Quasi-repeated units”, as used herein, refers to the repeats to be re-assorted and are by definition not identical. Indeed the method is proposed not only for practically identical encoding units produced by mutagenesis of the identical starting sequence, but also the reassortment of similar or related sequences which may diverge significantly in some regions. Nevertheless, if the sequences contain sufficient homologies to be reassorted by this approach, they can be refened to as "quasi-repeated" units.
  • random peptide library refers to a set of polynucleotide sequences that encodes a set of random peptides, and to the set of random peptides encoded by those polynucleotide sequences, as well as the fusion proteins contain those random peptides.
  • random peptide sequence refers to an amino acid sequence composed of two or more amino acid monomers and constructed by a stochastic or random process.
  • a random peptide can include framework or scaffolding motifs, which may comprise invariant sequences.
  • receptor refers to a molecule that has an affinity for a given ligand. Receptors can be naturally occurring or synthetic molecules. Receptors can be employed in an unaltered state or as aggregates with other species. Receptors can be attached, covalently or non-covalently, to a binding member, either directly or via a specific binding substance. Examples of receptors include, but are not limited to, antibodies, including monoclonal antibodies and antisera reactive with specific antigenic determinants (such as on viruses, cells, or other materials), cell membrane receptors, complex carbohydrates and glycoproteins, enzymes, and hormone receptors.
  • Recombinant enzymes refer to enzymes produced by recombinant DNA techniques, i.e., produced from cells transformed by an exogenous DNA construct encoding the desired enzyme.
  • Synthetic enzymes are those prepared by chemical synthesis.
  • sequence relationships between two or more polynucleotides are used to describe the sequence relationships between two or more polynucleotides: “reference sequence,” “comparison window,” “sequence identity,” “percentage of sequence identity,” and “substantial identity.”
  • a "reference sequence” is a defined sequence used as a basis for a sequence comparison; a reference sequence may be a subset of a larger sequence, for example, as a segment of a full-length cDNA or gene sequence given in a sequence listing, or may comprise a complete cDNA or gene sequence. Generally, a reference sequence is at least 20 nucleotides in length, frequently at least 25 nucleotides in length, and often at least 50 nucleotides in length.
  • two polynucleotides may each (1) comprise a sequence (i.e., a portion of the complete polynucleotide sequence) that is similar between the two polynucleotides and (2) may further comprise a sequence that is divergent between the two polynucleotides
  • sequence comparisons between two (or more) polynucleotides are typically performed by comparing sequences of the two polynucleotides over a "comparison window" to identify and compare local regions of sequence similarity.
  • Repetitive Index (RI) is the average number of copies of the quasi-repeated units contained in the cloning vector.
  • restriction site refers to a recognition sequence that is necessary for the manifestation of the action of a restriction enzyme, and includes a site of catalytic cleavage. It is appreciated that a site of cleavage may or may not be contained within a portion of a restriction site that comprises a low ambiguity sequence (i.e. a sequence containing the principal determinant of the frequency of occunence of the restriction site). Thus, in many cases, relevant restriction sites contain only a low ambiguity sequence with an internal cleavage site (e.g. G/AATTC in the EcoR I site) or an immediately adjacent cleavage site (e.g. /CCWGG in the EcoR II site). In other cases, relevant restriction enzymes [e.g.
  • the Eco57 1 site or CTGAAG(16/14)] contain a low ambiguity sequence (e.g. the CTGAAG sequence in the Eco57 1 site) with an external cleavage site (e.g. in the N 16 portion of the Eco57 I site).
  • an enzyme e.g. a restriction enzyme
  • cleave a polynucleotide, it is understood to mean that the restriction enzyme catalyzes or facilitates a cleavage of a polynucleotide.
  • a "selectable polynucleotide” is comprised of a 5' terminal region (or end region), an intermediate region (i.e. an internal or central region), and a 3' terminal region (or end region).
  • a 5' terminal region is a region that is located towards a 5' polynucleotide terminus (or a 5' polynucleotide end); thus it is either partially or entirely in a 5' half of a polynucleotide.
  • a 3' terminal region is a region that is located towards a 3 ' polynucleotide terminus (or a 3' polynucleotide end); thus it is either partially or entirely in a 3 ' half of a polynucleotide.
  • sequence identity means that two polynucleotide sequences are identical (i.e., on a nucleotide-by-nucleotide basis) over the window of comparison.
  • percentage of sequence identity is calculated by comparing two optimally aligned sequences over the window of comparison, determining the number of positions at which the identical nucleic acid base (e.g., A, T, C, G, U, or I) occurs in both sequences to yield the number of matched positions, dividing the number of matched positions by the total number of positions in the window of comparison (i.e., the window size), and multiplying the result by 100 to yield the percentage of sequence identity.
  • substantially identical denotes a characteristic of a polynucleotide sequence, wherein the polynucleotide comprises a sequence having at least 80 percent sequence identity, preferably at least 85 percent identity, often 90 to 95 percent sequence identity, and most commonly at least 99 percent sequence identity as compared to a reference sequence of a comparison window of at least 25-50 nucleotides, wherein the percentage of sequence identity is calculated by comparing the reference sequence to the polynucleotide sequence which may include deletions or additions which total 20 percent or less of the reference sequence over the window of comparison.
  • similarity between two enzymes is determined by comparing the amino acid sequence and its conserved amino acid substitutes of one enzyme to the sequence of a second enzyme. Similarity may be determined by procedures which are well-known in the art, for example, a BLAST program (Basic Local Alignment Search Tool at the National Center for Biological Information).
  • single-chain antibody refers to a polypeptide comprising a V H domain and a N L domain in polypeptide linkage, generally liked via a spacer peptide (e.g., [Gly-Gly-Gly-Gly-Ser] x ), and which may comprise additional amino acid sequences at the amino- and/or carboxy- termini.
  • a single-chain antibody may comprise a tether segment for linking to the encoding polynucleotide.
  • a scFv is a single-chain antibody.
  • Single-chain antibodies are generally proteins consisting of one or more polypeptide segments of at least 10 contiguous amino substantially encoded by genes of the immunoglobulin superfamily (e.g., see Williams and Barclay, 1989, pp. 361-368, which is inco ⁇ orated herein by reference), most frequently encoded by a rodent, non-human primate, avian, porcine bovine, ovine, goat, or human heavy chain or light chain gene sequence.
  • a functional single-chain antibody generally contains a sufficient portion of an immunoglobulin superfamily gene product so as to retain the property of binding to a specific target molecule, typically a receptor or antigen (epitope).
  • the members of a pair of molecules are said to "specifically bind" to each other if they bind to each other with greater affinity than to other, non-specific molecules.
  • an antibody raised against an antigen to which it binds more efficiently than to a non-specific protein can be described as specifically binding to the antigen.
  • a nucleic acid probe can be described as specifically binding to a nucleic acid target if it forms a specific duplex with the target by base pairing interactions (see above).)
  • Specific hybridization is defined herein as the formation of hybrids between a first polynucleotide and a second polynucleotide (e.g., a polynucleotide having a distinct but substantially identical sequence to the first polynucleotide), wherein substantially unrelated polynucleotide sequences do not form hybrids in the mixture.
  • the term "specific polynucleotide” means a polynucleotide having certain end points and having a certain nucleic acid sequence. Two polynucleotides wherein one polynucleotide has the identical sequence as a portion of the second polynucleotide but different ends comprises two different specific polynucleotides.
  • “Stringent hybridization conditions” means hybridization will occur only if there is at least 90% identity, preferably at least 95% identity and most preferably at least 97% identity between the sequences. See Sambrook et al, 1989, which is hereby inco ⁇ orated by reference in its entirety.
  • a "substantially identical" amino acid sequence is a sequence that differs from a reference sequence only by conservative amino acid substitutions, for example, substitutions of one amino acid for another of the same class (e.g., substitution of one hydrophobic amino acid, such as isoleucine, valine, leucine, or methionine, for another, or substitution of one polar amino acid for another, such as substitution of arginine for lysine, glutamic acid for aspartic acid, or glutamine for asparagine).
  • substantially identical amino acid sequence is a sequence that differs from a reference sequence or by one or more non-conservative substitutions, deletions, or insertions, particularly when such a substitution occurs at a site that is not the active site the molecule, and provided that the polypeptide essentially retains its behavioural properties.
  • one or more amino acids can be deleted from a phytase polypeptide, resulting in modification of the stracture of the polypeptide, without significantly altering its biological activity.
  • amino- or carboxyl-terminal amino acids that are not required for phytase biological activity can be removed. Such modifications can result in the development of smaller active phytase polypeptides.
  • the present invention provides a "substantially pure enzyme".
  • the term "substantially pure enzyme” is used herein to describe a molecule, such as a polypeptide (e.g., a phytase polypeptide, or a fragment thereof) that is substantially free of other proteins, lipids, carbohydrates, nucleic acids, and other biological materials with which it is naturally associated.
  • a substantially pure molecule, such as a polypeptide can be at least 60%, by dry weight, the molecule of interest.
  • the purity of the polypeptides can be determined using standard methods including, e.g., polyacrylamide gel electrophoresis (e.g., SDS-PAGE), column chromatography (e.g., high performance liquid chromatography (HPLC)), and amino-terminal amino acid sequence analysis.
  • polyacrylamide gel electrophoresis e.g., SDS-PAGE
  • column chromatography e.g., high performance liquid chromatography (HPLC)
  • amino-terminal amino acid sequence analysis e.g., amino-terminal amino acid sequence analysis.
  • substantially pure means an object species is the predominant species present (i.e., on a molar basis it is more abundant than any other individual macromolecular species in the composition), and preferably substantially purified fraction is a composition wherein the object species comprises at least about 50 percent (on a molar basis) of all macromolecular species present. Generally, a substantially pure composition will comprise more than about 80 to 90 percent of all macromolecular species present in the composition. Most preferably, the object species is purified to essential homogeneity (contaminant species cannot be detected in the composition by conventional detection methods) wherein the composition consists essentially of a single macromolecular species. Solvent species, small molecules ( ⁇ 500 Daltons), and elemental ion species are not considered macromolecular species.
  • variable segment refers to a portion of a nascent peptide which comprises a random, pseudorandom, or defined kernal sequence.
  • a variable segment refers to a portion of a nascent peptide which comprises a random pseudorandom, or defined kernal sequence.
  • a variable segment can comprise both variant and invariant residue positions, and the degree of residue variation at a variant residue position may be limited: both options are selected at the discretion of the practitioner.
  • variable segments are about 5 to 20 amino acid residues in length (e.g., 8 to 10), although variable segments may be longer and may comprise antibody portions or receptor proteins, such as an antibody fragment, a nucleic acid binding protein, a receptor protein, and the like.
  • wild-type means that the polynucleotide does not comprise any mutations.
  • a wild type protein means that the protein will be active at a level of activity found in nature and will comprise the amino acid sequence found in nature.
  • working as in “working sample”, for example, is simply a sample with which one is working.
  • a “working molecule” for example is a molecule with which one is working.
  • this invention describes a new method to sequence DNA.
  • the improvements over the existing DNA sequencing technologies are high speed, high throughput, no electrophoresis and gel reading artifacts due to the complete absence of an electrophoretic step, and no costly reagents involving various substitutions with stable isotopes.
  • the invention utilizes the Sanger sequencing strategy and assembles the sequence information by analysis of the nested fragments obtained by basespecific chain termination via their different molecular masses using mass specfrometry, as for example, MALDI or ES mass specfrometry.
  • a futher increase in throughtput can be obtained by introducing massmodifications in the oligonucleotide primer, chain- terminating nucleoside triphosphates and/or in the chainelongating nucleoside triphosphates, as well as using integrated tag sequences which allow multiplexing by hybridization of tag specific probes with mass differentiated molecular weights.
  • the present invention pertains to a method for sequencing genomes.
  • the method comprises the steps of obtaining nucleic acid material from a genome. Then there is the step of constructing a clone library and one or more probe libraries from the nucleic acid material. Next there is the step of comparing the libraries to form comparisons. Then there is the step of combining the comparisons to construct a map of the clones relative to the genome. Next there is the step of determimng the sequence of the genome by means of the map.
  • the present invention also pertains to a system for sequencing a genome.
  • the system comprises a mechanism for obtaining nucleic acid material from a genome.
  • the system also comprises a mechanism for constructing a clone library and one or more probe libraries.
  • the constructing mechanism is in communication with the nucleic acid material from a genome.
  • the system comprises a mechanism for comparing said libraries to form comparisons.
  • the comparing mechanism is in communication with the said libraries.
  • the system also comprises a mechanism for combining the comparisons to construct a map of the clones relative to the genome.
  • the said combining mechanism is in communication with the comparisons.
  • the system comprises a mechanism for determining the sequence of the genome by means of said map.
  • the said determimng mechanism is in communication with said map.
  • the present invention additionally pertains to a method for producing a gene of a genome.
  • a subclone path through the fragment is first identified; the collection of subclones that define this path is then sequenced using transposon-mediated direct sequencing techniques to an extent sufficient to provide the complete sequence of the fragment.
  • Improved techniques are provided for DNA sequencing, and particularly for sequencing of the entire human genome.
  • Different base-specific reactions are utilized to use different sets of DNA fragments from a piece of DNA of unknown sequence.
  • Each of the different sets of DNA fragments has a common origin and terminates at a particular base along the unknown sequence.
  • the molecular weight of the DNA fragments in each of the different sets is detected by a matrix assisted laser abso ⁇ tion mass spectrometer to determine the sequence of the different bases in the DNA.
  • the methods and apparatus of the present invention provide a relatively simple and low cost technique which may be automated to sequence thousands of gene bases per hour, and eliminates the tedious and time consuming gel electrophoresis separation technique conventionally used to determine the masses of DNA fragments.
  • a new contiguous genome sequencing method is described which allows the contiguous sequencing of a very long DNA without need to be subcloned. It uses the basic PCR technique but circumvents the usual need of this technique for the knowledge two primers for contiguous sequencing, enabling the knowledge of only one primer sufficient.
  • the present invention makes it possible to PCR amplify a DNA adjacent to a known sequence with which one primer can be made without the knowledge of the second primer binding site present in the unknown sequence.
  • the present invention could thus be used to contiguously sequence a very long DNA such as that contained in a YAC clone or a cosmid clone, without the need for subcloning smaller fragments, using the standard PCR technique. It can also be used to sequence a whole chromosome or genome without any need to subclone it.
  • Methods and means are provided for the massively parallel characterization of complex molecules and of molecular recognition phenomena with parallelism and redundancy attained through single molecule examination methods.
  • Applications include ultra-rapid genome sequencing, affinity characterization, pathogen characterization and detection means for clinical use and use in the development and construction of cybernetic immune systems.
  • Novel methods for single molecule examination and manipulation are provided, including scanned beam light microscopic means and methods, and detection means availing of optoelecfronic anay devices.
  • Various apparatus for rate control including stepping control for various reactions are combined with molecular recognition, signal amplification and single molecule examination methods. Inclusion of internal control in samples, algorithm- based dynamically responsive manipulation controls, and sample redundancy, are availed to provide an arbitrarily high degree of accuracy in final data.
  • the present invention relates to sequencing of DNA and is in the field of determimng the nucleotide sequence of large segments of DNA. More specifically, the invention provides an improved method to obtain the complete nucleotide sequence of genomic DNA provided in fragments of over 30 kb.
  • the present invention pertains to a process for determining the DNA sequence of the genome of an organism. And more particularly, the invention relates to the sequencing of the entire human genome.
  • the present invention is related to constructing clone maps of organisms, and then using these maps to direct the sequencing effort.
  • the invention also pertains to systems that can effectively use this sequence and map information.
  • the invention relates to the massively parallel single molecule examination of associations or reactions between large numbers of first complex molecules, which may be diverse, and second single or plural probing molecules, which may or may not be diverse, with applications to biology, biotechnology, pharmacology, immunology, the novel field of cybernetic immunology, molecular evolution, cybernetic molecular evolution, genomics, comparative genomics, enzymology, clinical enzymology, pathology, medical research, and clinical medicine.
  • the present invention has applications in the area of polynucleotide sequence determination, including DNA sequencing.
  • DNA sequencing is still critically important in research and for genetic therapies and diagnostics, (e.g., to verify recombinant clones and mutations).
  • DNA a polymer of deoxyribonucleotides, is found in all living cells and some viruses.
  • DNA is the carrier of genetic information, which is passed from one generation to the next by homologous replication of the DNA molecule. Information for the synthesis of all proteins is encoded in the sequence of bases in the DNA.
  • DNA sequence information represents the information required for gene organization and regulation of most life forms. Accordingly, the development of reliable methodology for sequencing DNA has contributed significantly to an understanding of gene structure and function.
  • DNA sequencing is one of the most fundamental technologies in molecular biology and the life sciences in general. The ease and the rate by which DNA sequences can be obtained greatly affects related technologies such as development and production of new therapeutic agents and new and useful varieties of plants and microorganisms via recombinant DNA technology. In particular, unraveling the DNA sequence helps in understanding human pathological conditions including genetic disorders, cancer and AIDS.
  • the present invention relates to the field of nucleic acid analysis, detection, and sequencing. More specifically, in one embodiment the invention provides improved techniques for synthesizing anays of nucleic acids, hybridizing nucleic acids, detecting mismatches in a double-stranded nucleic acid composed of a single- stranded probe and a target nucleic acid, and determining the sequence of DNA or RNA or other polymers.
  • a human being has 23 pairs of chromosomes consisting of a total of about 100,000 genes.
  • the human genome consists of those genes.
  • a single gene which is defective may cause an inheritable disease, such as Huntington's disease, Tay-Sachs disease or cystic fibrosis.
  • the human chromosomes consist of large organic linear molecules of double-strand DNA (deoxyribonucleic acid) with a total length of about 3.3 billion "base pairs".
  • the base pairs are the chemicals that encode information along DNA.
  • a typical gene may have about 30,000 base pairs.
  • mapping of the human genome is to accurately determine the location and composition of each of the 3.3 billion bases.
  • the complexity and large scale of such a mapping has placed it, in terms of cost, effort and scientific potential of such projects, as one of the largest and most important projects of the 1990's and beyond.
  • DNA sequencing is a technique by which the four DNA nucleotides (characters) in a linear DNA sequence is ordered by chemical and biochemical means.
  • strategies for determining the nucleotide sequence of DNA involve the generation of a DNA substrate i.e., DNA fragments suitable for sequencing a region of the DNA, enzymatic or chemical reactions, and analysis of DNA fragments that have been separated according to their lengths to yield sequence information. More specifically, to sequence a given region of DNA, labeled DNA fragments are typically generated in four separate reactions.
  • the DNA fragments typically have one fixed end and one end that terminates sequentially at each of the four nucleotide bases, respectively.
  • the products of each reaction are fractionated by gel elecfropheresis on adjacent lanes of a polyacrylamide gel.
  • the sequence of a given region of DNA can be determined from the four "ladders" of DNA fragments. The present status of techniques for determining such sequences is described in some detail in an article by Lloyd M. Smith published in the American Biotechnology Laboratory, Volume 7, Number 5, May 1989, pp 10-17.
  • the DNA strand is isotropically labeled on one end, broken down into smaller fragments at sequence locations ending with a particular nucleotide (A, T, C, or G) by chemical means, and the fragments ordered based on this information.
  • Base specific modifications result in a base specific cleavage of the radioactive or fluorescently labeled DNA fragment.
  • the DNA substrate is end labeled, it is subjected to chemical reactions designed to cleave the DNA at positions adjacent to a given base or bases.
  • the labeled DNA fragments will, therefore, have a common labeled terminus while the unlabeled termini will be defined by the positions of chemical cleavage.
  • DNA fragments four sets of nested fragments
  • PAGE polyacrylamide gel electrophoresis
  • unlabeled DNA fragments can be separated after complete restriction digestion and partial chemical cleavage of the DNA, and hybridized with probes homologous to a region near the region of the DNA to be sequenced. See, Church et al., Proc. Natl. Acad. Sci., 81:1991 (1984). After autoradiography, the sequence can be read directly since each band (fragment) in the gel originates from a base specific cleavage event. Thus, the fragment lengths in the four "ladders” directly translate into a specific position in the DNA sequence. 1. 2.2.2 Enzymatic/Sanger method for sequencing:
  • the four base specific sets of DNA fragments are formed by starting with a primer/template system elongating the primer into the unknown DNA sequence area and thereby copying the template and synthesizing complementary strands using a DNA polymerase in the presence of chain-terminating reagents.
  • the chain-terminating event is achieved by inco ⁇ orating into the four separate reaction mixtures in addition to the four normal deoxynucleoside triphosphates, dATP, dGTP, dTTP and dCTP, only one of the chain-terrninating dideoxynucleoside triphosphates, ddATP, ddGTP, ddTTP or ddCTP, respectively, in a limiting small concentration.
  • DNA polymerase leads to chain termination through preventing the formation of a 3'-5'-phosphodiester bond by DNA polymerase. Due to the random inco ⁇ oration of the ddNTPs, each reaction leads to a population of base specific terminated fragments of different lengths, which all together represent the sequenced DNA-molecule. The four sets of resulting fragments produce, after electrophoresis, four base specific ladders from which the DNA sequence can be determined.
  • the following basic steps are involved: (i) annealing an oligonucleotide primer to a suitable single or denatured double stranded DNA template; (ii) extending the primer with DNA polymerase in four separate reactions, each containing one - labeled dNTP or ddNTP (alternatively a labeled primer can be used), a mixture of unlabeled dNTPs, and one chain- terminating dideoxynucleoside- 5'-triphosphate (ddNTP); (iii) resolving the four sets of reaction products, which include a distribution of DNA fragments having primer- defined 5' termini and differing dideoxynucleotides at the 3' termini,on a high resolution polyacrylamide-urea gel; and (iv) producing an auto radiographic image of the gel that can be examined to infer the DNA sequence.
  • fluorescently labeled primers or nucleotides can be used to identify the reaction products.
  • Known dideoxy sequencing methods utilize a DNA polymerase such as the Klenow fragment of E. coli DNA polymerase, a DNA polymerase from Thermus aquaticus, reverse franscriptase, a modified T7 DNA polymerase, or the Taq polymerase. 1. 2.2.3 Similarities, differences and other details of the two methods:
  • the two sequencing methods differ in the techniques employed to produce the DNA fragments, but are otherwise similar.
  • Maxam-Gilbert method four different base-specific reactions are performed on portions of the DNA molecules to be sequenced, to produce four sets of radiolabeled DNA fragments. These four fragment sets are each loaded in adjacent lanes of a polyacrylamide slab gel, and are separated by electrophoresis. Autoradiographic imaging of the pattern of the radiolabeled DNA bands in the gel reveals the relative size, corresponding to band mobilities, of the fragments in each lane, and the DNA sequence is deduced from this pattern.
  • Both of these methods yield a population of molecules comprising a nested set which together may be analyzed to determine the base sequence of the sample. At least one of these two techniques is employed in essentially every laboratory concerned with molecular biology, and together they have been employed to sequence more than 26 million bases of DNA. Cunently a skilled biologist can produce about 30,000 bases of finished DNA sequence per year under ideal conditions.
  • the DNA to be sequenced has to be fragmented into sequencable pieces of currently not more than 500 to 1000 nucleotides.
  • this is a multi-step process involving cloning and subcloning steps using different and appropriate cloning vectors such as YAC, cosmids, plasmids and Ml 3 vectors (Sambrook et al., Molecular Cloning: A Laboratory Manual. Cold Spring Harbor Laboratory Press, 1989).
  • the fragments of about 500 to 1000 base pairs are integrated into a specific restriction site of the replicative form I (RF I) of a derivative of the Ml 3 bacteriophage (Vieria and Messing, Gene 19, 259(1982)) and then the double-stranded form is transformed to the single-stranded circular form to serve as a template for the Sanger sequencing process having a binding site for a universal primer obtained by chemical DNA synthesis (Sinha, Bieraat, McManus and K ⁇ ster, Nucleic Acids Res. 12, 4539-57 (1984); U.S. Patent No. 4725677 upstream of the restriction site into which the unknown DNA fragment has been inserted.
  • RF I replicative form I
  • a modified nucleotide compound possessing two properties particularly useful for pu ⁇ oses of the present invention has been described by N. Williams and P.S. Colemanl.
  • This compound is 3'-O-(4-benzoyl)benzoyl adenosine 5'-triphosphate.
  • This nucleotide bears a 3' protecting group linked via an ester function which should be susceptible to hydrolysis by appropriate chemical treatments.
  • the protecting moiety is suitable for photoactivation, and this property was utilized by those investigators to probe the structure of mitochondrial Fi-ATPase, indicating that this analog will interact properly with at least some enzymes. Under appropriate circumstances, the protecting moiety may also serve as a label.
  • detectable labels In order to be able to read the sequence from PAGE, detectable labels have to be used in either the primer (very often at the 5'-end) or in one of the deoxynucleoside triphosphates, dNTP. Using radioisotopes such as P, P, or S is still the most frequently used technique. After PAGE, the gels are exposed to X-ray films and silver grain exposure is analyzed. The use of radioisotopic labeling creates several problems. Most labels useful for autoradiographic detection of sequencing fragements have relatively short half-lives which can limit the useful time of the labels. The emission high energy beta radiation, particularly from P, can lead to breakdown of the products via radiolysis so that the sample should be used very quickly after labeling.
  • the fluorescent label can be tagged to the primer (Smith et al., Nature M, 674-679 (1986) and EPO Patent No. 873 00998.9; Du Pont De Nemours EPO Application No. 03 59225; Ansorge et al., L Biochem. Biophys. Method 13, 325-32 (1986)) or to the chain-terminating dideoxynucloside triphosphates (Prober et al. Science M, 336-41 (1987); Applied Biosystems, PCT Application WO 91/05060).
  • DNA sequencing Of particular interest in DNA sequencing are methods of automated sequencing, in which fluorescent labels are employed to label the size separated fragments or primer extension products of the enzymatic method. Cunently, three different methods are used for automated DNA sequencing. In the first method, the DNA fragments are labeled with one fluorophore and then run in adjacent sequencing lanes, one lane for each base. See Ansorge et al., Nucleic Acids Res. (1987)15:4593- 4602. In the second methods, the DNA fragments are labeled with oligonucleotide primers tagged with four fluorophores and all of the fragments are run in one lane. See Smith et al., Nature (1986) 321:674- 679.
  • each of the different chain terminatina dideoxynucleotides is labeled with a different fluorophore and all of the fragments are run in one lane. See Prober et al., Science (1987) 238:336-341.
  • the first method has the potential problems of lane-to-lane variations as well as a low throughput.
  • the second and third methods require that the four dyes be well excited by one laser source, and that they have distinctly different emission spectra. Otherwise, multiple lasers have to be used, increasing the complexity and the cost of the detection instrument.
  • the second method produces robust sequencing data in cunently commercial available sequencers.
  • the second method is not entirely satisfactory. In the second method, all of the false terminated or false stop fragments are detected resulting in high backgrounds. Furthermore, with the second method it is difficult to obtain accurate sequences for DNA templates with long repetitive sequences. See Robbins et al., Biotechniques (1996) 20: 862-868.
  • the third method has the advantage of only detecting DNA fragments inco ⁇ orated with a terminator. Therefore, backgrounds caused by the detection of false stops are not detected. However, the fluorescence signals offered by the dye- labeled terminators are not very bright and it is still tedious to completely clear up the excess of dye-terminators even with AmpliTaq DNA Polymerase (FS enzyme). Furthermore, non-sequencing fragments are detected, which contributes to background signal. Applied Biosystems Model 373 A DNA Sequencing System User Bulletin, November 17,P3,August 1990.
  • Such methodology would ideally include a means for isolating the DNA sequencing fragments from the remaining components of the sequencing reaction mixtures such as salts, enzymes, excess primers, template and the like, as well as false stopped sequencing fragments and non-sequencing fragments resulting from contaminated RNA and nicked DNA templates.
  • the primer extension products synthesized on the immobilized template strand are purified of enzymes, other sequencing reagents and by-products by a washing step and then released under denaturing conditions by loosing the hydrogen bonds between the Watson-Crick base pairs and subjected to PAGE separation.
  • the primer extension products (not the template) from a DNA sequencing reaction are bound to a solid support via biotin/avidin (Du Pont De Nemours, PCT Application WO 91/11533).
  • biotin/avidin Du Pont De Nemours, PCT Application WO 91/11533
  • the interaction between biotin and avidin is overcome by employing denaturing conditions (formamide EDTA) to release the primer extension products of the sequencing reaction from the solid support for PAGE separation.
  • beads e. g., magnetic beads (Dynabeads) and Sepharose beads
  • filters e.g., capillaries, plastic dipsticks (e.g., polystyrene strips) and microt
  • PAGE polyacrylamide gel electrophoresis
  • CZE capillary zone electrophoresis
  • PAGE slab gel electrophoresis
  • the enzymes used and the DNA are held in place by solid phases (DEAE-Sepharose and Sepharose) either by ionic interactions or by covalent attachment.
  • solid phases DAE-Sepharose and Sepharose
  • the amount of pyrophosphate is determined via bioluminescence (luciferase).
  • a synthesis approach to DNA sequencing is also used by Tsien et al. (PCT Application No. WO 91/06678).
  • the incoming dNTP's are protected at the T-end by various blocking groups such as acetyl or phosphate groups and are removed before the next elongation step, which makes this process very slow compared to standard sequencing methods.
  • the template DNA is immobilized on a polymer support.
  • a fluorescent or radioactive label is additionally inco ⁇ orated into the modified dNTP's.
  • PCT Application No. WO 91/06678 also describes an apparatus designed to automate the sequencing process.
  • Mass Specfrometry is a well known analytical technique which can provide fast and accurate molecular weight information on relatively complex mixtures of organic molecules. Mass specfrometry has historically had neither the sensitivity nor resolution to be useful for analyzing mixtures at high mass. A series of articles in 1988 by Hillenkamp and Karas do suggest that large organic molecules of about 10, 000 to 100,000 Daltons may be analyzed in a time of flight mass spectrometer, although resolution at lower molecular weights is not as sha ⁇ as conventional magnetic field mass spectrometry. Moreover, the Hillenkamp and Karas technique is very time-consuming, and requires complex and costly instrumentation.
  • Mass spectrometry in general., provides a means of "weighing" individual molecules by ionizing the molecules in vacuo and making them “fly” by volatilization.
  • the ions Under the influence of combinations of electric and magnetic fields, the ions follow trajectories depending on their individual mass (m) and charge (z). In the range of molecules with low molecular weight, mass spectrometry has long been part of the routine physical-organic repertoire for analysis and characterization of organic molecules by the determination of the mass of the parent molecular ion. In addition, by ananging collisions of this parent molecular ion with other particles (e.g., argon atoms), the molecular ion is fragmented forming secondary ions by the so-called collision induced dissociation (CID). The fragmentation pattern/pathway very often allows the derivation of detailed structural information.
  • CID collision induced dissociation
  • “sequencing” has been limited to low molecular weight synthetic oligonucleotides by determining the mass of the parent molecular ion and through this, confirming the already known sequence, or alternatively, confirming the known sequence through the generation of secondary ions (fragment ions) via CID in an MS/MS configuration utilizing, in particular, for the ionization and volatilization, the method of fast atomic bombardment (FAB mass spectrometry) or plasma deso ⁇ tion (PD mass spectrometry).
  • FAB mass spectrometry fast atomic bombardment
  • PD mass spectrometry plasma deso ⁇ tion
  • ES mass spectrometry has been introduced by Fenn et al. J. Phys. Chem. 18, 4451-59 (1984); PCT Application No. WO 90/14148) and current applications are summarized in recent review articles (R.D. Smith et al., Anal. Chem. 62, 882-89 (1990) and B. Ardrey, Electrospray Mass Spectrometry, Spectroscopy Europe 4, 10-18 (1992)).
  • MALDI mass spectrometry in contrast, can be particularly attractive when a time-of-flight (TOF) configuration is used as a mass analyzer.
  • TOF time-of-flight
  • the MALDI-TOF mass spectrometry has been introduced by Hillenkamp et al. ("Matrix Assisted UN-Laser Deso ⁇ tion/ionization: A New Approach to Mass Spectrometry of Large Biomolecules, Biological Mass Spectrometry (Burlingame and McCloskey, editors), Elsevier Science Publishers, Amsterdam, pp. 49-60, 1990.) Since, in most cases, no multiple molecular ion peaks are produced with this technique, the mass spectra, in principle, look simpler compared to ES mass spectrometry.
  • RNA transcripts extended by DNA both of which are complementary to the DNA to be sequenced are prepared by inco ⁇ orating NTP's, dNTP's and, as terminating nucleotides, ddNTP's which are substituted at the 5'- position of the sugar moiety with one or a combination of the isotopes 12 C, 13 C, 14 C, 1H, 2 H, 3 H, 16 0, 17 0 and 18 0.
  • the polynucleotides obtained are degraded to 3'- nucleotides, cleaved at the N-glycosidic linkage and the isotopically labeled 5'- functionality removed by periodate oxidation and the resulting formaldehyde species determined by mass spectrometry.
  • a specific combination of isotopes serves to discriminate base-specifically between internal nucleotides originating from the inco ⁇ oration of NTPs and dNTP's and terminal nucleotides caused by linking ddNTP's to the end of the polynucleotide chain.
  • RNA/DNA fragments A series of RNA/DNA fragments is produced, and in one embodiment, separated by electrophoresis, and, with the aid of the so-called matrix method of analysis, the sequence is deduced. 1. 2.8.4 Mass spectrometry using atoms which normally do not occur in DNA
  • the sequencing reaction mixtures are separated by an electrophoretic technique such as CZE, transfened to a combustion unit in which the sulfur isotopes of the inco ⁇ orated ddNTP's are transformed at about 900°C in an oxygen atmosphere.
  • the S0 2 generate with masses of 64, 65, 66 or 68 is determined on-line by mass spectrometry using, e.g., mass analyzer, a quadrupole with a single ion- multiplier to detect the ion current.
  • EPO Patent Applications No. 0360676 Al and 0360677 Al also describe Sanger sequencing using stable isotope substitutions in the ddNTP's such as D, 13 C, I5 N, 17 0, ,8 0, 32 S, 33 S, 34 S, 36 S, 19 F, 35 C1, 37 C1, 79 Br, 81 Br and 127 I or function groups such as CF 3 or Si(CH 3 ) 3 at the base, the sugar or the alpha position of the triphosphate moiety according to chemical functionality.
  • the Sanger sequencing reaction mixtures are separated by tube gel electrophoresis.
  • the effluent is converted into an aerosol by the electrospray/thermospray nebulizer method and then atomized and ionize by a hot plasma (7000 to 8000°K) and analyzed by a simple mass analyzer.
  • An instrument is proposed which enables one to automate the analysis of the Sanger sequencing reaction mixture consisting of tube electrophoresis, a nebulizer and a mass analyzer.
  • PCT patent Publication No. 92/10588 inco ⁇ orated herein by reference for all proposes, describes one improved technique in which the sequence of a labeled, target nucleic acid is determined by hybridization to an anay of nucleic acid probes on a substrate. Each probe is located at a positionally distinguishable location on the substrate. When the labeled target is exposed to the substrate, it binds at locations that contain complementary nucleotide sequences. Through knowledge of the sequence of the probes at the binding locations, one can determine the nucleotide sequence of the target nucleic acid. The technique is particularly efficient when very large anays of nuleic acid probes are utilized.
  • nucleic acid probes are of a length shorter than the target
  • a reconstruction technique to determine the sequence of the larger target based on affinity data from the shorter probes.
  • One technique for overcoming this difficulty has been termed sequencing by hybridization or SBH. For example, assume that a 12-mer target DNA 5'-AGCCTAGCTGAA is mixed with an array of all octanucleotide probes.
  • the target binds only to those probes having an exactly complementary nucleotide sequence, only five of the 65,536 octamer probes (3*-TCGGATCG, CGGATCGA, GGATCGAC, GATCGACT, and ATCGACTT) will hybridize to the target. Alignment of the overlapping sequences from the hybridizing probes reconstructs the complement of the original 12-mer target:
  • DNA can be amplified by a variety of procedures including cloning (Sambrook et at., Molecular Cloning : A Laboratory Manual., Cold Spring Harbor Laboratory Press, 1989), polymerase chain reaction (PCR) (C.R. Newton and A. Graham, PCF, BIOS Publishers, 1994), ligase chain reaction (LCR) (F. Barany Proc. Natl. Acad Sci USA 88, 189-93 (1991), strand displacement amplification (SDA) (G. Tenance Walker et al., Nucleic Acids Res. 22, 2670-77 (1994)) and variations such as RT-PCR, allele-specific amplification (ASA) etc.
  • PCR polymerase chain reaction
  • LCR ligase chain reaction
  • SDA strand displacement amplification
  • the polymerase chain reaction (Mullis, K. et al., Methods EnzymoL, 155:335- 350 1987) permits the selective in vitro amplification of a particular DNA region by mimicking the phenomena of in vivo DNA replication.
  • Required reaction components are single stranded DNA, primers (oligonucleotide sequences complementary to the 5' and 3' ends of a defined sequence of the DNA template), deoxynucleotidetriphosphates and a DNA polymerase enzyme.
  • the single stranded DNA is generated by heat denaturation of provided double strand DNA.
  • the reaction buffers contain magnesium ions and co-solvents for optimum enzyme stability and activity.
  • the amplification results from a repetition of such cycles in the following manner:
  • the two different primers which bind selectively each to one of the complementary strands, are extended in the first cycle of amplification.
  • Each newly synthesized DNA then contains a binding site for the other primer. Therefore each new DNA strand becomes a template for any further cycle of amplification enlarging the template pool from cycle to cycle.
  • Repeated cycles theoretically lead to exponential synthesis of a DNA-fragment with a length defined by the 5' termini of the primer.
  • the PCR amplification procedure has been used to sequence the DNA being amplified (e.g. "Introduction to the AmpliTaq Cycle Sequencing Kit Protocol", a booklet from Perkin Elmer Cetus Co ⁇ oration).
  • the DNA could be first amplified and then it could be sequenced using the two conventional DNA sequencing techniques.
  • Modified methods for sequencing PCR-amplified DNA have also been developed (e.g. Bevan et al., "Sequencing of PCR-Amplified DNA” PCR Meth. App. 4:222 (1992)). 1. 2.11 Additional Sequencing Methods
  • a recent modification of the Sanger sequencing strategy involves the degradation of phosphorothioate-containing DNA fragments obtained by using alpha- thio dNTP instead of the normally used ddNTPs during the primer extension reaction mediated by DNA polymerase (Labeit et al., MA 5, 173-177 (1986); Amersham, PCT- Application GB86/00349; Eckstein et al., Nucleic Acids Res. 1 ⁇ , 9947 (1988)).
  • the four sets of base-specific sequencing ladders are obtained by limited digestion with exonuclease III or snake venom phosphodiesterase, subsequent separation on PAGE and visualization by radioisotopic labeling of either the primer or one of the dNTPs.
  • the base-specific cleavage is achieved by alkylating the sulphur atom in the modified phosphodiester bond followed by a heat treatment (Max- Planck- Geselischaft, DE 3930312 Al). Both methods can be combined with the amplification of the DNA via the Polymerase Chain Reaction (PCR).
  • thermolabile DNA polymerase must be continually added to the reaction mixture after each denaturation cycle.
  • Major advances in PCR practice were the development of a polymerase, which is stable at the near-boiling temperature (Saiki, R. et al., Science 239:487-491 1998) and the development of automated thermal cyclers.
  • thermostable polymerases also allowed modification of the Sanger sequencing reaction with significant advantages.
  • the polymerization reaction could be carried out at high temperature with the use of thermostable DNA polymerase in a cyclic manner (cycle sequencing).
  • the conditions of the cycles are similar to those of the PCR teclinique and comprise denaturation, annealing, and extension steps. Depending on the length of the primers only one annealing step at the beginning of the reaction may be sufficient.
  • Carrying out a sequencing reaction at high temperature in a cyclic manner provides the advantage that each DNA strand can serve as template in every new cycle of extension which reduces the amount of DNA necessary for sequencing, thereby providing access to minimal volumes of DNA, as well as resulting in improved specificity of primer hybridization at higher temperature and the reduction of secondary structures of the template strand.
  • amplification of the terminated fragments is linear in conventional cycle sequencing approaches.
  • a recently developed method, called semi-exponential cycle sequencing shortens the time required and increases the extent of amplification obtained from conventional cycle sequencing by using a second reverse primer in the sequencing reaction.
  • the reverse primer only generates additional template strands if it avoids being terminated prior to reaching the sequencing primer binding site. Needless to say, terminated fragments generated by the reverse primer can not serve as a sufficient template. Therefore, in practice, amplification by the semi- exponential approach is not entirely exponential. (Sarkat, G. and Bolander Mark E., Semi Exponential Cycle Sequencing Nucleic Acids Research, 1995, Vol. 23, No. 7, p. 1269-1270). 1. 2.11.4 Need to facilitate highthroughput sequencing
  • cunent nucleic acid sequencing methods require relatively large amounts (typically about 1 g) of highly purified DNA template. Often, however, only a small amount of template DNA is available. Although amplifications may be performed, amplification procedures are typically time consuming, can be limited in the amount of amplified template produced and the amplified DNA must be purified prior to sequencing. A streamlined process for amplifying and sequencing DNA is needed, particularly to facilitate highthroughput nucleic acid sequencing. 1. 2.12 Strategies for obtaining the initial sequence
  • the DNA is fragmented into smaller, overlapping fragments, and subcloned to produce numerous clones containing overlapping DNA sequences. These clones are sequenced randomly and the sequences assembled by "overlap sequence- matching" to produce the contiguous sequence.
  • shot-gun sequencing method approx. ten times more sequencing than the length of the DNA being sequenced is required to assemble the contiguous sequence. Shotgun sequencing is reasonably appropriate for generating the initial sequences of the genomic clone.
  • the clone is digested with a multiplicity of restriction enzymes and the individual fragments are sequenced.
  • the shotgun strategy relies on assembly algorithms to piece together a final sequence by determining relationships between a selected set of random templates. Although this assembly process is semiautomated, it remains labor-intensive, especially in complex regions that contain highly related tandem repeats. In addition, since the selection of subclones is not random, gaps of unknown distance are included between islands of known sequence. Linking up the islands requires either sequencing additional subclones or ordering custom oligonucleotides to generate sequence into the gaps. The weaknesses of shotgun sequencing performed on substantial lengths of nucleotide sequence are thus 1) the difficulties involved in sequence assembly and 2) the need for hole-filling.
  • a non-ordered approach to sequencing would require the generation of 100 to 200 million DNA templates.
  • DNA substrate generation e.g., restriction mapping, preparation of subfragments for subcloning, identification of subclones, growing bacterial cultures, and purifying nucleic acids.
  • Current approaches therefore, are less than optimal for the large scale sequencing of DNA, particularly sequencing the human genome.
  • the transposon-mediated sequencing method described by Strathmann, M. et al. Proc Natl Acad Sci USA (1991) 88:1247- 1250 provides an orderly approach to generating subclones for sequencing.
  • the method uses a .gamma..delta. bacterial transposable element bracketed by sequencing primers.
  • the primer-flanked transposon permits the introduction of evenly spaced priming sites across a fragment with an unknown DNA sequence.
  • the number of template sequences required to obtain the complete sequence information can be calculated from the length of the fragment.
  • the linear order of the DNA clones has to be first determined by "physical mapping" of the clones.
  • the positions of the insertions are mapped, for example, using the polymerase chain reaction (PCR) using primers that amplify the intervening sequence between the transposon insertion site and the vector sequences at each end of the inserted fragment to be sequenced.
  • the lengths of the amplified products thus define a map position for the transposon.
  • Sequencing can be conducted based on the sequencing primers flanking the transposon, and since the position of the transposon has been mapped prior to sequencing, a fully automated assembly process is possible. There are no gaps since an ordered set of sequencing templates which cover the DNA fragment is produced.
  • transposon sequencing can only be used on fragments containing 2- 5 kb; preferably 3-4 kb.
  • smaller subclones of the original fragment must be generated and organized into an ordered overlapping set.
  • the shotgun strategy is not completely appropriate for this pu ⁇ ose.
  • Dog-tagging is a "walking" process, a contiguous DNA sequencing method called the "primer- walking” method using the Sanger's DNA polymerase enzymatic sequencing procedure, that scans through a 30-hit subclone library for sequences that are near the end of the last walking step. It is labor-intensive and does not always succeed.
  • the DNA copying has to occur always from the template DNA during DNA sequencing.
  • the target DNA amplified in the first rounds from the original input template DNA will function as the template DNA in subsequent cycles of amplification.
  • the DNA sequencing reaction will be started by adding the sequencing "cocktail".
  • the PCR reaction only one copy of template DNA is theoretically sufficient to amplify into millions of copies, and therefore a very little genomic (or template) DNA is sufficient for sequencing.
  • the advantage of DNA amplification that exists in PCR is lacking in the conventional Sanger procedure. Thus, this primer-walking method will require a larger amount of template DNA compared to the PCR sequencing method.
  • the sequencing gel pattern may not be as clean as in a PCR procedure, when a very long DNA is being sequenced. This may limit the length of the DNA, that could be contiguously sequenced without breaking the DNA, using the primer- walking procedure.
  • the PCR method also enables the reduction of non-specific binding of the primers to the template DNA because the enzymes used in these protocols function at high-temperatures, and thus allow "stringent" reaction conditions to be used to improve sequencing.
  • the present method of contiguous DNA sequencing using the basic PCR technique has thus many advantages over the primer walking method. Also, so far no method exists for contiguously sequencing a very long DNA using PCR technique.
  • the present invention thus offers a unique and very advantageous procedure for contiguous DNA sequencing.
  • the present invention provides a method for contiguous sequencing of very long DNA using a modification of the standard PCR technique without the need for breaking down and subcloning the long DNA.
  • PCR technique enables the amplification of DNA which lies between two regions of known sequence (K. B. Mullis et al., U.S. Pat. Nos. 4,683,202; 7/1987; 435/91; and 4,683,195, 7/1987; 435/6). Oligonucleotides complementary to these known sequences at both ends serve as "primers" in the PCR procedure. Double stranded target DNA is first melted to separate the DNA strands, and then oligonucleotide (oligo) primers complementary to the ends of the segment which is desired to be amplified are annealed to the template DNA.
  • oligo oligonucleotide
  • the oligos serve as primers for the synthesis of new complementary DNA strands, using a DNA polymerase enzyme and a process known as primer extension.
  • the orientation of the primers with respect to one another is such that the 5' to 3' extension product from each primer contains, when extended far enough, the sequence which is complementary to the other oligo.
  • each newly synthesized DNA strand becomes a template for synthesis of another DNA strand beginning with the other oligo as primer.
  • Repeated cycles of melting, annealing of oligo primers, and primer extension lead to a (near) doubling, with each cycle, of DNA strands containing the sequence of the template beginning with the sequence of one oligo and ending with the sequence of the other oligo.
  • the key requirement for this exponential increase of template DNA is the two oligo primers complementary to the ends of the sequence desired to be amplified, and oriented such that their 3' extension products proceed toward each other. If the sequence at both ends of the segment to be amplified is not known, complementary oligos cannot be made and standard PCR cannot be performed.
  • the object of the present invention is to overcome the need for sequence information at both ends of the • segment to be amplified, i.e. to provide a method which allows PCR to be performed when sequence is known for only a single region, and to provide a method for the contiguous sequencing of a very long DNA without the need for subcloning of the DNA.
  • Amplifying and sequencing using the PCR procedure requires that the sequences at the ends of the DNA (the two primer sequences) be known in advance. Thus, this procedure is limited in utility, and cannot be extended to contiguously sequence a long DNA strand. If the knowledge of only one primer is sufficient without anything known about the other primer, it would be greatly advantageous for sequencing very long DNA molecules using the PCR procedure. It would then be possible to use such a method for contiguously sequencing a long genomic DNA without the need for subcloning it into smaller fragments, and knowing only the very first, beginning primer in the whole long DNA. 1. 2.12.5 Large-scale sequencing throught the generation of a subclone path
  • the present invention provides a large-scale sequencing method which combines efficient method to generate a subclone path through the large original fragment, such as a genomic clone, wherein the subclones are accessible to transposon sequencing, in combination with sequencing these subclones using the transposon method.
  • a primary goal of the human genome project is to determine the entire DNA sequence for the genomes of human, model, and other useful organisms.
  • a related goal is to construct ordered clone maps of DNA sequences at 100 kilobase (kb) resolution for these organisms (D. R. Cox, E. D. Green, E. S. Lander, D. Cohen, and R. M. Myers, "Assessing mapping progress in the Human Genome Project," Science, vol. 265, no. 5181, pp. 2031- 2, 1994), inco ⁇ orated by reference.
  • Integrated maps that localize clones together with polymo ⁇ hic genetic markers J. Weber and P.
  • Mapping techniques include restriction enzyme analysis of genetic material., and the hybridization and detection of specific oligonucleotides which test for the presence or absence of particular alleles or loci, and may further be used to gain spatial information about the occunence of their targets when appropriate analytic techniques are subsequently applied. Note that such characterizations presently are methodologically and operationally distinct from other processes comprehended within the biotechnological and related arts.
  • Human DNA sequences now exist as genomic libraries in a variety of small- and large-insert capacity cloning vectors, with yeast artificial chromosomes (YACs) (D. T. Burke, G. F. Carle, and M. V.
  • the starting point for an effective sequencing method is a complete ordered clone map of a genome.
  • Current strategies for ordering clones build contiguous sequences (contigs) using short-range comparison data.
  • Sequence-tagged site (STS) M. Olson, L. Hood, C. Cantor, and D. Botstein, "A common language for physical mapping of the human genome,” Science, vol. 245, pp. 1434-35, 1989
  • SCM STS-content mapping
  • E. D. Green and P. Green "Sequence-tagged site (STS) content mapping of human chromosomes: theoretical considerations and early experiences," PCR Methods and Applications, vol. 1, pp.
  • RH mapping D. R. Cox, M. Burmeister, E. R. Price, S. Kim, and R. M. Myers, "Radiation hybrid mapping: a somatic cell genetic method for constructing high-resolution maps of mammalian chromosomes," Science, vol. 250, pp. 245-250, 1990), inco ⁇ orated by reference, has been used to localize small DNA sequences (though not clones) into high-resolution bins. Relatively few PCR experiments with one 96-well plate library of RHs generally suffice for mapping STSs or genes to unique bins having 250 kb to 1 Mb average resolution.
  • Inner product mapping is a hybridization-based method for achieving high-throughput, high-resolution RH mapping of clones (M. W. Perlin and A. Chakravarti, "Efficient construction of high-resolution physical maps from yeast artificial chromosomes using radiation hybrids: inner product mapping," Genomics, vol. 18, pp. 283-289, 1993), inco ⁇ orated by reference, that overcomes this barrier.
  • Experimental data have established that IPM is a highly rapid, inexpensive, accurate, and precise large-scale long-range mapping method, particularly when preexisting RH maps are available, and that IPM can replace or complement more conventional short- range mapping methods. 1. 2.13.6 Obtaining improved mapping results
  • mapping results can be obtained incrementally by gradually enlarging the data tables, a process which provides useful feedback to both experimentation and analysis. With additional RHs, the signal-to-noise characteristics of the clone profiles improve. This incremental process, and the relatively few RHs required for accurate mapping, follows the logarithmic number of the probes needed for IPM. For best mapping results, as many STS-typed RHs as feasible are used: with cunently available high-throughput, robotically-assisted hybridization methods, the localization benefits of performing many filter hybridizations outweigh the relatively low experimentation costs.
  • IPM builds accurate maps from low-confidence data.
  • IPM's partitioning of the experiments into two data tables of (A) clones vs. RHs and (B) RHs vs. STSs also partitions the data noise.
  • Table B is formed from relatively noiseless PCR-based comparisons of STSs against RH DNA, and can thus accurately order and position the STS bins using combinatorial mapping procedures (M. Boehnke, "Radiation hybrid mapping by minimization of the number of obligate chromosome breaks," Genetic Analysis Workshop 7: Issues in Gene Mapping and the Detection of Major Genes. Cytogenet Cell Genet, vol. 59, pp. 96-98, 1992; M. Boehnke, K. Lange, and D.
  • Table A is formed from inherently unreliable and inconsistently replicated hybridizations of complex RH probes against gridded filters.
  • Inner product mapping uses the table B data matrix to ameliorate these data enors and robustly translate a clones's noisy RH signature vector (a row of table A) into a chromosomal profile, whose peak bins the clone. 1. 2.13.8 Mapping YACs using IPM
  • IPM is a proven approach for mapping YACs (C. W. Richard III, D. J. Duggan, K. Davis, J. E. Fan, M. J. Higgins, S. Qin, L. Zhang, T. B. Shows, M. R. James, and M. W. Perlin, "Rapid construction of physical maps using inner product mapping: YAC coverage of chromosome 11," in Fourth International Conference on Human Chromosome 11, Sep. 22-24, Oxford, England, 1994), inco ⁇ orated by reference, and is a candidate method for mapping PACs (P. A. Ioannou, C. T. Amemiya, J. Games, P. M. Kroisel, H. Shizuya, C. Chen, M. A.
  • IRE- bubble PCR a rapid method for efficient and representative amplification of human genomic D ⁇ A sequences from complex sources
  • Genomics vol. 19, no. 3, pp. 506- 14, 1994
  • inco ⁇ orated by reference to reduce false negative enors
  • providing controls and redundant D ⁇ A spotting for internal calibration and directly acquiring signals (e.g., via a phosphorimager, Molecular Dynamics, Sunnyvale, Calif.) to facilitate automated scoring.
  • Cunent robotic technologies enable the high-throughput construction of gridded filters (A. Copeland and G. Lennon, "Rapid arrayed filter production using the 'ORCA' robot," Nature, vol. 369, no. 6479, pp.
  • Robots similarly provide high-throughput PCR comparisons for constructing table B.
  • existing RH mapping data can be rapidly extended (at low cost) into inner product maps of libraries (U. Francke, E. Chang, K. Comeau, E.-M. Geigl, J. Giacalone, X. Li, J. Luna, A. Moon, S. Welch, and P. Wilgenbus, "A radiation hybrid map of human chromosome 18," Cytogenet. Cell Genet., vol. 66, pp. 196-213, 1994), inco ⁇ orated by reference. 1. 2.13.9 Whole genome RH libraries
  • IPM is therefore useful in verifying and extending current mega YAC mapping projects, and in multiplexed experimental designs that pool sequences from well-separated bins. 1.2.13.10 Using short-range data to determine the orders and distances of clone subsets in proximate bins
  • IPM provides long-range mapping information for DNA sequences relative to RH bins through DNA hybridization. This binning information can be complemented with short-range mapping data, such as oligonucleotide finge ⁇ rint hybridizations (H. Lehrach, A. Drmanac, J. Hoheisel, Z. Larin, G. Lennon, A. P. Monaco, D. Nizetic, G. Zehetner, and A. Poustka, "Hybridization finge ⁇ rinting in genome mapping and sequencing," in Genetic and Physical Mapping I: Genome Analysis, K. E. Davies and S. M. Tilghman, ed. Cold Spring Harbor, N.Y.: Cold Spring Harbor Laboratory, 1990, pp.
  • short-range mapping data such as oligonucleotide finge ⁇ rint hybridizations (H. Lehrach, A. Drmanac, J. Hoheisel, Z. Larin, G. Lennon, A. P. Monaco,
  • this invention pertains to determining the sequence of the genome of an organism or species through the use of a novel, unobvious, and highly effective clone mapping strategy.
  • sequence information can be used for finding genes of known utility, determining stracture/function properties of genes and their products, elucidating metabolic networks, understanding the growth and development of humans and other organisms, and making comparisons of genetic information between species. From these studies, diagnostic tests and pharmacological agents can be developed of great utility for preventing and treating human and other disease.
  • the present invention provides an improved method of determimng the nucleotide base sequence of DNA.
  • the method of the invention involves the preparation of a DNA substrate comprising at a set of molecules, each having a template strand and a primer strand, wherein the 3' ends of the primer strands of the molecules terminate at about the same nucleotide position on the template strands of the molecules within each set.
  • the template and primer strands of the molecules are of unequal lengths wherein the 3' ends of the primer strands of the molecules terminate at about the same nucleotide position on the template strands of the molecules within each set.
  • DNA synthesis is induced to obtain labeled reaction products comprising newly sythesized DNA complementary to the template strands using the 3' ends of the primer strands to prime DNA synthesis, labeled nucleoside triphosphates, at least one modified nucleoside triphosphate, and preferably, a suitable chain terminator, wherein the modified nucleoside triphosphate is selected to substantially protect newly synthesized DNA from cleavage.
  • the labeled reaction products are cleaved at one or more selected sines to obtain labeled DNA fragments wherein newly synthesized DNA is substantially protected from cleavage by the inco ⁇ oration of the modified nucleotide.
  • the labeled DNA fragments obtained in the preceding step are separated and their nucleotide base sequence is identified by suitable means.
  • a combined amplification and termination reaction is performed using at least two different polymerase enzymes, each having a different affinity for the chain terminating nucleotide, so that polymerization by an enzyme with relatively low affinity for the chain terminating nucleotide leads to exponential amplification whereas an enzyme with relatively high affinity for the chain terininating nucleotide terminates the polymerization and yields sequencing products.
  • the invention features kits for directly amplifying nucleic acid templates and generating base specifically terminated fragments.
  • the kit can comprise an appropriate amount of: i) a complete set of chain- elongating nucleotides; ii) at least one chain-terminating nucleotide; (iii) a first DNA polymerase, which has a relatively low affinity towards the chain terminating nucleotide., and (iv) a second DNA polymerase, which has a relatively high affinity towards the chain terminating nucleotide.
  • the kit can also optionally include an appropriate primer or primers, appropriate buffers as well as instructions for use.
  • the instant invention allows DNA amplification and termination to be performed in one reaction vessel. Due to the use of two polymerases with different affinities for dideoxy nucleotide triphosphates, exponential amplification of the target sequence can be accomplished in combination with a termination reaction nucleotide. In addition, the process obviates the purification procedures, which are required when amplification is performed separately from base terminated fragment generation. Further, the instant process requires less time to accomplish than separate amplification and base specific termination reactions.
  • the process can be used to detect and/ or quantitate a particular nucleic acid sequence where only small amounts of template are available and fast and accurate sequence data acquisition is desirable.
  • the process is useful for sequencing unknown genes or other nucleic acid sequences and for diagnosing or monitoring certain diseases or conditions, such as genetic diseases, chromosomal abnormalities, genetic predispositions to certain diseases (e.g. cancer, obesity, artherosclerosis) and pathogenic (e.g. bacterial., viral., fungal., protistal) infections.
  • diseases e.g. cancer, obesity, artherosclerosis
  • pathogenic e.g. bacterial., viral., fungal., protistal
  • the instant process provides an opportunity to simultaneously sequence both strands, thereby providing greater certainty of the sequence data obtained or acquiring sequence information from both ends of a longer template.
  • a method and apparatus for determining the sequence of the bases in DNA by measuring the molecular mass of each of the DNA fragments in mixtures prepared by either the Maxam-Gilbert or Sanger-Coulson techniques.
  • the fragments are preferably prepared as in these standard techniques, although the fragments need not be tagged with radioactive tracers.
  • These standard procedures produce from each section of DNA to be sequenced four separate collections of DNA fragments, each set containing fragments terminating at only one or two of the four bases. In the Maxam-Gilbert method, the four separated collections contain fragments terminating at G, both G and A, both C and T, or C positions, respectively.
  • Each of these collections is sequentially loaded into an ultraviolet laser deso ⁇ tion mass spectrometer, and the mass spectrum of each collection is recorded and stored in the memory of a computer.
  • These spectra are recorded under conditions such that essentially no fragmentation occurs in the mass spectrometer, so that the mass of each ion measured conesponds to the molecular weight of one of the DNA fragments in the collection, plus a proton in the positive ion spectrum, and minus a proton in the negative ion spectrum.
  • Spectra obtained from the four spectra are compared using a computer algorithm, and the location of each of the four bases in the sequence is unambiguously determined.
  • the DNA fragments to be analyzed are dissolved in a liquid solvent containing a matrix material.
  • Each sample is radiated with a UN laser beam at a wavelength ofbetween 260 nm to 560 nm, and pulses of from 1 to 20 ns pulsewidth.
  • a target oligonucleotide is exposed to a large number of immobilized probes of shorter length.
  • the probes are collectively referred to as an "anay.”
  • one identifies whether a target nucleic acid is complementary to a probe in the anay by identifying first a core probe having high affinity to the target, and then evaluating the binding characteristics of all probes with a single base mismatch as compared to the core probe. If the single base mismatch probes exhibit a characteristic binding or affinity pattern, then the core probe is exactly complementary to at least a portion of the target nucleic acid.
  • the method can be extended to sequence a target nucleic acid larger than any probe in the anay by evaluating the binding affinity of probes that can be termed "left" and “right” extensions of the core probe.
  • the conect left and right extensions of the core are those that exhibit the strongest binding affinity and/or a specific hybridization pattern of single base mismatch probes.
  • the binding affinity characteristics of single base mismatch probes follow a characteristic pattern in which probe/target complexes with mismatches on the 3' or 5' termini are more stable than probe/target complexes with internal mismatches. The process is then repeated to determine additional left and right extensions of the core probe to provide the sequence of a nucleic acid target.
  • a target is expected to have a particular sequence.
  • an anay of probes is synthesized that includes a complementary probe and all or some subset of all single base mismatch probes. Through analysis of the hybridization pattern of the target to such probes, it can be determined if the target has the expected sequence and, if not, the sequence of the target may optionally be determined.
  • kits for analysis of nucleic acid targets are also provided by virtue of the present invention.
  • a kit includes an array of nucleic acid probes.
  • the probes may include a perfect complement to a target nucleic acid.
  • the probes also include probes that are single base substitutions of the perfect complement probe.
  • the kit may include one or more of the A, C, T, G, and/or U substitutions of the perfect complement. Such kits will have a variety of uses, including analysis of targets for a particular genetic sequence, such as in analysis for genetic diseases.
  • the present invention also enables the amplification of a DNA adjacent to a known sequence using the PCR, without the knowledge of the sequence for a second primer.
  • the present primary invention also provides a new method for sequencing a contiguously very long DNA sequence using the PCR technique, thereby enabling contiguous genomic sequencing. It will avoid the need for mapping or sub-cloning of shorter DNA fragments from haploid genomes such as the bacterial genomes.
  • This method can be used on very large DNA inserts into vectors such as the YAC. Thus, diploid genomes can be sequenced without any further need to sub-clone from the YAC clones.
  • the cloned inserts can be of any length, of several million nucleotides.
  • this method can be directly applied to sequence the whole chromosome without any need to fragment the chromosome or obtain YAC clones from the chromosome.
  • This method can also be used on whole unpurified genomes with appropriate modifications to account for the allelic variations of the two alleles present on the two chromosomes.
  • using the method of the present invention one can generate contiguous genomic sequence information in a manner not possible with any other known protocol using PCR.
  • the extended invention that enables the sequencing of an unknown region of very long DNA (e.g. genomic DNA) of totally unknown sequence would also find many applications in biology and medicine. For instance, it can be used to physically "map" a chromosome or genome. It would, for example, enable the production of an inventory of many about 500 nucleotide long sequences and the exact primer associated with each of them. This method would also enable the cloning of the amplified DNA sequences from arbitrary regions from a genomic DNA without the need for breaking down the DNA. Using appropriately longer partly fixed primers (as the second primers), very long DNA pieces (several kilobases long) could be amplified and cloned by using this method. 1.3.4.1 PCR Technique with 1 Primer
  • the present invention enables the amplification of a DNA stretch using the PCR procedure with the knowledge of only one primer.
  • the present invention describes a procedure by which a very long DNA of the order of millions of nucleotides can be sequenced contiguously, without the need for fragmenting and sub-cloning the DNA.
  • the general PCR technique is used, but the knowledge of only one primer is sufficient, and the knowledge of the other primer is derived from the statistics of the distributions of oligonucleotide sequences of specified lengths.
  • a primer is usually of length twelve nucleotides and longer. Let the sequence of one primer is known in a long DNA sequence from which the DNA sequence is to be worked out. From this primer sequence, a specific sequence of four nucleotides occurs statistically at an average distance of 256 nucleotides. It has been worked out by Senapathy that a particular sequence of four characters would occur anywhere from zero distance up to about 1500 characters with a 99.9% probability (P. Senapathy, "Distribution and repetition of sequence elements in eukaryotic DNA: New insights by computer aided statistical analysis," Molecular Genetics (Life Sciences Advances), 7:53-65 (1988)). The mean distance for such an occunence is 256 characters and the median is 180 characters.
  • a 5 nucleotide long specific sequence will occur at a mean distance of 1024 characters, with 99.99% of them occurring within 6000 characters from the first primer.
  • the median distance for the occunence of a 5 -nucleotide specific sequence is ⁇ 730 nucleotides.
  • a particular 6 nucleotide long sequence will occur at a mean distance of 4096 nucleotides and a median distance of ⁇ 2800 nucleotides.
  • a primer of known length, say length 14 can be prepared with a known sequence of 6 characters and the rest of the sequence being random in sequence. It means that any of the four nucleotides can occur at the "random" sequence locations.
  • a primer of length 12-18 can be prepared with high specificity of binding.
  • 1.3.4.2 Non-Random Primer Partly Fixed Primer
  • Such a partially non-random primer hereafter called the partly fixed primer, or partly non-random primer, meaning that part of its sequence is fixed
  • the partly fixed primer will bind at an average distance of 1024 characters (for a fixed five nucleotide characters). This primer will bind specifically only at the location of the occunence of the particular five nucleotide sequence with respect to the first primer.
  • the average distance between the first primer and the second non-random primer is ideal for DNA amplification and DNA sequencing.
  • the first primer is labeled.
  • the non-random primer it would not affect the DNA sequencing because it is dependent only upon the labeled primer.
  • the partly fixed second primer has a random sequence component in it, a sub-population of the primer molecules will have the exact sequence that would bind with the exact target sequence.
  • the proportion of the molecules with exact sequence that would bind with the exact target sequence will vary depending on the number of random characters in the partly fixed second primer. For example, in a second primer 11 nucleotides long with 6 characters fixed and 5 characters random, one in -1000 molecules will have the exact sequence complementary to the target sequence on the template.
  • any non-specific binding by any population of the second primers to non-target sequences could be avoided by adjusting (increasing) the temperature of re- annealing appropriately during DNA amplification. It is well known that the change of even one nucleotide due to point-mutation in some cancer genes can be detected by DNA-hybridization. This technique is routinely used for diagnosing particular cancer genes (e.g. John Lyons, "Analysis of ras gene point mutations by PCR and oligonucleotide hybridization," in PCR Protocols: A guide to methods and applications, edited by Michael A Innis et al., (1990), Academic Press, New York).
  • non-specific binding sites for the partly fixed second primers could be expected to occur statistically on a long genomic DNA at many places other than the target site which is close to the first primer. Amplification of non-specific DNA between these primer binding sites that could occur on opposite strands of the template DNA could happen. However, this would not affect the objective of the present invention of specific DNA sequencing of the target sequence. Because only the first primer is labeled radioactivity or fluorescently, only the reaction products of the target DNA will be visualized on the sequencing gel pattern. The presence of such non-specific amplification products in the reaction mixture will also not affect the DNA sequencing reaction.
  • Amplification of DNA will occur not only between the first primer and the partly fixed second primer that occurs closest downstream from the first primer, but also between the first primer and one or two subsequently occuning second primers, depending upon the distance at which they occur. However, these amplification products will all start from the first primer and will proceed up to these second primers. Since the DNA sequencing products are visualized by labeling the first primer, and since the DNA synthesis during the sequencing reaction proceeds from the first primer, the presence of two or three amplification products that start from the first primer will not affect the DNA sequencing products and their visualization on gels. At the most, the intensity of the bands that are subsets of different amplification products will vary slightly on the gel, but not affect the gel pattern. In fact, it is expected that this phenomenon will enable the sequencing of a longer DNA strand where the closest downstream primer is too close to the first primer—thereby avoiding the need for sequencing from the first primer again using another partly fixed second primer.
  • the minimum length of primer for highly specific amplification between primers on a template DNA is usually considered to be about 15 nucleotides. However, in the present invention, this length can be reduced by increasing the G/C content of the fixed sequence to 12-14 nucleotides. In essence, the basic procedure of the present invention is fully viable and feasible, and any non-specificity can be avoided by fine-tuning the reaction conditions such as adjusting the annealing temperature and reaction temperature during amplification, and/or adjusting the length and G/C content of the primers, which are routinely done in the standard PCR amplification protocol. 1.3.4.4 Sequence DNA of 2 nd Primer
  • the primary advantage of the present invention is to provide an extremely specific second primer that would bind precisely to a sequence at an appropriate distance from the first primer resulting in the ability to sequence a DNA without the prior knowledge of the second primer.
  • a primer sequence can be made complementary to a sequence located close to the downstream end. This can be used as the first primer in the next DNA amplification- sequencing reaction, and the unknown sequence downstream from it can be obtained by again using the same partly fixed primer that was used in the first round of sequencing as the second primer.
  • knowing only one short sequence in a contiguously long DNA molecule the entire sequence can be worked out using the present invention.
  • the distance from the first primer at which the second primer will bind on the template will also be conespondingly increased.
  • the length of amplified DNA is several thousand nucleotides, still this DNA can be used in DNA sequencing procedures.
  • the present invention can be used to amplify a DNA of length which is limited only by the inherent ability of PCR amplification.
  • a technique known as “long PCR” is used to amplify long DNA sequences (Kainz et at., "In vitro amplification of DNA Fragments > 10 kb," Anal Biochem., 202:46 (1992); Ponce & Micol, "PCR amplification of long DNA fragments” Nucleic Acids Research, 20:623 (1992)).
  • YAC Yeast Artificial Chromosome
  • This extended invention would enable the sequencing of -500 nucleotide long sequence somewhere within a given long DNA with no prior information of any sequence at all within the long DNA.
  • the probability that any specific primer of length 10 nucleotides would occur somewhere in a DNA of about one million nucleotides is approximately 1.
  • the probability that any primer of length 15 nucleotides occur somewhere in a genome of about one billion nucleotides is approximately 1.
  • use of any exact primer of about 15 nucleotide sequence on a genomic DNA in the present invention as the first primer, and the use of the second partly fixed primer will enable the sequencing of the DNA sequence bracketed by the two primers somewhere in the genome.
  • this procedure can be used to obtain an exact sequence of about 500 characters somewhere from a genome without the prior knowledge of any of its sequence at all.
  • this procedure can be used to obtain many -500-nucleotide sequences at random locations within a genome.
  • these sequences as the starting points for contiguous genome sequencing in the present invention, the whole genomic sequence can be closed and completed.
  • an advantage of the present invention is that without any prior knowledge of any sequence in a genome, the whole sequence of a genome can be obtained.
  • every 15-nucleotide arbitrary primer may not always have a complementary sequence in a genome (of -one billion nucleotides long). However, most often it would be present and would be useful in performing the above-mentioned sequencing. In some cases, there may be more than one occurrence of the primer sequence in the genome, and so may not be useful in obtaining the sequence. However, the frequency of successful single-hits can be extremely high (-90%) and can be further refined by using an appropriate length of the arbitrary primer. For genomes (or long DNAs) that are shorter than a billion nucleotides, shorter exact sequences in the first primers (say 10 characters) could be used, and the rest could be random or "degenerate" nucleotides.
  • the longer primer will aid in avoiding non-specific DNA amplification.
  • the length of the first primer can thus be increased using degenerate nucleotides at the ends to a desired extent, without affecting any specificity.
  • the present invention can also be useful to amplify the DNA between the first primer and the partly fixed second primer, with an aim to using this amplified DNA for pu ⁇ oses other than DNA sequencing, such as cloning.
  • the reaction products will, however, contain the population of non-specific DNA amplified between the non-specifically occuning second primer binding sites on opposite strands.
  • a purification step from this reaction mixture such as using an immobilized column containing only the first primer, the amplified target DNA can be purified and used for any other pu ⁇ oses.
  • the invention also provides a systematic and efficient way to sequence large fragments of DNA, in particular genomic DNA. It combines an end-sequencing-based method of subclone pathway generation through the fragment with efficient transposon-based sequencing of the identified subclones.
  • the invention is directed to a method to sequence a fragment of DNA, said fragment typically having a length of more than about 30 kb.
  • the method comprises the following steps.
  • the fragment is provided in a host cloning vector capable of accommodating it.
  • the size of the fragment that can be sequenced will depend on the nature of the host cloning vector. Cloning vectors are available that can accommodate large fragments of DNA; even the approximately 30-40 kb fragments that are suitable for insertion into cosmids are of sufficient length that the method of the invention is usefully applicable to them.
  • a composition comprising said vector containing the inserted fragment is then randomly sheared, such as by sonication, to obtain subfragments of approximately 3 kb.
  • the length of the subfragments is appropriate to the transposon-mediated directed sequencing method that will ultimately be applied.
  • the 3 kb length is an approximation; it is intended only as an order of magmtude.
  • subfragments of 2-5 kb are susceptible to this approach.
  • the subfragments are then inserted into host cloning vectors to obtain a library of subclones.
  • host cloning vectors are ideally of minimal size, containing only a selectable marker, an origin of replication, and appropriate insertion sites for the subfragments.
  • the desirability of minimizing the available plasmid DNA in the performance of transposon-mediated sequencing is described by Sttathmann, et al. (supra).
  • Sufficient subclones that contain subfragments derived from the original fragment are then recovered to provide lx coverage of the fragment when the end of each subfragment is sequenced.
  • a stretch of about 400-450 bases can be sequenced with assurance using available automated sequencing techniques.
  • the sequencing can be conducted using the sequencing primers based on the vector sequences adjacent the inserts to proceed into the insert to approximately this distance.
  • the number of subclones required can be calculated by dividing the length of the original fragment by the intended sequencing distance—i.e., by approximately 400- 450.
  • each recovered subclone containing fragment-derived DNA is then sequenced and this sequence information is placed into a searchable database.
  • the database is searched for subclones that contain subfragments with nucleotide sequences matching those that characterize the host vector that accommodated the original fragment. To the extent that these subfragments also contain sequence from the original fragment, that sequence must be at one or the other end of the original fragment. This illustrates why the efficiency of the method is improved by introducing a prescreening step which eliminates any subclones which do not contain portions of the original fragment. If the prescreening has been done, these subclones contain oligonucleotide sequence from either end of the original fragment. The identified subclones are recovered. 1.3.5.1 "Second End" Sequence
  • a partial sequence of each of the identified subclones is determined from the opposite end of the subfragment insert from that originally placed in the database. This provides "second end" sequence information concerning sequence further removed from the end of the original fragment. This information is then used to search the database in order to identify subclones containing nucleotide sequence that matches this second end sequence. Such subclones are likely to represent regions of the original fragment that are farther removed from the ends and provide further progress in constructing a path across the fragment. These subclones are recovered as well, and sequenced from the end opposite to that which was sequenced to provide the information for the database and this new information, in turn, used to search the database for a matching sequence.
  • the steps of second end sequencing, searching the database with the resulting sequence information, and recovery of subclones which contain a match are repeated sequentially until subclones have been identified that represent the complete original fragment.
  • the resulting collection of subclones consists of an ordered minimum set that collectively represent the original fragment. The appropriate sequence of such subclones to span the original fragment from end to end is also known.
  • the invention is directed to kits suitable for conducting the method of the invention.
  • the invention also describes a new method to sequence DNA.
  • the improvements over the existing DNA sequencing technologies include high speed, high throughput, no required electrophoresis (and, thus, no gel reading artifacts due to the complete absence of an electrophoretic, step), and no costly reagents involving various substitutions with stable isotopes.
  • the invention utilizes the Sanger sequencing strategy and assembles the sequence information by analysis of the nested fragments obtained by base-specific chain termination via their different molecular masses using mass spectrometry, for example, MALDI or ES mass spectrometry.
  • a further increase in throughput can be obtained by introducing mass modifications in the oligonucleotide primer, the chain-terminating nucleoside triphosphates and/or the chain- elongating nucleoside triphosphates, as well as using integrated tag sequences which allow multiplexing by hybridization of tag specific probes with mass differentiated molecular weights.
  • the present invention pertains to a method for sequencing genomes.
  • the method comprises the steps of obtaining nucleic acid material from a genome. Then there is the step of constructing a clone library and one or more probe libraries from the nucleic acid material. Next there is the step of comparing the libraries to form comparisons. Then there is the step of combining the comparisons to construct a map of the clones relative to the genome. Next there is the step of determining the sequence of the genome by means of the map.
  • the present invention pertains to a system for sequencing a genome.
  • the system comprises a mechanism for obtaining nucleic acid material from a genome.
  • the system also comprises a mechanism for constructing a clone library and one or more probe libraries.
  • the constructing mechanism is in communication with the nucleic acid material from a genome.
  • the system comprises a mechanism for comparing said libraries to form comparisons.
  • the comparing mechanism is in communication with the said libraries.
  • the system also comprises a mechanism for combining the comparisons to construct a map of the clones relative to the genome.
  • the said combining mechanism is in communication with the comparisons.
  • the system comprises a mechanism for determining the sequence of the genome by means of said map.
  • the said determining mechanism is in communication with said map.
  • the present invention additionally pertains to a method for producing a gene of a genome.
  • the method comprises the steps of obtaining nucleic acid material from a genome. Then there is the step of constructing libraries from the nucleic acid material. Next there is the step of comparing the libraries to form comparisons. Then there is the step of combining the comparisons to construct a map of the clones relative to the genome. Next there is the step of localizing a gene on the map. Then there is the step of cloning the gene from the map. 1.3.8 Methods and means for the massively parallel characterization of complex molecules and of molecular recognition phenomena with parallelism and redundancy attained through single molecule examination methods
  • the present invention approaches the vastness of biological complexity through massive parallelism, which may conveniently be attained through various single molecule examination (SME) methods variously refened to heretofore as single molecule detection (SMD), single molecule visualization (SMN) and single molecule spectroscopy (SMS) techniques.
  • SME single molecule examination
  • SMD single molecule detection
  • SN single molecule visualization
  • SMS single molecule spectroscopy
  • Molecular parallelism may be applied to the examination of the composition of complex molecules (including co-polymers of natural or of synthetic origin) or to determinations of interactions between larqe numbers of molecules.
  • the former case may be applied to genome-scale sequencing methods.
  • the latter case may be applied to rapid determination of molecular complementarity, with applications in (biological or non-biological) affinity characterization, immulogical study, clinical pathology, molecular evolution (e.g. in vitro evolution), and the construction of a cybernetic immune system as well as prostheses based thereupon.
  • molecular recognition phenomena are observed with molecular parallelism.
  • both kinetics of both binding association and dissociation, and binding equilibria may be examined.
  • Kinetics may be examined by observing the rates of occupation of appropriate sites or diverse populations thereof by some homogenous or heterogeneous sample, and the rates of vacancy formation from occupied sites.
  • Equilibria constants may be determined by observing the proportion (number of occupied sites divided by number of total sites) of sites occupied under equilibrium conditions, with greater quantitative confidence yielded by, for example, examining more binding sites.
  • Sequencing of polynucleotide molecules may be effected by the (preferably end- wise) immobilization of a library of such molecules to a surface at a density convenient for detection, which will vary according to the detection methodology availed.
  • Several methods capable of effecting such immobilization will be obvious to those skilled in the arts of recombinant DNA technology and molecular biology, among others.
  • Priming, which may be random or non-random, is effected by any of a variety of methods, most of which are obvious to those skilled in the relevant arts.
  • Genome sequencing applications availing of enzymatic polymerization's and conesponding embodiments of the present invention, rely upon control over polymerization rate and nucleotide inco ⁇ oration specificity, consistent with the well- known Watson-Crick base pairing rales which may be enforced (upon single nucleotides in a processive manner, as conditions permit) by the use of DNA polymerases or analogs thereof, in combination with repeatable single molecule detection applied to a large population of diverse molecules.
  • a sequencing cycle comprises the steps of: (1.) polymerizing one or less nucleotides, which carry some removable or neutralizable molecular label and may optionally be reversibly 3' protected (or otherwise protected in anv manner which modulates polymerization rate onto each sample molecule at the primer or at subsequent extensions thereof and in opposition to (and pairing with) a single, unique, base of the template polynucleotide strand; (2.) optionally washing away any unreacted labeled nucleotides; (3.) detecting, by either direct or indirect methods, said labeled nucleotides inco ⁇ orated into said sample molecules, in a manner which repeatably associates information obtained about the type of label observed with the unique identity of the template molecule under observation, which may be uniquely distinguished by a variety of methods (which include: a mappable location of immobilization of the sample template molecule on a substrate surface; a mappable location of immobilization of the sample template molecule within some matrix volume element; microscopic labeling with some readily identifiable
  • Said sequencing cycle comprising an appropriate subset of steps 1-6 may be repeated as many times as convenient, but must be repeated a sufficient number of times to obtain sequence information of sufficient complexity from each individual molecule to permit unambiguous alignment of all such sequence information determined for all of the molecules of the sample. This minimum number of cycles will be approximately related to the complexity C of the sample to be treatated as part of the same macroscopic reaction (i.e.
  • the sequence determination applications of the present invention enjoys substantial advantages deriving from sample manipulation in the single-molecule-regime.
  • Working instead in the distinct single-molecule-regime rather than with populations of identical molecules provides substantial advantages of parallelism, facility of use and implementatiol, (including automated implementation,) and operability.
  • unanticipated advantages (1) because a single molecule is necessarily monodisperse, failure of a molecule to undergo addition in a cycle does not cause a loss of sample monodispersion (i.e.
  • samples comprising multiple identical molecules may thus take on non-identical lengths, complicating data collection and analysis;
  • samples comprising a plurality of individually distinct single molecules (species) may be handled unitarily without requiring any handling measures to keep distinct molecules apart, providing a large reduction in manipulations required on a per-species basis and not requiring the use of many separate, parallel fluid handling steps or means;
  • inadvertent multiple base additions are more readily detected and their extent is more readily quantified because these changes in quantity are large compared to the signal expected from the inco ⁇ oration of a single base (i.e.
  • Oversampling redundancy may be availed to increase data confidence by providing the opportunity to score and match multiple occurrences of the same sequence segment and thus detect and eliminate enoneous sequence segment information by virtue of its less frequent occunence.
  • Erroneous sequence segment information may arise, for instance, by nucleotide inco ⁇ oration errors which are an inevitable feature of polymerization with polymerases having a characteristic fidelity, i.e. displaying a characteristic nucleotide misinco ⁇ oration rate, Such methods will be particularly useful where polynucleotide polymerases fidelity would otherwise be unacceptably low. It should be noted that an enor rate of one percent or more has been deemed conventionally acceptable for genome informatic pu ⁇ oses. 1.3.8.3 Controls/Data
  • known molecules having sequences that are highly unrelated to the sample may be included as internal controls to monitor the efficiency and accuracy of a particular sequence collection process; such internal control sequences will present negligibly small overhead because molecular parallelism may easily accommodate any such comparatively small increase in sample complexity, even though it might be considered large with respect to pre-existing methods.
  • data alignment may be performed in tandem or parallel with later cycles and may be monitored by appropriate computational algorithms for data quality and confidence of sequence information, and cycling may continue till desired criteria are satisfied.
  • Computer, microprocessor, electronic or other automated control of instrumentation, including fluidics and robotics for the manipulation of samples, and the automated effectuation of the various methods of the present invention, all according to parameterized algorithms, may be accomplished by means obvious from the present disclosure to those skilled in the relevant arts (e.g. fluidics, robotics, electronics, microelectronics, computer science and engineering, and mechanical engineering).
  • Concunent data alignment and monitoring will permit modifications of the sequencing cycle described above, such as dynamic adjustment of polymerization reaction conditions and durations, label removal or neutralization procedure parameters, polymerization deprotection conditions, and any other desired parameter, so as to permit optimization of procedures and results.
  • Double/Single Stranded Polynucleotide Sequencing Method may be examined. Where single stranded polynucleotide molecules are preferred, second strands may be removed by performing said immobilization so as to only involve only one strand in covalent linkage with said surface and then performing a denaturation of the sample with washing.
  • Priming means required by any particular enzyme must then be provided, usually by hybridization of a complementary oligo- or polynucleotide to the sample template molecules, though other means are possible.
  • Other methods which will be obvious to those skilled in the arts of recombinant DNA technology may also be employed to yield immobilized or otherwise uniquely identifiable single stranded polynucleotide samples.
  • said second strands may be treated with an appropriate exonuclease under appropriate conditions and for an appropriate lengths of time to provide a good distribution of lengths of said second strands such that the termini of the undegraded portions of said second strands provide convenient priming for enzymatic nucleotide polymerization (i.e. DNA directed DNA synthesis or DNA replication, D A directed RNA synthesis or transcription, RNA directed DNA synthesis or reverse transcription, or RNA directed RNA synthesis or RNA replication).
  • enzymatic nucleotide polymerization i.e. DNA directed DNA synthesis or DNA replication, D A directed RNA synthesis or transcription, RNA directed DNA synthesis or reverse transcription, or RNA directed RNA synthesis or RNA replication).
  • polynucleotide sequencing methods of the present invention represent the converse of conventional enzymatic and chemical sequencing methods in that those conventional methods rely upon the production of multiple homogeneous sub-populations of DNA molecules which together comprise a nested set, and the detection of each of such sub-population (with deviant chain terminator misinco ⁇ oration molecules arising with significantly lower frequency and thus constituting a poorly detected population), while the present invention relies on alignment of information from a highly inhomogeneous population molecules and repeatable detection of single molecules.
  • each species yields information about only one base at one position within the sample sequence, while with the methods of the present invention, each individual sample template molecule may yield information about the identity of several bases.
  • this invention describes an improved method of sequencing DNA.
  • this invention employs mass spectrometry to analyze the Sanger sequencing reaction mixtures.
  • the DNA sequence can be assigned via supe ⁇ osition (e.g., inte ⁇ olation) of the molecular weight peaks of the four individual experiments.
  • supe ⁇ osition e.g., inte ⁇ olation
  • the molecular weights of the four specifically terminated fragment families can be determined simultaneously by MS, either by mixing the products of all four reactions ran in at least two separate reaction vessels (i.e., all ran separately, or two together, or three together) or by running one reaction having all four chain-terminating nucleotides (e.g., a reaction mixture comprising dTTP, ddTTP, dATP, ddATP, dCTP, ddCTP, dGTP, ddGTP) in one reaction vessel.
  • a reaction mixture comprising dTTP, ddTTP, dATP, ddATP, dCTP, ddCTP, dGTP, ddGTP
  • chain-elongating nucleotides include 2'- deoxyribonucleotides and chain-terminating nucleotides include 2', 3'- dideoxyribonucleotides.
  • chain-elongating nucleotides include ribonucelotides and chain-terminating nucleotides include 3'- deoxyribonucleotides.
  • the term nucleotide is also well known in the art.
  • nucleotides include nucleoside mono-, di-, and triphosphates. Nucleotides also include modified nucleotides such as phosphorothioate nucleotides.
  • mass spectrometry is a serial method, in contrast to cunently used slab gel electrophoresis which allows several samples to be processed in parallel
  • a further improvement can be achieved by multiplex mass spectrometric DNA sequencing to allow simultaneous sequencing of more than one DNA or RNA fragment.
  • the range of about 300 mass units between one nucleotide addition can be utilized by employing either mass modified nucleic acid sequencing primers or chain-elongating and/or terminating nucleoside triphosphates so as to shift the molecular weight of the base- specifically terminated fragments of a particular DNA or RNA species being sequenced in a predetermined manner.
  • several sequencing reactions can be mass spectrometrically analyzed in parallel.
  • multiplex mass spectrometric DNA sequencing can be performed by mass modifying the fragment families through specific oligonucleotides (tag probes) which hybridize to specific tag sequences within each of the fragment families.
  • tag probe can be covalently attached to the individual and specific tag sequence prior to mass spectrometry.
  • Prefened mass spectrometer formats for use in the invention are matrix assisted laser deso ⁇ tion ionization (MALDI), electrospray (ES), ion cyclotron resonance (ICR) and Fourier Transform.
  • MALDI matrix assisted laser deso ⁇ tion ionization
  • ES electrospray
  • ICR ion cyclotron resonance
  • ABI atmospheric pressure ionization interface
  • MS/N4S quadrupole configuration In MALDI mass spectrometry, various mass analyzers can be used, e.g., magnetic sector/magnetic deflection instruments in single or triple quadrupole mode (MS MS), Fourier transform and time-of-flight (TOF) configurations as is known in the art of mass spectrometry. For the deso ⁇ tion/ionization process, numerous matrix/laser combinations can be used. Ion- trap and reflectron configurations can also be employed.
  • MS MS magnetic sector/magnetic deflection instruments in single or triple quadrupole mode
  • TOF time-of-flight
  • the molecular weight values of at least two base-specifically terminated fragments are determined concurrently using mass spectrometry.
  • the molecular weight values of preferably at least five and more preferably at least ten base-specifically terminated fragments are determined by mass spectrometry.
  • the nested base-specifically terminated fragments in a specific set can be purified of all reactants and by- products but are not separated from one another. The entire set of nested base-specifically terminated fragments is analyzed concunently and the molecular weight values are determined. At least two base-specifically terminated fragments are analyzed concurrently by mass spectrometry when the fragments are contained in the same sample.
  • the overall mass spectrometric DNA sequencing process will start with a library of small genomic fragments obtained after first randomly or specifically cutting the genomic DNA into large pieces which then, in several subcloning steps, are reduced in size and inserted into vectors like derivatives of M 13 or pUC (e.g., M13mpl8 or M13mpl9).
  • the fragments inserted in vectors, such as M 13 are obtained via subcloning starting with a cDNA library.
  • the DNA fragments to be sequenced are generated by the polymerase chain reaction (e.g., Higuchi et al., "A General Method of in vitro Preparation and Mutagenesis of DNA Fragments: Study of Protein and DNA Interactions," Nucleic Acids Res., 16, 7351-67 (1988)).
  • Sanger sequencing can start from one nucleic acid primer (UP) binding to the plus-strand or from another nucleic acid primer binding to the opposite minus-strand.
  • either the complementary sequence of both strands of a given unknown DNA sequence can be obtained (providing for reduction of ambiguity in the sequence determination) or the length of the sequence information obtainable from one clone can be extended by generating sequence information from both ends of the unknown vector- inserted DNA fragment.
  • the nucleic acid primer canies preferentially at the 5 '-end, a linking functionality, L, which can include a spacer of sufficient length and which can interact with a suitable functionality, L', on a solid support to form a reversible linkage such as a photocleavable bond. Since each of the four Sanger sequencing families starts with a nucleic acid primer this fragment family can be bound to the solid support by reacting with functional groups, L', on the surface of a solid support and then intensively washed to remove all buffer salts, triphosphates, enzymes, reaction by- products, etc.
  • the temporary linkage can be such that it is cleaved under the conditions of mass spectrometry, i.e., a photocleavable bond such as a charge transfer complex or a stable organic radical.
  • the linkage can be formed with L'being a quaternary ammonium group.
  • the surface of the solid support carries negative charges which repel the negatively charged nucleic acid backbone and thus facilitates deso ⁇ tion.
  • Deso ⁇ tion will take place either by the heat created by the laser pulse and/or, depending on L,' by specific abso ⁇ tion of laser energy which is in resonance with the L' chromophore.
  • the functionalities, L and L,' can also form a charge transfer complex and thereby form the temporary L-L 1 linkage.
  • Various examples for appropriate functionalities with either acceptor or donator properties are depicted without limitation herein. Since in many cases the "charge-transfer band" can be determined by UN/vis spectrometry (see e.g. Organic Charge Transfer Complexes by R. Foster, Academic Press, 1969), the laser energy can be tuned to the conesponding energy of the charge-transfer wavelength and, thus, a specific deso ⁇ tion off the solid support can be initiated. Those skilled in the art will recognize that several combinations can serve this pu ⁇ ose and that the donor functionality can be either on the solid support or coupled to the nested Sanger DNA R A fragments or vice versa.
  • the temporary linkage L-L' can be generated by homolytically forming relatively stable radicals.
  • a combination of the approaches using charge-transfer complexes and stable organic radicals is shown.
  • the nested Sanger DNA/RNA fragments are captured via the formation of a charge transfer complex.
  • deso ⁇ tion as well as ionization will take place at the radical position.
  • the L-L' linkage under the influence of the laser pulse, the L-L' linkage will be cleaved and the nested Sanger DNA/RNA fragments desorbed and subsequently ionized at the radical position formed.
  • a conesponding laser wavelength can be selected (see e.g. Reactive Molecules by C. Wentrup, John Wiley & Sons, 1984).
  • the nested Sanger DNA/RNA fragments are captured via Watson-Crick base pairing to a solid support- bound oligonucleotide complementary to either the sequence of the nucleic acid primer or the tag oligonucleotide sequence. The duplex formed will be cleaved under the influence of the laser pulse and deso ⁇ tion can be initiated.
  • the solid support- bound base sequence can be presented through natural oligoribo- or oligodeoxyribonucleotide as well as analogs (e.g. thio-modified phosphodiester or phosphotriester backbone) or employing oligonucleotide mimetics such as PNA analogs (see e.g. Nielsen et al., Science, 254, 1497 (1991)) which render the base sequence less susceptible to enzymatic degradation and hence increases overall stability of the solid support-bound capture base sequence.
  • PNA analogs see e.g. Nielsen et al., Science, 254, 1497 (1991)
  • L-L' a cleavage can be obtained directly with a laser tuned to the energy necessary for bond cleavage.
  • the immobilized nested Sanger fragments can be directly ablated during mass spectrometric analysis.
  • nucleic acids can be "conditioned” by adding positive or negative charges, i.e. charge tags (CTs).
  • CTs increase the mass spectrometer detection sensitivity by increasing the degree of ionization during the mass spectrometric (e.g.MALDI) process.
  • a CT can be linked either to the external 3' or 5' position or internally e.g. at the 2' position or at the base, e.g.
  • Charge tags, CTs can function molecules with permanent (i.e. pH-independent) ionization, such as:
  • the trityl group is used to anchor the oligonucleotide to a solid support via the tertiary carbon and this bond is cleaved during mass spectrometry (e.g. MALDI), leaving a positive charge on the desorbing and high vacuum flying oligonucleotide.
  • mass spectrometry e.g. MALDI
  • conditioning is modification of the phosphodiester backbone of the nucleic acid molecule (e.g. cation exchange), which can be useful for eliminating peak broadening due to a heterogeneity in the cations bound per nucleotide unit.
  • a nucleic acid molecule can be contacted with an alkylating agent such as alkyliodide, iodoacetamide, ⁇ -iodoethanol, or 2,3 -epoxy- 1 - propanol, the monothio phosphodiester bonds of a nucleic acid molecule can be transformed into a phosphotriester bond.
  • alkylating agent such as alkyliodide, iodoacetamide, ⁇ -iodoethanol, or 2,3 -epoxy- 1 - propanol
  • the monothio phosphodiester bonds of a nucleic acid molecule can be transformed into a phosphotriester bond.
  • phosphodiester bonds may be transformed to uncharged
  • Further conditioning involves inco ⁇ orating nucleotides which reduce sensitivity for depurination (fragmentation during MS) such as N7- or N9-deazapurine nucleotides, or RNA building blocks or using oligonucleotide triesters or inco ⁇ orating phosphorothioate functions which are alkylated or employing oligonucleotide mimetics such as PNA.
  • Modification of the phosphodiester backbone can be accomplished by, for example, using alpha-thio modified nucleotides for chain elongation and termination.
  • alkylating agents such as akyliodides, iodoacetarnide, ⁇ - iodoethanol, 2,3- epoxy-1- propanol
  • the monothio phosphodiester bonds of the nested Sanger fragments are transformed into phosphotriester bonds.
  • Multiplexing by mass modification in this case is obtained by mass-modifying the nucleic acid primer (UP) or the nucleoside triphosphates at the sugar or the base moiety.
  • UP nucleic acid primer
  • nucleoside triphosphates at the sugar or the base moiety.
  • the linking chemistry allows one to cleave off the so- purified nested DNA enzymatically, chemically or physically.
  • the L- L' chemistry can be of a type of disulfide bond (chemically cleavable, for example, by mcrcaptoethanol or dithioerythrol), a biotin/streptavidin system, a heterobifunctional derivative of a trityl ether group (K ⁇ ster et al., "A Versatile Acid- Labile Linker for Modification of Synthetic Biomolecules," Tetrahedron Letters 31, 7095 (1990)) which can be cleaved under mildly acidic conditions, a levulinyl group cleavable under almost neutral conditions with a hydrazinium/acetate buffer, an arginine- arginine or lysine-lysine bond cleavable by an endopeptidase enzyme like trypsin or a pyrophosphate bond cleavable by a pyrophosphatase, a photocleavable bond which can be, for example, physically cle
  • another cation exchange can be performed prior to mass spectrometric analysis.
  • the enzyme used to cleave the bond can serve as an internal mass standard during MS analysis.
  • the purification process and/or ion exchange process can be carried out by a number of other methods instead of, or in conjunction with, immobilization on a solid support.
  • the base-specifically terrainated products can be separated from the reactants by dialysis, filtration (including ultrafiltration), and chromatography.
  • these techniques can be used to exchange the cation of the phosphate backbone with a counter-ion which reduces peak broadening.
  • the base-specifically terminated fragment families can be generated by standard Sanger sequencing using the Large Klenow fragment of E. coli DNA polymerase I, by Sequenase, Taq DNA polymerase and other DNA polymerases suitable for this ptupose, thus generating nested DNA fragments for the mass spectrometric analysis. It is, however, part of this invention that base-specifically terminated RNA transcripts of the DNA fragments to be sequenced can also be utilized for mass spectrometric sequence determination.
  • various RNA polymerases such as the SP6 or the T7 RNA polymerase can be used on appropriate vectors containing, for example, the SP6 or the T7 promoters (e.g.
  • nucleic acid primer PituUe et al., "Initiator Oligonucleotides for the Combination of Chemical and Enzymatic RNA Synthesis, " Gene 112, 101- 105 (1992)
  • L linking functionalities
  • various solid supports can be used, e.g., beads (silica gel, controlled pore glass, magnetic beads, Sephadex/Sepharose beads, cellulose beads, etc.), capillaries, glass fiber filters, glass surfaces, metal surfaces or plastic material.
  • useful plastic materials include membranes in filter or microtiter plate formats, the latter allowing the automation of the purification process by employing microtiter plates which, as one embodiment of the invention, carry a permeable membrane in the bottom of the well functionalized with L'.
  • Membranes can be based on polyethylene, polypropylene, polyamide, polyvinylidenedifluoride and the like.
  • suitable metal surfaces include steel, gold, silver, aluminum, and copper.
  • purification, cation exchange, and/or modification of the phosphodiester backbone of the L-L' bound nested Sanger fragments they can be cleaved off the solid support chemically, enzymatically or physically.
  • the L-L'bound fragments can be cleaved from the support when they are subjected to mass spectrometric analysis by using appropriately chosen L-L linkages and conesponding laser energies/intensities as described above and herein.
  • MALDI Data Analysis
  • the highly purified, four base-specifically terminated DNA or RNA fragment families are then analyzed with regard to their fragment lengths via determination of their respective molecular weights by MALDI or ES mass spectrometry.
  • the samples dissolved in water or in a volatile buffer, are injected either continuously or discontinuously into an atmospheric pressure ionization interface (API) and then mass analyzed by a quadrupole.
  • API atmospheric pressure ionization interface
  • the molecular weight peaks are searched for the known molecular weight of the nucleic acid primer (UP) and determined which of the four chain terminating nucleotides has been added to the UP. This represents the first nucleotide of the unknown sequence.
  • the second, the third, the n th extension product can be identified in a similar manner and, by this, the nucleotide sequence is assigned.
  • the generation of multiple ion peaks which can be obtained using ES mass spectrometry can increase the accuracy of the mass determination.
  • the process of multiplexing by mass-modified nucleic acid primers (UP) is illustrated by way of example herein for mass analyzing four different DNA clones simultaneously.
  • the first reaction mixture is obtained by standard Sanger DNA sequencing having unknown DNA fragment 1 (clone 1) integrated in an appropriate vector (e.g., M13mpl8), employing an unmodified nucleic acid primer UP °, and a standard mixture of the four unmodified deoxynucleoside triphosphates, dNTP ° and with 1/10th of one of the four dideoxynucleoside triphosphates, ddNTP .
  • a second reaction mixture for DNA fragment 2 (clone 2) is obtained by employing a mass- modified nucleic acid primer UP ' and, as before, the four unmodified nucleoside triphosphates, 0 dNTP , containing in each separate Sanger reaction l/10 th of the chain- terminating unmodified dideoxynucleoside triphosphates ddNTP .
  • the four Sanger reactions have the following compositions: DNA fragment 3 (clone 3 ), UP 2 , dNTP 0 , ddNTP 0 and DNA fragment 4 (clone 4), UP 3 , dNTP 0 , ddNTP 0 .
  • the first nucleotides of the four unknown DNA sequences of clone 1 to 4 are determined.
  • the process is repeated, having memorized the molecular masses of the four specific first extension products, until the four sequences are assigned.
  • Unambiguous mass/sequence assignments are possible even in the worst case scenario in which the four mass-modified nucleic acid primers are extended by the same dideoxynucleo side triphosphate, the extension products then being, for example, UP 0 ddT, UP ' -ddT, UP
  • an analogous technique is employed using different vectors containing, for example, the SP6 and/or T7 promoter sequences, and performing transcription with the nucleic acid primers UP °, UP ', UP 2 and UP 3 and either an RNA polymerase (e.g., SP6 or T7 RNA polymerase) with chain-elongating and temiinating unmodified nucleoside triphosphates NTP ° and 3 '-dNTP °.
  • an RNA polymerase e.g., SP6 or T7 RNA polymerase
  • the DNA sequence is being determined by Sanger RNA sequencing.
  • DNA Sanger sequencing reaction (DNA fragment 1, clone 1) is the standard mixture employing unmodified nucleic acid primer UP , dNTP and in each of the four reactions one of the four ddNTP 0 .
  • the second (DNA fragment 2, clone 2) and the third (DNA fragment 3, clone 3) have the following contents: UP 0 , dNTP 0 , ddNTP 1 and UP 0 , dNTP 0 , ddNTP 2 , respectively.
  • an amplification of the mass increment in mass-modifying the extended DNA fragments can be achieved by either using an equally mass-modified deoxynucleoside triphosphate (i.e., dNTP 1 , dNTP 2 ) for chain elongation alone or in conjunction with the homologous equally mass-modified dideoxynucleoside triphosphate.
  • an equally mass-modified deoxynucleoside triphosphate i.e., dNTP 1 , dNTP 2
  • the contents of the reaction mixtures can be as follows: either UP ° /dNTP ° /ddNTP ° , UP ° /dNTP ' /ddNTP ° and UP ° /dNTP 2 /ddNTP ° or UP ° /dNTP ° /ddNTP 0 , UP ° /dNTP ' /ddNTP ' and UP ° /dNTP 2 /ddNTP 2 .
  • DNA sequencing can be performed by Sanger RNA sequencing employing unmodified nucleic acid primers, UP , and an appropriate mixture of chain-elongating and terminating nucleoside triphosphates.
  • the mass-modification can be again either in the chain- terminating nucleoside triphosphate alone or in conjunction with mass-modified chain-elongating nucleoside triphosphates.
  • Multiplexing is achieved by pooling the three base-specifically terminated sequencing reactions (e.g., the ddTTP terminated products) and simultaneously analyzing the pooled products by mass spectrometry. Again, the first extension products of the known nucleic acid primer sequence are assigned, e.g., via a computer program.
  • Mass/sequence assignments are possible even in the worst case in which the nucleic acid primer is extended/terminated by the same nucleotide, e.g., ddT, in all three clones.
  • the following configurations thus obtained can be well differentiated by their different mass modifications: UP 0 ddT 0 , UP 0 ddT 1 , UP 0 ddT 2 .
  • DNA sequencing by multiplex mass spectrometry can be achieved by cloning the DNA fragments to be sequenced in "plex-vectors" containing vector specific "tag sequences" as described (K ⁇ ster et al., "Oligonucleotide Synthesis and Multiplex DNA Sequencing Using Chemiluminescent Detection," Nucleic Acids Re. Symposium Ser. No.
  • the four base- specifically terminated multiplex DNA fragment families are run by the mass spectrometer and all ddT 0 , ddA ddC and ddG° terminated molecular ion peaks are respectively detected and memorized.
  • Assignment of, for example, ddT terminated DNA fragments to a specific fragment family is accomplished by another mass spectrometric analysis after hybridization of the specific tag probe (TP) to the conesponding tag sequence contained in the sequence of this specific fragment family.
  • the differentiation of the tag probes for the different multiplexed clones can be obtained just by the DNA sequence and its ability to Watson-Crick base pair to the tag sequence. It is well known in the art how to calculate stringency conditions to provide for specific hybridization of a given tag probe with a given tag sequence (see, for example, Molecular Cloning: A laboratory manual 2ed, ed. by Sambrook, Fritsch and Maniatis (Cold Spring Harbor Laboratory Press: NY, 1989, Chapter 11). Furthermore, differentiation can be obtained by designing the tag sequence for each plex- vector to have a sufficient mass difference so as to be unique just by changing the length or base composition or by mass-modifications.
  • the DNA sequence is unraveled again by searching for the lowest molecular weight molecular ion peak conesponding to the known UP ° -tag sequence/tag probe molecular weight plus the first extension product, e.g., ddT ° , then the second, the third, etc.
  • a further increase in multiplexing can be achieved by using, in addition to the tag probe/tag sequence interaction, mass-modified nucleic acid primers and or mass-modified deoxynucleoside, dNTP " ', and or dideoxynucleoside triphosphates, ddNTP ° ⁇ * .
  • tag sequence/tag probe multiplexing approach is not limited to Sanger DNA sequencing generating nested DNA fragments with DNA polymerases.
  • the DNA sequence can also be determined by transcribing the unknown DNA sequence from appropriate promoter- containing vectors (see above) with various RNA polymerases and mixtures of NTP 0-1 3' dNTP 0_1 , thus generating nested RNA fragments.
  • the mass-modifying functionality can be introduced by a two or multiple step process.
  • kits for sequencing nucleic acids by mass spectrometry which include combinations of the above-described sequencing reactants.
  • the kit comprises reactants for multiplex mass spectrometric sequencing of several different species of nucleic acid.
  • the kit can include a solid support having a linking functionality (L 1 ) for immobilization of the base- specifically terminated products; at least one nucleic acid primer having a linking group (L) for reversibly and temporarily linking the primer and solid support through, for example, a photocleavable bond; a set of chain-elongating nucleotides (e.
  • dATP dCTP, dGTP and dTTP, or ATP, CTP, GTP and UTP
  • chain- terminating nucleotides such as 2',3'-dideoxynucleotides for DNA synthesis or 3' deoxynucleotides for RNA synthesis
  • an appropriate polymerase for synthesizing complementary nucleotides.
  • Primers and/or terminating nucleotides can be mass- modified so that the base-specifically terminated fragments generated from one of the species of nucleic acids to be sequenced can be distinguished by mass spectrometry from all of the others.
  • a set of tag probes can be included in the kit.
  • the kit can also include appropriate buffers as well as instructions for performing multiplex mass spectrometry to concunently sequence multiple species of nucleic acids.
  • a nucleic acid sequencing kit can comprise a solid support as described above, a primer for initiating synthesis of complementary nucleic acid fragments, a set of chain-elongating nucleotides and an appropriate polymerase.
  • the mass-modified chain-terminating nucleotides are selected so that the addition of one of the chain terminators to a growing complementary nucleic acid can be distinguished by mass spectrometry.
  • the invention features a process for directly amplifying and base specifically terminating a nucleic acid molecule.
  • a combined amplification and termination reaction is performed on a nucleic acid template using: i) a complete set of chain-elongating nucleotides; ii) at least one chain-terminating nucleotide; and (iii) a first DNA polymerase, which has a relatively low affinity towards the chain terminating nucleotide; and (iv) a second DNA polymerase, which has a relatively high affinity towards the chain terminating nucleotide, so that polymerization by the enzyme with relatively low affinity for the chain terminating nucleotide leads to amplification of the template, whereas the enzyme with relatively high affinity for the chain terminating nucleotide terminates the polymerization and yields sequencing products.
  • the combined amplification and sequencing can be based on any amplification procedure that employs an enzyme with polynucleotide synthetic ability (e.g. polymerase).
  • One prefened process based on the polymerase chain reaction (PCR), is comprised of the following three thermal steps: 1) denaturing a double stranded (ds) DNA molecule at an appropriate temperature and for an appropriate period of time to obtain the two single stranded (ss) DNA molecules (the template: sense and antisense strand); 2) contacting the template with at least one primer that hybridizes to at least one ss DNA template at an appropriate temperature and for an appropriate period of time to obtain a primer containing ss DNA template; 3) contacting the primer containing template at an appropriate temperature and for an appropriate period of time with: (i) a complete set of chain elongating nucleotides, (ii) at least one chain terminating nucleotide, (iii) a first DNA polymerase, which has a relatively low affinity towards the chain terminat
  • Steps 1)- 3) can be sequentially performed for an appropriate number of times (cycles) to obtain the desired amount of amplified sequencing ladders.
  • the quantity of the base specifically terminated fragment desired dictates how many cycles are performed. Although an increased number of cycles results in an increased level of amplification, it may also detract from the sensitivity of a subsequent detection. It is therefore generally undesirable to perform more than about 50 cycles, and is more preferable to perform less than about 40 cycles (e.g. about 20-30 cycles).
  • the first denaturation step is performed at a temperature in the range of about 85°C to about 100°C (most preferably about 92°C to about 96°C) for about 20 seconds (s) to about 2 minutes (most preferably about 30s- 1 minute).
  • the second hybridization step is preferably performed at a temperature, which is in the range of about 40°C to about 80°C (most preferably about 45°C to about 72°C) for about 20s to about 2 minutes (most preferably about 30s-l minute).
  • the third, primer extension step is preferably performed at about 65°C to about 80°C (most preferably about 70°C to about 74°C) for about 30 s to about 3 minutes (most preferably about 1 to about 2 minutes).
  • each of the single stranded sense and antisense templates generated from the denaturing step can be contacted with appropriate primers in step 2), so that amplified and chain terminated nucleic acid molecules generated in step 3), are complementary to both strands.
  • SDA strand displacement amplification
  • this process involves the following three steps, which altogether comprise a cycle: 1) denaturing a double stranded (ds) DNA molecule containing the sequence to be amplified at an appropriate temperature and for an appropriate period of time to obtain the two single stranded (ss) DNA molecules (the template: sense and antisense strand); 2) contacting the template with at least one primer (P), that contains a recognition/cleavage site for a restriction endonuclease (RE) and that hybridizes to at least one ss DNA template at an appropriate temperature and for an appropriate period of time to obtain a primer containing ss DNA template; 3) contacting the primer containing template at an appropriate temperature and for an appropriate period of time with: (i) a complete set of chain elongating nucleotides; (ii) at least one chain terminating nucleotide, (iii) a first DNA polymerase, which has a relatively low affinity towards the chain terminating nucleotide; (iv) a second DNA poly
  • Steps 1) - 3) can be sequentially performed for an appropriate number of times (cycles) to obtain the desired amount of amplified sequencing ladders.
  • the quantity of the base specifically terminated fragment desired dictates how many cycles are performed. Preferably, less than 50 cycles, more preferably less than about 40 cycles and most preferably about 20 to 30 cycles are performed.
  • the amplified sequencing ladders obtained as described above can be separated and detected and/or quantitated using well established methods, such as polyacrylamide gel electrophoresis (PAGE), or capillary zone electrophoresis (CZE) (Jorgenson et al., J. Chromatography 352, 337 (1986); Gesteland et al., Nucleic Acids Res. L8, 1415-1419 (1990)); or direct blotting electrophoresis (DBE) (Beck and Pohl, EMBO J, vol. 3: Pp. 2905-2909 (1984)) in conjunction with, for example, colorimetry, fluorimetry, chemiluminescence and radioactivity.
  • PAGE polyacrylamide gel electrophoresis
  • CZE capillary zone electrophoresis
  • DBE direct blotting electrophoresis
  • Dye-terminator chemistry can be employed in the combined amplification and sequencing reaction to enable the simultaneous generation of forward and reverse sequence ladders, which can be separated based on the streptavidin-biotin system when one biotinylated primer is provided.
  • Step A represents the exponential amplification of a target sequence by the polymerase with a low affinity for ddNTPs.
  • One of the sequence specific oligonucleotide primers is biotinylated.
  • Step B represents the generation of a sequence ladder either from the original template or the simultaneously generated amplification product carried out by the polymerase with a high affinity for ddNTPs.
  • Step C Biotinylated forward sequencing products and reverse products hybridized to the forward template are immobilized.
  • Step D The immobilized strands are washed and separated by denaturation with ammonium hydroxide at room temperature.
  • the non-biotinylated reverse sequencing products are removed from the beads with ammonium hydroxide supernatant during this procedure.
  • the biotinylated forward sequencing products remain immobilized to the beads and are re-solubilized with ammonium hydroxide at 60°C. After ethanol precipitation, both sequencing species can be resuspended in loading dye and run on an automated sequencer, for example.
  • the sequencing ladders can be directly detected without first being separated using several mass spectrometer formats.
  • Amenable formats for use in the invention include ionization techniques such as matrix-assisted laser deso ⁇ tion (MALDI), continuous or pulsed elecfrospray (ESI) and related methods (e.g. Ionspray or Thermospray), and massive cluster impact (MSI); these ion sources can be matched with a detection format, such as linear or reflection time-of-flight (TOF), single or multiple quadrupole, single or multiple magnetic sector, Fourier Transform ion cyclotron resonance (FTICR), ion trap, or combinations of these to give a hybrid detector (e.g. ion trap-TOF).
  • TOF linear or reflection time-of-flight
  • FTICR Fourier Transform ion cyclotron resonance
  • ion trap-TOF e.g. ion trap-TOF
  • numerous matrix/wavelength combinations (MALDI) or solvent combinations (ESI) can be employed.
  • the above-described process can be performed using virtually any nucleic acid molecule as the source of the DNA template.
  • the nucleic acid molecule can be: a) single stranded or double stranded; b) linear or covalently closed circular in supercoiled or relaxed form; or c) RNA if combined with reverse transcription to generate a cDNA.
  • reverse transcription can be performed using a suitable reverse franscriptase (e.g. Moloney murine leukemia virus reverse franscriptase) using standard techniques (e.g. Kawasaki (1990) in PCR Protocols: A Guide to Methods and Applications, Innis et al., eds., Academic Press, Berkeley, CA pp21- 27).
  • Sources of nucleic acid templates can include: a) plasmids (naturally occurring or recombinant); b) RNA- or DNA- viruses and bacteriophages (naturally occurring or recombinant); c) chromosomal or episomal replicating DNA (e. g. from tissue, a blood sample, or a biopsy); d) a nucleic acid fragment (e.g. derived by exonuclease, unspecific endonuclease or restriction endonuclease digestion or by physical disruption (e.g. sonication or nebulization)); and e) RNA or RNA transcripts like mRNAs.
  • plasmids naturally occurring or recombinant
  • b) RNA- or DNA- viruses and bacteriophages naturally occurring or recombinant
  • c) chromosomal or episomal replicating DNA e. g. from tissue, a blood sample, or a biopsy
  • the nucleic acid to be amplified and sequenced can be obtained from virtually any biological sample.
  • biological sample refers to any material obtained from any living source (e.g. human, animal., plant, bacteria, fungi, protist, virus).
  • appropriate biological samples for use in the instant invention include: solid materials (e.g tissue, cell pellets, biopsies) and biological fluids (e.g. urine, blood, saliva, amniotic fluid, mouth wash, spinal fluid).
  • the nucleic acid to be amplified and sequenced can be provided by unpurified whole cells, bacteria or virus.
  • the nucleic acid can first be purified from a sample using standard techniques, such as: a) cesium chloride gradient centrifugation; b) alkaline lysis with or without RNAse treatment; c) ion exchange chromatography; d) phenol/chloroform extraction; e) isolation by hybridization to bound oligonucleotides; f) gel electrophoresis and elution; alcohol precipitation and h) combinations of the above.
  • standard techniques such as: a) cesium chloride gradient centrifugation; b) alkaline lysis with or without RNAse treatment; c) ion exchange chromatography; d) phenol/chloroform extraction; e) isolation by hybridization to bound oligonucleotides; f) gel electrophoresis and elution; alcohol precipitation and h) combinations of the above.
  • chain-elongating nucleotides and “chain- terminating nucleotides” are used in accordance with their art recognized meaning.
  • chain-elongating nucleotides include 2'- deoxyribonucleotides (e.g. dATP, dCTP, dGTP and dTTP) and chain-terminating nucleotides include 2', 3'- dideoxyribonucleotides, (e.g. ddATP, ddCTP, ddGTP, ddTTP).
  • chain- elongating nucleotides include ribonucleotides (e.g., ATP, CTP, GTP and UTP) and chain-terminating nucleotides include 3'-deoxyribonucleotides (e.g. 3'dA, 3'dC, 3'dG and 3'dU).
  • a complete set of chain elongating nuclectides refers to dATP, dCTP, dGTP and dTTP.
  • nucleotide is also well known in the art.
  • nucleotides include nucleoside mono-, di-, and triphosphates.
  • Nucleotides also include modified nucleotides, such as phosphorothioate nucleotides and deazapurine nucleotides.
  • a complete set of chain- elongating nucleotides refers to four different nucleotides that can hybridize to each of the four different bases comprising the DNA template.
  • the amplified sequencing ladders are to be detected by mass spectrometric analysis, it may be useful to "condition" nucleic acid molecules, for example to decrease the laser energy required for volatization and/or to minimize fragmentation. Conditioning is preferably performed while the sequencing ladders are immobilized.
  • An example of conditioning is modification of the phosphodiester backbone of the nucleic acid molecule (e.g. cation "change), which can be useful for eliminating peak broadening due to a heterogeneity in the cations bound per nucleotide unit.
  • nucleic acid molecule which contains an -thio-nucleoside- triphosphate during polymerization with an alkylating agent such as akyliodide, iodoacetamide, - iodoethanol, or 2,3-epoxy-l-propanol
  • alkylating agent such as akyliodide, iodoacetamide, - iodoethanol, or 2,3-epoxy-l-propanol
  • Further conditioning involves inco ⁇ orating nucleotides which reduce sensitivity for depurination (fragmentation during MS), e.g.
  • a purine analog such as N7- or N9- deazapurine nucleotides, and partial RNA containing oligodeoxynucleotide to be able to remove the unmodified primer from the amplified and modified sequencing ladders by RNAse or alkaline treatment.
  • the N7 deazapurine nucleotides reduce the formation of secondary stracture resulting in band compression from which no sequencing information can be generated.
  • Critical to the novel process of the invention is the use of appropriate amounts of two different polymerase enzymes, each having a different affinity for the particular chain terminating nucleotide, so that polymerization by the enzyme with relatively low affinity for the chain terminating nucleotide leads to amplification whereas the enzyme with relatively high affinity for the chain terminating nucleotide terminates the polymerization and yields sequencing products.
  • Preferably about 0.5 to about 3 units of polymerase is used in the combined amplification and chain termination reaction. Most preferably about I to 2 units is used.
  • thermostable polymerases such as Taq DNA polymerase (Boehringer Mannheim), AmpliTaq FS DNA polymerase (Perkin-Elmer), Deep Vent (exo-), Vent, Vent (exo-) and Deep Vent DNA polymerases (New England Biolabs), Thermo Sequenase (Amersham) or exo(- ) Pseudococcusfuriosus (Pfu) DNA polymerase (Stratagene, Heidelberg Germany). AmpliTaq, Ultman, 9 degree Nm, Tth, Hot Tub, and Pyrococcusfuriosus. In addition, preferably the polymerase does not have 5'-3' exonuclease activity.
  • the process of the invention can be carried out using AmpliTaq FS DNA polymerase (Perkin-Elmer), which has a relatively high affinity and Taq DNA polymerase, which has a relatively low affinity for chain terminating nucleotides.
  • AmpliTaq FS DNA polymerase Perkin-Elmer
  • Taq DNA polymerase which has a relatively low affinity for chain terminating nucleotides.
  • Other appropriate polymerase pairs for use in the instant invention can be determined by one of skill in the art. (See e.g. S. Tabor and C.C. Richardson (1995) Proc. Nat. Acad. Sci. (USA), vol. 92: Pp. 6339-6343.) in addition to polymerases, which have a relatively high and a relatively low affinity to the chain terminating nucleotide, a third polymerase, which has proofreading capacity (e.g. Pyrococcus woesei (Pwo)) DNA polymerase may also be added to the a
  • Oligonucleotide primers for use in the invention, can be designed based on knowledge of the 5' and/or 3' regions of the nucleotide sequence to be amplified and sequenced, e.g., insert flanking regions of cloning and sequencing vectors (such as Ml 3, pUC, phagemid, costaid).
  • at least one primer used in the chain extension and termination reaction can be linked to a solid support to facilitate purification of amplified product from primers and other reactants, thereby increasing yield or to separate the Sanger ladders from the sense and antisense template strand where simultaneous amplification-sequencing of both a sense and antisense strand of the template DNA has been performed.
  • solid supports examples include beads (silica gel, controlled pore glass, magnetic beads, Sephadex/Sepharose beads, cellulose beads, etc.), capillaries, flat supports such as glass fiber filters, glass surfaces, metal surfaces (steel, gold, silver, aluminum, and copper), plastic materials or membranes (polyethylene, polypropylene, polyamide, polyvinylidenedifluoride) or beads in pits of flat surfaces such as wafers (e.g. silicon wafers), with or without filter plates.
  • beads sica gel, controlled pore glass, magnetic beads, Sephadex/Sepharose beads, cellulose beads, etc.
  • capillaries flat supports such as glass fiber filters, glass surfaces, metal surfaces (steel, gold, silver, aluminum, and copper), plastic materials or membranes (polyethylene, polypropylene, polyamide, polyvinylidenedifluoride) or beads in pits of flat surfaces such as wafers (e.g. silicon wafers), with or without filter plates.
  • Immobilization can be accomplished, for example, based on hybridization between a capture nucleic acid sequence, which has already been immobilized to the support and a complementary nucleic acid sequence, which is also contained within the nucleic acid molecule containing the nucleic acid sequence to be detected. So that hybridization between the complementary nucleic acid molecules is not hindered by the support, the capture nucleic acid can include a spacer region of at least about five nucleotides in length between the solid support and the capture nucleic acid sequence. The duplex formed will be cleaved under the influence of the laser pulse and deso ⁇ tion can be initiated.
  • the solid support-bound base sequence can be presented through natural oligoribo- or oligodeoxyribo- nucleotide as well as analogs (e.g. thio- modified phosphodiester or phosphotriester backbone) or employing oligonucleotide mimetics such as PNA analogs (see e.g. Nielsen et al., Science, 254, 1497 (1991)) which render the base sequence less susceptible to enzymatic degradation and hence increases overall stability of the solid support-bound capture base sequence.
  • analogs e.g. thio- modified phosphodiester or phosphotriester backbone
  • PNA analogs see e.g. Nielsen et al., Science, 254, 1497 (1991)
  • a target detection site can be directly linked to a solid support via a reversible or ineversible bond between an appropriate functionality (L') on the target nucleic acid molecule and an appropriate functionality (L) on the capture molecule.
  • a reversible linkage can be such that it is cleaved under the conditions of mass spectrometry (i.e., a photocleavable bond such as a trityl ether bond or a charge transfer complex or a labile bond being formed between relatively stable organic radicals).
  • the linkage can be formed with L' being a quaternary ammonium group, in which case, preferably, the surface of the solid support carries negative charges which repel the negatively charged nucleic acid backbone and thus facilitate the deso ⁇ tion required for analysis by a mass spectrometer.
  • Deso ⁇ tion can occur either by the heat created by the laser pulse and/or, depending on L,' by specific abso ⁇ tion of laser energy which is in resonance with the L' chromophore.
  • the L-L' chemistry can be of a type of disulfide bond (chemically cleavable, for example, by mercaptoethanol or ditWoei throl), a biotin/streptavidin system, a heterobifunctional derivative of a trityl ether group (K ⁇ ster et al., "A Versatile Acid-Labile Linker for Modification of Synthetic Biomolecules," Tetrahedron Letters 31, 7095 (1990)) which can be cleaved under mildly acidic conditions as well as under conditions of mass spectrometry, a levulinyl group cleavable under almost neutral conditions with a hydrazinium/acetate buffer, an arginine-arginine or lysine-lysine bond cleavable by an endopeptidase enzyme like trypsin or a pyrophosphate bond cleavable by a pyrophosphatase or a ribonucle
  • the functionalities, L and L, 1 can also form a charge transfer complex and thereby form the temporary L-L' linkage. Since in many cases the "charge- transfer band" can be determined by UV/vis spectrometry (see e.g. Organic Charge Transfer Complexes by R. Foster, Academic Press, 1969), the laser energy can be tuned to the corresponding energy of the charge-transfer wavelength and, thus, a specific deso ⁇ tion off the solid support can be initiated. Those skilled in the art will recognize that several combinations can serve this pu ⁇ ose and that the donor functionality can be either on the solid support or coupled to the nucleic acid molecule to be detected or vice versa.
  • a reversible L-L' linkage can be generated by homolytically forming relatively stable radicals. Under the influence of the laser pulse, deso ⁇ tion (as discussed above) as well as ionization will take place at the radical position.
  • deso ⁇ tion as well as ionization will take place at the radical position.
  • a corresponding laser wavelength can be selected (see e.g. Reactive Molecules by C. Wentrup, John Wiley & Sons, 1984).
  • An anchoring function L' can also be inco ⁇ orated into a target capturing sequence by using appropriate primers during an amplification procedure, such as PCR, LCR or transcription amplification.
  • oligonucleotide or oligonucleotide mimetic arrays it may be useful to simultaneously amplify and chain terminate more than one (mutated) loci on a particular captured nucleic acid fragment (on one spot of an anay) or it may be useful to perform parallel processing by using oligonucleotide or oligonucleotide mimetic arrays on various solid supports.
  • Multiplexing can be achieved either by the sequence itself (composition or length) or by the introduction of mass-modifying functionalities into the primer oligonucleotide. Such multiplexing is particularly useful in conjunction with mass spectrometric DNA sequencing or mobility modified gel based fluorescence sequencing. 1.4.2.5 Mass or Mobility Modification
  • the mass or mobility modification can be introduced by using oligo/polyethylene glycol derivatives.
  • the oligo/polyethylene glycols can also be monoalkylated by a lower alkyl such as methyl, ethyl, propyl, isopropyl, t-butyl and the like.
  • Other chemistries can be used in the mass-modified compounds, as for example, those described recently in Oligonucleotides and Analogues- A Practical Approach, F. Eckstein, editor IRL Press, Oxford, 1991.
  • various mass or mobility modifying functionalities can be selected and attached via appropriate linking chemistries.
  • a simple modification can be achieved by using different alkyl, aryl or aralkyl moieties such as methyl, ethyl, propyl, isopropyl, t- butyl, hexyl, phenyl, substituted phenyl or benzyl.
  • Yet another modification can be obtained by attaching homo- or heteropeptides to the nucleic acid molecule (e.g., primer) or nucleoside triphosphates.
  • Simple oligoamides also can be used. Numerous other possibilities, in addition to those mentioned above, can be performed by one skilled in the art.
  • Different mass or mobility modified primers allow for multiplex sequencing via simultaneous detection of primer-modified Sanger sequencing ladders.
  • Mass or mobility modifications can be inco ⁇ orated during the amplification process through nucleoside triphosphates or modified primers. 1.4.2.6 Kits for Amplified Base Specifically Terminated Fragments
  • kits for directly generating from a nucleic acid template, amplified base specifically terminated fragments include combinations of the above-described reactants.
  • the kit can comprise: i) a set of chain-elongating nucleotides; ii) a set of chain-terminating nucleotides; and (iii) a first DNA polymerase, which has a relatively low affinity towards the chain terminating nucleotide; and (iv) a second DNA polymerase, which has a relatively high affinity towards the chain terminating nucleotide.
  • the kit can also include appropriate solid supports for capture/purification and buffers as well as instructions for use.
  • detectable labels For use with certain detection means, such as polyacrylamide gel electrophoresis (PAGE), detectable labels must be used in either the primer (typically at the 5'-end) or in one of the chain extending nucleotides, or chain terminating nucleotides.
  • PAGE polyacrylamide gel electrophoresis
  • radioisotopes such as 32 P, 33 P, or 31 S is still the most frequently used technique. After PAGE, the gels are exposed to X-ray films and silver grain exposure is analyzed.
  • Oligonucleotide anays can be used in a wide variety of applications, including hybridization studies.
  • the anay can be exposed to a receptor (R) of interest.
  • the receptor can be labelled with an appropriate label (*), such as fluorescein.
  • the locations on the substrate where the receptor has bound are determined and, through knowledge of the sequence of the oligonucleotide probe at that location one can then determine, if the receptor is an oligonucleotide, the sequence of the receptor.
  • Sequencing by hybridization is most efficiently practiced by attaching many probes to a surface to form an anay in which the identity of the probe at each site is known. A labeled target DNA or RNA is then hybridized to the array, and the hybridization pattern is examined to determine the identity of all complementary probes in the anay. Contrary to the teachings of the prior art, which teaches that mismatched probe/target complexes are not of interest, the present invention provides an analytical method in which the hybridization signal of mismatched probe/target complexes identifies or confirms the identity of the perfectly matched probe/target complexes on the anay.
  • an anay of all tetranucleotides was produced in sixteen cycles, which required only 4 hours to complete. Because combinatorial strategies are used, the number of different compounds on the anay increases exponentially during synthesis, while the number of chemical coupling cycles increases only linearly. For example, expanding the synthesis to the complete set of 4 (65,536) octanucleotides adds only 4 hours (or less) to the synthesis due to the 16 additional cycles required. Furthermore, combinatorial synthesis strategies can be implemented to generate arrays of any desired probe composition.
  • any subset of the dodecamers can be constructed in 48 or fewer chemical coupling steps.
  • the number of compounds in an array is limited only by the density of synthesis sites and the overall array size.
  • the present invention has been practiced with arrays with probes synthesized in square sites 25 microns on a side. At this resolution, the entire set of 65,536 octanucleotides can be placed in an array measuring only 0.64 cm 2 .
  • the set of 1,048,576 dodecanucleotides requires only a 2.56 cm 2 array at this individual probe site size.
  • oligonucleotide anays can be used for primary sequencing applications, many diagnostic methods involve the analysis of only a few nucleotide positions in a target nucleic acid sequence. Because single base changes cause multiple changes in the hybridization pattern of the target on a probe anay, the oligonucleotide anays and methods of the present invention enable one to check the accuracy of previously elucidated DNA sequences, or to scan for changes or mutations in certain specific sequences within a target nucleic acid. The latter as is important, for example, for genetic disease, quality control, and forensic analysis.
  • a single base change in a target nucleic acid can be detected by the loss of eight perfect hybrids, and the generation of eight new perfect hybrids.
  • the single base change can also be detected through altered mismatch probe/target complex formation on the anay. Perhaps even more su ⁇ risingly, such single base changes in a complex nucleic acid dramatically alter the overall hybridization pattern of the target to the anay. According to the present invention such changes in the overall hybridization pattern are used to actually simplify the analysis.
  • Arrays can also be constructed to contain genetic markers for the rapid identification of a wide variety of pathogenic organisms, and to study the sequence specificity of RNA/RNA, RNA/DNA, protein/RNA or protein/DNA, interactions.
  • Suitably protected RNA monomers can be employed for RNA synthesis, and a wide variety of synthetic and non-naturally occuning nucleic acid analogues can be used, depending upon the motivations of the practitioner. See, e.g., PCT patent Publication Nos. 91/19813, 92/05285, and 92/14843, inco ⁇ orated herein by reference.
  • the oligonucleotide anays can be used to deduce thermodynamic and kinetic rales governing the formation and stability of oligonucleotide complexes.
  • the support bound octanucleotide probes discussed above were hybridized to a target of 5'GCGTAGGC-fluorescein in the hybridization chamber by incubation for 15 minutes at 15°C.
  • the array surface was then intenogated with an epifluorescence microscope (488 nm argon ion excitation).
  • the fluorescence intensity pattern matches the 800 X 1280 ⁇ m stripe used to direct the synthesis of the probe. Furthermore, the signal intensities are high (four times over the background of the glass substrate), demonstrating specific binding of the target to the probe.
  • the probe S-3'-CGCATCCG was synthesized in stripes 1, 3 and 5.
  • the probe S-3 -CGCTTCCG was synthesized in stripes 2, 4 and 6.
  • the results of hybridizing a 5'-GCGTAGGC- fluorescein target to the substrate at 15°C are depicted herein.
  • the probes differ by only one internal base, the target hybridizes specifically to its complementary sequence (-500 counts above background in stripes 1, 3 and 5) with little or no detectable signal in positions 2, 4 and 6 (-10 counts).
  • the process continues through round 2 to form sixteen dinucleotides.
  • the masks of round 3 further subdivide the synthesis regions so that each coupling cycle generates 16 trimers.
  • the subdivision of the substrate is continued through round 4 to form the tetranucleotides.
  • the synthesis of this probe matrix can be compactly represented in polynomial notation as (A+C+G+T) 4 . Expansion of this polynomial yields the 256 tetranucleotides.
  • S-3'-CGCAGCCG (554 counts), S-3'-CGCCGACG (317 counts), S- 3'-CGCCGTCG (272 counts), S-3'-CGACGCCG (242 counts), S-3'-CGTCGCCG (203 counts), S-3'-CGCCCCCG (180 counts), S-3'-CGCTGCCG (163 counts), S- 3'-CGCCACCG (125 counts), and S-3'-CGCCTCCG (78 counts).
  • the anays discussed herein can be utilized in the present method to determine the nucleic acid sequence of an oligonucleotide of length n using an array of probes of shorter length k.
  • the target has a sequence S'-XXYXY-S 1 , where X and Y are complementary nucleic acids such as A and T or C and G.
  • X and Y are complementary nucleic acids such as A and T or C and G.
  • the example is simplified by using only two bases and very short sequences, but the technique can easily be extended to larger nucleic acids with, for example, all 4 RNA or DNA bases.
  • the sequence of the target is, generally, not known ab initio.
  • an anay of all possible X and Y 4-mers is synthesized and then used to determine the sequence of a 5-mer target.
  • the core probe is exactly complementary to a sequence in the target using the mismatch analysis method of the present invention.
  • the core probe is identified using one or both of the following criteria:
  • the core probe exhibits stronger binding affinity to the target than other probes, typically the strongest binding affinity of any probe in the array (that has not been identified as a core probe in a previous cycle of analysis).
  • Probes that are mismatched with the target exhibit a characteristic pattern, discussed in greater detail below, in which probes that mismatch at the 3'- and 5 '-end of the probe bind more strongly to the target than probes that mismatch at interior positions.
  • selection criteria #1 identifies a core 4-mer probe with the strongest binding affinity to the target that has the sequence 3'-YYXY.
  • the probe 3'-YYXY (conesponding to the 5'-XXYX position of the target) is, therefore, chosen as the "core" probe.
  • Selection criteria #2 is utilized as a "check" to ensure the core probe is exactly complementary to the target nucleic acid.
  • the second selection criteria evaluates hybridization data (such as the fluorescence intensity of a labeled target hybridized to an array of probes on a substrate, although other techniques are well known to those of skill in the art) of probes that have single base mismatches as compared to the core probe.
  • the core probe has been selected as S-3'-YYXY.
  • the single base mismatched probes of this core probe are: S-3'-XYXY, S-3'-YXXY, S-3'-YYYYY, and S-3'-YYXX.
  • the binding affinity characteristics of these single base mismatches are utilized to ensure that a "conect" core has been selected, or to select the core probe from among a set of probes exhibiting similar binding affinities.
  • binding affinity values typically fluorescence intensity of labeled target hybridized to probe, although many other factors relating to affinity may be utilized
  • the binding affinity values are all normalized to the binding affinity of S-3'-YYXY to the target, which is plotted as a value of 1. Because only two nucleotides are involved in this example, the value plotted for a probe mismatched at position 1 (the nucleotide at the 3'-end of the probe) is the normalized binding affinity of S-3'-XYXY.
  • the value plotted for mismatch at position 2 is the normalized affinity of S-3'-YXXY.
  • the value plotted for mismatch at position 3 is the normalized affinity of S-3'-YYYY
  • the value plotted for mismatch position 4 is the normalized affinity of S-3'-YYXX.
  • affinity may be measured in a number of ways including, for example, the number of photon counts from fluorescence markers on the target.
  • the affinity of all three mismatches is lower than the core in this illustration. Moreover, the affinity plot shows that a mismatch at the 3 '-end of the probe has less impact than a mismatch at the 51-end of the probe in this particular case, although this may not always be the case. Further, mismatches at the end of the probe result in less disturbance than mismatches at the center of the probe.
  • identification of a core is all that is required such as in, for example, forensic or genetic studies, and the like. In sequencing studies, this process is then repeated for left and/or right extensions of the core probe. In one example, only right extensions of the core probe are possible.
  • the possible 4-mer extension probes of the core probe are 3'-YXYY and 31 -YXYX. Again, the same selection criteria are utilized. Between 31-YXYY and 3'- YXYX, it would normally be found that 31 -YXYX would have the strongest binding affinity, and this probe is selected as the conect probe extension. This selection may be confirmed by again plotting the normalized binding affinity of probes with single base mismatches as compared to the core probe.
  • a method for sequencing genomes that is comprised of the steps:
  • the clones may be comprised of large-sized clones that have genomic inserts greater than 250 kb (e.g., YACs), medium-sized clones that have genomic inserts greater than 50 kb, but less than 250 kb (e.g., PACs, BACs, Pis, or YACs), or small- sized clones that have genomic inserts less than 50 kb (e.g., cosmids, plasmids, phage, phagemids, or cDNAs).
  • YACs genomic inserts greater than 250 kb
  • medium-sized clones that have genomic inserts greater than 50 kb, but less than 250 kb e.g., PACs, BACs, Pis, or YACs
  • small- sized clones that have genomic inserts less than 50 kb (e.g., cosmids, plasmids, phage, phagemids, or cDNAs).
  • the clone library has at least two-fold redundancy relative to the genome.
  • the technology for constructing these clones is well described (F. M. Ausubel, R. Brent, R. E. Scientific, D. D. Moore, J. G. Seidman, J. A. Smith, and K. Struhl, ed., Cunent Protocols in Molecular Biology. New York, N.Y.: John Wiley and Sons, 1995; N. J. Dracopoli, J. L. Haines, B. R. Korf, C. C. Morton, C. E. Seidman, J. G. Seidman, D. T. Moir, and D.
  • Chromosome-specific cosmid clones are available from Los Alamos National Laboratories (Los Alamos, N.Mex.), genome-wide PAC clones from Pieter de Jong (Roswell Park, Buffalo, N. Y.), and the Genethon YAC libraries from the national genome center GESTECs, including the Whitehead Institute (Cambridge, Mass.).
  • cDNA libraries ATCC, Rockville, Md.
  • PI libraries DuPont/Merck Pharmaceuticals, Glenolden, Pa.
  • BAG libraries Research Genetics, Huntsville, Ala.
  • cDNAs and other genome-wide resources BIOS Labs, New Haven, Conn.
  • DNA from the clones is prepared for DNA hybridization experiments.
  • DNA derived from bacterial clones cosmids, PACs, etc.
  • two straightforward protocols are: (a) growing up colonies for each clone, and then lysing the bacterial cells to expose the cloned insert DNA, or (b) specifically extracting the DNA material from the clone using DNA prep such as an ion exchange column (Qiagen, Chatsworth, Calif).
  • DNA prep such as an ion exchange column (Qiagen, Chatsworth, Calif).
  • a species-specific DNA prep e.g., Alu-PCR or IRE- bubble PCR
  • the prefened long-range multiplexed probe is the radiation hybrid (RH) (D. R. Cox, M. Burffle, E. R. Price, S. Kim, and R. M. Myers, "Radiation hybrid mapping: a somatic cell genetic method for constructing high-resolution maps of mammalian chromosomes," Science, vol. 250, pp. 245-250, 1990; S. J. Goss and H. Hanis, “New method for mapping genes in human chromosomes," Nature, vol. 255, pp. 680-684, 1975; S. J. Goss and H. Harris, "Gene transfer by means of cell fusion: statistical mapping of the human X-chromosome by analysis of radiation-induced gene segregation," J. Cell.
  • RH radiation hybrid
  • Chromosome-specific RH libraries have been constructed for other human chromosomes (M. R. James, C. W. Richard III, J.-J. Schott, C. Yousry, K. Clark, J. Bell, J. Hazan, C. Dubay, A. NignaL, M. Agrapart, T. Imai, Y. ⁇ akamura, M. Polymeropoulos, J. Weissenbach, D. R. Cox, and G. M. Lathrop, "A radiation hybrid map of 506 STS markers spanning human chromosome 11," Nature Genetics, vol. 8, no. 1, pp. 70-76, 1994; S. H.
  • WG-RHs Whole-genome RHs for humans and other mammalian genomes have also been developed (M. A. Walter, D. J. Spillett, P. Thomas, J. Weissenbach, and P. N. Goodfellow, "A method for constructing radiation hybrid maps of whole genomes," Nature Genet., vol. 7, no. 1, pp. 22-28, 1994), inco ⁇ orated by reference, including the high-energy Stanford set (David Cox, Stanford, Calif.) and the low-energy Genethon set; the DNAs from both WG-RH sets are available (Research Genetics, Huntsville, Ala.).
  • One alternative embodiment is the use of rare cutter restriction enzymes (e.g., Notl partial digests) to develop large DNA sequences from genomes. These fragments can be purified using pulsed-field gel electrophoresis (D. C. Schwartz and C. R. Cantor, "Separation of yeast chromosome-sized DNAs by pulsed field gradient gel electrophoresis," Cell, vol. 37, pp. 67-75, 1984), inco ⁇ orated by reference, and then selectively pooled.
  • a second alternative embodiment is the use of a second clone library that has a larger average insert size than the first clone library in step 1.
  • Subsets of these larger insert clones can be pooled together to form a long-range probe library (relative to the first clone library).
  • a third alternative embodiment which is particularly useful in animal models is the use of genetically inbred strains. With an FI backcross between strains A and B, the meiotic events produce an interleaving of large chromosomal fragments of strains A and B. A subfractive hybridization can selectively remove the DNA from strain B, leaving behind just the large chromosomal regions of strain A for each backcross individual. This procedure constructs a long- range probe library (relative to the strain A clone library). The subfractive hybridization can be performed by first digesting the backcross individual genome with restriction enzymes, and then using whole genome DNA from strain B bound to solid support to selectively remove the strain B DNA.
  • the long-range probe DNA often resides in a complex background genome.
  • the background is murine genome
  • the background is the yeast genome. Therefore, the DNA preparations for these long-range probe embodiments prefenably use a species-specific DNA extraction and amplification. The particular assay often depends on the clone library used.
  • inter- Alu hybridization is the prefened approach in step 5.
  • Alu- PCR preparation of the long-range probes M. T. Ross and V. P. J. Stanton, "Screening large-insert libraries by hybridization," in Cunent Protocols in Human Genetics, vol. 1, N. J. Dracopoli, J. L. Haines, B. R. Korf, C. C. Morton, C. E. Seidman, J. G. Seidman, D. T. Moir, and D. Smith, ed. New York: John Wiley and Sons, 1995, pp.
  • IRE-bubble PCR D. J. Munroe, M. Haas, E. Brie, T. Whirton, H. Aburatani, K. Hunter, D. Ward, and D. E. Housman, "IRE-bubble PCR: a rapid method for efficient and representative amplification of human genomic DNA sequences from complex sources," Genomics, vol. 19, no. 3, pp. 506-14, 1994), inco ⁇ orated by reference.
  • IRE-bubble PCR is the prefened embodiment. This situation applies to many clone libraries, including cosmids, PACs, BACs, and Pis.
  • the species-specific DNA is then amplified and labeled for use as a hybridization probe.
  • this amplification and labeling is performed using a labeled dNTP with the random primer method (A. P. Feinberg and B. Vogelstein, "A technique for radiolabeling DNA restriction endonuclease fragments to high specific activity," Analyt. Biochem., vol. 132, pp. 6-13, 1983; N.J. Dracopoli, J. L. Haines, B. R. Korf, C. C. Morton, C. E. Seidman, J. G. Seidman, D. T. Moir, and D. Smith, ed., Current Protocols in Human Genetics. New York: John
  • P- dNTP is inco ⁇ orated into a random primer PCR amplification, possibly using a kit such as the DECprime II DNA labeling kit (Ambion, Austin, Tex.). Other isotopes such as 35 S or 33 P can be used.
  • nonisotopic labeling is performed (L. J. Kricka, ed., Nonisotopic Probing, Blotting, and Sequencing, Second Edition. San Diego, Calif: Academic Press, 1995), inco ⁇ orated by reference. 1.4.4.1.4 Comparing DNA from the clone library with DNA from the long-range probe library.
  • the labeled long-range probe DNA is hybridized against the gridded clone library (A. P. Monaco, V. M. S. Lam, G. Zehetner, G. G. Lennon, C. Douglas, D. Nizetic, P. N. Goodfellow, and H. Lehrach, "Mapping inadiation hybrids to cosmid and yeast artificial chromosome libraries by direct hybridization of Alu-PCR products," Nucleic Acids Res., vol. 19, no. 12, pp. 3315-3318, 1991), inco ⁇ orated by reference.
  • the roles of the long-range probe library and the clone library are reversed, with the long-range probe immobilized on the membrane and the label on the clone.
  • the hybridization comparison is done by preannealing the probe with 25 ng of Cot-1 DNA (Gibco-BRL, Grand Island, N.Y.) for 2 hours at 37°C. before adding to the prehybridization mix.
  • the nylon filters containing the spotted clone DNA is then prehybridized overnight per manufacturer's instructions (Amersham, Arlingon Heights, 111.), except for the addition of sheared, denatured human placental DNA at a final concentration of 50 ng/ml. Filters are hybridized overnight at 68°C, washed three times with final wash of 0.1 SSPE/0.1% SDS at 72° C, before exposing to autoradiographic film for 1 to 8 days. The exposed film image is then electronically scanned into a computer with memory. A phosphorimager (Molecular Dynamics, Sunnyvale, Calif.) or other electronic device can be used for imaging without the use of film.
  • each of the clone positions on the autoradiographs of the gridded filters are scored on a numerical scale, such as 1-5, with 1 negative, 2 equivocal., 3 weakly positive, 4 positive, and 5 strongly positive.
  • a numerical scale such as 1-5, with 1 negative, 2 equivocal., 3 weakly positive, 4 positive, and 5 strongly positive.
  • the maximum of the two scores is used, since there is a very high false-negative rate in the hybridization data.
  • This data entry can be facilitated by use of an interactive computer program that presents the electronic image of the filter on a computer display, or by automated computer inte ⁇ retation of the scanned image.
  • 1.4.4.2 Producing a clone library characterized by long-range probes.
  • the hybridization experiments construct a table of scores that compare the DNA from clones against DNA from long-range probes for detectable sequence similarity, and thus presumed genomic colocalization.
  • the scores are rescaled so that the new scaling is approximately linear (C. C. Clogg and E. S. Shihadeh, Statistical Models for Ordinal Nariables. Thousand Oaks, Calif: Sage Press, 1994), inco ⁇ orated by reference. That is, a unit increase in the scaling indicates a unit increase in the confidence one holds that the clone actually hybridized with the long-range probe.
  • An equivocal event is scored as a 0, since it was equally likely to be negative or positive.
  • a negative event is scored as -1, since there is high confidence that no observable hybridization has occuned; both positive and strongly positive events are scored as 1 , since there is certainty that a hybridization event has occuned.
  • a weakly positive event can be scored at 0.67 when a single typing is available, since there is considerably more confidence that it is positive than negative, and is considered equivocal when duplicate typings were available.
  • the data is scored in a manner determined by the laboratory investigator and data analyst. This rescaled clone vs. probe comparison table A is stored in the memory of a computational device.
  • This table A might suffice for ordering the clones using conventional RH mapping methods.
  • the high-throughput hybridization experiments incur a large noise cost. Therefore, some correction data is required to accurately map the clones.
  • This conection stage is performed in the following steps. 1.4.4.2.1 Obtaining a bin probe library suitable for positioning the D ⁇ A sequences of long-range probes relative to the genome.
  • the bin probe library is comprised of sequence- tagged sites (STSs).
  • STSs sequence- tagged sites
  • many of the STSs are prefenably made polymo ⁇ hic.
  • the genetic or physical markers to be used for each STS are obtained as PCR primer sequences pairs and PCR reaction conditions from available Internet databases (Genbank, Bethseda, Md.; GDB, Baltimore, Md.; EMBL, Cambridge, UK; Genethon, Ervy, France; Stanford Genome Center, Stanford, Calif; Whitehead Institute Genome Center, Cambridge, MA; G. Gyapay, J. Morissette, A. NignaL, C. Dib, C. Fizames, P. Millasseau, S. Marc, G. Bernardi, M.
  • the locations of the long-range probe fragments are localized on the genome by fluorescence in situ hybridization (FISH) studies.
  • FISH fluorescence in situ hybridization
  • the nuclear DNA of the genome serves as the bin probe.
  • the binning is effected by comparison with previously positioned DNA probes, including mapped clone libraries, ESTs, or PCR primers.
  • PCR amplifications are carried out between the STSs in the bin probe library and the RH (or other) DNAs in the long-range probe library. Subsequent detection for presence or absence of PCR products (+/- scores) is carried out either by gel electrophoresis or by internal oligonucleotide hybridizations.
  • the orders of the STSs relative to the genome are then determined using computational or statistical methods (M. Boehnke, "Radiation hybrid mapping by minimization of the number of obligate chromosome breaks," Genetic Analysis Workshop 7: Issues in Gene Mapping and the Detection of Major Genes. Cytogenet Cell Genet, vol. 59, pp. 96- 98, 1992; M. Boehnke, K. Lange, and D. R. Cox, "Statistical methods for multipoint radiation hybrid mapping," Am. J. Hum. Genet., . vol. 49, pp. 1174-1188, 1991; A. Chakravarti and J. E.
  • DNA from the long-range probes e.g., specifies-specific PCR products
  • the fragment positions on the genome of the probes are then visualized using fluorescent microscopic imaging. Linear fractional length measurements on the metaphase spreads of chromosomes are then performed to determine the bin positions of the fragments.
  • DNA from the previously positioned bin probes is hybridized to DNA from the long-range probes.
  • the procedures produce a data table which compares the DNA content of the long-range probes to bins on the genome.
  • this is a table B of long-range probes (the rows of B) vs. ordered STSs (the columns of B).
  • the pairwise distance information between the ordered STSs is also recorded.
  • the table can be ananged similarly.
  • step 10 produces a table which bins each clone relative to the genome.
  • this is a table C of clones (the rows of C) vs. ordered bins (the columns of C). Each entry in the table describes the confidence that the clone is located in the bin.
  • this result C is a binning of clones, not a contig.
  • a short-range probing is prefenably performed. This probing and contig formation is performed in the following steps. 1.4.4.3.1 Obtaining a short-range probe library relative to the clone library.
  • oligonucleotide sequences are generally designed to preferentially detect sequences that are related to the genes in the genome, rather than to repetitive elements in the genome or to the cloning vector. This selective bias can be achieved either by experimental probings, or by examination of the sequences to be compared.
  • these oligonucleotides are prefenably ordered from a DNA synthesis service (Research Genetics, Huntsville, Ala.). Alternatively, they can be synthesized on a DNA synthesizer (Applied Biosystems, Foster City, Calif).
  • Alternative hybridization embodiments include using clones (or their PCR products) to probe clone libraries, using pools of clones as hybridization probes, and using Southern blotting of digested clones with repetitive element hybridization probes.
  • Enzymatic methods include gel electrophoresis of restriction endonuclease digests of clones, PCR-based STS comparisons, and hybrid methods such as Alu finge ⁇ rinting. Other short-range probes can be formed by selective or random retention of fragments produced by genome cutting.
  • probes For experimental efficiency, many of these short-range probes work in a multiplexed way, and probe one or more genome regions simultaneously. These probes include oligonucleotides, pooled clones, and repetitive-element finge ⁇ rint probes.
  • DNA from the clone library is spotted onto nylon membranes. This DNA is comprised of lysed colonies, DNA preps, or species- specific PCR products. The membranes are then prepared for hybridization. Each oligonucleotide short-range probe is then labeled, prefenably with 32 P using a kinase. The labeled probe is then hybridized to the membranes, followed by rinsing, stringent washing, and autoradiography. The filters may be stripped for subsequent reuse. The autoradiograph spots are then scored on a binary or more continuous (e.g., 0-255) scale.
  • the comparison experiments of the previous step construct a table D of scores that compare the DNA from clones against DNA from short-range probes. These provide measures of genomic colocalization and distance.
  • contigs can be formed from the short- range characterization data of the clones.
  • each clone's score signature relative to the oligonucleotides is compared against other clones' score signatures. Pairs of clones having similar score signatures are infened to be close, and their distances can be estimated.
  • the prefened ordering method is simulated annealing (W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Netterling, Numerical Recipes in C: The Art of Scientific Computing. Cambridge: Cambridge University Press, 1988), inco ⁇ orated by reference. Effective contiging algorithms have been described (A. J. Cuticchia, J.
  • Warburton, I. S. Edelman, and A. Efstratiadis "Assembly of ordered contigs of cosmids selected with YACs of human chromosome 13," Genomics, vol. 21, pp. 525-537, 1994; R. Mort, A. Grigoriev, E. Maier, J. Hoheisel, and H. Lehrach, "Algorithms and software tools for ordering clone libraries: application to the mapping of the genome of Schizosaccharomyces pombe," Nucleic Acids Research, vol. 21, no. 8, pp. 1965-1974, 1993), inco ⁇ orated by reference.
  • a (not necessarily unique) subset of clones that cover the genome can be identified. This identification is done by starting from a leftmost clone by moving rightward from a selected clone A, selecting a neighbor B which overlaps A, and then iteratively continuing from B. A constraint can be placed on this process to find tiling paths having small or minimal length, where length is defined as the sum of the insert sizes of the component clones.
  • each mapped clone is selected in turn from a minimum tiling path.
  • This clone is then subcloned into Ml 3 sequencing vectors.
  • Ml 3 subclone nested deletions are constracted for use in DNA sequencing.
  • a DNA sequencing template is prepared. This template is then sequenced by the dideoxy method, prefenably using an automated DNA sequencer, such as an A. L. F. (Pharmacia Biotech, Piscataway, N.J.) or an ABI/373 or ABI/377 (Applied Biosystems, Foster City, Calif.) , and 100-500 bp of sequence determined.
  • a "walking" phase takes additional reads from selected subclones by use of custom primers.
  • Complete protocols for these and related sequencing steps have been described (F. M. Ausubel, R. Brent, R. E. Scientific, D. D. Moore, J. G. Seidman, J. A. Smith, and K. Struhl, ed., Current Protocols in Molecular Biology. New York, N.Y.: John Wiley and Sons, 1995; N. J. Dracopoli, J. L. Haines, B. R. Korf, C. C. Morton, C. E. Seidman, J. G. Seidman, D. T. Moir, and D. Smith, ed., Current Protocols in Human Genetics. New York: John Wiley and Sons, 1995).
  • sequences of the nested deletion clones are assembled into the complete sequence of the subclone by matching overlaps.
  • the subclone sequences are then assembled into the sequence of the mapped clone.
  • the sequences of the mapped clones are assembled into the complete sequence of the genome by matching overlaps.
  • Computer programs are available for these tasks (Rodger Staden programs, Cambridge, UK; DNAStar, Madison, Wis.). Following sequence assembly, current analysis practice includes similarity and homology searches relative to sequence databases (Genbank, Bethesda, Md.; EMBL, Cambridge, UK; Phil Green's GENEFINDER, Seattle, Wash.) to identify genes and repetitive elements, infer function, and determine the sequence's relation to other parts of the genome and cell. 1.4.4.4.6 Application of Strategies
  • the current best mode for sequencing is gel electrophoresis on polyacrylamide gels, possibly using fluorescence detection.
  • Newer technologies for DNA size separation are being developed that are applicable to DNA sequencing, including ultrathin gel slabs (A. J. Kostichka, M. L. Marchbanks, R. L. Brumley Jr., H. Drossman, and L. M. Smith, "High speed automated DNA sequencing in ultrathin slab gels," Bio/Technology, vol. 10, pp. 78-81, 1992), inco ⁇ orated by reference, capillary anays (R. A. Mathies and X. C. Huang, "Capillary array electrophoresis: an approach to high-speed, high-throughput DNA sequencing," Nature, vol.
  • the process begins with a fragment of DNA, such as a genomic fragment, which is inserted into an appropriate host vector capable of accommodating it.
  • a BAC vector can accommodate approximately 140 kb of DNA;
  • a cosmid vector can accommodate approximately 40 kb.
  • a composition comprised of these insert-containing vectors is randomly sheared using standard methods, such as sonication, to obtain fragments suitable for transposon-based sequencing ⁇ i.e., about 2-5 kb, preferably 3-4 kb, on the average.
  • the resulting subfragments are ligated into cloning vectors to create a first library of subclones representing the original fragment. Because the subclones in this library will be used as target plasmids for transposon-mediated sequencing, the size of the cloning vector should be minimized; preferably it should contain only a selectable marker, an origin of replication, and an insertion site.
  • a suitable host plasmid is pOT2; the subfragments obtained by shearing the original composition are end- repaired, ligated to suitable restriction site containing adapters, and inserted into the host vector. Suitable adapters for the pOT2 vector contain BstXI sites.
  • the resulting cloning vectors with their inserts are then transfected into bacteria, typically E. coli, for clonal growth.
  • This first library should contain a 15-20- fold representation of the original fragment of DNA. For example, if the original fragment is approximately 40 kb, and the subclones contain inserts of approximately 4 kb, 200 such clones would be required for a 20-fold representation of the original fragment.
  • this first library will contain subclones which do not contain DNA derived from the original fragment to be sequenced.
  • a preliminary hybridization screen is conducted.
  • the required number of subclones is prepared for hybridization screening, for example, by plating in 96-well plates and transferring to filters.
  • the filters are then probed with the original fragment insert to weed out any colonies which do not contain DNA which represents portions of the original fragment. This checks the quality of the library and eliminates subclones that contain only host cloning vector for the original fragment or contaminating bacterial DNA.
  • the criteria used to determine the number of subclones used to establish the database in the method described above are that low sequencing redundancy must be maintained and a complete path must be available within the set of subclones chosen to provide complete coverage of the original fragment. In addition, the number must be chosen so that there is a high probability of finding the next subclone when searching with the newly sequenced end sequence.
  • subclone coverage i.e., the redundancy based on the complete sequence contained in the number of subclones chosen— is important.
  • a subclone coverage factor of 7x-8x times provides a 99.9% probability that each nucleotide in the fragment will actually reside in the library. This requires only about 100 subclones averaging 3 kb in size for a 40 kb fragment.
  • Sequence information from the host vector for the original fragment is used as the first query and reveals which subclones in the library are hybrid vector/fragment insert subclones. These will identify the two ends of the original fragment.
  • One subclone representing each end, preferably that containing the least amount of vector sequence, is selected for further sequencing.
  • the insert of the identified subclone will be sequenced from the opposite end from that previously sequenced— i.e., opposite the end containing the vector sequence.
  • the new sequence information (which is now derived from the fragment) is used as the next query. This identifies additional subclones which contain additional nucleotide sequence farther in from the end of the original fragment.
  • the next identified subclone is then also sequenced from the opposite end of the insert from that used to place it in the database and the new sequence information used as the next query. The process is continued sequentially until a subclone path through the fragment is obtained.
  • the subclone path will represent the collection of subclones which completely define the fragment from which they originated, and their conect relative positions are known.
  • the subclone path is determined, it remains only to complete the sequencing of the subclones involved in the path. According to the method of the invention, this is accomplished using the transposon- mediated method of Sfrathmann inco ⁇ orated by reference hereinabove. Use of this method to complete the sequence information for the fragment has been designated "minimal assembled path" (MAP) sequencing. The name is apt because the information provided by the subclone path can be used to determine the minimal sequencing path through the identified subclones. For example, if two subclones overlap over 1 kb, transposon insertions can be selected so that the overlap region is sequenced only once.
  • MAP minimal assembled path
  • primers are used in combination with capturable chain terminators to produce primer extension products capable of being captured on a solid phase, where the primer extension products may be labeled, e. g. by employing labeled primers to generate the primer extension products.
  • the primer extension products are isolated through capture on a solid phase. The isolated primer extension products are then released from the solid phase, size separated and detected to yield sequencing data from which the nucleic acid sequence is determined.
  • enzymatic sequencing methods which are also refened to as Sanger dideoxy or chain termination methods
  • differently sized oligonucleotide fragments representing termination at each of the bases of the template DNA are enzymatically produced and then size separated yielding sequencing data from which the sequence of the nucleic acid is determined.
  • the results of such size separations are shown herein.
  • the first step in such methods is to produce a family of differently sized oligonucleotides for each of the different bases in the nucleic acid to be sequenced, e.g. for a strand of DNA comprising all four bases (A, G, C, and T) four families of differently sized oligonucleotides are produced, one for each base.
  • each base in the sequenced nucleic acid i.e. template nucleic acid
  • an oligonucleotide primer a polymerase
  • nucleotides and a dideoxynucleotide conesponding to one of the bases in the template nucleic acid.
  • Each of the families of oligonucleotides are then size separated, e.g. by electrophoresis, and detected to obtain sequencing data, e.g. a separation pattern or electropherogram, from which the nucleic acid sequence is determined.
  • sequencing data e.g. a separation pattern or electropherogram
  • primer extension products consisting of the families of different sized oligonucleotide fragments (hereinafter refened to as primer extension products) comprising a capture moiety at the 3' terminus.
  • the primer sequences employed to generate the primer extension products will be sufficiently long to hybridize the nucleic acid comprising the target or template nucleic acid under chain extension conditions, where the length of the primer will generally range from 6 to 40, usually 15 to 30 nucleotides in length.
  • the primer will generally be a synthetic oligonucleotide, analogue or mimetic thereof, e.g. a peptide nucleic acid.
  • primers may hybridize directly to the 3' terminus of the target nucleic acid where a sufficient portion of this terminus of the target nucleic acid is known, conveniently a universal primer may be employed which anneals to a known vector sequence flanking the target sequence.
  • Universal primers which are known in the art and commercially available include pUC/M 13, gtlO, gtl l and the like.
  • the primers employed in the subject invention will comprise a detectable label.
  • labels are known in the art and suitable for use in the subject invention, including radioisotopic, chemiluminescent and fluorescent labels.
  • fluorescent labels are prefened.
  • Fluorescently labeled primers employed in the subject methods will generally comprise at least one fluorescent moiety stably attached to one of the bases of the oligonucleotide.
  • the primers employed in the subject invention may be labeled with a variety of different fluorescent moieties, where the fluorescer or fluorophore should have a high molar absorbance, where the molar absorbance will generally be at least 10 4 cm “ " M “1 , usually at least 10 4 c ⁇ 'M '1 and preferably at least 10 5 cm “ 'M “* , and a high fluorescence quantum yield, where the fluorescence quantum yield will generally be at least about 0.1, usually at least about 0.2 and preferably at least about 0.5.
  • the wavelength of light absorbed by the fluorescer will generally range from about 300 to 900 nm, usually from about 400 to 800 nm, where the absorbance maximum will typically occur at a wavelength ranging from about 500 to 800 nm.
  • Specific fluorescers of interest for use in singly labeled primers include: fluorescein, rhodamine, BODIPY, cyanine dyes and the like, and are farther described in Smith et al., Nature (1986) 321 : 647-679, the disclosure of which is herein inco ⁇ orated by reference.
  • energy transfer labeled fluorescent primers in which the primer comprises both a donor and acceptor fluorescer component in energy transfer relationship.
  • Energy transfer labeled primers are described in PCT/US95/01205 and PCT/US96/13134, as well as in Ju et al., Nature Medicine (1996)2:246-249, the disclosures of which are herein inco ⁇ orated by reference.
  • labeled deoxynucleotides instead of using labeled primers labeled deoxynucleotides are employed, such as fluorescently labeled dUTP, which are inco ⁇ orated into the primer extension product resulting in a labeled primer extension product.
  • the dideoxynucleotides employed as capturable chain terminators in the subject methods will comprise a functionality capable of binding to a functionality present on a solid phase.
  • the bond arising from reaction of the two functionalities should be sufficiently strong so as to be stable under washing conditions and yet be readily disruptable by specific chemical or physical means.
  • the chain terminator dideoxynucleotide will comprise a member of a specific binding pair which is capable of specifically binding to the other member of the specific binding pair present on the solid phase.
  • Specific binding pairs of interest include ligands and receptors, such as antibodies and antigens, biotin and strept/avidin, sulfide and gold (Cheng & Brajter-Toth, Anal.Chem.
  • the nucleic acids which are capable of being sequenced by the subject methods are generally deoxyribonucleic acids that have been cloned in appropriate vector, where a variety of vectors are known in the art and commercially available, and include M 13mp 18, pGEM, pSport and the like.
  • the first step in the subject method is to prepare a reaction mixture for each of the four different bases of the sequence to be sequenced or target DNA.
  • Each of the reaction mixtures comprises an enzymatically generated family of primer extension products, usually labeled primer extension products, terminating in the same base.
  • primer extension products usually labeled primer extension products
  • primer extension reaction mixture template DNA, a DNA polymerase, primer (which may be labeled), the four different deoxynucleotides, and capturable dideoxynucleotides are combined in a primer extension reaction mixture.
  • the components are reacted under conditions sufficient to produce primer extension products which are differently sized due to the random inco ⁇ oration of the capturable dideoxynucleotide and subsequent chain termination.
  • the above listed reagents will be combined into a reaction mixture, where the dideoxynucleotide is ddATP modified to comprise a capturable moiety, e.g.
  • biotinylated ddATP such as biotin- 11 -ddATP.
  • the remaining “G”, C,” and “T” families of differently sized primer extension products will be generated in an analogous manner using the appropriate dideoxynucleotide.
  • the labeled primers may be the same or different.
  • the labeled primer employed will be different for production of each of the four families of primer extension products, where the labels will be capable of being excited at substantially the same wavelength and yet will provide a distinguishable signal.
  • the use of labels with distinguishable signals affords the opportunity of separating the differently sized primer extension products when such products are together in the same separation medium. This results in superior sequencing data and therefore more accurate sequence determination.
  • the label used in production of "G,” “C,” and “T” families will be excitable at the same wavelength as that used in the "A” family, but will emit at 555 nm, 580 nm, and 605 nm respectively. Accordingly, the primer extension labels are designed so that all four of the labels absorb at substantially the same wavelength but emit at different wavelengths, where the wavelengths of the emitted light differ in detectable and differentiatable amounts, e.g. differ by at least 15 nm.
  • the next step in the subject method is isolation of the primer extension products.
  • the primer extension products are isolated by first capturing the primer extension products on a solid phase through the capture moiety at the 3' terminus of the primer extension product and then separating the solid phase from the remaining components of the reaction mixture.
  • Capture of the primer extension products occurs by contacting the reaction mixture comprising the family of primer extension products with a solid phase.
  • the solid phase has a member of a specific binding pair on its surface.
  • the other member of the specific binding pair is bonded to the primer extension products, as described above. Contact will occur under conditions sufficient to provide for stable binding of the specific binding pair members.
  • Specific solid phases of interest include polystyrene pegs, sheets, beads, magnetic beads, gold surface and the like. The surfaces of such solid phases have been modified to comprise the specific binding pair member, e.g. for biotinylated primer extension products, streptavidin coated magnetic bead may be employed as the solid phase.
  • the solid phase is then separated from the remaining components of the reaction mixture, such as template DNA, excess primer, excess deoxy- and dideoxymicleotides, polymerase, salts, extension products which do not have the capture moiety, and the like. Separation can be accomplished using any convenient methodology.
  • the methodology will typically comprise washing the solid phase, where further steps can include centrifugation, and the like.
  • the particular method employed to separate the solid-phase is not critical to the subject invention, as long as the method employed does not disrupt the bond linking the primer extension reaction product from the solid-phase.
  • the primer extension products are then released from the solid phase.
  • the products may be released using any convenient means, including both chemical and physical means, depending on the nature of the bond between the specific binding pair members.
  • the bond may be disrupted by contacting the solid phase with a chemical disruption agent, such as formamide, and the like, which disrupts the biotin-sfreptavidin bond and thereby releases the primer extension product from the solid phase.
  • the released primer extension products are then separated from the solid phase using any convenient means, including elution, centrifugation and the like.
  • the next step in the subject method is to size separate the primer extension products. Size separation of the primer extension products will generally be accomplished through electrophoresis, in which the primer extension products are moved through a separation medium under the influence of an electric field applied to the medium, as is known in the art. Alternatively, for sequencing with Mass Spectrometry (MS) where unlabeled primer extension products are detected, the sequencing fragments are separated by the time of the flight chamber and detected by the mass of the fragments. See Roskey et al., Proc. Natl. Acad. Sci. USA (1996) 93: 4724-4729. The subject methodology is especially important for obtaining accurate sequencing data with MS, because the subject methodology offers a means to load only the primer extension products terminated with the capturable chain terminators, eliminating all other masses windthereby producing accurate results.
  • MS Mass Spectrometry
  • the size separated primer extension products are then detected, where detection of the size separated products yields sequencing data from which the sequence of the target or template DNA is determined. For example, where the families of fragments are separated in a traditional slab gel in four separate lanes, one conesponding to each base of the target DNA, sequencing data in the form of a separation pattern is obtained. From the separation pattern, the target DNA sequence is then determined, e.g. by reading up the gel. Alternatively, where automated detectors are employed and all of the reaction products are separated in the same electrophoretic medium, the sequencing data may take the form of an electropherogram, as is known in the art, from which the DNA sequence is determined.
  • the nature of the labeled primers will, in part, determine whether the families of labeled primer extension products may be separated in the same electrophoretic medium, e.g. in a single lane of slab gel or in the same capillary, or in different electophoretic media, e.g. in different lanes of a slab gel or in different capillaries.
  • the same labeled primer generating the same detectable single is employed to generate the primer extension products in each of the different families
  • the families of primer extension products will be elecfrophoretically separated in different electrophoretic media, so that the families of primers extension products conesponding to each base in the nucleic acid can be distinguished.
  • the families of products may be grouped together and elecfrophoretically separated in the same electrophoretic medium.
  • the families of primer extension products may be combined or pooled together at any convenient point following the primer extension product generation step.
  • the primer extension products can be pooled either prior to contact with the solid phase, while bound to the solid phase or after separation from the solid phase but prior to electrophoretic separation.
  • kits for practicing the subject sequencing methods are also provided.
  • such kits will comprise capturable chain terminators, e.g. biotinylated- ddATP; -ddTTP; - ddCTP and -ddGTP.
  • the kits will further comprise a means for generating labeled primer extension products, such as labeled deoxynucleotides, or preferably labeled primers, where the labeled primers are preferably Energy Transfer labeled primers which absorb at the same wavelength and provide distinguishable fluorescent signals.
  • kits may further comprise one or more additional reagents useful in enzymatic sequencing, such as vector, polymerase, deoxynucleotides, buffers, and the like.
  • the kits may further comprise a plurality of containers, wherein each contain may comprise one or more of the necessary reagents, such as labeled primer, unlabled primer or degenerate primer, dNTPs, dNTPs containing a fraction of fluorescent dNTPs, capturable ddNTP, polymerase and the like.
  • the kits may also further comprise solid phase comprising a moiety capable of binding with the capturable ddNTP, such as streptavidin coated magnetic beads and the like.
  • the DNA fragments are preferably prepared according to either the enzymatic or chemical degradation sequencing techniques previously described, but the fragments are not tagged with radioactive tracers. These standard procedures produce, from each section of DNA to be sequenced, four separate collections of DNA fragments, each set containing fragments terminating at only one of the four bases. These four samples, suitably identified, are provided as a few microliters of liquid solution. 1.4.7.1 Sample Preparation and Introduction
  • Suitable matrices for this pu ⁇ ose include cinnamic acid derivatives such as (4-hydroxy, 3-methoxy) cinnamic acid (feralic acid), (3,4-dihydroxy) cinnamic acid (caffeic acid) and (3,5-dimethoxy, 4- hydroxy) cinnamic acid (sinapinic acid). These materials may be dissolved in a suitable solvent such as 3:2 mixture of 0.1% aqueous trifluoroacetic acid and acetonifrile at concentrations which are near saturation at room temperature.
  • One technique for introducing samples into the vacuum of the mass spectrometer is to deposit each sample and matrix as a liquid solution at specific spots on a disk or other media having a planar surface.
  • To prepare a sample for deposit approximately 1 microliter of the sample solution is mixed with 5-10 microliters of the matrix solution. An aliquot of this mixed solution for each D ⁇ A sample is placed on the disk at a specific location or spot, and the volatile solvents are removed by room temperature evaporation. When the solution containing the samples and thousand-fold or more excess of matrix is dried on the disk, the result should be a solid solution of samples each in the matrix at a specific site on the disk.
  • Each molecule of the sample should be fully encased in matrix molecules and isolated from other sample molecules. Aggregation of sample molecules should not occur.
  • the matrix need not be volatile, but it must be rapidly vaporized following abso ⁇ tion of photons. This can occur as the result of photochemical conversion to more volatile substances.
  • the matrix must transfer ionization to the sample.
  • the proton affinity of the matrix must be less than that of the basic sites on the molecule, and to form deprotonated negative ions, the gas phase acidity of the matrix must be less than that of acidic sites on the sample molecule.
  • the matrix does not absorb laser photons to avoid radiation damage and fragmentation of the sample. Therefore, matrices which have abso ⁇ tion bands at longer wavelengths are prefened, such as at 355 nm, since DNA fragment molecules do not absorb at the longer wavelengths.
  • a suitable automated DNA sample preparation and loading technique Depicted herein is a suitable automated DNA sample preparation and loading technique.
  • a commercially available autosampler is used to add matrix solution from container to the separated DNA samples.
  • a large number of DNA fragment samples for example 120 samples, may be loaded into a sample tray.
  • the matrix solution may be added automatically to each sample using procedures available on such an autosampler, and the samples may then be spotted sequentially as sample spots on an appropriate surface, such as the planar surface of the disk rotated by stepper motor.
  • Sample spot identification is entered into the data storage and computing system which controls both the autosampler and the mass spectrometer. The location of each spot relative to a reference mark is thus recorded in the computer.
  • Sample preparation and loading onto the solid surface is done off-line from the mass spectrometer, and multiple stations may be employed for each mass spectrometer if the time required for sample preparation is longer than the measurement time.
  • the disk may be inserted into the ion source of a mass spectrometer through the vacuum lock. Any gas introduced in this procedure must be removed prior to measuring the mass spectrum.
  • Loading and pump down of the spectrometer typically requires two to three minutes, and the total time for measurement of each sample to obtain a spectrum is typically one minute or less.
  • 50 or more complete DNA spectrum may be determined per hour according to the present invention. Even if the samples were manually loaded, less than one hour would be required to obtain sequence data on a particular segment of DNA, which might be from 400 to 600 bases in length. Even this latter techmque is much faster than the conventional DNA sequencing techniques, and compares favorably with the newer automated sequencers using fluorescence labeling.
  • the technique of the present invention does not, however, require the full- time attention of a dedicated, trained operator to prepare and load the samples, and preferably is automated to produce 50 or more spectrum per hour.
  • the disk Under the control of the computer, the disk may be rotated by another stepper motor relative to the reference mark to sequentially bring any selected sample to the position for measurement. If the disk contains 120 samples, operator intervention is only required approximately once every two hours to insert a new sample disk, and less than five minutes of each two hour period is required for loading and pumpdown. With this approach, a single operator can service several spectrometers.
  • the particular disk geometry shown for the automated system is chosen for illustrative pu ⁇ oses only. Other geometries, employing for example linear translation of the planar surface, could also be used. 1.4.7.2
  • the Mass Spectrometer The Mass Spectrometer
  • the present invention preferably utilizes a laser deso ⁇ tion time of flight (TOF) mass spectrometer.
  • the disk has a planar face containing a plurality of sample spots, each being approximately equal to the laser beam diameter.
  • the disk is maintained at a voltage Vi and may be manually inserted and removed from the spectrometer. Ions are formed by sequentially radiating each spot on the disk with a laser beam from source.
  • TOF time of flight
  • the ions extracted from the face of the disk are attracted and pass through the grid covered holes in the metal plates.
  • the plates are at voltages V 2 and V 3 .
  • V 3 is at ground, and Ni and N 2 are varied to set the accelerating electrical potential., which typically is in the range of 15,000-50,000 volts.
  • a suitable voltage Ni -N 2 is 5000 volts and a suitable range of voltages N 2 -N 3 is 10,000 to 45,000 volts.
  • the low mass ions are almost entirely prevented from reaching the detector by the deflection plates.
  • the ions travel as a beam between the deflection plates which suitably are spaced 1 cm. apart and are 3-10 cm long.
  • the first plate is at ground and a second plate receives square wave pulses, for example, at 700 volts with a pulse width in the order of 1 microsecond after the laser strikes the tip.
  • Such pulses suppress the unwanted low mass ions, for example, those under 1,000 Daltons, by deflecting them, so that the low weight ions do not reach the detector, while the higher weight ions pass between the plates after the pulse is off, so they are not deflected, and are detected by detector.
  • An ion detector is positioned at the end of the spectrometer tube and has its front face maintained at voltage Nd.
  • the gain of the ion detector is set by V which typically is in the range of -1500 to -2500 volts.
  • the detector is a chevron-type tandem microchannel plate anay with a front plate at about -2000 volts.
  • the spectrometer tube is straight and provides a linear flight path, for example, 1/2 to 4 meters in length, and preferably about two meters in length.
  • the ions are accelerated in two stages and the total acceleration is in the range of about 15,000-50,000 volts, positive or negative.
  • the spectrometer is held under high vacuum, typically 10 uPa, which may be obtained, for example, after 2 minutes of introduction of the samples.
  • the face of the disk is struck with a laser beam to form the ions.
  • the laser beam is from a solid laser.
  • a suitable laser is an HY-400 ⁇ d-YAG laser (available from Lumonics Inc., Kanata (Ottawa), Ontario, Canada), with a 2nd, 3rd and 4th harmonic generation/selection option.
  • the laser is tuned and operated to produce maximum temporal and energy stability.
  • the laser is operated with an output pulse width of 10 ns and an energy of 15 mj of UV per pulse.
  • the amplifier rod is removed from the laser.
  • the output of the laser is attenuated with a 935-5 variable attenuator (available from Newport Co ⁇ ., Fountain Valley, Calif), and focused onto the sample on the face, using a 12-in. focal length fused-slica lens.
  • the incident angle of the laser beam, with respect to the normal of the disk's sample surface, is 70°.
  • the spot illuminated on the disk is not circular, but a stripe of approximate dimensions 100x300 um or larger.
  • the start time for the data system i.e., the time the laser actually fired
  • the laser is operated in the Q switched mode, internally triggering at 5 Hz, using the Pockels cell Q-switch to divide that frequency to a 2.5 Hz output.
  • the data system for recording the mass spectra produced is a combination of a TR8828D transient recorder and a 6010 CAMAC crate controller (both manufactured by Lecroy, Chestnut Ridge, N. Y.).
  • the transient recorder has a selectable time resolution of 5-20 ns.
  • Spectra may be accumulated for up to 256 laser shots in 131 ,000 channels, with the capability of running at up to 3 Hz, or with fewer channels up to 10 Hz.
  • the data is read from the CAMAC crate using a Proteus IBM AT compatible computer.
  • the spectra (shot-to- shot) may be readily observed on a 2465 A 350 MHz oscilloscope (available from Tektronix, Inc., Beaverton, Oreg.).
  • a suitable autosampler for mixing the matrix solution and each of the separated DNA samples and for depositing the mixture on a solid planar surface is the Model 738 Autosampler (available from Alcott Co., Norcoss, Ga.).
  • This linear TOF system may be switched from positive to negative ions easily, and both modes may be used to look at a single sample.
  • the sample preparation was optimized for the production of homogeneous samples in order to produce similar signals from each DNA sample spot. 1.4.7.3 Data Analysis and Determination of Sequence
  • the raw data obtained from the laser deso ⁇ tion mass spectrometer 30 consists of ion cunent as a function of time after the laser pulse strikes the target containing the sample and matrix. This time delay conesponds to the "time-of-flight" required for an ion to travel from the point of formation in the ion source to the detector, and is proportional to the mass-to-charge ratio of the ion. By reference to results obtained for materials whose molecular weights are known, this time scale can be converted to mass with a precision of 0.01% or better.
  • the data obtained from the mass spectra contains significantly more useful information that the conesponding traces from electrophoresis. Not only can the mass order of the peaks be determined with good accuracy and precision, but also the absolute mass differences between adjacent peaks, both in individual spectra and between spectra, can be determined with high accuracy and precision.
  • This information may be used to detect and conect sequence enors which might otherwise go undetected. For example, a common source of enor which often occurs in conventional sequencing results from variations the amounts of the individual fragments present in a mixture due to variations in the cleavage chemistry. Because of this variation it is possible for a small peak to go undetected using conventional sequencing techniques.
  • such enors can be immediately detected by noting that the mass differences between detected peaks do not match the apparent sequence.
  • the enor can be quickly corrected by calculating the apparent mass of the missing base from the observed mass differences across the gap.
  • the present invention provides sequence data not only much faster than conventional techniques, but also data which is more accurate and reliable. This conection technique will reduce the number of extra runs which are required to establish the validity of the result.
  • the present invention enables the amplification of a DNA stretch using the PCR procedure with the knowledge of only one primer.
  • the present invention describes a procedure by which a very long DNA of the order of millions of nucleotides can be sequenced contiguously, without the need for fragmenting and sub-cloning the DNA.
  • the general PCR technique is used, but the knowledge of only one primer is sufficient, and the knowledge of the other primer is derived from the statistics of the distributions of oligonucleotide sequences of specified lengths.
  • a method comprising: a) synthesizing a partly fixed primer, with 4, 5, 6 nucleotide, or longer sequence characters fixed within it.
  • the fixed sequence can be any sequence, with some prefened sequences such as those containing many G-C pairs that increases binding affinity.
  • the fixed position within the primer can be anywhere, with some prefened positions; b) taking a very long genomic DNA, either uncloned or a cloned large insert such as the YAC or cosmid in which a short sequence of about 20 characters somewhere within the DNA is known; c) synthesizing a primer from the sequence known from the DNA in step b; d) radiolabeling the primer in step c; e) annealing the primers (from step a, and step d or step g as appropriate) to the DNA in step b, and amplifying the DNA between the attached primers; f) performing DNA sequencing of the amplified DNA by the chemical degradation method of Maxam and Gilbert, or carrying out DNA sequencing by the Sanger method, or by modified PCR-sequencing method; g) after obtaining the DNA sequence from step f, selecting an appropriate first primer towards the 3' end of the sequence, synthesizing it, and radiolabeling it; h) repeating the steps e through g with
  • the partly fixed primer used to perform DNA amplification and sequencing are, of course, not limited to those described under the examples. Further modification in the method may be made by varying the length, content and position of the fixed sequence and the length of the random sequence. Additional obvious modifications include using different DNA polymerases and altering the reaction conditions of DNA amplification and DNA sequencing. Furthermore, the basic technique can be used for sequencing RNA using appropriate enzymes.
  • the first primer can also be prepared as follows. Two or three shorter oligonucleotides that would comprise the complete primer could be ligated, by joining end-to-end after annealing to the template DNA, as described under another patent (Helmut Blocker, U.S. Pat. No. 5,114,839, 435/6, 5/1992) or as described in the publication (L. E. Kotler, et al., Proceedings of the National Academy of Science, USA, 90:4241-4245 (1993)). Alternatively, it can be synthesized using the single-stranded DNA binding protein, the subject of another invention (J. Kieleczawa, et al., Science, 258:1787-1791 (1992)).
  • the first primer need not be synthesized at every PCR reaction while contiguously sequencing a long DNA, and can be directly constructed from an oligonucleotide bank.
  • the second primer also can be chosen from a set of only a few pre-prepared primers. This enables the direct automation of sequencing the whole long DNA by inco ⁇ orating the primer elements into the series of sequential PCR reactions. 1.4.8.2 Advantages of Method
  • An advantage of the present invention is that from a known sequence in a very long DNA, sequencing can be performed in both directions on the DNA.
  • Two first primers can be prepared, one on each strand, running in the opposite directions, and the sequence can be extended on both directions until the two very ends of the long DNA are reached by the present invention, using a small set of pre-prepared partly fixed second primers.
  • One of the major advantages of the present invention is that it is highly amenable to various kinds of automation. Instead of radiolabeling the first known primer, it can be fluorescently labeled, and with this the DNA sequencing can be performed in an automated procedure on machines such as that marketed by the Applied Biosystems ("373 DNA Sequencer: Automated sequencing, sizing, and quantitation", a pamphlet from the Applied Biosystems, A Division of Perkin-Elmer Co ⁇ oration (1994)). In the present invention there is no need to newly synthesize any primers to sequence a very long DNA.
  • an oligonucleotide bank for the synthesis of the first primer and a large supply of the template genomic DNA (or any long DNA)
  • the sequencing of the whole long DNA can be automated using robots almost without any human intervention, except for changing the sequencing gels.
  • the following processes can be computer controlled: 1) the selection of the appropriate sequence for constructing the first primer close to the 3' end of the newly worked out sequence, 2) determining whether the sequence obtained is too short and selection of a different partly fixed second primer, 3) assembling the contiguous DNA sequences from the various lanes and various gels and appending to a database, and other such processes.
  • the present invention enables the construction of a fully automated contiguous DNA sequencing system. Any such automations are obvious modifications to the present invention.
  • the present invention is not limited to only unknown genomic DNA, and can be used to sequence any DNA under any situations.
  • DNAs or RNAs of many different origins e.g. viral, cDNA, mRNA
  • pmposes such as disease diagnosis and treatment, DNA testing, and forensic applications.
  • any kit or process used for research, diagnostic, forensic, treatment, production or other pu ⁇ oses that uses the present invention is covered under these claims.
  • the various sequences of the partly fixed second primers that can be used in the present invention are covered under this patent.
  • any kit or process that uses this method and/or the DNA strands with the sequences that would comprise the partly fixed second primers will also be covered under this.
  • the present invention will cover the amplification of the DNA strands that are bounded between the known primer and the partly fixed second primer (either from claim 1 or from claim 2).
  • the DNA amplification can also be performed for long DNA strands using the long PCR amplification protocols.
  • a DNA sample is prepared by shearing or digestion at a first sequence with a first restriction enzyme producing a 3' overhang terminus, to some appropriate, known size distribution, and labeled with a digoxigenin bearing nucleotide by the action of terminal deoxynucleotidyl transferase.
  • said DNA sample is then subjected to random internal cleavage, for example by shearing so as to produce a population of molecules with an average length half that produced in the previous sizing step, or digestion with a second restriction enzyme recognizing a distinct, second recognition sequence.
  • Sample molecules of said sample are then bound at some convenient surface density to a transparent surface modified with a monolayer or a sub-monolayer density of anti-digoxigenin antibody.
  • Said sample molecules which will thus be bound to said transparent surface by the 3' termini of one strand, are then subjected to treatment by a 3' to 5' exonuclease, which will only act at the 3' terminus which does not bear the digoxigenin moiety due to the hindrance of this latter 3' terminus by its interaction with the surface, preferably not to completion of digestion of susceptible strands.
  • a 3' to 5' exonuclease which will only act at the 3' terminus which does not bear the digoxigenin moiety due to the hindrance of this latter 3' terminus by its interaction with the surface, preferably not to completion of digestion of susceptible strands.
  • each of the four nucleotides denivatized to effect communication of said nucleotides with a biotin moiety via a chemically cleavable linker such as those described by S.W. Ruby et al.34 polymerization directed by the template provided by each involved DNA sample template molecule is effected with an appropriate DNA polymerase lacking a 3' to 5' exonuclease activity, such as Sequenase 2.0,35 with only one nucleotide type present during each polymerization step sub-cycle, at sufficiently low concentration to effect equilibrium controlled stepping.
  • Polymerization reagents are then washed away, and may favorably be recycled after quantitation and readjustment of respective labeled nucleotide content.
  • biotin bearing molecules may be labeled with microscopic streptavidin coated beads. Unbound beads are then washed away. Bead labeled molecules may then be observed by a video microscope, and the position of said bead labeled molecules within a sample may be recorded by image analysis of digital images thus obtained, in a manner similar to that used by Finzi and Gelles.
  • Dithiothreitol or other reagents capable of cleaving said linker holding said biotin in communication with said nucleotide inco ⁇ orated during the previous polymerization sub-cycle are then used to treat sample molecules to cleave said linkers and thus release said biotin labeling moieties and the beads which have bound to them.
  • a wash step is then performed to remove said beads. The extent of bead removal may be checked with another video microscopy detection step if needed; and further cleavage treatment may be performed if decoupling was not adequate.
  • the same subcycle (comprising polymerization, bead association, video microscopic examination, bead and label cleavage and removal by washing, and optionally a bead removal confirmation video microscopic examination step) is then repeated in succession for each of the three remaining nucleotide types, to complete a full base sequencing cycle (which as noted may yield information about more than one base location for some template molecules according to the sequence composition and the order of sub-cycles, and no information for other sample template molecules). Multiple said base sequence cycles are repeated until enough data have been accumulated relative to the total complexity of the initial DNA sample. Recorded data are then used to reconstruct sequence information for a segment of each sample template molecule, and segment sequence data are then aligned by appropriate computational algorithms.
  • this embodiment avails only existing and generally available materials and devices, relies on relatively simple manipulations which are known to be highly reproducible according to their general use in the relevant fields, but due to the novel process of the present invention may yield genome sequence information far more rapidly and inexpensively than highly complex robotic instruments with sequencing methods utilizing electrophoretic separation.
  • the transparent substrate providing the surface for immobilization may be that of a spooled film, which may be advanced at an appropriate rate before the objective of said video microscope of the present embodiment.
  • said film may be circular, and continuously advanced through multiple video microscope apparata and wells effecting polymerization sub-cycles, all in appropriate order such that benefit of full pipelining of each step may be enjoyed.
  • Sequence determination may additionally effected by the random immobilization at some appropriate density of appropriately prepared and primed sample molecules on the surface of a transparent film, and stepwise polymerization with some appropriate polymerase, of all four nucleotides, all of which are protected at the 3 '-hydroxyl with a photolabile (and hence photoremovable) protecting group in communication with labeling moieties which distinctly conespond to each nucleoside base type of the respective nucleotide.
  • Label inco ⁇ oration is detected, for example by the scanned beam light microscopic methods of the present invention, or with highly sensitive CCDs, and assigned to the spatial region occupied by a particular molecule. Said film is translated appropriately such that the full complexity of the sample may be examined after each polymerization cycle.
  • Data are recorded electronically and according to the molecule for which they are obtained. Illumination of the sample with an appropriate frequency and intensity of light to effect 3'-hydroxy deprotection and hence also labeling moiety removal is performed, and a wash step is performed to remove freed label. Such polymerization, detection and deprotection cycles are repeated until the sample is sufficiently well characterized.
  • Methods of the present invention may be combined with the immobilization of highly diverse libraries of binding specificities with either encoding labels or phenogenocouples, which may therefore be characterized dynamically and related to any detected binding of particles of interest from a sample.
  • Clinical samples are interacted with said libraries. All retained material is then interacted with some general label such as a polynucleotide binding dye (e.g. ethidium bromide, DAPI) or some chromophorigenic or photoemissive or labeled competitive inhibitor analog reagent detecting some metabolically fundamental reaction such as ATP hydrolysis, or the presence enzymes catalyzing said metabolically fundamental reaction.
  • a polynucleotide binding dye e.g. ethidium bromide, DAPI
  • some chromophorigenic or photoemissive or labeled competitive inhibitor analog reagent detecting some metabolically fundamental reaction such as ATP hydrolysis, or the presence enzymes catalyzing said metabolically fundamental reaction.
  • various implementations may distribute binding specificities of known composition in a spatially controlled manner, and thus rely on spatial information to encode specificity type and hence, if known, composition of each specificity type.
  • said libraries may comprise known mimetics or small molecules of known binding specificity.
  • the profile of any sample type from an individual organism according to such an assay may be monitored over time, and a profile is preferably obtained for a state of presumed health for comparison to samples conelated to states of disease, deficiency or degeneration or other states of ill health (i.e. longtitudinal tracking of individuals stratified by sample type). Samples of similar type may also be compared across populations and subpopulations, and the profile of these samples also conelated with state of health of the respective individuals (cross- sectional comparison).
  • sample characterized as above may be further characterized according to the immunocharacterization method below.
  • Banks comprising all of the specificities of a library may be maintained as monoclones, and upon detection of a pathogen in association with one or more binding specificity contained in some library, and the identification and/or characterization of said one or more binding specificity, an alignment of the respective said monoclone, from one of said banks, may can be provided to the organism.
  • Such analysis and provision of one or more monoclones be automated and controlled by algorithms.
  • enzymes contained within some sample may be analyzed according to their binding probability, binding duration or dissociation rate and conformational or phosphorylation or other status.
  • assays may favorably be performed by the methods of the present invention, with immobilized libraries which may include competitive inhibitors, and with pre- or post-binding labeling of sample enzymes by encoded label antibodies, to permit classification of sample enzyme type on a molecule by molecule basis, which classification data may be combined with the data obtained in this assay.

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Chemical & Material Sciences (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Organic Chemistry (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Physics & Mathematics (AREA)
  • Immunology (AREA)
  • General Health & Medical Sciences (AREA)
  • Microbiology (AREA)
  • Biochemistry (AREA)
  • Biophysics (AREA)
  • Urology & Nephrology (AREA)
  • Plant Pathology (AREA)
  • Hematology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Cell Biology (AREA)
  • Food Science & Technology (AREA)
  • Medicinal Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • General Physics & Mathematics (AREA)
  • Pathology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Ecology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Micro-Organisms Or Cultivation Processes Thereof (AREA)
EP01944583A 2000-06-14 2001-06-14 Manipulation von ganzen zellen durch mutation eines grossteils des stargenoms, kombination der mutationen und wahlweises wiederholen Withdrawn EP1294869A2 (de)

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
US09/594,459 US6605449B1 (en) 1999-06-14 2000-06-14 Synthetic ligation reassembly in directed evolution
US594459 2000-06-14
US677584 2000-09-30
US09/677,584 US7033781B1 (en) 1999-09-29 2000-09-30 Whole cell engineering by mutagenizing a substantial portion of a starting genome, combining mutations, and optionally repeating
PCT/US2001/019367 WO2001096551A2 (en) 2000-06-14 2001-06-14 Whole cell engineering by mutagenizing a substantial portion of a starting genome, combining mutations, and optionally repeating

Publications (1)

Publication Number Publication Date
EP1294869A2 true EP1294869A2 (de) 2003-03-26

Family

ID=56290150

Family Applications (1)

Application Number Title Priority Date Filing Date
EP01944583A Withdrawn EP1294869A2 (de) 2000-06-14 2001-06-14 Manipulation von ganzen zellen durch mutation eines grossteils des stargenoms, kombination der mutationen und wahlweises wiederholen

Country Status (4)

Country Link
EP (1) EP1294869A2 (de)
AU (1) AU2001266978A1 (de)
CA (1) CA2413022A1 (de)
WO (1) WO2001096551A2 (de)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6991922B2 (en) 1998-08-12 2006-01-31 Proteus S.A. Process for in vitro creation of recombinant polynucleotide sequences by oriented ligation
US6951719B1 (en) 1999-08-11 2005-10-04 Proteus S.A. Process for obtaining recombined nucleotide sequences in vitro, libraries of sequences and sequences thus obtained
CN103484486B (zh) 2003-07-02 2018-04-24 维莱尼姆公司 葡聚糖酶,编码它们的核酸以及制备和使用它们的方法
US7741089B2 (en) 2003-08-11 2010-06-22 Verenium Corporation Laccases, nucleic acids encoding them and methods for making and using them
US20080070291A1 (en) 2004-06-16 2008-03-20 David Lam Compositions and Methods for Enzymatic Decolorization of Chlorophyll
BRPI0510939A (pt) 2004-06-16 2007-07-17 Diversa Corp composições e métodos para descoloração enzimática de clorofila
US20090038023A1 (en) 2005-03-10 2009-02-05 Verenium Corporation Lyase Enzymes, Nucleic Acids Encoding Them and Methods For Making and Using Them
KR101393679B1 (ko) 2005-03-15 2014-05-21 비피 코포레이션 노쓰 아메리카 인코포레이티드 셀룰라제, 이것을 암호화하는 핵산 및 이들을 제조 및사용하는 방법
MY160772A (en) 2006-02-10 2017-03-15 Verenium Corp Cellulolytic enzymes, nucleic acids encoding them and methods for making and using them
USRE45660E1 (en) 2006-02-14 2015-09-01 Bp Corporation North America Inc. Xylanases, nucleic acids encoding them and methods for making and using them
EP2385108B1 (de) 2006-03-07 2016-11-23 BASF Enzymes LLC Adolasen, Nukleinsäuren, die diese codieren, und Verfahren zu deren Herstellung und Verwendung
WO2007103389A2 (en) 2006-03-07 2007-09-13 Cargill, Incorporated Aldolases, nucleic acids encoding them and methods for making and using them
CA2669453C (en) 2006-08-04 2018-11-13 Verenium Corporation Glucanases, nucleic acids encoding them and methods for making and using them
CN101652381B (zh) 2007-01-30 2014-04-09 维莱尼姆公司 用于处理木质纤维素的酶、编码它们的核酸及其制备和应用方法
JP5744518B2 (ja) 2007-10-03 2015-07-08 ビーピー・コーポレーション・ノース・アメリカ・インコーポレーテッド キシラナーゼ、キシラナーゼをコードする核酸並びにそれらを製造及び使用する方法
US8709772B2 (en) 2008-01-03 2014-04-29 Verenium Corporation Transferases and oxidoreductases, nucleic acids encoding them and methods for making and using them
CN111154798B (zh) * 2020-02-18 2021-07-20 杭州师范大学 马铃薯x病毒在诱导番茄种子胎萌中的应用及应用方法
CN113041190B (zh) * 2021-03-19 2023-03-17 河南董欣生物科技有限公司 抗氧化组合物、制备方法及用途
CN117467751B (zh) * 2023-12-27 2024-03-29 北京百力格生物科技有限公司 靶向目的基因fish荧光探针及其自组装放大探针系统

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000009755A2 (en) * 1998-08-12 2000-02-24 Pangene Corporation Domain specific gene evolution
WO2000009679A1 (fr) * 1998-08-12 2000-02-24 Proteus (S.A.) Procede d'obtention in vitro de sequences polynucleotidiques recombinees, banques de sequences et sequences ainsi obtenues

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6087129A (en) * 1996-01-19 2000-07-11 Betagene, Inc. Recombinant expression of proteins from secretory cell lines
US6326204B1 (en) * 1997-01-17 2001-12-04 Maxygen, Inc. Evolution of whole cells and organisms by recursive sequence recombination
JP4062366B2 (ja) * 1997-01-17 2008-03-19 マキシジェン,インコーポレイテッド 再帰的配列組換えによる全細胞および生物の進化
FR2761576B1 (fr) * 1997-04-04 2002-08-30 Inst Nat Sante Rech Med Animal transgenique non humain dans lequel l'expression du gene codant pour l'insuline est supprimee
JP2002537758A (ja) * 1998-09-29 2002-11-12 マキシジェン, インコーポレイテッド コドン変更された遺伝子のシャッフリング
AU3900700A (en) * 1999-03-17 2000-10-04 Paradigm Genetics, Inc. Methods and materials for the rapid and high volume production of a gene knock-out library in an organism
WO2001002555A1 (en) * 1999-07-06 2001-01-11 Institut Pasteur Method of making and identifying attenuated microorganisms, compositions utilizing the sequences responsible for attenuation, and preparations containing attenuated microorganisms

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000009755A2 (en) * 1998-08-12 2000-02-24 Pangene Corporation Domain specific gene evolution
WO2000009679A1 (fr) * 1998-08-12 2000-02-24 Proteus (S.A.) Procede d'obtention in vitro de sequences polynucleotidiques recombinees, banques de sequences et sequences ainsi obtenues

Non-Patent Citations (11)

* Cited by examiner, † Cited by third party
Title
BENDER AND KLECKNER: "TN10 insertion specificity is strongly dependent upon sequences immediately adjacent to the target-site consensus sequence", PROC. NATL. ACAD. SCI, vol. 89, September 1992 (1992-09-01), pages 7996 - 8000 *
CHRISTIANS ET AL: "Directed evolution of thymidine kinase for AZT phosphorylation using DNA family shuffling", NATURE BIOTECHNOLOGY, vol. 17, March 1999 (1999-03-01), pages 259 - 264 *
GRAUSCHOPF ET AL.: "Why is DsbA Such an Oxidizing DIsulfide Bond Catalyst?", CELL, vol. 83, 15 December 1995 (1995-12-15), pages 947 - 955 *
HOTZ AND SCHWER: "Mutational Analysis of teh Yeast DEAH-Box Splicing Factor Prp16", GENETICS, vol. 149, June 1998 (1998-06-01), pages 807 - 815 *
KAST AND HILVERT: "Genetic selection strategies for generating and chracterizing catalysts.", PURE AND APPL. CHEM., vol. 68, no. 11, 1996, pages 2017 - 2024 *
KOZMINSKI ET AL.: "Functions and Functional Domainsof the GTPase Cdc42p", MOL. BIOL. CELL., vol. 11, January 2000 (2000-01-01), pages 339 - 354 *
MÖSSNER ET AL.: "Importance of Redox Potential for the in Vivo Function of the Cytoplasmic DIsulfide Reductant Thioredoxin from Escherichia coli", J. BIOL. CHEM., vol. 274, no. 36, 3 September 1999 (1999-09-03), pages 25254 - 25259 *
See also references of WO0196551A3 *
TIAN ET AL.: "A mutant hunt for defects in membrane protein assembly yields mutations affecting the bacterial signal recognition particle and Sec machinery", PROC. NATL. ACAD. SCI., vol. 97, no. 9, 25 April 2000 (2000-04-25), pages 4730 - 4735 *
VAGNER ET AL: "A vector for systematic gene inactivation in Bacillus subtilis.", MICROBIOLOGY, vol. 144, 1998, pages 3097 - 3104 *
ZHANG ET AL: "Directed evolution of a fucosidase from a galactosidase by DNA shuffling and screening", PROC NATL ACAD SCI, vol. 94, April 1997 (1997-04-01), pages 4504 - 4509 *

Also Published As

Publication number Publication date
WO2001096551A3 (en) 2002-05-23
CA2413022A1 (en) 2001-12-20
AU2001266978A1 (en) 2001-12-24
WO2001096551A2 (en) 2001-12-20

Similar Documents

Publication Publication Date Title
US7033781B1 (en) Whole cell engineering by mutagenizing a substantial portion of a starting genome, combining mutations, and optionally repeating
US20050124010A1 (en) Whole cell engineering by mutagenizing a substantial portion of a starting genome combining mutations and optionally repeating
WO2002029032A2 (en) Whole cell engineering by mutagenizing a substantial portion of a starting genome, combining mutations, and optionally repeating
WO2001096551A2 (en) Whole cell engineering by mutagenizing a substantial portion of a starting genome, combining mutations, and optionally repeating
US6379964B1 (en) Evolution of whole cells and organisms by recursive sequence recombination
AU771511B2 (en) Evolution of whole cells and organisms by recursive sequence recombination
US7629170B2 (en) Evolution of whole cells and organisms by recursive sequence recombination
AU2005202462B2 (en) Evolution of whole cells and organisms by recursive sequence recombination
AU2004200501A1 (en) Evolution of Whole Cells and Organisms by Recursive Sequence Recombination
MXPA00012522A (es) Evolución de células y organismos enteros mediante la recombinación de secuencias recursivas

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20030113

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE TR

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE TR

AX Request for extension of the european patent

Extension state: AL LT LV MK RO SI

17Q First examination report despatched

Effective date: 20031208

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20060110