WO2008147879A1

WO2008147879A1 - Automated method and device for dna isolation, sequence determination, and identification

Info

Publication number: WO2008147879A1
Application number: PCT/US2008/064519
Authority: WO
Inventors: Ryan Golhar
Original assignee: Ryan Golhar
Priority date: 2007-05-22
Filing date: 2008-05-22
Publication date: 2008-12-04

Abstract

A processing technique, associated method, product description, and related software are disclosed for achieving rapid identification of DNA from a single or multiple organisms contained within an organic sample such as blood, tissue, sputum, urine, cell culture, water, leaf spot, or any other form suitable for containing DNA. A biological sample is provided in a container from which the DNA contained within the sample is isolated and purified. The purified DNA is then sequenced using a single- molecule DNA sequencing technique. The resulting DNA sequences are identified by comparing the sequences to a DNA database. The resulting database matches are then reported.

Description

AUTOMATED METHOD AND DEVICE FOR DNA ISOLATION, SEQUENCE DETERMINATION, AND IDENTIFICATION

Related Information This application claims the benefit of U.S. Provisional Patent Application Serial

No. 60/931,285, filed May 22, 2007, the contents of this and any patents, patent applications, references, cited throughout this specification are hereby incorporated by reference in their entireties.

Background of the Invention

The rapid identification of the nucleic acid sequences present in a complex biological sample has many practical applications. For example, the ability to rapidly identify the presence of pathogens in a biological sample, via their DNA or RNA signature, would be of enormous importance for the identification of hazardous bioagents or the diagnosis of disease in human patients.

The majority of current methods for pathogen identification require specimen culturing or detection with pathogen-specific antibodies, both of which are not possible for all types of infections. Molecular diagnostic methods involve detecting the hybridization of pathogen DNA or RNA present in the sample to known probes using DNA chips. Such methods are limited to the detection of known pathogens thus, as pathogens mutate, the pathogenic DNA may no longer hybridize to existing probes and new probes must be developed. Alternative methods of pathogen identification include nucleic acid sequencing of DNA or RNA present in the sample. However, current sequencing methodologies for pathogen identification are based on Sanger DNA sequencing which requires both amplification of the target nucleic acid and allows only a single nucleotide sequence to be identified from each sequencing reaction. Sanger sequencing is performed on a single known DNA fragment of interest. Thus, amplification and sequencing of the target nucleic acid implies a priori knowledge of the pathogen contained within the sample. Moreover, none of these current detection methods are capable of seamless, integrated operation.

There is therefore a need in the art for alternative methods and devices for the rapid identification of nucleic acid sequences present in biological samples.

Summary of the Invention The present invention provides novel methods, software and devices for the rapid identification of any nucleic acid sequence or nucleic acid-containing bioagent present in a biological sample. The present invention involves: a) isolating nucleic acid from a biological sample; b) sequencing the nucleic acid within the sample using single- molecule sequencing technology; and c) analyzing the derived nucleic acid sequences by comparison to reference sequence(s), for example, in a database.

The present invention has many uses in areas that would require a rapid and integrated molecular diagnostic identification system. The present invention allows extremely rapid and accurate detection and identification of bioagents compared to existing methods. Furthermore, this rapid detection and identification is possible even when sample material is impure. Thus, the invention is useful in a wide variety of fields, including, but not limited to, medical diagnosis and pharmacogenetic analysis (including: diagnosis of infectious diseases and conditions; cancer diagnosis based on mutations and polymorphisms; drug resistance and susceptibility testing; screening for and/or diagnosis of genetic diseases and conditions), germ warfare (allowing immediate identification of the bioagent and appropriate treatment), environmental testing (e.g., detection and discrimination of pathogenic vs. non-pathogenic bacteria in soil, water or other samples), agricultural testing (e.g., detection of livestock infection, produce contamination), veterinary testing, and forensics (e.g., rapid detection of bioagents for molecular fingerprinting).

The present invention can be used to detect and classify any bioagent containing nucleic acid (e.g., DNA), including bacteria, viruses, fungi and toxins. As one example, where the bioagent is a biological threat, the information obtained is used to determine practical information needed for countermeasures, including toxin genes, pathogenicity islands and antibiotic resistance genes. In addition, the methods can be used to identify natural or deliberate engineering events including chromosome fragment swapping, molecular breeding (gene shuffling), DNA mutations (preventing DNA chip or primer hybridization) and emerging infectious diseases. Accordingly, the invention has several advantages that include, but are not limited to, the following, providing integrated methods for the rapid identification of any nucleic acid sequence or nucleic acid-containing biological organisms present in a complex biological sample directly from the sample without the need for amplification of the nucleic acid; providing software for the identifying the source organism of any deduced nucleic acid sequence; and providing devices capable of performing the integrated processing of complex biological samples to determine the identity and predicted source of any nucleic acid present.

Other features and advantages of the invention will be apparent from the following detailed description, and from the claims. Brief Description of the Figures

Figure 1. Depicts an environment suitable for practicing an embodiment of the present invention;

Figure 2. Depicts an alternative distributed environment suitable for practicing an embodiment of the present invention;

Figure 3. Depicts a flowchart of a sequence of steps that may be followed by an embodiment of the present invention to predict bioagents present in a nucleic acid sequence isolated from a biological sample and subjected to a single molecule sequencing operation.

Detailed Description of the Invention

In order to provide a clear understanding of the specification and claims, the following definitions are provided below.

Definitions So that the invention may be more readily understood, certain terms are first defined.

The term "bioagent" refers to any organism, living or dead, or a nucleic acid derived from such an organism. Examples of bioagents include but are not limited to cells (including but not limited to human clinical samples, bacterial cells and other pathogens) viruses, toxin genes and bioregulating compounds). Samples may be alive or dead or in a vegetative state (for example, vegetative bacteria or spores) and may be encapsulated or bioengineered.

The term "sample" refers to any form of matter capable of containing a bioagent. Examples of samples include, but are not limited to, blood, animal tissue, sputum, urine, cell culture medium, water, leaf spot, soil, plant tissue, paleontology samples, forensic samples, water, food, and powders.

The term "nucleic acid" and "single-stranded nucleic acid" refers to RNA or RNA containing molecules as well as DNA or DNA containing molecules. The term RNA refers to a polymer of ribonucleotides. The term "DNA" or "DNA molecule" or deoxyribonucleic acid molecule" refers to a polymer of deoxyribonucleotides. DNA and RNA can be synthesized naturally (e.g., by DNA replication or transcription of DNA, respectively). RNA can be post-transcriptionally modified. DNA and RNA can also be chemically synthesized. DNA and RNA can be single- stranded (i.e., ssRNA and ssDNA, respectively), or multi-stranded (e.g., double stranded, i.e., dsRNA and dsDNA, respectively), i.e., duplexed or annealed.

The term "nucleic acid sequence" refers to the ordering of the individual nucleotides in a DNA or RNA polymer. The term "single-molecule sequencing" refers to any method of determining the sequence of an individual nucleic acid molecule without the need for prior amplification. The term "compare", when used with respect to nucleic acid sequences, refers to the alignment of one or molecule nucleic acid sequences to establish a percentage identity or similarity (identity and similarity will be used interchangeably) using, for example, a mathematical algorithm. To determine the percent identity of two nucleic acid sequences (or of two amino acid sequences), the sequences are aligned for optimal comparison purposes (e.g., gaps can be introduced in the first sequence or second sequence for optimal alignment). The nucleotides (or amino acid residues) at corresponding nucleotide (or amino acid) positions are then compared. When a position in the first sequence is occupied by the same residue as the corresponding position in the second sequence, then the molecules are identical at that position. The percent identity between the two sequences is a function of the number of identical positions shared by the sequences (i.e., % homology = # of identical positions (+ # of substitutions for bases or amino acids)/total # of positions x 100), optionally penalizing the score for the number of gaps introduced and/or length of gaps introduced. The alignment can be generated over a certain portion of the sequence (i.e., a local alignment). A non-limiting example of a local alignment algorithm utilized for the comparison of sequences is the algorithm of Karlin and Altschul (1990) Proc. Natl. Acad. ScL USA 87:2264-68, modified as in Karlin and Altschul (1993) Proc. Natl. Acad. ScL USA 90:5873-77. Such an algorithm is incorporated into the BLAST programs (version 2.0) of Altschul, et al. (1990) /. MoI. Biol. 215:403-10. The alignment can be optimized by introducing appropriate gaps and percentage identity determined over the length of the aligned sequence (i.e., a gapped alignment). To obtain gapped alignments for comparison purposes, Gapped BLAST can be utilized as described in Altschul et al., (1997) Nucleic Acids Res. 25(17):3389-3402. In another embodiment, the alignment is optimized by introducing appropriate gaps and percent identity is determined over the entire length of the sequences aligned (i.e., a global alignment). A preferred, non-limiting example of a mathematical algorithm utilized for the global comparison of sequences is the algorithm of Myers and Miller, CABIOS (1989). Such an algorithm is incorporated into the ALIGN program (version 2.0) which is part of the GCG sequence alignment software package. Another global alignment algorithm is that of Needleman-Wunsch, (1970) /. MoI. Biol. 48:443-453. Various aspects of the invention are described in further detail in the following subsections.

/. Overview The present invention provides novel methods, software algorithms and devices for the rapid identification of any nucleic acid sequence or nucleic acid-containing bioagent present in a biological sample. The present invention involves: a) isolating nucleic acid from a biological sample; b) sequencing the totality of nucleic acid within the sample using single-molecule sequencing technology; and c) analyzing the derived nucleic acid sequences by comparison to a database.

In one aspect, the invention provides methods for the identification of any nucleic acid sequence or nucleic acid-containing bioagent present in a biological sample. In one embodiment, a sample suspected of containing a bioagent capable of causing a disease or disorder is obtained. In a related embodiment, a blood sample is obtained from a human patient suspected of having contracted an infectious, bioagent- induced disease. The total nucleic acid content of the either sample is extracted from the sample by art-recognized means and subject to a single-molecule sequencing reaction. The resultant nucleic acid sequence data is then searched against reference sequences in databases using a software algorithm and the predicted source of the nucleic acid reported.

In another aspect, the invention provides a physical medium that holds computer- executable instructions for identifying bioagents present in a biological sample. The medium holds instructions for receiving at least one result of a single molecule sequencing reaction conducted on nucleic acid in a biological sample. The medium further holds computer-executable instructions for comparing the received nucleic acid sequence obtained from the single molecule sequencing reaction to one or more reference sequences contained in a database in order to predict at least one bioagent present in the biological sample.

In another aspect, the invention provides devices for the identification of any nucleic acid sequence or nucleic acid-containing bioagent present in a biological sample. In one embodiment a device is contacted with a sample and said device performs all the combined functions of the invention in an integrated manner i.e., nucleic acid extraction, single-molecule sequencing, database searching and source organism reporting.

In another aspect, the invention provides a means to acquire patient- specific, as well as general, population-based data concerning the genetic basis of diseases and disorders.

In another aspect, the invention provides a means to acquire gene expression analysis data indicative of a change in physiological status of an organism. In another aspect, the invention provides a means to acquire epidemiological data.

In another aspect, the invention provides methods performing pharmacogenomics. In another aspect, the invention provides a means for testing livestock animals for diseases such as foot and mouth, and mad cow disease.

2. Selecting a Biological Sample

The present invention provides methods and devices for the identification of nucleic acid molecules contained within a biological sample. Exemplary samples include, but are not limited to, blood, animal tissue, sputum, urine, cell culture medium, water, leaf spot, soil, plant tissue, paleontology samples, forensic samples, water, food or any form of matter capable of containing bioagents or nucleic acid. Several independent sources of nucleic acid may exist in the sample. In the case of human blood, human DNA and RNA will be present in white blood cells, in addition to the nucleic acid present in any infectious bioagents that may be present.

3. Bioagents

The present invention provides methods and devices for the identification of bioagents via the presence of their nucleic acids. In the context of the present invention, a "bioagent" is any organism, living or dead, or a nucleic acid derived from such an organism. Examples of bioagents include but are not limited to cells (including but not limited to human clinical samples, bacterial cells and other pathogens) viruses, toxin genes and bioregulating compounds). Samples may be alive or dead or in a vegetative state (for example, vegetative bacteria or spores) and may be encapsulated or bioengineered.

Bacterial biological warfare bioagents capable of being detected by the present methods include, but are not limited to, Bacillus anthracis (anthrax), Yersinia pestis (pneumonic plague), Franciscella tularensis (tularemia), Brucella suis, Brucella abortus, Brucella melitensis (undulant fever), Burkholderia mallei (glanders),

Burkholderia pseudomalleii (melioidosis), Salmonella typhi (typhoid fever), Rickettsia typhii (epidemic typhus), Rickettsia prowasekii (endemic typhus) and Coxiella burnetii (Q fever), Rhodobacter capsulatus, Chlamydia pneumoniae, Escherichia coli, Shigella dysenteriae, Shigella flexneri, Bacillus cereus, Clostridium botulinum, Coxiella burnetti, Pseudomonas aeruginosa, Legionella pneumophila, Borrelia burgdorferi (Lyme disease), and Vibrio cholerae.

Biological warfare fungus bioagents include, but are not limited to, coccidioides immitis (Coccidioidomycosis). Biological warfare toxin genes capable of being detected by the methods of the present invention include but not limited to botulism, T- 2 mycotoxins, ricin, staph enterotoxin B, shigatoxin, abrin, aflatoxin, Clostridium perfringens epsilon toxin, conotoxins, diacetoxyscirpenol, tetrodotoxin, and saxitoxin. Biological warfare viral bioagents are mostly RNA viruses (positive-strand and negative- strand), with the exception of smallpox. Every RNA virus is a family of related viruses (quasispecies). These viruses mutate rapidly and the potential for engineered strains (natural or deliberate) is very high. RNA viruses cluster into families that have conserved RNA structural domains on the viral genome (e.g., virion components, accessory proteins) and conserved housekeeping genes that encode core viral proteins including, for single strand positive strand RNA viruses, RNA-dependent RNA polymerase, double stranded RNA helicase, chymotrypsin-like and papain-like proteases and methyltransferases.

Examples of (-)-strand RNA viruses include arenaviruses (e.g., sabia virus, lassa fever, Machupo, Argentine hemorrhagic fever, flexal virus), bunyaviruses (e.g., hantavirus, nairovirus, phlebovirus, hantaan virus, Congo-crimean hemorrhagic fever, rift valley fever), and mononegavirales (e.g., filovirus, paramyxovirus, ebola virus, Marburg, equine morbilli virus).

Examples of (+)-strand RNA viruses include picornaviruses (e.g., coxsackievirus, echovirus, human coxsackievirus A, human echovirus, human enterovirus, human poliovirus, hepatitis A virus, human parechovirus, human rhinovirus), astroviruses (e.g., human astrovirus), calciviruses (e.g., chiba virus, chitta virus, human calcivirus, norwalk virus), nidovirales (e.g., human coronavirus, human torovirus), flaviviruses (e.g., dengue viruses, Japanese encephalitis virus, Kyanasur forest disease virus, Murray Valley encephalitis virus, Rocio virus, St. Louis encephalitis virus, West Nile virus, yellow fever virus, hepatitis c virus) and togaviruses (e.g., Chikugunya virus, Eastern equine encephalitis virus, Mayaro virus, O'nyong-nyong virus, Ross River virus, Venezuelan equine encephalitis virus, Rubella virus, hepatitis E virus).

4. Nucleic Acid Extraction

The present invention can employ at least partial purification of target nucleic acid molecules. All methods of art recognized nucleic acid extraction and purification are contemplated. Exemplary methods include those commercialized by QIAGEN or PROMEGA. Nucleic acid purification on nanoengineered surfaces, as exemplified in U.S. patent application US20060166223), is also contemplated. In cases where biological samples are desiccated, where necessary, the sample with be solublized using appropriate art recognized solvents to facilitate nucleic acid extraction. 5. Single Molecule Sequencing

The present invention involves nucleic sequencing at the single molecule level. Several art-recognized methods of single-molecule sequencing have been developed (see U.S. patent application US2006000400730 and U.S. patents 7,169,560; 6,221,592; 6,905,586; 6,524,829; 6,242,193; 6,221,592; and 6,136,543). Single molecule sequencing is a powerful tool capable of elucidating sequence-specific information on a single nucleic acid template. The ability to conduct single template sequencing allows the identification of subtle, often rare event, changes in nucleic acids that are important as the underlying basis for diseases such as cancer and others.

Single molecule sequencing also provides the ability to rapidly analyze a multitude of single nucleic acid templates, from a single sample, in parallel and with a high degree of precision. Using an isolated nucleic acid sequence as the substrate, individual labeled nucleotides are added sequentially by a polymerase to a growing complement strand. A label is detected as each nucleotide is added to the strand and the template sequence is determined.

In one embodiment, the invention comprises exposing a nucleic acid primer to a template sequence in the presence of a polymerase and at least one labeled nucleotide base that is capable of hybridizing with a template nucleic acid downstream of the hybridized primer. Nucleotide bases may be selected from the common Watson-Crick bases, adenine, thymine, cytosine, guanine, and uracil, or may be modifications of those bases, such as peptide nucleic acids, ribonucleotides, or nucleotides modified to incorporate a detectable label (e.g., with linkers or adapters). As each nucleotide is added to the growing complement strand, its label is detected and its position on the template is noted. Once a sufficient number of nucleotides have been incorporated, a sequence is determined. Methods of the invention facilitate rapid whole genome sequencing. Methods of the invention, however, also contemplate partial genome sequencing to obtain template or fingerprint sequences, thereby facilitating even more rapid sequence comparisons. Suitable nucleic templates include DNA, RNA and RNA/DNA hybrids.

In another embodiment, the invention comprises passing a single-stranded nucleic acid through a nano-pore. As the ssDNA travels through the nano-pore, the ssDNA passes over 4 nano-probes each containing one of the four nucleotide bases. Each time a probe hybridizes with the ssDNA, the signal is detected and the template sequence is determined. 6. Devices

In another aspect, the present invention provides devices for the identification of nucleic acid molecules and nucleic acid-containing bioagents contained within a biological sample. In one embodiment, the device contains an integrated means of nucleic acid purification, single molecule sequencing, and sequence analysis.

Embodiments where any one or more of the aforementioned functions are performed outside or remote from the device are also contemplated. In another embodiment the device is portable, preferably handheld. In another embodiment, the device may also include a microfabricated biopsy instrument as exemplified in U.S. patent application 2003/0119176A1. In another embodiment, the device connects wirelessly to a computer. In another embodiment, the device is part of a remotely controlled vehicle. In another embodiment, the device is capable of being operated by remote control. In another embodiment, the device is disposable. In another embodiment the device is biodegradable. In another embodiment, the device is designed and/or packaged for home use, hospital use, or police/military use.

7. Genomic DNA Analysis

In another aspect, the present invention allows for the acquisition of patient- specific, as well as general, population-based data concerning the genetic basis of diseases and disorders. Cancer is an example of a disease or disorder that has a strong genetic basis. Complete sequencing of large numbers of tumors using single molecule sequencing provides a catalog of somatic cell mutations (including, without limitation, deletions, additions, amplifications, rearrangements, substitutions, losses, translocations, methylation, and other alterations of genomic DNA) that are useful to diagnose, evaluate, prognose, and treat patients. A catalog of disease-related mutations and other alterations is a powerful diagnostic tool useful to rapidly categorize samples sequenced from future patients. Moreover, single molecule sequencing allows one to identify previously-unknown mutations that may be associated with cancer. Finally, single molecule sequencing on pooled samples allows rapid identification of deletions, amplifications, and other changes that are indicative of cancer, even if the specific mutational change is not known.

Analysis of genomic DNA using single molecule sequencing provides an approach that allows rapid identification of a genomic change present in a sample in low amounts. The ability to quickly and accurately perform rare-event detection is of great significance for the early diagnosis of cancer. Many cancers, if detected early, are treatable, and if detected too late may not be treatable. Cancer begins as somatic cell mutations accumulate in a very small initial population of cells. In samples typically obtained for genomic analysis, cancer or precancer cells are in very low abundance compared to healthy somatic cells. Bulk mutation detection mechanisms typically fail to detect these rare event changes. A digital technique, such as single molecule sequencing, allows the sequencing through mutations in multiple single templates rapidly. This, in turn, allows the detection of the rare-event mutations underlying cancer or precancer. In one embodiment of the invention, tumor DNA is obtained and prepared using standard methods. Approximately 10 times coverage of each genomic region is sequenced. Using single molecule sequencing, the genome of the cancer tissue is rapidly sequenced. Mutations, insertions, deletions, rearrangements, and other alterations present in the tumor DNA are detected. Sequence assembly is accomplished using standard alignment techniques, such as BLAST (www.ncibi.nlm.nih.gov), incorporated by reference herein. Tumor sequences are compared to known sequences for either normal or cancer tissue or to consensus sequences in order to identify changes associated with cancer. Newly discovered genomic changes (i.e., those not previously associated with cancer) are cataloged and become known to be associated with a particular disease over time. Thus, patients are rapidly and accurately diagnosed based upon their individual genomic complement, either before or at the time of symptomatic- presentation of a disease.

In another embodiment of the invention, DNA is isolated from a patient's tumor or other diseased sample and is compared to normal DNA from the same patient. Whole genome sequencing of both the tumor and normal DNA may be done rapidly on a parallel basis using single molecule sequencing as described above. Alternatively, only portions of the genome are sequenced and compared. Genome portions of interest include, for example, sequences associated with a known or candidate tumor suppressor gene or oncogene, or intronic sequences containing repeats that are susceptible to amplification by defective cellular machinery. Following sequence determination, a comparison is made between tumor and normal sequence. Differences between the tumor and normal sequences are identified as tumor-related mutations. In effect, any difference between the two likely is indicative of disease because all somatic cells should have the same sequence. Detection of a variation from the normal somatic cell sequence, indicating that a population of cells containing abnormal sequences is present, results in a positive diagnosis. Alternatively, patient tumor sequence may be compared to a normal banked or consensus sequence instead of the patient's own normal DNA.

In another embodiment broad-based disease susceptibility testing is performed using single molecule sequencing on pooled genomic samples. For example, in a large population, the number of positive samples (i.e., those with a mutation present) is relatively small. Bulk sequencing likely would not detect mutations in pooled samples. Using high-resolution single molecule sequencing, however, any positive sample is detected with digital precision. Thus, according to the invention, genomic samples from a predetermined number of patients (the number of patients does not matter for purposes of the invention) are collected, pooled and sequenced using single molecule sequencing techniques as described above. Single molecule sequencing is done through large tracts of the genome, and mutations derived from any source are detected in the pooled sample. To determine the source of a mutation or mutations, the original collection of individual patient samples is divided in half, re-pooled, and resequenced. This process continues until a unique identification of the affected patient or patients is possible. Due to the rapidity of single molecule sequencing, it is possible to perform multiple sequencing steps in a matter of minutes, hours or days. Using single molecule sequencing, pooled sequences, when compared to a consensus sequence, readily identify losses or amplifications in genomic DNA. All somatic cells will have not only the same sequence but will also be present in the same amounts. Deviations are detected using single molecule sequencing with fewer cells than in bulk sequencing because individual DNA molecules are sequenced instead of an amalgam of cells that typically provide the basis for bulk sequencing assays as, for example, in assays for loss of heterozygosity. In a related embodiment, data from a pooled experiment is useful for determining the frequency and distribution of mutations in a given population, without identifying the owners of specific mutations.

The rapid results provided by the invention also allow sequencing to detect familial mutations. For example, if it is determined that a patient has a mutation indicative of a cancer, certain forms of which have a strong familial link (e.g., breast cancer, colon cancer), primary siblings typically are not tested unless specified criteria are met. Single molecule sequencing not only identifies the underlying mutation in the primary patient, but allows rapid, cost-effective sequencing of relatives who also might carry the mutation.

The invention is also useful to perform tumor typing. Tumor typing may involve determining a genetic profile for a particular patient's tumor in order to guide treatment or other decisions. For example, the standard treatment for patients with colon cancer is the drug 5-Fluorouracil (5FU). Although 5FU works to reduce tumors in many colon cancer patients, it actually accelerates tumor growth in a class of patients who have Hereditary Non-Polyposis Colorectal Cancer (HNPCC). HNPCC is a familial form of colon cancer with a distinct genetic profile that is ascertainable by sequencing cellular DNA. Thus, to avoid tumor acceleration in potential HNPCC patients, it is particularly important to know a colon cancer patient's genetic profile in order to determine the most effective treatment for that patient. Single molecule sequencing is useful to make that determination because it is rapid, reliable, and effectively digital, therefore promptly indicates the presence or absence of the relevant genetic event(s). Methods of the invention make possible the rapid and accurate identification of tumor-related mutations, thus an appropriate treatment may be selected or an inappropriate treatment avoided.

8. Expression Analysis In another aspect, the invention provides gene expression analysis data.

Alteration in expression constructs is often indicative of a change in physiological status. Changes in expression patterns reflect cellular activities as well as disease state. Expression sequence analysis provides insight into the specialized activities of cells from different organs or of different types. Thus, expression analysis reveals aspects of the immune repertoire that are not apparent on a gross level. In one embodiment of the invention, a sequence determination is made with respect to the total antibody repertoire expressed by B-cells. In another embodiment of the invention, a sequence determination is made with respect to the T-cell receptor repertoire expressed by T-cells. Single molecule sequencing offers rapid, high-throughput sequencing that reveals specific detail as to which immune cells are active, and the likely epitopes against which they function. Single molecule sequencing also provides an immune fingerprint that is used to identify an infection based upon the specifics of a patient's immune response. The immune fingerprint generated using single molecule sequencing is compared to a database of collected immune sequence data in order to identify an infection. New infections are tracked through the appearance of new sequence specificities either alone or in combination with other diagnostic techniques. Isolation of immune cells is well- known in the art, and application of the present invention to sequencing a patient's immune cell complement is contemplated by the present invention.

9. Epidemiology

In another aspect the invention also provides epidemiological data. In a preferred embodiment, an appropriate patient sample is obtained and DNA in the sample is sequenced. Optionally, the patient's genomic DNA is excluded. A catalog is compiled comprising a fingerprint of the DNA (or RNA in other preferred embodiments) present in samples obtained from a multiplicity of patients. Each patient's disease status then is correlated with specific sequence information obtained from the patient's sample. In this way, diagnostic accuracy and verifiability is improved, as a patient's disease status is confirmed by comparing the patient's DNA to sequences in the database. As mentioned above, whole genome sequencing is optional. In some circumstances, it is necessary only to sequence sufficient nucleic acid to establish a fingerprint for comparison with future samples. In one embodiment, ubiquitous epidemiology is performed. In this case patient DNA is routinely sequenced and stored for disease identification and comparison with future samples to identify and track new disease outbreaks. For example, a patient who presents with a new DNA profile (i.e., containing a sequence that is not in the database) may be diagnosed with a new condition. Future patients presenting with the same nucleic acid profile are tracked. In this way, potential epidemic outbreaks are controlled. With respect to new diseases, no a priori assumptions are necessary. A novel sequence will immediately be identified as such, and appropriate monitoring can be put in place.

10. Pharmacogenomics

In another aspect, the invention provides methods and devices for performing pharmacogenomics. Differences in metabolism of therapeutics can lead to severe toxicity or therapeutic failure by altering the relation between dose and blood concentration of the pharmacologically active drug. Thus, a physician or clinician may consider applying knowledge obtained in relevant pharmacogenomics studies in determining whether to administer a therapeutic agent as well as tailoring the dosage and/or therapeutic regimen of treatment with a therapeutic agent.

Pharmacogenomics deals with clinically significant hereditary variations in the response to drugs due to altered drug disposition and abnormal action in affected persons. See, for example, Eichelbaum, M. et al. (1996) Clin. Exp. Pharmacol. Physiol. 23(10-11): 983-985 and Linder, M.W. et al. (1997) Clin. Chem. 43(2):254-266 In one embodiment, the methods of the invention provide information regarding patient genome sequence which is used to select patients or patient subpopulations for treatment with FDA-approved therapies e.g., antibody, small molecule or peptide therapies.

//. Hardware and Software Environment

As noted above, the embodiments of the present invention programmatically analyze the results of a single molecule sequencing reaction in order to predict bioagents present in a biological sample. Figures 1-3 discuss aspects of the hardware and software environment utilized by the present invention to perform the bioagent prediction.

Figure 1 depicts an environment suitable for practicing an embodiment of the present invention. A computing device 102 holds a database 104 or other storage structure containing reference sequences 105 and an analysis facility 106. The computing device 102 may be a server, workstation, laptop, personal computer, PDA or other computing device equipped with one or more processors and able to execute the analysis facility 106 discussed herein. The analysis facility 106 is preferably implemented in software although in an alternate implementation, the logic may be also be implemented in hardware. The analysis facility 106 operates on and analyzes results of single molecule sequencing reactions 122 that are received from a biological sample acquisition apparatus 120. The biological sample acquisition apparatus conducts single molecule sequencing operations on nucleic acid isolated from a biological sample. In one embodiment, the biological sample acquisition apparatus 120 is a handheld device in wireless communication with the computing device 102. The analysis facility 106 programmatically compares the results of the single molecule sequencing reaction 122 to the reference sequences 105 contained in the database 104 in order to generate a listing of predicted bioagents 144 that are present in the biological sample under consideration. In one implementation, the comparison of the results of the single molecule sequencing operation 122 to the reference sequences 105 in order to predict bioagents present in a biological sample is performed programmatically without any user input. In alternate implementations, the analysis facility 106 prompts a user for parameters controlling the comparison via the user interface 142.

The listing of the predicted bioagents 144 may be displayed to a user via a user interface 142 displayed on a display device 140 that is in communication with the computing device 102. It will be appreciated that the listing of predicted bioagents 144 may also be stored for later use and/or display to a user. The user interface 142 may also be utilized to enable a user to configure the parameters of the comparison operation performed by the analysis facility 106. Those skilled in the art will recognize that many other configurations are also possible within the scope of the present invention. Figure 2 depicts an alternative distributed environment 200 suitable for practicing an embodiment of the present invention. A first computing device 202 may be used to execute an analysis facility 204. The first computing device 202 may communicate over a network 250 with a second computing device 210 holding reference sequences 212. The network 250 may be the Internet, a local area network (LAN), a wide area network (WAN), an intranet, an internet, a wireless network or some other type of network over which the first computing device 202 and the second computing device 210 can communicate. The analysis facility 204 on the first computing device 202 may communicate over the network 250 with a biological sample acquisition apparatus 230 that generates results data 232 from a single molecule sequencing reaction performed on nucleic acid isolated from a biological sample. The analysis facility 204 may store a listing of predicted bioagents that is generated by a comparison of the results of the single molecule sequencing reaction and the reference sequences 212. The storage may occur on the first computing device 202 or at a location remote from the first computing device that is accessible over the network 250. Alternatively, the listing of predicted bioagents may be displayed to a user. It should be recognized that Figure 2 depicts only a single distributed configuration and many other distributed configurations are possible within the scope of the present invention. Figure 3 is a flowchart of a sequence of steps that may be followed by an embodiment of the present invention to predict bioagents present in a biological sample. The sequence begins by providing a biological sample (step 302). The sample may be a previously acquired sample or may be a sample that is obtained immediately in advance of the bioagent prediction process that is discussed herein being performed. Nucleic acid is then isolated from the biological sample (step 304) and a single molecule sequencing reaction is conducted on the isolated nucleic acid (step 306) as discussed above. The results of the single molecule sequencing reaction are compared to reference sequences (step 308) and a listing of predicted bioagents that are present in the biological sample is generated. The listing of predicted bioagents may then be displayed to a user or stored for later retrieval (step 310).

Embodiments of the present invention may be provided as one or more computer-readable programs embodied on or in one or more mediums. The mediums may be a floppy disk, a hard disk, a compact disc, a digital versatile disc, a flash memory card, a PROM, an MRAM, a RAM, a ROM, or a magnetic tape. In general, the computer-readable programs may be implemented in any programming language. Some examples of languages that can be used include FORTRAN, C, C++, C#, Python, Perl or Java. The software programs may be stored on or in one or more mediums as object code. Hardware acceleration may be used and all or a portion of the code may run on a FPGA, an Application Specific Integrated Processor (ASIP), or an Application Specific Integrated Circuit (ASIC). The code may run in a virtualized environment such as in a virtual machine. Multiple virtual machines running the code may be resident on a single processor.

Since certain changes may be made without departing from the scope of the present invention, it is intended that all matter contained in the above description or shown in the accompanying drawings be interpreted as illustrative and not in an exclusive sense. Practitioners of the art will realize that the sequence of steps and architectures depicted in the figures may be altered without departing from the scope of the present invention and that the illustrations contained herein are singular examples of a multitude of possible depictions of the present invention.

12. Other Applications of the Technology of the Invention

In another aspect, the methods and devices disclosed herein can be used to screen fetal mRNA or DNA, present in maternal blood, for disease-associated mutations as an alternative to amniocentesis.

In another aspect, the invention provides nucleic acid sequence information for making diagnostic kits, or chips. In another aspect, the nucleic acid sequence information or methodology disclosed herein can be used for forensic applications.

In another aspect, the methods and devices disclosed herein can be used for research purposes, for example genetic research on the distribution or migration of human populations.

In another aspect, the methods and devices disclosed herein can be used in paleontology, for example to identify and catalogue nucleic acid sequences contained in ancient biological samples.

In another aspect, the methods and devices disclosed herein can be used for environmental analysis to determine the bioagent profile of a particular of ecosystem.

In another aspect, the methods and devices disclosed herein can be used in agriculture. In one embodiment the methods and devices of the invention are used to determine the bioagent profile of soil. In another embodiment the methods and devices of the invention are used to determine the nucleic acid sequences present in plant samples and thereby assess whether they have been infected with disease-causing bioagents or have been modified by genetic engineering.

In another aspect, the methods and devices disclosed herein are used to determine genetic fingerprinting information about a subject and thereby uniquely identify them. In another aspect, the invention provides business methods for commercializing nucleic acid sequences suitable for use in, for example, the making of devices, diagnostic chips, kits, networks, and pharmaceuticals for diagnosing and treating disease.

Exemplification

Throughout the examples, the following materials and methods were used unless otherwise stated.

Materials and Methods

In general, the practice of the present invention can employ, unless otherwise indicated, conventional techniques of chemistry, molecular biology, recombinant DNA technology, PCR technology, immunology, cell culture, and any necessary computer or electronic related technology that are within the skill of the art and are explained in the literature. See, e.g., Sambrook, Fritsch and Maniatis, Molecular Cloning: Cold Spring Harbor Laboratory Press (1989); DNA Cloning, VoIs. 1 and 2, (D.N. Glover, Ed. 1985); Oligonucleotide Synthesis (MJ. Gait, Ed. 1984); PCR Handbook Current Protocols in Nucleic Acid Chemistry, Beaucage, Ed. John Wiley & Sons (1999) (Editor); Oxford Handbook of Nucleic Acid Structure, Neidle, Ed., Oxford Univ Press (1999); PCR Protocols: A Guide to Methods and Applications, Innis et al, Academic Press (1990); PCR Essential Techniques: Essential Techniques, Burke, Ed., John Wiley & Son Ltd (1996); The PCR Technique: RT-PCR, Siebert, Ed., Eaton Pub. Co. (1998).

EXAMPLE 1

A METHOD FOR DETECTING A BIOAGENT IN A PATIENT SUSPECTED OF HAVING CONTRACTED AN INFECTIOUS BIOAGENT-INDUCED DISEASE

The following example describes a novel method for determining the presence of an unknown bioagent in a patient sample. A patient sample is obtained. If necessary several types of sample encompassing all potential areas of infection may be obtained and processed together.

Nucleic acid is extracted from the sample(s) using art recognized means. A single- nucleotide sequencing of the nucleic acid is then performed to determine the sequences of nucleic acids present in the sample. The deduced nucleic acid sequence data is compared against databases of known nucleic acid sequences, for example, using a mathematical algorithm and the percentage identity with known sequences is reported.

From these sequence identities all bioagents present in the sample are deduced and known infectious bioagents identified. EXAMPLE 2.

A DEVICE FOR DETECTING A BIOAGENT IN A PATIENT SUSPECTED OF HAVING CONTRACTED AN INFECTIOUS BIOAGENT-INDUCED DISEASE The following example describes a novel device for determining the presence of an unknown bioagent in a patient sample.

A patient sample is obtained. If necessary several types of sample encompassing all potential areas of infection may be obtained and processed together. The sample is contacted with a device which proceeds to perform the following steps in an integrated operation: a) isolate nucleic acid from the sample in sufficient purity to perform single - nucleotide sequencing; b) perform single-nucleotide sequencing to determine the sequences of all nucleic acids isolated from the sample; c) compare the deduced nucleic acid sequences against databases of known nucleic acid sequences using, for example, a mathematical algorithm to determine the percentage identity of the deduced nucleic acid with known sequences; d) report the best sequence matches for all known (infectious) bioagents.

Equivalents

Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific embodiments of the invention described herein. Such equivalents are intended to be encompassed by the following claims.

Claims

What is claimed

1. A method for identifying a bioagent(s) present in a biological sample, the method comprising the steps of: isolating nucleic acid from a biological sample; conducting a single molecule sequencing reaction on the plurality of nucleic acid in said sample; comparing nucleic acid sequences obtained in said conducting step to one or more reference sequences contained in a database to predict the bioagent(s) present in the sample.

2. The method of claim 1, wherein said biological sample is any form of matter containing nucleic acid.

3. The method of claim 1, wherein said biological sample is blood or another bodily derived fluid.

4. The method of claim 1, wherein said biological sample is obtained from tissue.

5. The method of claim 1, wherein said biological sample is obtained from soil.

6. The method of claim 1, wherein said biological sample is obtained is suspected to contain a hazardous bioagent.

7. A device capable of performing the method of claim 1.

8. The method of claim 1 wherein one or more steps are carried out programmatically.

9. A method for detecting nucleic acids indicative of a disease state in a sample, the method comprising the steps of: isolating nucleic acid from a biological sample suspected to contain a nucleic acid that would not be expected to be present in the sample if the individual from whom it was obtained were healthy; conducting a single molecule sequencing reaction on nucleic acid in said sample; and comparing nucleic acid sequences obtained in said conducting step to one or more reference sequences that represent nucleic acids that are not expected to be present in a sample obtained from a healthy individual, thereby identifying nucleic acids in said sample that are indicative of a disease state.

10. The method of claim 9, wherein said biological sample is blood or another bodily-derived fluid.

11. The method of claim 9, wherein said biological sample is obtained from tissue.

12. The method of claim 9, wherein said reference sequences represent a mutation that is indicative of cancer or precancer.

13. The method of claim 9, wherein said reference sequences represent an infectious disease agent.

14. The method of claim 9, wherein said heterogeneous sample comprises nucleic acid derived from multiple cell types.

15. The method of claim 9, wherein said mutation is a mutation or an insertion or a deletion.

16. The method of claim 9, wherein said biological sample is maternal blood.

17. The method of claim 9, wherein said reference nucleic acid is fetal DNA or RNA.

18. The method of claim 9, wherein said comparing step identifies the presence of nucleic acids derived from multiple organisms in a pooled sample.

19. A device capable of performing the method of claim 9.

20. The method of claim 9, wherein one or more steps are carried out programmatically.

21. A method for identifying a patient or patient subpopulation amenable to therapy wherein the patient or patient subpopulation is first identified as in need of such therapy according to the methods of any one of the preceding claims.

22. A physical medium holding computer-executable instructions for predicting bioagents present in a biological sample, the medium comprising: instructions for receiving at least one result of a single molecule sequencing reaction conducted on nucleic acid in a biological sample; and instructions for comparing at least one nucleic acid sequence obtained in said at least one result to one or more reference sequences contained in a database to predict at least one bioagent present in the biological sample.

23. The medium of claim 22 wherein the medium further comprises instructions for receiving parameters controlling said comparing from a user prior to performing said comparing.