WO2002029101A2

WO2002029101A2 - Methods for cloning, sequencing and expressing genes from complex microbial populations

Info

Publication number: WO2002029101A2
Application number: PCT/US2001/030343
Authority: WO
Inventors: William J. Coleman; Michael A. Tanner; Edward J. Bylina; Douglas C. Youvan
Original assignee: Kairos Scientific Inc.
Priority date: 2000-09-30
Filing date: 2001-09-28
Publication date: 2002-04-11
Also published as: WO2002029101A3; AU2001296365A1

Abstract

Methods are described for obtaining functional gene and regulatory sequences from complex populations of microorganisms, termed a 'plethogenome'. These methods are based on random sequencing of genetic material isolated from an ensemble of organisms, followed by computer analysis to assign a putative function to each sequence. These assignments can be confirmed by subsequent activity screening. The plethogenomic sequence database and corresponding clone library can be used to discover and manufacture new industrial enzymes and pharmaceuticals.

Description

TITLE OF INVENTION

Methods for Cloning, Sequencing and Expressing Genes from Complex Microbial Populations CROSS-REFERENCE TO RELATED APPLICATIONS This application claims the benefit of U.S. Provisional patent Application No.

60/237,547, filed September 30, 2000, the entire contents of which are hereby incorporated by reference for all purposes. BACKGROUND OF THE INVENTION

The invention relates to the field of molecular biology and more specifically to gene discovery.

Efforts to develop improved industrial biocatalysts could be greatly enhanced by accelerating the rate at which new microbial enzymes are discovered. Microbes also offer a highly diverse source of pharmaceutical compounds (Bull et al, 2000). Although the diversity of species in the environment is known to be vast, it has been difficult in the past to access this richness without first cultivating the organisms (Hunter-Cevera, 1998). Up to now, microbial genome sequencing projects have focused on complete sequencing of single species, and almost all of these are culturable organisms. Estimates of microbial diversity on earth indicate that greater than 99% have never been cultured, and the vast majority have never been identified (Pace, 1997; Short, 1997; Torsvik et al, 1998; Tiedje & Stein, 1999; Bull et al, 2000).

Despite the obvious attractiveness of environmental microorganisms as a resource for biotechnology, tools for efficiently exploiting their genomic diversity have not been available. As recently as 1998, a review of the technology for analyzing complex microbial populations concluded that "Total DNA from complex microbial communities...contain [sic] too much information to be analysed directly by high resolution methods" (Torsvik et al, 1998). The desire on the part of researchers to develop the genomics toolbox by first investigating well-known organisms is understandable from the standpoint of practicality, and the heavy focus on the study of discrete genomes has provided an enormous amount of useful information.

Early efforts at sequencing genomic DNA from environmental samples focussed on using the polymerase chain reaction (PCR) to amplify particular genes. For example, universal primers for 16S small subunit RNA genes are often targeted by environmental microbiologists. The mixed products are then cloned and sequenced to determine the diversity of species present. This analysis is extremely powerful for conducting a 'census' of species in a given sample (Head, 1999; Dojka et al, 2000). However, PCR sequencing alone has only limited value for analyzing genomes of entire populations of uncultured organisms because it requires knowledge about the exact sequence of the targeted genes prior to amplification. The methods of the present invention are capable of analyzing much larger segments of the various genomes because they rely on recovering random fragments of those genomes and are not targeted to specific genes. Other methods directed toward more extensive genomic analysis of microbial populations have also been developed. These methods always include pre-screening of the genomic clone libraries. The pre-screening is designed to detect specific enzyme activities, ribosomal RNA (rRNA) sequences or other targeted sequences (e.g., by probe hybridization). The information obtained from the screening of individual clones is essential for reducing the complexity of the sequencing effort and limiting it to clones that contain desired genes.

Among the methods that can be used to access microbial genetic diversity by activity screening are those that have been previously described in U.S. Pat. No. 5,958,672, U.S. Pat. No. 6,054,267 and in publications by Handelsman et al. (1998) and Rondon et al. (1999, 2000). In these methods, large nucleic acid fragments (typically genomic DNA) are isolated from the microbial population and shotgun- cloned into expression vectors. These vectors are used to transform a heterologous host to create an expression library. Members of the expression library are screened for the presence of a particular desired activity, such as lipid hydrolysis. Cells containing the desired activity are then isolated, and their plasmid DNA is analyzed in order to determine the sequence of the gene encoding the identified activity.

Although activity screening is a powerful method for rapidly identifying genes of interest, the results are necessarily limited by the nature of the screening method that is chosen. For example, an enzyme screening assay designed to identify lipase activity within the library will probably be blind to the presence of genes encoding glycosidase activity. Moreover, if a gene contained in the library produces an enzyme that acts only on other proteins or on exotic substrates, it may be difficult to devise an effective screening assay to identify such a gene. Similarly, it can be difficult to develop an activity screening protocol that is broad enough in scope to identify other types of genes or gene-related sequences. These include:

(1) structural proteins;

(2) receptors; (3) transport proteins;

(4) binding proteins;

(5) multi-subunit protein complexes derived from different operons

(6) incomplete gene fragments;

(7) genes that are toxic to the heterologous host when fully expressed; (8) gene products that do not fold or assemble properly in the heterologous host;

(9) gene products that interact unfavorably with gene products from the heterologous host genome;

(10) gene products that are not post-translationally modified by the heterologous host, or that are improperly modified;

(11) gene products that are also expressed by the heterologous host (creating high background interference); and

(12) regulatory elements.

A slightly different approach is to shotgun-clone DNA from a microbial population to create a library and then screen the clones for members that contain sequences from a desired taxonomic group. Typically, amplification and sequencing of a 16S rRNA gene that maybe present on some cloned fragments is used to determine the phylogeny of the specific organism from which the clone originated. The presence of a 16S sequence in a clone library can be determined by hybridization with a labeled probe. This pre-screening by PCR or hybridization is used to identify a small number of clones that are likely to be the most promising candidates for sequencing. This method for cloning and sequencing microbial DNA from the environment is described by Stein et al. (1996), Hughes et al. (1997), Vergin et al. (1998), Millikan et al (1999), Beja et al. (2000) and in U.S. Pat. Nos. 6,030,779 and 6,054,267. In this method, genomic DNA is isolated from a population of environmental microorganisms by, for example, detergent/lysozyme treatment and embedded in agarose plugs. The DNA is partially digested with a restriction endonuclease and cloned into a fosmid (Kim et al., 1992) or other suitable vector, generating a library of clones containing inserts of approximately 40 kbp (kilobase pairs) each. Fosmid clones are chosen for sequencing by screening the fosmid DNA clone library with multiplexed PCR (using, for example, archaeon-biased ssu rRNA- targeted primers) or by filter hybridization using a labeled rDNA probe for a specific taxon (e.g., the domain Archaea). Desired fosmid clones are then subcloned and partially sequenced.

As in the activity-screening method described above, the hybridization/PCR- screening method evaluates the desirability of individual members of the library before they are sequenced. In this case, however, the criterion used for evaluation is the similarity of the member's DNA sequence to a previously known or isolated sequence. This approach makes it easier to isolate clones from a targeted taxonomic group or to isolate a homolog to a known gene without performing large amounts of sequencing. However, the limitation of this approach is that it is highly biased toward genes and organisms that have been previously characterized. Genomic fragments from completely novel organisms are therefore less likely to be found. Methods are needed, however, for analyzing complex ensembles of genomes, including genomes from organisms that have not been isolated and that are difficult or impossible to culture using existing methods, for the purpose of very large-scale gene discovery. The present invention provides for these and other advantages. According to the methods provided by the invention, the discovery process involving these complex ensembles can now start directly with cloning and sequencing, bypassing an initial activity or hybridization screen. The methods of the present invention make it cost-effective to perform direct cloning and sequencing on such material.

The present invention provides methods for obtaining large amounts of functional genomic sequence from a complex population of microorganisms, particularly those which are uncultivated. We have coined a new term for this approach, 'plethogenomics' (from the Gτeekplethos, meaning 'multitude' or 'population'), to distinguish it from other genomic methods. Rather than focusing on a single genome at a time, which requires isolating and culturing individual organisms, plethogenomic sequencing exploits the fact that some environmental samples contain an extremely rich diversity of gene sequences, including those encoding synthetic and catabolic enzymes. Thus, sequencing a heterogeneous population of genomes as if they were a single super-genome yields a wealth of exotic genes and, by expression cloning, their encoded proteins. Potentially useful open- reading frames or other sequences can be identified by automated sequence comparison. Clones containing these genes are then expressed and screened for biological activity. This method has the advantage that large clusters of genes can be accessed, as well as sequences of multi-subunit enzymes.

The methods described in the present invention exploit recent advances in sequencing throughput, microbial genomics and data analysis, as well as continuing rapid reductions in the cost of sequencing, to overcome the limitations and biases that are inherent to both the activity-screening and hybridization/PCR-screening approaches. Plethogenomic sequencing relies on direct analysis of the entire population of genomes present in the sample. This means that multiple genomes can be analyzed simultaneously. Open reading frames (ORFs) and other fragments are picked from the library and functionally identified only after they have been sequenced and analyzed by software algorithms. Desired sequences are then cloned and expressed, and their function is characterized in one or more host organisms. The advantage of this random sequencing method is that it provides a significantly less biased means to find novel sequences. These new gene sequences in turn yield proteins and enzymes with properties that are substantially different from those that have been previously discovered. SUMMARY OF THE INVENTION

The present invention provides methods for identifying and isolating useful DNA sequences from complex populations of microorganisms by plethogenomic sequencing. The invention provides a method for determining functional genomic sequences from a population of organisms. For DNA processing, which is preferred for bacteria and archaea, this involves the steps of: (a) extracting DNA fragments from a complex population of organisms; (b) cloning the DNA fragments into a vector, and without prior characterization of the DNA fragments by activity screening of an expression product or by hybridization screening with an oligonucleotide probe; (c) sequencing a plurality of the cloned DNA fragments; and (d) analyzing the sequences to identify functional regions. For messenger RNA (mRNA) processing, which is preferred for eucarya, this involves the steps of: (a) extracting mRNA fragments from a complex population of organisms; (b) synthesizing cDNA from the mRNA fragments; (c) cloning the cDNAs into a vector, and without prior characterization of the DNA fragments by activity screening of an expression product or by hybridization screening with an oligonucleotide probe; (d) sequencing a plurality of cloned cDNA fragments; and (e) analyzing the sequences to determine their function.

A 'complex population' is a collection of numerous organisms, which can be classified into unique species. Each unique species contributes no more than about 10%), and preferably no more than about 1%, to the total number of species in the population. The extracted DNA and mRNA fragments that are processed for plethogenomic sequencing may be considered to be 'random' because no effort is made to extract specific sequences, such as could be done by probe hybridization or PCR, from the population of organisms. Utilization of the method begins with selection of an appropriate source of microorganisms, typically from an environmental sample, to provide the genetic material for cloning. Sources which do not possess the desired genetic composition can be enhanced by selective enrichment, fractionation or sorting. Before further processing is attempted, the source of genetic material can be analyzed to determine its species entropy, which is a measure of the taxonomic diversity and evenness. Samples with sufficiently high entropy are favored, since this characteristic tends to maximize the chance of finding novel sequences.

Once an appropriate source is selected, nucleic acids (genomic DNA, mRNA, or plasmids) are extracted from the sample and purified. Random genomic DNA or synthesized cDNA fragments are shotgun-cloned into vectors, such as artificial chromosomes, to create a stable library. The members of the library (containing inserts as large as several hundred kbp) can be sequenced directly, or they can be divided further into smaller fragments to create a subclone library that is more amenable to high throughput sequencing. The individual sequences are electronically assembled into larger contiguous segments (contigs) based on sequence overlap.

After sequencing errors and ambiguities are corrected, the finished sequence is analyzed by a number of software algorithms to identify important structural and functional features, such as open reading frames, genes, promoters and the like. Putative functional and structural assignments can be made by comparing the unknown sequences with those from various databases and by analyzing groups of similar sequences obtained from the plethogenome. These assignments can be confirmed by expressing the genes or sequences in a heterologous host organism and screening the sequences or their gene products for biological activity. High- throughput structural analysis can also be performed. The sequences, along with their corresponding structure/function information, are used to construct an annotated plethogenomic database. Information from the database is used to retrieve desired clones from an archival library for subsequent cloning, expression, or genetic engineering. These manipulations may include development of new vectors, expression of useful enzymes, mutagenesis and directed evolution of proteins, engineering of metabolic pathways, and construction of artificial genomes. BRIEF DESCRIPTION OF THE DRAWINGS

Fig. 1 is a sequence analysis of clones from black mud organisms. Sequences were analyzed by BLASTX for percent similarity to known sequences in the 'nr' database and then grouped by functional category.

Fig. 2 is a plot of the distribution of sequenced black mud clones based on percent similarity. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The present invention describes methods for plethogenomic sequencing, defined herein as comprising random sequencing and analysis of genomes derived from a heterogeneous population of organisms for the purpose of finding functional sequences. The genes derived from the resulting sequences are useful for the production of new industrial enzymes, antibiotics and other pharmaceuticals.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention, suitable methods and material are described below.

All publications, patent applications, patents, and other references mentioned herein are incorporated by reference in their entirety. In the case of conflict, the present application, including definitions, will control. In addition, the materials, methods, and examples described herein are illustrative only and are not intended to be limiting.

Other features and advantages of the invention will be apparent from the following detailed description, the drawings, and from the claims. 1. Sources of Genetic Diversity

Highly diverse populations of microorganisms can be obtained from a number of sources. An example of a diverse environmental source is a sulfide-rich marine sediment (black mud) community commonly found in estuarine environments (Dojka et al, 2000). Another example is a biofilm, such as a microbial mat (Paerl & Pinckney, 1996). Bioremediation sites have also been found to contain a great variety of uncultivated species. Diverse communities may also be found as symbionts or commensals inside or in association with larger organisms (Moran & Baumann, 2000; Moran & Wernegreen, 2000; Hentschel et al, 2000). Examples of symbiotic communities include those found in the hindgut of termites, the rumen of herbivores, the oral cavity of animals, earthworm casts, the interior of hydrothermal vent tubeworms, in mycorrhizal associations with plants and on the surfaces of marine and terrestrial plants. Microbial communities that contain large numbers of unculturable species are desirable, since these are likely to have the most novel gene sequences. Estimates of global microbial diversity suggest that as many as 99.9% or more of microbial species have not been identified (Pace, 1997; Short, 1997). However, before creating a plethogenomic library using a sample from a given microbial population, it is often desirable to measure the genetic entropy of the sample in order to confirm that it possesses a minimum level of diversity. Otherwise, random sequencing of the genomic fragments will probably not yield the maximal amount of novel sequence. Algorithms and methods for determining the entropy value are described in a later section. 2. Selective Enrichment

In an alternative embodiment, it may be advantageous to selectively enrich or perturb the initial source population by physical or chemical treatments (Holt & Krieg, 1994). The goal is to improve the relative abundance of species that are likely to possess a desired gene or metabolic pathway. For example, a microbial sample can be subjected to a period of elevated temperature to select for thermophilic species or reduced temperature to select for psychrophiles. Methods for culturing extremophiles are reviewed in Rinker et al (1999). Other physical treatments include the addition of particles of various sizes and compositions for adherent species, agitation of liquid or semi-liquid samples, high pressure treatment, exposure to light, exposure to high- energy radiation, and the like. The population can also be exposed to an exogenous carbon source such as cellulose to enrich for species that possess cellulolytic enzymes. Other chemical treatments mclude changes in mineral nutrients or cofactors, atmospheric composition, redox potential, osmolarity and pH. Likewise, one may add chemicals such as quorum-sensing agents, antibiotics or non-aqueous solvents. Selection can also be done crudely by creating conditions wherein some of the cells will lyse (e.g., by changing the osmotic strength or adding detergents). Intact cells can be removed by centrifugation. It may also be desirable in some cases to derive the genomic material from a population that has been partially cultured in a laboratory microcosm (e.g., a Winogradsky column) to deliberately 'shape' its genetic profile through complex interactions among community members. Microcosm enrichment and other selective methods are described in Holt & Krieg (1994), Wagner-Dobler et al. (1998), Hunter-Cevera & Belt (1999), and Tiedje & Stein (1999). 3. Fractionating and Sorting a Microbial Population Prior to DNA Harvesting

Physical separation of the desired organisms by dispersion and fractionation is another optional method that can be employed to adjust the species composition of the genomic DNA that is harvested from a microbial population in a liquid medium or suspension (Holt & Krieg, 1994). A useful method for obtaining genomic source material is through bioprospecting (Watts et al, 1999). The simplest method for dispersing and fractionating soil bacteria is to homogenize the sample in a Waring blender and then to pellet the soil particles, plant material, protists and fungi by low- speed centrifugation. A second, high-speed centrifugation is then used to pellet the bacteria. Another type of fractionation comprises filtering the cells through porous membranes or columns of coarse particles (which can also be used to separate adherent from non-adherent species). Filtration/capture can also be accomplished by affinity or size-exclusion chromatography. Affinity capture can be coupled with magnetic bead technology to aid in removing particular cell types from the medium (Porter & Pickup, 1999). Density gradient centrifugation using, for example, Percoll and sucrose can be used to separate microbial cells from other material (Guerrero et al, 1985; Kubitschek, 1987; Putzer et al, 1991). Species that prefer to form biofilms can also be separated out by allowing them to adhere to a solid surface (Fletcher, 1999), such as a modified Robbins device or flow cell, which can be removed from the medium for harvesting (Hall-Stoodley et al, 1999). Methods for separating whole bacterial cells using isoelectric focusing (Jaspers & Overmann, 1997) and dielectrophoresis (Pimbley et al, 1999) have also been described.

Cells can also be sorted individually using fluorescence activated cell sorting (FACS; Porter, 1999a). In addition to separating classes of cells based on their light scattering properties, a FACS instrument can sort microbial cells that have been selectively labeled with one or more fluorescent dyes. Thus, cells can be sorted based on fluorescence emission wavelength and intensity by choosing the appropriate laser excitation wavelength and emission windows. Fluorescent labeling may comprise direct labeling with reactive dyes (e.g., N-hydroxysuccinimide esters to label amino groups on the cells or maleimide to label thiol groups). Indirect labeling may comprise modifying molecules on the external surfaces of susceptible cells with a chemical or enzymatic reagent and then labeling the derivatized molecules with a reactive fluorescent dye. Thus, for example, cells having exposed galactosyl residues on their surface can be treated with galactose oxidase, which introduces an aldehyde group. This group can then be reacted with fluorescein hydrazide (available from

Molecular Probes, Eugene, OR), for example, to label the cell surface. After washing out the unbound dye, the cells can be sorted based on their fluorescence emission at about 520 nm. Other enzyme/dye combinations can be used to create multiplexed labels. A second indirect labeling method uses fluorescently-labeled antibodies to identify particular strains. Such immunofluorescence staining is well known in the art (Lam & Mutharia, 1994). A third indirect labeling method uses dyes that are differentially absorbed by various cells. Such dyes may include live/dead viability stains or dyes that sense transmembrane redox gradients (Porter, 1999b) These dyes and stains are available from Molecular Probes (Eugene, OR). A fourth indirect labeling method employs fluorogenic enzyme substrates (e.g., the ELF97-palmitate substrate from Molecular Probes) to differentially label cells based on their enzyme activity. Flow-based assays of microbial cells using fluorogenic substrates have been described previously (Wittrup & Bailey, 1988; Chung et al., 1995). 4. Entropy Analysis of Species Diversity Given that the goal of plethogenomic sequencing is to maximize the number of novel genes obtained per sequencing run, it is very important to maximize both the species diversity and the evenness of the distribution of species in the microbial source population. This ensures that the rate of discovery of new sequences is maximized while the cost is minimized. It is therefore preferable in the present method to minimize duplication of genomes in the source material. This should not be confused with redundant sequencing of clones at the later stage of subclone library analysis, where redundancy may be important for resolving sequence ambiguities. It is also important to note that it is sometimes desirable to maximize the diversity within a taxonomic group (e.g., the domain Archaea), as described above, rather than to harvest all of the microorganisms present in the source material.

An effective method to calculate an index that describes both the diversity and evenness of the population uses mathematics from statistical mechanics to calculate an entropy value. This approach has also been used in population genetics and ecology to study complex groups of macroscopic organisms. For microorganisms, the species composition of a population can be conveniently measured by ribosomal RNA (16S/18S) gene sequence analysis. This type of population analysis comprises the following steps: (1) isolate genomic DNA from the population (see below);

(2) PCR amplify part or all of a gene for a ribosomal RNA subunit (e.g., the 16S/23S rRNA gene for archaea and bacteria or the 18S/28S for eucarya) with oligonucleotide primers;

(3) clone the amplified fragments into a plasmid vector; (4) transform E. coli cells with the library of cloned fragments;

(5) plate the cells on nutrient agar and allowing them to form single colonies;

(6) pick a representative number of isolated colonies;

(7) culture each clone; (8) isolate plasmid DNA from each culture;

(9) determine the sequence of the amplified and cloned fragments using appropriate sequencing primers;

(10) compare the cloned sequences to a database of known rRNA gene sequences; and (11) determine the number of different species present and their phylogenetic distribution. Oligonucleotide primers can be designed for specific amplification of bacterial or archaeal 16S genes and some groups of eucaryal 18S. For example, the partially degenerate universal primer 1492-R (5'-GGYTACCTTGTTACGACTT) (SΕQ ID NO: 1) can be used in combination with the bacteria-specific 27F primer (5'- AGAGTTTGATCMTGGCTCAG) (SΕQ ID NO: 2) and the partially degenerate archaea-specific primer 21F (5'TTCCGGTTGATCCTGCCGGA) (SΕQ ID NO: 3) to estimate the presence or absence of bacteria and archaea, respectively. In addition, because the eucaryal 18S rRNA gene is larger than the bacterial or archaeal 16S rRNA genes, PCR amplification of the 18S gene yields a DNA fragment that migrates more slowly than the 16S product during agarose gel electrophoresis. Thus, by quantitating the relative amounts of each band in the gel, it is possible to rapidly obtain a crude estimate the relative abundance of bacteria/archaea and eucarya in the microbial source population prior to cloning. However, it is also necessary to take into consideration the fact that archaea (which typically have two or fewer copies of the small subunit RNA gene per cell) and bacteria (which have an average of about 3.8 copies per cell) are very different from eucarya (which may have as many as 300- 400 copies per cell). Thus, methods have been devised to normalize population estimates for genome size and small subunit rDNA copy number (Farrelly et al, 1995; Fogel et α/., 1999).

Detailed methods for performing this RNA-based population analysis can be found in a review by Head (1999). An advantage of sequencing ribosomal RNA genes is that the information can be used to perform detailed phylogenetic analyses of the population. These analyses are useful if one wants to be certain of including particular taxonomic groups in the plethogenomic sequencing (based on, for example, prior knowledge of the metabolism of previously characterized species). Alternatively, there are also lower-resolution methods for determining the number of unique species in the sample, such as denaturing gradient gel electrophoresis (DGGE) and restriction fragment length polymoφhism (RFLP) analysis (Bruce & Hughes, 1999).

It is important to analyze the species composition of the sample prior to cloning and sequencing because the average genome size of metazoans (100-3,000 Mbp), protozoans (30-200 Mbp) and yeasts (20 mbp) are orders of magnitude larger than those of bacteria (3.6 Mbp) and archaea (2.5 Mbp). Thus, even if genomic DNA from eucarya is not desired, the presence of even a small amount of eucaryal cells in the source material can nevertheless skew the distribution of total genomic DNA isolated from the cells. The net effect of having too many eucarya would be to dilute the number of coding genes in a mixed population of microbes (e.g., one containing members of all three domains). Ideally, a population of bacteria and/or archaea with less than 1% eucarya is preferred for constructing a genomic DNA library. One method to avoid 'contamination' by undesirable eucaryal DNA is to use source material that is dominated by bacteria and/or archaea. Bacterial species are typically extremely abundant in sulfide-rich 'black mud' marine sediment, as noted below, and yet this material contains relatively few higher organisms (less than about 1%). Thus, selecting a source of microbial material that already contains a desirable species distribution can greatly simplify downstream processing. Methods for quantifying the species diversity are described below. After the data have been obtained regarding the species composition of the sample, one can use this information to calculate an entropy value for the species diversity. This analysis may comprise the use of Shannon's Entropy, group probability, the Kullback-Leibler distance, and the Good equation. Shannon's Entropy (H) can be defined as:

H = — ∑ Pt lOg iPi Equation 1

!=1 where P,- is the fractional probability of finding the z^'th member within the total population, and n denotes the number of unique members of the population. Note that log₂ P_t = 0 when P-- = 0. This equation is similar in nature to the Boltzmann- Planck equation used in information theory: S - k ln W Equation 2

Where J^is the number of microstates per macrostate. For Eq. 1, maximum entropy is achieved as each of the probabilities for the 1 through n members approaches \ln (i.e., the members are equiprobable) and as n approaches infinity. The logarithm can be taken to any base, but it is informative to use base 2 because the dimensionless quantity H can be read as having units of 'bits'. For example, if there are two equiprobable members in a population (designated species 0 and species 1), then Pj = P₂ = 1/2 = 0.5; H - \ bit. This is analogous to the information content of a computer bit that may take on the value of either 1 or 0 with equal probability, and hence carries one bit of information. As an example, one may consider the entropy of two different populations, both having 1,000 total members and 101 unique members. In the first population, one member (or species) accounts for 50% of the total population Pi = 500/1,000 = 0.5) and the remaining 100 unique members are equiprobable with each other (P_2.3... ₁₀₁ = 5/1,000 = 0.005). According to Eq 1., H= 4.32 bits for this population. In the second population, all 101 unique members are approximately equiprobable, i.e., Pι_,2,₃...ιoι ~ 10/1,000. According to Eq 1., H= 6.66 bits for this population. Thus, although the two populations have the same number of unique members, the second population has higher entropy because no single member is more or less numerous (on average) than any other.

For a population of equiprobable members (i.e., Pi = 1/n), the sum from 1 to n for all i members is unity, and so Eq. 1 reduces to: HE = — log iPi Equation 3 where the subscript E reminds us that this entropy equation is true only for maximum entropy populations with equiprobable members. H_E becomes larger as the number of members in the population increases. Thus, a large value for the Shannon Entropy is consistent with the intuitive notion of a large population that is not overwhelmed by a small number of different members. Additional information regarding this type of calculation is available online at http://imthworld.wolfram.com Entropy.html .

Other measures of entropy are known which can also be used to quantify populations for complexity. One such measure is to determine maximum Group Probabilty (P_G): jf σ = IT Pi Equation 4

Equation 4 yields a maximum entropy value when members are restricted to a particular set and they are all equiprobable. If members from the specified set are missing from the population then P^is reduced. Also, while Shannon Entropy is not defined when the individual probabilities of the members do not sum to unity, PQ remains defined (Arkin, 1992). In general, Shannon Entropy is preferable over P_G when there is no predetermined or specified set from which to select.

In addition to having quantitative measures for the entropy of populations or distributions, it is also possible to compare the difference in entropy between two populations or discrete distributions using the 'relative entropy', also called the Kullback-Leibler distance, d:

d = Equation 5

where p_k and q^ are the probability functions of two discrete distributions. Additional information is available online at hl p://__athworld.wolfram.co__/RelativeErιtropy.ht__l .

Another method to determine the complexity or entropy of a population uses so-called catch-and-release population statistics. One can estimate the number of different members in a population by observing a sample of the population and seeing how many individuals are unique. The occurrence of duplicates in a small sampling is an indication that the population is not very complex. Conversely, if one continues to find unique members even after large numbers of individuals in a population have been examined, this indicates that the population is very complex. The Good Equation (Good, 1953) quantifies this concept in terms of the percentage 'coverage', C:

C = (1 - (nlN)) x 100 Equation 6 where n is the number of unique samples and N is the total number of samples. Thus, if one sees only two identical members after analyzing 100 members of a population, then n = 99; N= 100; C = 1%. An estimate of the total number of unique members U in the library is then:

U= Nx C x 100 Equation 7 which, in this example, is 10,000. We can then approximate Shannon's entropy as:

H ~ — log 2(10000) Equation 8 or about 13 bits.

We have applied this analysis to members of an actual microbial population. Members of a population can be comprised of microbes from any of the domains Bacteria, Archaea or Eucarya within a community. In fact, the 13 -bit Shannon entropy value and the 10,000-member population value derived above are the actual indices derived from our analysis of an environmental sample of bacteria from a 'black mud' environment, wherein a total of 100 clones or purified PCR products were partially sequenced to analyze their 16S rRΝA genes. Only two out of one hundred 16S rRΝA gene sequences were found to be identical, probably indicating that they were derived from the same species. Coverage of the population of the black mud environment was determined to be 1 % based on the Good equation (Eq. 6). By this estimate, therefore, there were at least 10,000 species present in the sample. As sequencing of new isolates proceed and more duplicates are found, the Good equation yields a better estimate of the true diversity. 5. Nucleic Acid Extraction There are a number of methods that can be used to extract random nucleic acid fragments from heterogeneous populations of microorganisms. The extraction procedure usually must be optimized for each type of sample, based on the organisms to be targeted and the physico-chemical composition of the source material (e.g., soil type, concentration of humic acids, etc.). A number of different protocols — including procedures for extracting nucleic acids from soil, which is one of the most difficult source materials — have been reviewed in the scientific literature (Zhou et al, 1996; Frostegard et al, 1999; Miller et al, 1999; Watts et al 1999; Wery et al, 1999). For cells that have a strong cell wall or that contain large amounts of extracellular polysaccharide, such as yeast, filamentous fungi or certain types of bacteria, the extraction step can be coupled with or preceded by a mechanical disruption step. Generally, however, procedures which involve disruption, pipetting or extensive manipulation will shear the high molecular weight DNA into smaller pieces (shorter than about 10-50 kbp). These shorter fragments are still useful for sequencing, but for isolating much larger fragments, it genrally is necessary to use agarose plugs or beads, as described below. The large cloned inserts can then be fragmented further, if desired, and used to create a subclone library.

Mechanical disruption of cells can be accomplished by, e.g., sonication, shearing in a French Press, homogenization with a Dounce or Polytron homogenizer (Brinkmann Instruments, Westbury, NY), repeated freeze-thaw cycles, liquid agitation (bead-milling) with small glass or plastic beads (for example, 1-2 minutes in a Bead-Beater device from BioSpec Products, Bartlesville, OK), or by freezing the cells in liquid nitrogen and pulverizing them with a pre-chilled mortar and pestle or a Freezer/Mill (Spex CertiPrep, Metuchen, NJ). Due to the strong collisional forces created during bead-milling, few cell types are able to avoid disruption. Bead-milling is therefore particularly useful for diversity and phylogenetic analysis because it extracts DNA from microorganisms that may be resistant to other disruption techniques (Bruce et al, 1999). The thoroughness of the disruption is important for obtaining an accurate representation of all the genomes in the population. In order to obtain very large genomic fragments, however, care must be taken when using any of the physical disruption techniques to minimize shearing of the DNA as the cells are broken.

Mechanical disruption can also be combined with enzymatic digestion to weaken the cell wall, spore coat, capsule or outer membrane surrounding various cell types. The type of enzymatic treatment employed depends on the type of organism that is targeted. Enzymes such as lysozyme and lysostaphin can be used on gram- negative and gram-positive bacteria, respectively. Combinations of enzymes which may include beta-glucanases, such as lyticase and zymolase, can be used on yeast and other fungi. This treatment can be further enhanced by adding Proteinase K or other proteases to digest gram-positive bacteria that are not sensitive to lysostaphin. A typical digestion mixture for lysozyme treatment contains lysozyme (about 10 mg/g of cells, or about 1 mg/ml), 50 mM NaCl, 50 mM EDTA (ethylenediamine tetra- acetic acid, pH 8.0) and 50 mM sodium phosphate (pH 8.0). Digestion is carried out at 37°C for 2 hours. These enzymes are available from a variety of commercial suppliers, including Sigma-Aldrich (St. Louis, MO) and Ambion (Austin, TX). Non- enzymatic nucleic acid extraction kits for yeast are also available from Epicentre (Madison, WI). Mechanical/enzymatic disruption can also be used in combination with chemical disruption by guanidinium isothiocyanate to extract nucleic acids. An RNA extraction kit employing this procedure is available from Ambion (Austin, TX).

Following the optional steps of fractionation and mechanical/enzymatic disruption a complex population of microorganisms is thoroughly lysed and extracted using chemical treatment to obtain the nucleic acids. Methods for extracting genomic DNA from fungal mycelia is described in Ashktorab & Cohen (1992) and Randall & Judelson (1999). In addition to the specific procedures described below, general methods for nucleic acid extraction and purification are described in Sambrook et al. (1989), Birren et al. (1999a,b) and Ausubel et al (2000). For direct extraction from soil material without any pretreatment, a solution of 1-2% (w/v) sodium dodecyl sulfate (SDS) in 50 mM Na-EDTA (pH 8.0), 100 mM NaCl and 50 mM sodium phosphate (pH 8.0) can be used to dissolve the cell membranes and denature proteins. The sample is incubated with occasional inversion for 2 hours at 60°C. Organic solvent extraction using buffer-equilibrated phenol (pH 8.0) or chloroform-isoamyl alcohol (24: 1, v/v) improves the DNA yield when combined with detergent lysis. Treatment with these lysis/extraction reagents can also be combined with brief bead- milling. These treatments can also be performed on cells that have been separated from the soil matrix. In a typical soil extraction without pretreatment to separate the cells, between 0.5 and 5 g of soil material is used. When employing phenol extraction, DNA yields and purity can be enhanced by using Phase Lock Gel (Eppendorf Scientific, Hamburg, Germany) to create a solid barrier between the organic and aqueous phases when extracting the upper aqueous phase. An additional extraction step with 1% CTAB (hexadecylmethylammonium bromide) and PNPP (polyvinylpolypyrrolidone) can also be employed after the cell lysis step to remove polysaccharides and humic substances that might interfere with later enzymatic processing (Zhou et al, 1996). The NaCl concentration should be maintained above 0.5 M in the presence of CTAB to prevent DNA precipitation. After lysis/extraction, any remaining debris is removed from the aqueous suspension by centrifugation at 5,000 x g for 5 minutes. Crude DNA is then precipitated by adding sodium acetate to a final concentration of 0.3 M followed by 2.5 volumes of ice-cold ethanol. After mixing, the DNA is pelleted by centrifugation at 14,000 x g for 10 minutes. The pellet is then resuspended in 1 ml of 10 mM Tris (pH 8.0) and 1 mM EDTA. RNA can be hydrolyzed by adding a final concentration of 0.1 mg/ml of RNase A and incubated for 30 min at 37oC. The DNA is then extracted again with buffered phenol/chloroform isoamyl alcohol (24:24: 1). The DNA can be precipitated again by adding 0.6 ml of isopropanol, incubating for 1 hour, and centrifuging the sample at 14,000 x g for 15 minutes. The DNA pellet is then washed with cold 70% ethanol, dried, and resuspended in a minimal amount (typically about 100 microliters) of sterile, ultra-pure water. DNA can be further purified by column chromatography and agarose gel electrophoresis as described in Miller et al. (1999) and Frostegard et al. (1999), or by using columns and reagents contained in the UltraClean Soil DNA Kit from Mo-Bio Laboratories (Solana Beach, CA). Methods for extracting bacterial plasmid DNA are similar to those described above and are reviewed in Crosa et al (1994) and Mai & Wiegel (1999). The purity of the DNA can be checked by measuring an aliquot in a spectrophotometer and calculating the A₂₆o A ₈o ratio, where the subscript denotes the measuring wavelength. The ratio should be at least about 1.0 and preferably at least about 1.5. DNA fragment size can be determined by electrophoresis in 0.3% SeaKem Gold Agarose (FMC BioProducts, Rockland, ME) in 40 mM Tris-acetate and 1 mM EDTA (TAE buffer), pH 8.0, with appropriate DNA size standards (Sambrook et al, 1989).

The purified population of genomic DNA fragments can be further fractionated, if desired, by equilibrium centrifugation in the presence of bis- benzimidazole (Holben & Harris, 1995; Nusslein & Tiedje, 1998), a technique which separates DNA fragments based on their relative G + C content. This technique is useful for separating pooled genomic DNA, since the G + C content of genomic DNA tends to differ among taxonomic groups. The DNA can also be size-fractionated by gel electrophoresis using low melting point agarose with DNA molecular weight size standards. DNA fragments of the appropriate size are excised from the gel, and the gel slice is melted in a small volume of buffer to release the DNA. The DNA is then subjected to repeated washings in a Microcon-100 microcentrifugation column (Amicon Corp.; Beverly, MA). 6. Isolation of Very Large Genomic DNA Fragments For cloning into bacterial artificial chromosome (BAG) vectors (Shizuya et al,

1992) and PI bacteriophage-derived artificial chromosome (PAC) vectors (Ioannou et al, 1994), it is desirable to have very large genomic DNA fragments. The requirement for a large average fragment size necessitates the use of extremely gentle DNA extraction procedures and minimal handling to avoid excessive shearing. The preferred method to isolate high molecular weight DNA fragments and avoid unnecessary shearing is first to separate and concentrate the microbial cells and embed them in an agarose plug (2% (w/v SeaPlaque GTG agarose from FMC BioPoducts (Rockland, ME)) . The DNA is then extracted by lysing the cells in the plug with a combination of lytic enzymes and detergent (0.5-1% N-lauroylsarcosine sodium salt and 0.2% sodium deoxycholate from Sigma-Aldrich, St. Louis, MO) and separating the DΝA fragments by pulsed-field gel electrophoresis (PFGE) . Regions of the gel containing the large DΝA fragments are excised. The DΝA can be digested with restriction enzymes in the gel slice and ligated into an appropriately digested vector, such as an artificial chromosome vector (see below). Methods for preparing high molecular weight DΝA are described in Reithman et al. (1999).

A procedure for obtaining high molecular weight genomic DΝA from bacterial cells embedded in agarose plugs using lytic enzymes and detergent is described below. It is a modification of the procedure contained in the Instruction Manual for the CHEF Genomic DΝA Plug Kits from Bio-Rad Laboratories (Hercules, CA):

1) Obtain several hundred ml of mud slurry or other material containing a high density of bacteria. Approximately 500 ml should be sufficient to obtain more than about 5 x 109 cells.

2) Optionally, filter the mud slurry through a series of increasingly fine mesh screens (e.g., the Cellector from Thermo EC; Holbrook, ΝY) to remove large organisms, debris and soil and mineral particles. Mesh size 10 = 1.91 mm opening; mesh size 500 = 0.0254 mm opening.

3) Vortex aliquots of the mud slurry for about 2 minutes to detach adherent cells. 4) For larger volumes of mud, spin the material in a sterile Falcon blue-cap 2070 tube at 1,200 rpm (259 x g) in an Eppendorf Model 5804R tabletop centrifuge for 5 min. For small volumes (less than 2 ml), spin the material in a sterile plastic microfuge tube in an Eppendorf 5415D microcentrifuge for about 8 s (until the speed indicator reaches 10,000 rpm, or about 8,200 x g).

5) Discard the pellet containing soil particles and debris. Transfer the turbid supernatant containing the bacteria to a clean tube. Repeat the centrifugation, if necessary, until soil particles have been removed.

6) For larger volumes of cell suspension, spin the material in a Falcon blue-cap 2070 tube at 5,000 rpm (4,500 x g) in an Eppendorf Model 5804R tabletop centrifuge for 5 min. For small volumes (less than 2 ml), spin the material in a plastic microfuge tube in an Eppendorf 5415D microcentrifuge at 13,200 rpm (14,200 x g) for 3 min. Pellets will appear dark orange-brown to black. Combine pellets from several tubes or repeat the process with just one or two tubes to build up the cell mass. 7) Repeat the centrifugations until enough slurry has been processed to provide a desired number of cells. Resuspend the cell pellet in 1 ml of sterile filtered supernatant from the original mud slurry or an appropriate isotonic buffer.

8) Remove a 20-microliter aliquot from each sample and add 1 microliter of 2% crystal violet (Sigma- Aldrich, St. Louis, MO) in isopropanol to stain the bacteria. Measure the cell concentration in a hemocytometer as described in the Bio-Rad manual. Cells may first need to be diluted to make counting easier.

9) Remove an aliquot from the cell suspension that contains a desired number of cells. Resuspend them to an appropriate volume in Bio-Rad Cell Suspension buffer.

10) Prepare agarose plugs in 2% agarose using the method described in the Bio- Rad manual. It maybe desirable to make sets of plugs at various cell concentrations

(e.g., 5 x 107, 5 x 108, 5 x 109 cells per ml of plugs).

11) Remove the plugs from their molds after the agarose has solidified. The plugs can be pushed into the first lytic solution that is part of the kit (for example, the Bio- Rad lysozyme solution). However, for agarose plugs of environmental bacteria and those containing high concentrations of cells, it is preferable to pre-incubate the plugs with additional lytic enzymes prior to beginning the lysis steps that are described as part of the Bio-Rad kit. For example, both lysostaphin and mutanolysin can be used to enhance the lysis of gram-positive bacteria. It may also be useful in some cases to treat the plugs with elevated concentrations of the enzymes (DNase-free) that are used as part of the Bio-Rad bacterial DNA module (e.g., lysozyme). It may also be necessary to extend the incubation times to improve the extent of cell lysis and DNA yield. About 5-10 ml of solution can be used for ten 100-microliter plugs.

12) For improved lysis of highly diverse populations of bacteria, the plugs containing the cells can first be suspended in buffer (20 mM Tris-HCl, pH 7.5, 100 mM NaCl, 0.1 mM EDTA in sterile-filtered distilled water) containing 0.5 mg/ml of lysostaphin (Sigma-Aldrich, St. Louis, MO). Incubate the plugs at 37 C for 1 hour with gentle agitation.

13) Carefully decant the lysostaphin solution and replace it with buffer (20 mM Tris-HCl, pH 8.0, 100 mM NaCl, 10 mM EDTA in sterile-filtered distilled water) containing 100 Units/ml of mutanolysin and 10 mg/ml lysozyme (Sigma-Aldrich, St. Louis, MO). Incubate the plugs at 37 C for 3 hours with gentle agitation.

14) Carefully decant the mutanolysin/lysozyme solution and replace it with the Bio-Rad lysozyme solution provided with the Bacterial Module of the CHEF Genomic DNA Plug Kit.

15) Proceed with the remainder of the Bio-Rad protocol (i.e., lysozyme and proteinase K treatments in the presence of detergent, followed by PMSF treatment and washing to inactivate the protease). Incubate the plugs in the Proteinase K solution for 12-24 hours. 16) Store the washed plugs in lx Wash Buffer at 4C. 7. RNA Extraction

There are also a number of well-known methods for recovering mRNA from eucarya, such as yeast and filamentous fungi. Methods for isolating mRNA from fungi using freeze-thawing, grinding and extraction with guanidinium isothiocyanate are described in Dalboge & Heldt-Hansen (1994). This reference also describes methods for cDNA library construction and expression cloning of the fungal genes in yeast. The MasterPure RNA isolation kit from Epicentre (Madison, WI) is also useful for isolating RNA from yeast for the purpose of synthesizing cDNA libraries. A method for extracting RNA from fungi using a modification of an SDS-based procedure of Ohi & Short (1980) is described in U.S. Pat. No. 5,393,670. SDS/phenol extraction, particularly when coupled with brief bead-milling, is useful for recovering rare messenger RNA. However, the extraction step must be performed rapidly and immediately after harvesting the cells from the source material or medium. SDS/phenol extraction of RNA is described in Mauchline et al (1999) and Ausubel et al. (2000).

After the RNA has been extracted and purified, the eucaryotic poly(A)⁺- containing mRNA is selectively removed by affinity purification. This can be accomplished using either an oligo(dT)-cellulose column or oligo(dT)-containing magnetic beads. Methods for oligo(dT)-cellulose chromatography are described in Ausubel et al. (2000), and a pre-packaged column kit is available from Amersham Pharmacia Biotech (Piscataway, NJ). Oligo(dT) magnetic beads are available from Promega (Madison, WI). Methods for synthesizing double-stranded cDNA libraries from purified mRNA are well known in the art, and are described in Cowell & Austin (1997) and Ausubel et al. (2000). A kit for synthesizing cDNA libraries (the Universal RiboClone cDNA Synthesis System) is available from Promega (Madison, WI). 8. Further Fragmentation of Genomic DNA To facilitate shotgun subcloning of large inserts for sequencing (or if the average measured size of the DNA fragments is too large to permit efficient cloning into a particular vector), the fragment length can be further reduced by mechanical shearing or restriction endonuclease digestion of the DNA. Smaller fragments containing incomplete genes may also be preferred in some cases for genes that cannot be cloned in complete form because their products are toxic to E. coli or other host organisms. Mechanical shearing has the advantage of creating breaks that are sequence-independent and randomly distributed. Mechanical shearing techniques include sonication (Favello et al, 1995; Fleischmann et al, 1995) and nebulization (Fraser et al, 1995). For sonication, the DNA is diluted to 10-200ng/microliter in buffer containing 30% glycerol. If the optimal treatment time is not precisely known, the sample must be divided into several aliquots, and several of the aliquots are treated for varying amounts of time in plastic microcentrifuge tubes or Corex tubes. The sonication time for each tube is stepped up by 10 s intervals to a maximum of 1 min. The tubes are kept in constant contact with an ice water bath and are allowed to cool for 1 min between each 10 s treatment. The DNA is sonicated according to the above protocol at the lowest power setting using a Branson Ultrasonics Corp. (Danbury, CT) Model S-450A Sonifier equipped with a 3 mm probe tip. A model S- 150D can be used for small- volume samples. For microliter volumes of sample, a cup horn sonicator (Heat Systems model XL2015 equipped with a CL4 cup-horn probe) is preferred. Details of this method are described in Wilson & Mardis (1999b). Agarose gel electrophoresis is then used to determine the optimal conditions, and the remaining DNA is sheared using those conditions. After the sheared DNA is precipitated and resuspended, any remaining staggered ends must be repaired. This repair can be done in several ways. For example, the DNA can be treated with bacteriophage T4 DNA polymerase to fill in 5' overhangs and remove 3' overhangs. The Klenow fragment of DNA polymerase I is next used to repair any remaining 5' overhangs. Finally, bacteriophage T4 polynucleotide kinase is used to phosphorylate the 5' ends (Wilson & Mardis, 1999b). Another method involves digesting an aliquot with BAL-31 nuclease (New England BioLabs, Beverly, MA), as described in Fleischmann et al. (1995) followed by DNA polymerase repair to create uniform blunt ends. The simplest method, however, is to use Mung Bean nuclease (New England BioLabs; Beverly, MA), which removes the staggered ends and polishes them in one step. For nebulization, 1-2 ml of buffered DNA solution (at 10-50 microgram/ml) in

25% glycerol may be passed through an AeroMist Nebulizer from IPI Medical Products (Chicago, IL) at 10-20 psi of nitrogen for 60 s. Alternatively, a plastic nebulizer that is supplied as part of the TOPO Shotgun Cloning kit from Invitrogen (Carlsbad, CA) may be employed according to the manufacturer's instructions. The DNA sample is maintained in an ice water bath at all times during the shearing process. An aliquot is subsequently examined by electrophoresis to check the size distribution. The DNA is precipitated, resuspended in buffer and treated with nuclease as described above to generate clean blunt ends for cloning.

Genomic DNA can also be randomly fragmented by partial cleavage with a. restriction endonuclease. Small aliquots of the DNA are first tested by incubation with the enzyme for various time intervals, and the products are then examined by electrophoresis to determine the optimal digestion conditions. Although many different restriction enzymes can be used for this purpose, the enzyme SauiA I, which generates overhangs that are compatible with BamH I overhangs, is often used for this purpose. When the digestion reaction is complete, the enzyme is inactivated by heating the sample at 70°C for 15 min. Protocols for DNA fragmentation by partial restriction enzyme digestion are well known in the art, and details are provided in Rosteck et al (1999). Digestion of high molecular weight DNA embedded in agarose plugs is described in Birren et al. (1999c). 9. Cloning Vectors for Library Construction

The choice of vector for creating a library depends on the size of the fragments to be cloned. It is therefore important to size fractionate the DNA by gel electrophoresis in a preparative sizing gel and to purify the bands having a desired length prior to ligation into the vector (Birren et al., 1999c). Small fragments in the range of 0.1-12 kb can be carried by typical plasmid vectors such as pBR322, pUC, Litmus (New England BioLabs; Beverly, MA) and pBluescript (Stratagene; La Jolla, CA), or by phage vectors such as M13. Larger inserts in the range of 10-20 kb can be efficiently packaged by bacteriophage lambda vectors, available as part of the Gigapack kit from Stratagene. Inserts of 35-45 kb can be cloned into cosmid (Hohn, 1979) or fosmid (Kim et al, 1992) vectors. Slightly larger inserts (30-90 kb) can be carried by the PI cloning system (Sternberg, 1990). The largest inserts, in the range of 30-300 kb, are preferably cloned into PACs (Ioannou & de Jong, 1996) or BACs (Shizuya et al, 1992). Fosmids and BACs are preferred for inserts larger than about 30 kb because they are derived from the single-copy F-factor of E. coli. This feature allows the library to be more stably maintained during propagation than it would be in a cosmid.

A population of small blunt-end DNA fragments shorter than about 5 kb can be cloned into standard plasmid vectors, such as the pCR4Blunt-TOPO plasmid from Invitrogen (Carlsbad, CA), which is part of a commercial TOPO Shotgun Cloning kit. Methods for preparing the insert and vector for blunt-end cloning are described in Sambrook et al (1989) and Ausubel et al (2000).

If blunt-end fragments are to be ligated into a cloning vector that requires protruding ends, one can attach linkers or adapters that contain appropriate restriction sites such as, for example, EcoR. I. Linkers, adapters and methylases are available from New England BioLabs (Beverly, MA). For attaching linkers, the genomic DNA or cDNA inserts are first methylated to prevent cleavage of the insert sequence during subsequent digestion (Hoheisel et al, 1989). Phosphorylated, unmethylated EcoR I linkers are then ligated onto the ends of the insert fragments. The constructs are digested with EcoR I to create the proper overhangs and remove multiple linkers (concatemers). The excess linkers are removed by chromatography on a Sepharose CL-4B column (Amersham Pharmacia Biotech; Piscataway, NJ). Alternatively, adapters, which do not require methylation of the insert DNA, can be used. Protocols for attaching linkers and adapters are described in Ausubel et al (2000). This protocol can be used to clone fragments of any size.

Large, double-stranded DNA fragments, such as cDNAs or genomic DNA, can also be cloned into vectors using lambda phage packaging extracts. For example, the Gigapack packaging extract (Stratagene; La Jolla, CA) can be used to clone a library of environmental DNA fragments into a fosmid vector for hybridization screening (Stein et al, 1996). This method has the advantage that the packaging extract automatically selects for large inserts without having to size fractionate them. Populations of fragments can also be cloned into lambda phage for creating lambda libraries using, for example, the Lambda FIX II or Lambda DASH II vector kits for genomic cloning or the ZAP vector kits for cDNA cloning. These kits are available from Stratagene (La Jolla, CA).

Size-fractionated fragments smaller than about 10 kb that have 'sticky' ends (i.e., overhangs created by restriction endonuclease digestion) can be cloned into standard plasmid or phage vectors using techniques that are well known in the art (Sambrook et al, 1989; Ausubel et al, 2000). For larger inserts, fosmid or BAG vectors are preferred. The procedure for efficient cloning of fragments into BAG vectors comprises the following steps: (1) The BAC vector DNA (e.g., pBeloBACl 1 or pBACCe3.6), which has been purified to remove any contaminating E. coli chromosomal DNA, is digested with the appropriate restriction enzyme (i.e., it is linearized) and then dephosphorylated with alkaline phosphatase. Tests are performed (comprising self-ligation or ligation with dummy inserts, transformation and blue/white colony screening) after both the linearization and dephosphorylation steps to ensure that (a) the vector has been thoroughly digested; (b) the cloning sites have not been damaged; and (c) the desphosphorylation of the vector is complete enough to inhibit self-ligation. [White colonies on X-gal plates indicate BAC clones with inserts; blue colonies indicate clones lacking inserts.] (2) The insert DNA is ligated into the vector with a molar excess of vector (e.g., at a ratio between 2:1 and 10:1) and at low DNA concentration (e.g., 20 ng of insert and 1-5 ng of vector). Ligation tests are performed at various ratios of vector DNA to size-selected insert DNA to determine the optimal ratio. (3) After salt removal by dialysis, transformation-competent E. coli cells are transformed with the ligation mixture. Electroporation is a convenient method to use for transformation. However, because the efficiency of transformation is greatly biased in favor of smaller inserts, care must be taken to ensure that the ligation mixture is free of small inserts (i.e., those smaller than about 30 kb). The E. coli strain DH10B is preferred for this procedure because it can be efficiently transformed with large plasmids and has the appropriate genotype in terms of recombination, restriction, and DNA modification. Εlectrocompetent DH1 OB cells (ΕlectroMAX DH1 OB) can be purchased from Life Technologies

(Rockville, MD). These cells are also suitable for transformation with cDNA vector libraries and small-insert plasmid libraries. (4) The transformed cells are spread on LB (Luria-Bertani) plates containing chloramphenicol, X-gal and IPTG. (5) The plates are incubated for at least 18 hours at 37°C so that the color indicator (indigo) has sufficient time to develop in those colonies whose vector DNA does not contain an insert. (6) White colonies are picked and used to inoculate liquid cultures. These cultures are used to create frozen stocks of each clone for long term storage, and to provide cells for preparing BAC DNA for sequencing. Prior to full-scale plating of the library, a small-scale plating is performed to determine the appropriate plating density, the proportion of colonies lacking inserts, and the average insert size. BAC DNA for insert analysis, sequencing and downstream cloning can be obtained via alkaline lysis minipreps of E. coli cultures (e.g., the R.Ε.A.L. Prep 96 System from Qiagen; Valencia, CA). BAC clones containing inserts can be further analyzed by restriction digests and PFGE fingeφrinting to determine insert sizes and to screen for possible spurious rearrangements of the DNA sequence. Methods for cloning large DNA inserts into BACs and analyzing the inserts are described in Birren et al. (1999c). 10. Sequencing DNA Inserts

Methods for fluorescence-based sequencing are well known in the art (Wilson & Mardis, 1999a), with cycle sequencing being the most widely used. The process uses a modification of the S anger method of dideoxy (ddNTP)-mediated chain termination, wherein fluorescent dye is incorporated into the DNA that is synthesized in vitro by DNA polymerase. The entire process of preparing the templates and the sequencing reactions can be automated for high-throughput processing of multiple clones. Fluorescent dye-labeled primers can be used for routine sequencing.

Alternatively, fluorophore-labeled terminator nucleotides can be used to simplify the preparation of multiplexed sequencing reactions (since only one reaction is required for each template, rather than a separate reaction for each fluorophore). Using dye- labeled terminators also has the advantage of improving the read in compressed regions of the gel. Additional methods for reading through difficult sequences, such as those containing repeats, are described in Wilson & Mardis (1999b). Sequencing reagent kits are available from Applied Biosystems (Foster City, CA) and Amersham- Pharmacia Biotech (Piscataway, NJ). Automated sequencing instruments for performing electrophoresis and gel reading are available from a number of manufacturers, including Applied Biosystems (Foster City, CA), Molecular Dynamics/ Amersham-Pharmacia Biotech (Piscataway, NJ) and LI-COR (Lincoln, NE). Once the laser scanner and computer have converted the fluorescence signals into a digital format, computer software such as Sequencher (Gene Codes; Ann Arbor, MI) is used to perform the base-calling operations, which convert the signals into chromatograms and corresponding nucleotide sequences. The finished nucleotide sequence can then be analyzed using the methods described below.

An alternative method to S anger sequencing, which involves enzymatic synthesis of a detectable product, is sequencing by hybridization. In this method, a microchip is covered with DNA probes consisting of short oligonucleotides of known sequence that are chemically immobilized on the chip surface. Each different oligonucleotide sequence is placed at a defined position on the chip. Hybridization of fluorescently labeled DNA to the immobilized oligonucleotides generates a fluorescent spot. Hybridization with DNA fragments can discriminate among perfect duplexes, duplexes containing single internal mismatches, and major parts of duplexes containing terminal mismatches. The hybridization signals can be read by a fluorescent microscope with a charged-coupled-display (CCD) camera connected to a computer. By analyzing the signals created by the combinatorial binding of the fragment to various spots on the chip, the sequence of the entire fragment can be reconstructed. A universal sequencing chip (the HyChip system) is available from Hyseq (Sunnyvale, CA).

Large DNA inserts, such as those cloned into BACs, PACs or fosmids, can be directly sequenced by the Sanger method using primers that target the flanking sequence of the vector or any known sequences within the large insert (e.g., a 16S rRNA gene). Primer-directed sequencing can also be used to fill in gaps between sequenced contigs in a random shotgun strategy (see below). However, for end- sequencing of very large inserts, reading in from the flanking vector regions will only allow one to sequence the 5' and 3' ends of the insert. To obtain internal sequences by the direct approach, primer walking can be used. The primer walking approach uses a sequencing primer designed to hybridize to the 3' end of a segment that has previously been sequenced. The primer is designed to anneal to a region that lies about 50-100 bp upstream of the 3' end of the known sequence so that there is recognizable overlap. The sequence is then extended from this primer using the same template. This new sequence is used to design the next primer, and the process is repeated. This procedure is most efficient for relatively small inserts (approximately 1-5 kb).

Direct sequencing of cDNA clones in bacteriophage lambda vectors is problematic because it is difficult to purify sufficient quantities of the template. It is therefore preferable to generate a PCR product using primers directed to the flanking vector sequence or to use the Lambda ZAP vector (Stratagene; La Jolla, CA) to recover the insert in plasmid form via the lambda autoexcision mechanism. Shotgun sequencing of lambda clones is also inefficient because of the large amount of vector sequence present. It is therefore preferable to subclone the insert into a smaller vector prior to shotgun sequencing (see below). These subclones can also be analyzed by primer walking or transposon sequencing, as described below.

For de novo sequencing of large inserts (greater than about 5 kb), a random approach, such as shotgun sequencing, is preferred. In this method, the DNA is fragmented into smaller segments which are used to generate a subclone library. Mechanical shearing, described above, is the preferred method because it creates the most random fragments. The ends of the sheared DNA are enzymatically repaired, and the DNA is size-fractionated to remove very large and very small fragments. The purified, size-selected fragments are cloned into M13 or plasmid vectors, and the ligation products are used to transform E. coli. Individual colonies are picked and used to inoculate liquid cultures. DNA is harvested from each culture by standard miniprep methods. The purified DNA from each clone is used as a template for the sequencing reactions, and the reaction products are analyzed by gel electrophoresis, as described above. The inserts in the subclone library are sequenced using primers that anneal to the flanking vector sequences. Regions that cannot be reached by end- sequencing can be sequenced by primer walking.

For approximately 95% coverage of a large insert, the subclones are typically sequenced so as to achieve four- to six-fold redundancy. This means that a 40 kb insert typically requires sequencing about 600 fragments averaging about 400 bp in length. Strings of overlapping sequence fragments are identified by computer analysis of the individual sequence fragments, as described below. These can then be assembled electronically into contigs to reconstruct the full sequence of the original insert. Contigs are defined as contiguous, or uninterrupted, DNA sequences generated from a set of overlapping sequences of shorter length. In some cases wherein the species entropy of the source DNA is not too great, it is possible to use this reconstruction to determine the entire genomic sequence of an unculturable organism or a significant fraction thereof. Note, however, that the utility of the present mvention does not necessarily require complete sequencing of the cloned DNA, since sequences that are useful for biotechnological applications can be obtained from almost any size fragment.

A third method for sequencing large inserts employs transposons. Random insertion of a transposon sequence into a large cloned insert can be used to generate primer binding sites, since the sequence of the transposon itself is known. A library of inserts containing random transposon insertions can then be sequenced and analyzed to generate a set of contigs. In vitro kits are available for inserting transposons (containing selectable antibiotic resistance markers) into plasmids, cosmids and BACs. The GPS-1 Genome Priming System based on the Tn7 transposon is available from New England BioLabs (Beverly, MA) and the EZ::TN Insertion kit based on the Tn5 transposon is available from Epicentre (Madison, WI). Once a large number of subclones have been sequenced and analyzed so that contigs can be identified, there remains the problem of filling in any remaining sequence gaps. Sequence gaps can be filled by subclone-directed closure, wherein particular subclones are resequenced to obtain extended reads or wherein reverse primers are designed based on the known sequence, and these are used to cover the gap. Alternatively, a form of primer walking can be performed (primer-directed closure) wherein a new forward primer is designed to anneal to the 3' end of the sequence obtained from a given subclone, and this starting point is used to extend the read. A third approach is simply to design forward and reverse primers that flank the gap and use PCR amplification of the original large-insert template to fill in the gap. Note that PCR can also be used to obtain flanking sequence of a gene or other sequence that has been cloned but is incomplete (truncated). This can be done by designing a specific primer designed to anneal near the truncated end of the sequence. The original source DNA for the library (i.e., the complex DNA isolated from the population of organisms) can then be used as a template for primer walking. The final stage to sequence finishing is to resolve repeats and ambiguities. Regions that contain tandem repeats or other segments that are difficult to sequence can be analyzed by deletional sequencing, which involves digesting the template DNA with a restriction enzyme, treating the products with exonuclease HI and Mung Bean nuclease, re-ligating the digestion products, transforming E. coli with the ligation mix, isolating DNA from the transformant colonies, and sequencing the deletion subclones (Wilson & Mardis, 1999a). Interspersed repeats that are longer than the typical read length (i.e., greater than about 400 bp) can be resolved by resequencing with the reverse primer so that the fragment can be properly oriented within the set of contigs. This strategy can also be used to locate flanking sequence on long repeats or multiple copies of a tandem repeat. Short inverted repeats can be sequenced by finding an appropriate subclone that contains unique sequence and extending it. PCR amplification of the repeated region can also be used, followed by direct sequencing of the fragment. Long inverted repeats (greater than about 500 bp) separated by intervening sequence can be analyzed by designing PCR primers that will anneal to unique sequence on either side of the repeat. These can be used to screen the subclone library to find those that contain only one copy of the repeat and the intervening sequence.

Other small-scale sequencing ambiguities and artifacts can be resolved by various means. Compression artifacts caused by small hairpins can be resolved by using fluorescently-labeled terminator nucleotides (rather than labeled primers) or by employing a nucleotide analog, such as deazaguanine. Polymerase stops caused by DNA structures that block extension by the polymerase can be resolved by designing sequencing primers with higher melting temperatures that bind close to the problem area. Elevating the annealing temperature during the cycle sequencing reaction allows the primer to anneal to the region and allows better access by the polymerase. 11. DNA Sequence Assembly

Once a sufficient amount of DNA sequence information is obtained from a set of clones or subclones, the process of assembling the complete sequence of the insert can begin. Particularly when one is employing shotgun sequencing, this reconstruction process relies on creating ordered sets of contigs. A number of commercial software packages are available for this stage of the analysis, depending on the computer platform to be used (workstation or PC) and the number of contigs that must be simultaneously analyzed. If a very large (e.g. BAC) insert is being sequenced, any software limitations on the number of contigs that can be processed can usually be overcome by subcloning the large insert into smaller segments prior to shotgun sequencing, thereby reducing the total number of subclones/contigs that must be analyzed at any one time. There are numerous programs available for contig assembly and editing. For example, the Sequencher package from GeneCodes (Ann Arbor, MI) contains a contig assembly module that can be used to sequence entire microbial genomes. Similarly, there are assembly programs such as the GCG assembly tools (GelStart, GelMerge, GelAssemble, GelStatus, GelPicture, etc.) from the Genetics Computer Group (Madison, WI). The Phred/Phrap/Consed package for base-calling, assembly, editing and finishing is available from the University of Washington. An optimized version of Phrap designed by Southwest Parallel Software is also available for Windows-NT and hardware-accelerated computers through TimeLogic Corporation (Incline Village, NV). These programs can be used to import the sequences, remove vector sequence contamination, align the contigs, edit the contigs, indicate sequencing errors and ambiguities (e.g., frameshift errors; Fichant & Quentin, 1995), indicate the size and distribution of gaps in the sequence, and fuse the sequences into a final consensus sequence that can be stored in a database. 12. DNA Sequence Analysis

After the sequencing phase has been partially or completely finished for a given set of clones, the sequence database is analyzed by a number of algorithms, depending on the biological source of the sequence and the goal of the search (Sterky & Lundeberg, 2000). The first objective is generally to identify open reading frames (ORFs) based on the presence of start and stop codons occurring in the same reading frame of length greater than a certain threshold. These can be checked in all six reading frames. ORFs longer than about 400 nucleotides are also more likely to represent true genes (Borodovsky et al, 1999). Note that it is not always necessary to have a long, finished stretch of sequence before applying these functional genomics tools. Individual segments can be analyzed to determine whether they contain potentially useful genes. In some cases, this information can be used to decided whether to proceed with full sequencing of the insert. In addition to finding ORFs, computer searching using Hidden Markov Models and other algorithms (Krogh et al, 1994; Pedersen et al, 1996; Yada et al, 1998) can also be employed to identify useful intergenic sequences. These sequences may contain, for example, regulatory, elements, such as promoters, Shine-Delgarno sequences, binding sites for transcriptional regulators and the like.

ORFs (and their putative protein translation products) can be further analyzed to determine their similarity to known gene sequences and hypothetical gene sequences contained in various genetic databases. Public sequence databases include GenBank, the nucleotide sequence database at EMBL (the European Molecular Biology Laboratory) and the DDB J (the DNA Database of Japan). Other databases specializing in protein-related sequences and whole genome analysis include Swiss- Prot, SRS (Lion Bioscience Ltd., Cambridge, U.K.), TrEMBL and InterPro.

ORFs can be analyzed by a variety of homology-searching tools. Methods for predicting gene structure have been reviewed in Burset & Guigo (1996), Fickett (1996), Brutlag (1998), Burge & Karlin (1998) and Skolnick & Fetrow (2000). The relative advantages and disadvantages of several sequence comparison algorithms, such as Smith- Waterman (Smith & Waterman, 1981), FASTA (Lipman & Pearson, 1985; Pearson & Lipman, 1988; Pearson, 1996; Pearson, 1998) and BLAST (Altschul et al, 1990; Altschul et al, 1997) have been reviewed in Shpaer et al. (1996), Anderson & Brass (1998) and Brenner et al (1998). Tools based on these sequence alignment algorithms can be used to find genes in genomic DNA sequences from both archaeabacteria and eucarya. They can also be used to find homologs to cDNA sequences. Parallelized versions of these algorithms and specialized hardware implementations have also been created to achieve faster data processing rates (Brutlag et al, 1993; Chen et al, 1993; Julich, 1995; Hughey, 1996). Commercial hardware and software systems for running these enhanced versions or other similar programs are available from Paracel (Pasadena, CA), Integrated Genomics (Chicago, EL) and TimeLogic (Incline Village, NV).

These search algorithms compare the query sequence from the plethogenomic database against a second annotated database by performing a pairwise alignment of the sequences to find the closest matches. Similarly, genomic or cDNA sequences from the plethogenomic database can also be compared against other plethogenomic sequences from the same database. The search is usually performed using translations of the DNA sequence into protein sequence to eliminate the problem of redundancy in the genetic code. The pairwise sequence similarity search algorithms are useful for genome analysis because they use an extreme value distribution to calculate the probability that the observed similarity scores could arise simply by chance. Probability analysis is preferred over simple percent identity matching, which is more prone to error. The statistical information is therefore important for discarding potentially spurious matches. Statistically significant similarity can be assumed to indicate homology between the two sequences (i.e., shared evolutionary ancestry). This homology is a strong indicator that a protein, for example, encoded by the query sequence has the same or similar function as the matching sequence in the database. In addition, other contextual information can be employed to make a functional assignment. For example, if the putative gene occurs as part of an operon or cluster of genes, it may be possible to make a reasonable inference about its function as part of a metabolic pathway or multisubunit protein (Overbeek et al., 1999; 2000).

Another set of alignment tools and databases, such as PROSITE (Hofmann et al, 1999), BLOCKS (Henikoff et al, 2000), PRINTS (Attwood et al, 2000), EMOTIF (Nevill-Manning et al, 1998), Pfam (Bateman et al, 2000) and ProDom (Corpet et al, 2000) compare sequences based on specific motifs known to be present in particular classes of proteins or DNA sequences. These targeted searches can be used to identify particular classes of protein or DNA sequences based on functional knowledge. However, care must be taken when inteφreting the results of such searches due the likelihood of false-positive hits. These search algorithms are useful for analyzing an ORF for which the similarity searching programs such as Smith- Waterman, FASTA and BLAST do not produce any significant homology.

A third class of gene-finding algorithm employs Hidden Markov Models (HMMs). HMMs were first applied to analyzing protein and DNA sequences by Krogh et al. (1994). These algorithms can be used to find genes, identify conserved motifs in alignments of similar sequences, locate insertions and deletions in raw sequence data and predict secondary structure from sequence (Eddy, 1996; Amitai, 1998). Rather than using pairwise sequence comparison, the HMM method takes into account position-specific information about groups of sequences based on known structural and functional features. Thus, for example, certain residues may be more evolutionarily conserved than others, or some sequence positions may be more likely to accept insertions or deletions. By constructing a profile for a given type of sequence, it is possible to use this profile (expressed as an HMM) to find similar sequences within a target database. For example, HMM-based software algorithms have been developed for identifying promoters and regulatory regions from eucarya and bacteria (Pedersen et al, 1996; Crowley et al, 1997; Ohler et al, 1999; Scherf et al, 2000).

HMMs can also be employed for gene-finding. Gene-finding using HMMs for archaeal and bacterial sequence is aided by the fact that Shine-Dalgarno/ribosome binding sites can be used to locate the 5' translation initiation codon for many genes (Shine & Dalgarno, 1974; Saito & Tomita, 1999; Osada et al, 1999). This feature has been exploited to develop HMM-based software with improved accuracy in identifying the 5' start codons in various genomes (Krogh et al, 1994; Hirosawa et al, 1997; Lukashin & Borodovsky, 1998; Besemer & Borodovsky, 1999; Borodovsky et al, 1999; Hannenhalli et al, 1999; Shmatkov et al, 1999; Yada et al, 1999). However, the difficulty in analyzing bacterial genomic sequence is that a fraction of the genes are overlapping, and thus some of the true gene sequences may be overlooked by the initial search because their translation initiation sites are difficult to detect. More recent versions of the GeneMark.hmm gene-finding software have been improved to correct for this problem (Shmatkov et al, 1999). Gene-finding programs for eucaryal sequence are capable of identifying coding regions, splice sites, introns and intergenic regions. Software programs such as GRAIL (Uberbacher et al, 1996), HMMGene (Krogh, 1997; 2000), Genie (Kulp et al, 1996), GENSCAN (Burge & Karlin, 1997) and VEIL (Henderson et al, 1997) have been designed for searching this type of target. In addition, GeneMark.hmm has also been adapted for searching for eucaryal genes (Besemer & Borodovsky, 1999).

Transcriptional regulatory regions in the DNA sequence can also be identified by programs that analyze conserved noncoding sequences (Crowley et al, 1997; Yada et al, 1998; Hardison, 2000; Fickett & Wasserman, 2000). This data can be used to find locations in the sequence that provide potential binding sites for transcription factors and other proteins.

The "sequence-to-function" algorithms described above can also be complemented by "sequence-to-structure-to-function" algorithms and protein- threading methods that exploit knowledge about the functional significance of protein structural domains and active site residues (Skolnick & Fetrow, 2000). This type of analysis can potentially identify functionally similar proteins that share only about 25- 30% sequence identity (Yang & Honig, 2000). Databases available for this type of comparison include FSSP (Holm & Sander, 1998), CATH (Pearl et al, 2000) and SCOP (Lo Conte et al, 2000). In addition to identifying potential genes, there are also programs that are effective at finding potential insertions and deletions (indels) in the plethogenomic sequencing data. This is a significant problem for single-pass DNA sequencing, where the error rate is relatively high. These indels create problems for finding and comparing gene sequences because they create frameshift errors when the DNA sequence is translated into protein sequence. Various programs have been developed to identify these potential errors in low-redundancy data, or to correct for them when they occur. Frameshift-tolerant sequence comparison programs, for example, have been developed to make homology searching possible even in the presence of a substantial number of frameshifts (Guan & Uberbacher, 1996; Zhang et al, 2000). The GeneMark.hmm program also identifies potential frameshifts as part of the analysis for gene-finding. Fragments containing potential errors can be re-sequenced to verify whether the read is correct.

Chimera artifacts created by the fusion of two unrelated sequences during manipulation of the DNA can be detected by analyzing the base composition of the gene (Mooers & Holmes, 2000) and the codon usage, which is highly non-random in some species (Kreitman & Comeron, 1999). Mismatches can also become apparent during similarity searching. Abrupt shifts in these parameters or characteristics may indicate the presence of such an artifact. The sequence can be re-analyzed to check for chimerism by PCR-amplifying the region from the original DNA fragment. The most widely used family of tools, known as BLAST (Basic Local Alignment Search Tool), is provided by the National Center for Biotechnology Information (NCBI; Baxevanis et al, 1999). The program can be accessed via e-mail or as part of a network-client server over the World-Wide Web, or it can be downloaded to run on a local computer. Information on using BLAST and related programs is available at http://www.ncbi.nlm.nih.gov . Sequences for BLAST searching are entered in FASTA format. Three embodiments of BLAST can be used for queries comprising nucleotides (BLASTN, BLASTX and TBLASTX), and two can be used for protein sequences (BLASTP and TBLASTN). NCBI BLAST programs employ as a default the gapped-BLAST algorithm (Altschul et al , 1997), which is designed to allow gaps in the alignments (insertions and deletions) and thus find evolutionarily significant matches that might otherwise be missed. In addition to these initial searching algorithms, the Position-Specific Iterated BLAST (PSI- BLAST) is a profile-based program that can be used to search for homologs in protein sequences (see below). This is useful for identifying and creating families of related protein sequences for phylogenetic and functional analysis.

The BLAST search strategy employed depends on the nature and quality of the DNA sequence being used as the query. For example, single-pass sequence data is likely to contain frameshift errors, and therefore a BLASTN (nucleotide) search against the NCBI nr nucleotide database may sometimes be difficult to inteφret. A follow-up BLASTX (nucleotide, six-frame translation) search against the nr protein database or the SWISS-PROT database is therefore likely to be more informative because it enhances the sensitivity toward subtle matches. This greater sensitivity arises because the amino acid code has 20 different characters, whereas the nucleotide code has only four. Thus, a match at the protein level is much less likely to occur by chance. In addition, evolutionary relationships can be inferred from the presence of conservative amino acid substitutions, wherein a functionally similar side chain has replaced a given residue. Another nucleotide-based program, TBLASTX, translates both the query and database sequences from nucleotides to protein in all six reading frames and then compares them at the protein level. This program can be used to search the expressed sequence tag (EST) databases, and is useful for analyzing single- pass genomic DNA, cDNAs and putative exon sequences. Quite often, low- complexity confounding sequences (Wootton & Federhen, 1993) are present in the query, and these may cause difficulty in inteφreting the search results. Regions of low-complexity sequence can be masked (filtered) by using the SEG or DUST algorithms that are available as BLAST search options.

Once the BLAST search results have been returned, the output can be evaluated by using the alignment scores and statistics provided by the software. The statistics are used to determine the degree of similarity based on the alignments. The degree of similarity can in turn be used to determine whether one is justified in inferring homology among the sequences. For local sequence comparisons, the statistical analysis used depends on whether a gapped or ungapped alignment is performed. A raw score is determined by summing the scores for each position in the alignment based on a substitution matrix (Altschul, 1991, 1993; States et al, 1991). The matrix employed depends on the expected degree of homology and the length of the query sequence. Short queries (which need to have relatively strong alignment scores in order to avoid the background noise) may benefit from a PAM (Percent Accepted Mutation) matrix (Schwartz & Dayhoff, 1978). Longer sequences with relatively weak similarity may benefit from a BLOSUM matrix (Henikoff & Henikoff, 1993). The default BLOSUM-62 matrix (calculated from comparing sequences with no less than 62% divergence) provides an optimal alignment for moderately diverged sequences. A search for more distant relatives may, however, require a BLOSUM-45 matrix or a combination of matrices. In alignments involving gaps, a gap penalty may also be added to the raw score in addition to the pairwise alignment score. Local alignments without gaps are known as High Scoring Pairs (HSPs). The raw score (S) for a given alignment can then be converted into a normalized bit score (S') by using values for lambda and K, which represent the parameters used for the substitution matrix and gap penalties in a particular scoring system. Bit scores make it possible to compare alignment values calculated using different scoring systems. A higher bit score indicates a better alignment. Ungapped alignments are assigned a P- value based on a Poisson distribution of random HSPs, i.e., the number of random HSP scores equal to or greater than S. Significant scores thus have P-values approaching zero. Gapped alignments are assigned an E-value (expectation value), which indicates the probability of finding at least one HSP with a score greater than or equal to S. A lower E-value indicates a more significant alignment. BLAST reports only E- values rather than P-values because the former are easier to compare. Sequence comparison is used in the present invention to assign a tentative identity to a newly discovered gene or gene fragment. As large numbers of gene sequences are identified in this manner, it is useful to perform a second type of analysis, based on protein sequence comparison, to organize the sequences into functional clusters. Thus, putative gene products that appear to be enzymes can be grouped according to their functional class (hydrolase, lyase, etc.). Comparisons among members of various protein families and superfamilies can then be performed. Multiple protein alignments are created using programs such as ClustalW (Thompson et al, 1994), Hidden Markov Model algorithms (Amitai, 1998), PSI-BLAST, Pattern- Hit Initiated BLAST (PHI-BLAST) and concordance analysis (Bruccoleri et al, 1998).

As additional sequences are discovered, it is also useful to classify the genes by their deduced evolutionary relationships (Gogarten & Olendzenski, 1999; Fitch, 2000; Galperin & Koonin, 2000). Genes can be broadly classified into groups known as homologs (wherein the genes are similar because they evolved from a common ancestor), orthologs (wherein the genes are homologs that evolved from a common ancestor but may no longer share the same function) and parologs (wherein the genes are homologs that arose through duplication). Orthologs are typically the result of the process of speciation, and therefore comparison of orthologs can be used to compare different species. By creating families or clusters of orthologs (COGs), it is possible to make better functional predictions for various gene products (Koonin et al, 1998). Orthologs can be selected by a symmetric homology search, wherein "a query sequence from genome 1 has an ortholog in genome 2 if searching the query in genome 2 turns up the ortholog as the best match and searching the ortholog versus genome 1 turns up the query protein as the best match" (Marcotte, 2000). These comparisons can be used to construct phylogenetic trees, which provide a graphical representation of the putative evolutionary relationships among the sequences (Felsenstein, 1996) based, for example, on the conservation of residues that are important for biological activity. Software programs for analyzing COGs are available through the NCBI at http://www.ncbi.nhn.nih.gov/COG/ . Phytogeny is useful for assigning function to a new gene sequence because the known functions of its nearest neighbors can be incoφorated into the evaluation. These techniques can also be used to construct metabolic pathways and understand the function of operons containing multiple genes. Predictions based on so-called nonhomology methods can also be implemented (Marcotte, 2000). In this method, functional patterns are determined which are based on shared properties among groups of genes, and this information is used to infer the function of a given gene product, even if the functional role of a putative gene product cannot be determined by direct sequence or structural homology. Examples include the domain fusion method (Marcotte et al, 1999), wherein separate proteins found in one organism are observed to be fused together in a second organism. The fusion of the two domains in the second organism often indicates that there is a functional relationship between the two proteins in the first organism. Similarly, conservation of gene position, wherein a functional relationship between two genes can be inferred based on the fact that the two genes repeatedly appear as neighbors within the chromosomes of different organisms, can be used to identify components of operons and proteins that interact with each other (Tamames et al, 1997; Dandekar et al, 1998; Overbeek et al, 1999). Examples of the latter are members of signaling pathways or subunits of multimeric proteins. A third method analyzes phylogenetic profiles of multiple genes (co-inheritance). This technique relies on the fact that inheritance of functionally related proteins is often correlated (Pellegrini et al, 1999). The fitness advantage of inheriting a complete functional set of genes (encoding, for example, a biosynthetic pathway) maybe a reason for the preservation of operons (Lawrence, 1999). Analysis of co-inheritance is complementary to the homology-based phylogenetic analysis described above, hi the nonhomology case, a gene that cannot be matched in a BLAST search can be tentatively identified based on homologies between its co-inherited neighbors and similar groups from other genomes. Due to the fact that sequence-based functional prediction is based on mathematical comparisons, care must be taken to avoid over- or under-prediction. Sources of systematic error in this process have been reviewed in an on-line article by Galperin & Koonin (http://biotech.dbbm.unina.it/corsi/systematic_errors.htm).

After a putative function has been assigned to an individual gene sequence, the gene can be further classified according to the biological role of the gene within the cell. These functional classes typically include amino acid biosynthesis; biosynthesis of cofactors, prosthetic groups and carriers; cell envelope proteins; proteins for cellular processes; central intermediary metabolism; energy metabolism; fatty acid and lipid metabolism; purine, pyrimidine, nucleoside and nucleotide metabolism; replication; transcription; translation; transport and binding proteins; and other processes. More generalized classes that apply to all three domains of life have also been proposed (Andrade et al., 1999). Although the genes discovered using the present invention are likely not derived from the same species, they nevertheless can be assigned to these classes because of the underlying unity of cellular biochemistry. The assembled data thus create a functional profile of the plethogenome that is analogous to the profiles previously created for single-species sequencing projects. Confirmation of functional assignments can be made by performing activity assays (described below), and this data can be used to supplement the information that is used to create functional groupings in the plethogenomic database. A gene can then be listed according to its known function (e.g., "confirmed alcohol dehydrogenase"). Creating a useful plethogenome database requires that the finished sequences, which have been functionally and structurally analyzed, be properly annotated and organized (Rouze et al, 1999). Graphical annotation of contig segments or other sequence fragments allows the database user to electronically retrieve various fragments based on a fragment ID number or other queries and simultaneously compare a number of parameters. The database management system contains a program that displays key features of the sequence, such as ORFs, G+C content, promoters, ribosome binding sites, regulatory regions (e.g., repressor and activator binding sites), transcription terminators, introns and exons, simple and complex repeats, sequence matches, functional motifs and homologs, structural predictions of the gene product, functional assignments, and the like. The user can simultaneously access both text and graphical data for analyzing the sequence. Thus, for example, one may search the plethogenomic database to find all genes or gene fragments that are associated with the term "beta-galactosidase." Alternatively, one may wish to enter a given sequence and find all of the most significant matches within the database. It is also advantageous to link the sequence and functional prediction information with software for displaying phylogenetic, structural and metabolic relationships. The system can be based on a relational database management system, with a set of customized tools for data entry and editing. Available annotation programs include Genotator (Harris, 2000), GALA (Bailey et al, 1998) and Imagene

(Medigue et al, 1999).

13. Library Storage and Retrieval

The actual DNA clones may be organized into arrays and stored frozen in an archival library so that they can be conveniently accessed when needed. The purified DNA stocks can be stored in a sterile solution of 10 mM Tris-HCl, 1 mM EDTA (pH 8.0), although for longer term storage 10 mM EDTA and the addition of 1 M NaCl is preferred. DNA stocks can be stored at -20° C, but -80° C is preferred for long-term storage. Bacterial stocks are made by adding a final concentration of 7% DMSO or 15%) glycerol to the medium, freezing the stocks, and storing them at -80° C. Yeast stocks are frozen in 20% glycerol and stored at -80° C. The DNA stocks and cell stocks can be conveniently stored by arraying them in covered 96-well or 384- well microplates or other similar holders while maintaining each clone in a separate, identifiable well (Dunham et al, 1999). This type of system allows rapid access to specific clones either manually or robotically. Clones are identified by computer- generated labels (e.g., bar codes) so that the arrays can be rapidly generated and the inventory can be easily and efficiently managed. This system also permits rapid replication of the library, if desired, and recovery of individual clones (Dunham et al, 1999). Thus, a DNA fragment from the plethogenomic library that is found to have desirable properties (based on the analysis described above) can be retrieved from the stored array for further manipulation (e.g., expression cloning). Retrieval can be performed robotically to increase the throughput. 14. Verifying the Function of Putative Genes After a sequence has been assigned a putative function based on the analytical methods described above, it is useful to confirm its function by activity screening. In order to screen for biological activity, the sequenced gene must first be cloned and expressed in an appropriate host organism (Gross & Hauser, 1995). Expression cloning of genes or sequences from a plethogenomic library requires the use of a heterologous host. Methods for expression cloning of genes in heterologous hosts are well known to those of skill in the art. Examples of heterologous hosts include Gram- negative bacteria, such as E. coli (Baneyx, 1999), Gram-positive bacteria (Brawner, 1994; Baltz, 1995; Binnie et al, 1997; Deb & Nath, 1999; Cereghino, J.L. & Cregg, 2000), archaea (Mai & Wiegel, 1999), yeasts (Cereghino, G.P. & Cregg, 1999) and filamentous fungi (Visser et al, 1995; Archer & Peberdy, 1997; van den Hombergh, 1997). Insect cells and cultured mammalian cells can also be used as hosts. The choice of an appropriate expression host involves many factors, including codon usage within the gene, the presence of introns, requirements for post-translational modifications or unusual cofactors, and the presence of similar genes within the host. In one embodiment, the sequenced insert is cloned directly into a donor vector, which can be used for further shuttling into various expression vectors. A donor vector is available as part of the Echo Cloning System from h vitrogen (Carlsbad, CA). In an alternative embodiment, the gene sequence can be PCR amplified from the DNA clone in which it was initially identified, using primers that are complementary to its 5' and 3' ends. The primers have additional sequence extensions that enable cloning of the final PCR product into an expression plasmid in the correct reading frame. For example, the gene can be amplified with primers that incoφorate a unique restriction site on the 5' end of the gene immediately upstream of the initiation codon and a second unique restriction site on the 3' end of the gene immediately after the termination codon. After the PCR product is purified, it can be digested with the appropriate restriction endonucleases and ligated into a vector (e.g., an expression plasmid) that has been similarly cut and treated with alkaline phosphatase. The gene is operably linked to a promoter contained on the vector. Expression of the gene is thus controlled by the heterologous promoter rather than its native promoter. The ligation product is then electroporated into competent E. coli cells and the transformants are plated on nutrient agar to form colonies. Vector DNA is extracted from individual cultures of several of the colonies and analyzed for correct insertion of the gene. Host cells containing a correct construct can then be induced to express the gene product encoded by the vector. The gene product can be accumulated in the cytosol or the periplasm, or it can be secreted into the growth medium. In addition to activity screening, an expression system can also be used for production-scale synthesis of useful biological products. Examples of a plasmid expression vectors that can be employed for this puφose are pQE60 and pQE70, which can be expressed in E. coli strain Ml 5 [pREP4] and induced with isopropyl- beta-D-thiogalactopyranoside (EPTG). The plasmids and the host strain are available from Qiagen (Valencia, CA). High-throughput screening of ORFs is facilitated by PCR amplification of the insert as an "ATG-to-TAA" cassette, which can be cloned into an appropriate ATG-vector (Nishi et al, 1983; Amann & Brosius, 1985; Beernink & Tolan, 1992). Vectors can also be engineered that will accept genes having alternative initiation codons.

For genes that are toxic to E. coli, it may be necessary to use a vector that provides extremely tight control of expression levels in the host cell, such as pBAD or P_L from Invitrogen (Carlsbad, CA). Proteins that maybe difficult to express because of low solubility can be expressed via the ThioFusion system from Invitrogen (Carlsbad, CA), which fuses the gene product to the thioredoxin protein. The thioredoxin serves as a carrier protein.

A similar procedure can be used for expression cloning using yeast (Saccharomyces cerevisiae) as a host. For example, the pYES2.1 expression vector and any Gat or galactose-utilizing yeast host strain (e.g., INVScl) can be used. These are available commercially from Invitrogen (Carlsbad, CA). Protocols for expression cloning in yeast are described in Ausubel et al. (2000).

In addition to plasmid-based expression, sequenced genes can also be expressed by chromosomal integration. For example, a gene can be expressed in the methylotrophic yeast Pichia pastoris by using a vector with a selectable marker that integrates the gene of interest along with the marker gene into the host chromosome. Expression of the gene can be controlled by an inducible promoter, and the recombinant protein can be produced intracellularly or secreted from the cells. Pichia expression kits are available from Invitrogen (Carlsbad, CA). Methods for integrating gene cassettes into E. coli for metabolic pathway engineering are described in LaDuca et al (1999).

15. Assaying for Biological Activity

After a sequenced gene from the plethogenomic library has been cloned and expressed, an activity assay can be used to confirm its function. For single genes, this can be done by growing a culture of the host organism containing the expressed gene, extracting the recombinant protein, purifying the protein, and performing the appropriate biological activity assay in vitro. Activity can also be determined in vivo by performing, for example, functional complementation of a gene that has been deleted from the host. A gene from the plethogenomic library can be screened against a host strain library of different deletions to find which deletion is compensated. To verify the function of an operon encoding a metabolic pathway, the entire operon can be expressed in an appropriate host. If, for example, the pathway produces a natural product, synthesis of the natural product by the heterologous host can be measured to verify the function of the operon. If, for example, the operon encodes enzymes for catabolism, then degradation of the compound by the heterologous host can be monitored. Information about operons encoding enzymes for metabolic pathways is also important because it can be used to understand the nutritional requirements of unculturable organisms. This information can in turn be used to develop strategies for cultivating such organisms.

For large numbers of sequenced genes, high-throughput screening assays can be used. For example, a group of a thousand different sequenced genes that appear to encode DNA polymerase can be simultaneously assayed to determine which clones have the highest activity on a given substrate or are resistant to thermal inactivation, solvent conditions, and the like. This information can be used to determine which gene encodes an enzyme that is appropriate for a desired application.

Assays have been developed that employ absorbance, fluorescence, luminescence and fluorescence resonance energy transfer as detection modes (Gonzalez & Negulescu, 1998). Assay formats include microplates (Major, 1998; Kolb & Neumann, 1997; Sittapalam, 1997), microfluidics (Sundberg, 2000), fluorescence activated cell-sorting, or FACS (Wittrup & Bailey, 1988; Miao et al, 1993; Plovins et al, 1994; Haugland, 1995; Chung et al, 1995), and solid-phase methods (U.S. Pat. No. 5,914,245). Substrates for enzyme assays are available from Molecular Probes (Eugene, OR), Sigma-Aldrich (St. Louis, MO), Research Organics (Cleveland, OH), Worthington Biochemical (Lakewood, NJ), Biosynth (Naperville, DL) and Beckman-Coulter (Fullerton, CA).

It is also useful to screen sequenced genes from the plethogenomic library for protein-protein interactions and binding activity. Examples of these kinds of assays include one-hybrid screens for protein-DNA interactions, two-hybrid and three-hybrid screens for protein-protein interactions (Colas & Brent, 1998; Serebriiskii et al, 1999; Hu et al, 2000), cell surface display (Francisco & Georgiou, 1994; Boder & Wittrup, 1997), display cloning (Sche et al, 1999) and phage display (Scott & Smith, 1990; Rodi & Makowski, 1999). These assays can be performed in a high throughput format (Munder & Hinnen, 1999; Emili & Cagney, 2000). Kits for performing one-, two- and three-hybrid assays are available from Clontech (Palo Alto, CA) and Invitrogen (Carlsbad, CA). Kits for cell surface display in mammalian cells and yeast are available from Invitrogen (Carlsbad, CA). A T7Select kit for phage display of complete proteins is available from Novagen (Madison, WI). In addition to gene sequences encoding enzymes and structural proteins, sequences for promoters are valuable in biotechnology for analyzing transcriptional elements, constructing new regulatable expression vectors (Goldstein & Doi, 1995) and engineering biomarkers (Jansson & De Bruijn, 1999; Diaz & Prieto, 2000). It is therefore important to be able to screen promoter sequences and determine their inducers. Methods for assaying novel promoter sequences include enzymatic reporter assays using beta-lactamase (Zlokarnik et al, 1998), beta-galactosidase, or alkaline phosphatase. The green fluorescent protein (GFP) can also be used for detection. The pBlue-TOPO kit (for beta-galactosidase) and pGlow-TOPO kit (for GFP) are available from Invitrogen (Carlsbad, CA). Kits employing beta-galactosidase, alkaline phosphatase and GFP are also available from Clontech (Palo Alto, CA). These systems can be used to construct expression vectors for reporter assays by cloning the promoter region from the plethogenomic library into the vector and measuring the appropriate activity of the reporter gene.

There are also some cases in which the biological role of a sequenced gene can be better understood by complementing the sequence analysis and functional analysis with structural analysis of the gene product (Elofsson & Sonnhammer, 1999). Methods to combine structural, functional and sequence information have been developed (Sauder et al, 2000; Wilson, CA. et al, 2000). Structural genomics can supply important information about the function of a protein for the following reasons (Kim, 2000; Eisenstein et al, 2000):

(1) Although over 10,000 three-dimensional protein structures are known, they can be grouped into as few as 600 fold families; and although similarities in structure do not necessarily imply a common ancestry, most folds have only one homologous family (and function) associated with them (Orengo et al, 1994; Murzin et al, 1995). Thus, knowledge of the structure can often lead to a correct functional assignment (Todd et al, 1999). This information can be used to group genes obtained by subsequent cloning and sequencing, since structure-sequence correlations will improve predictive modeling if applied cautiously (Sali & Kuriyan, 1999).

(2) High sequence similarity is strongly correlated with having the same three- dimensional fold and similar or related biological function. However, care must be taken not to extrapolate functional information from homolog to homolog without justification.

(3) Many enzyme mechanisms have been observed multiple times in protein having different folds (by convergent evolution). These are known as functional analogs. If a significant fraction of the possible examples of a given mechanism are known, one can examine a new structure to see whether a common motif is present (Orengo et al, 1999). Although the different groups may have low sequence similarity, members of the same structural group are likely to have similar sequences.

(4) Remote homologs, which may have low or no sequence similarity, sometimes have the same three-dimensional fold and identical or similar biological function. However, as the number of sequences with assigned structures increases, it will become easier to assign new sequences to functional families via comparative genomics (Clayton et al, 1998).

(5) Binding of small-molecule ligands to the protein often provides an important indication of its function. It can also indicate the location of a catalytic center or binding motif in a protein whose function is unknown (Shapiro & Harris, 2000). It is becoming increasingly practical to obtain high-resolution structural information (using X-ray crystallography and NMR) in a high-throughput mode because of advances in protein isolation, crystallization, and structure determination and analysis (Eisenstein et al, 2000). These techniques make it possible to annotate a large number of proteins whose functions are only partially assignable from the sequence information. 16. Mutagenesis and Directed Evolution

Sequences obtained from the plethogenomic library may encode proteins, RNA or regulatory elements whose functions are not optimal for a desired biotechnology application. It may therefore be necessary to mutagenize the sequence to create a more useful product. Protein engineering methods for random and site- directed mutagenesis are well known in the art, and have been described in

McPherson (1991). Iterative rounds of random mutagenesis, expression, activity screening and sequencing (known as directed evolution) can be used to rapidly improve the function of a gene product or sequence element (Sutherland, 2000). Directed evolution methods include sequential random mutagenesis (Chen & Arnold, 1993), recursive ensemble mutagenesis (Arkin & Youvan, 1992; Delagrave et al, 1993), exponential ensemble mutagenesis (Delagrave & Youvan, 1993) and DNA shuffling (Stemmer, 1994a,b). Expression and screening methods are described above. Note that directed evolution techniques such as Recursive/Exponential Ensemble Mutagenesis and DNA shuffling do not require that all of the sequences used in the combinatorial event be complete. Some of the sequences that contribute to the genetic diversity can be incomplete, so long as they overlap the region that is being mutagenized in the parent sequence. This means that even partial sequences obtained by use of the present invention are highly useful for directed evolution.

Large numbers of individual genes in the plethogenomic library can also be mutagenized via PCR in a high-throughput mode (Lashkari et al, 1997). This is a massively parallel technique that can be used to study the function of a given gene or create a second library of new genes with new functions.

Fragments from partial or complete sequences can be fused with other sequences to create chimeras. Engineered chimeric proteins are useful for creating new enzyme activities, for example, by fusing functional domains from different proteins. Different gene sequences from the plethogenomic library can also be used to design, assemble and express artificial polycistronic operons (Hershberger et al, 1999). Artificial operons expressed in a heterologous host are useful for creating metabolic pathways with targeted functions, such as remediation of environmental toxins or the production of antibiotics, anti-cancer agents and other pharmaceuticals. Collections of genes, gene clusters and operons can be further integrated into higher- order structures to create artificial chromosomes containing large numbers of genes. These are useful building blocks for creating minimal artificial genomes (Goffeau, 1995; Mushegian & Koonin, 1996; McFadden et al, 1997; Mushegian, 1999) and designed microorganisms. 17. Use of Plethogenomic Sequences for Designing Primers and Probes

Although the sequences obtained from the plethogenomic libraries are of themselves useful for various applications, they can also be used to obtain other genes, cDNAs, regulatory elements and the like by methods previously known in the art. A novel plethogenomic sequence can be used as a hybridization probe, or it can be used to design PCR primers for obtaining and identifying similar sequences from a clone library. By taking advantage of the enormous amount of sequence information present in the plethogenomic database, large amounts of time and labor can be saved in subsequent gene prospecting efforts. Methods for performing probe hybridization screening are described in Ausubel et al. (2000) and Wolff & Gemmill (1999). Thus, for example, a gene encoding a novel oxidase obtained by utilizing the present invention can in turn be used to create a labeled hybridization probe for screening. The hybridization probe can be used to find clones having oxidase sequences in a cDNA library or genomic expression clone library from environmental microorganisms (U.S. Pat. No. 5,958,672). This type of screening can be performed without first sequencing the clone library. The hybridization probe can also be designed to have degeneracy at some positions in order to find homologous sequences within the library. Hybridization screening can also utilize a large number of plethogenomic ORFs for screening on microarrays (Lee & Lee, 2000).

Sequence analysis of genes from the plethogenomic library can also be used to create consensus sequences for families of apparently related genes. The consensus sequence can be used to design PCR primers that will amplify all similar sequences from any source of genetic material. Methods for obtaining genomic sequences by PCR are described in Lashkari et al. (1997) and Fanning & Gibbs (1999). A source of microbial DNA, for example, can be subjected to PCR amplification with these primers. The amplified DNA is cloned into a vector and used to transform E. coli. Transformant colonies are picked, their plasmid DNA is isolated and sequenced. The sequences can then be analyzed to find previously unknown homologs. This targeted PCR method makes it possible to find even rare homologs in a microbial population with minimal amounts of cloning and sequencing.

Probes and primers that are designed based on plethogenomic information can also be used to identify microorganisms by fluorescence in situ hybridization (FISH) and PCR. Microbial identification by PCR includes both in situ and traditional in vitro methods. Methods for identifying organisms using these methods are described in a pending KAIROS patent application (Coleman et al, 2000). These techniques are particularly well suited to identifying uncultured microorganisms. EXAMPLE 1: Results of Plethogenome Sequencing In order to demonstrate the feasibility of using the methods of the present invention to find novel sequences from a complex mixture of microorganisms, genomic DNA was isolated and purified from marine sediment via bead-milling using a Mo-Bio DNA isolation kit. The DNA was purified by agarose gel electrophoresis, digested with Sau 3 A and ligated into a Litmus 28 plasmid vector (New England BioLabs) that had been cut with BamH I (which therefore has a compatible overhang). The ligation mix was used to transform competent E. coli. A number of transformant colonies containing DNA inserts were picked and used to inoculate individual liquid cultures. Plasmid DNA was isolated from each culture and the clones were sequenced using fluorescently labeled primers directed to the flanking vector sequence. Partial or complete ORFs from each clone were analyzed by the BLASTX program to find sequences in the nr database that produce significant (high-scoring) alignments. Alignments were analyzed for percent similarity, which includes both identical amino acid matches and conservative replacements. The highest-scoring alignments include matches to known genes and to conserved hypothetical genes. When the highest-scoring matches are grouped by functional category, the list (Figure 1) indicates that the distribution of putative genes by functional category is similar to the distribution found within the sequenced genomes of single species of bacteria. This result indicates that plethogenomic sequencing is capable of finding gene sequences that are reasonably representative of all the individual genomes contained within the microbial population. Since the microorganisms in the 'black mud' marine sediment sample are uncultivated, these results also indicate that it is possible to retrieve identifiable gene sequences from such organisms.

In Figure 2, however, these same data are plotted as a histogram distribution. The striking feature of the distribution of genes by similarity match is that there are very few genes in the black mud (only 6 out of 175) that possess strong similarity (>80%) to known sequences in the database. A significant fraction show a similarity in the range of 50-69%, which makes them useful as potential homologs or orthologs to known genes. A significant number (75/175) also have less than 40% similarity to sequences in the database; some of these may be unique genes with novel functions. These results indicate that a complex population of uncultured organisms from black mud contains a large percentage of novel sequences. Taken together, these results show that the methods of the present invention can extract sequences that are representative of the uncultured plethogenome, and that many of these sequences are highly novel.

REFERENCES

Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, DJ. (1990) Basic local alignment search tool. J. Mol. Biol. 215:403-410. Altschul, S.F. (1991) Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol 219:555-565.

Altschul, S.F. (1993) A protein alignment scoring system sensitive at all evolutionary distances. J. Mol. Evol 36:290-300. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. &

Lipman, D.j. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402.

Amann, E. & Brosius, J. (1985) "ATG vectors" for regulated high-level expression of cloned genes in Escherichia coli. Gene 40:183-190. Amitai, M. (1998) Hidden models in bioploymers. Science 282: 1436-1437. Anderson, I. & Brass, A. (1998) Searching DNA databases for similarities to DNA sequences: when is a match significant? Bioinformatics 14:349-356.

Andrade, M.A., Ouzounis, C, Sander, C, Tamames, J. & Valencia, A. (1999)

Functional classes in the three domains of life. J. Mol. Evol. 49:551-557. Archer, D.B. & Peberdy, J.F. (1997) The molecular biology of secreted enzyme production by fungi. Crit. Rev. Biotechnol. 17:273-306. Arkin, A.P. (1992) An application of artificial intelligence and machine vision to protein engineering. PhD. Thesis, Massachusetts Institute of Technology,

Cambridge, MA. Arkin, A.P. & Youvan, D.C. (1992) An algorithm for protein engineering: simulations of recursive ensemble mutagenesis. Proc. Natl. Acad. Sci. U.S.A. 89:7811-7815. Ashktorab, H. & Cohen, R.J. (1992) Facile isolation of genomic DNA from filamentous fungi. Biotechniques 13:198-200. Attwood, T.K., Croning, M.D., Flower, D.R., Lewis, A.P., Mabey, J.E., Scordis, P.,

Selley, J.N. & Wright, W. (2000) PPJNTS-S: the database formerly known as

PRINTS. Nucleic Acids Res. 28:225-227. Ausubel, F.M., Brent, R., Kingston, R.E., Moore, D.D., Seidman, J.G., Smith, J.A. &

Struhl, K., eds. (2000) Current Protocols in Molecular Biology, John Wiley and

Sons, New York. Bailey, L.C., Jr., Fischer, S., Schug, J., Crabtree, J., Gibson, M. & Overton, G.C.

(1998) GALA: framework annotation of genomic sequence. Genome Res. 8:234- 250.

Baltz, R.H. (1995) Gene expression in recombinant Streptomyces. Bioprocess

Technol 22:309-381. Baneyx, F. (1999) Recombinant protein expression in Escherichia coli. Curr. Opin.

BiotechnolA0:41l-421. Bateman, A., Birney, E., Durbin, R., Eddy, S.R., Howe, K.L. & Sonnhammer, E.L.

(2000) The Pfam protein families database. Nucleic Acids Res. 28:263-266. Baxevanis, A.D., Boguski, M.S. & Ouellette, B.F.F. (1999) Computational analysis of

DNA and protein sequences. In: Genome Analysis: A Laboratory Manual, Vol. 1:

Analyzing DNA, (Birren, B., Green, E.D., Klapholz, S., Myers, R.M. & Roskams, J., eds.), pp. 533-586, Cold Spring Harbor Laboratory Press, Plainview, NY.

Beernink, P.T. & Tolan, D.R. (1992) Construction of a high-copy "ATG vector" for expression in Escherichia coli. Protein Expr. Purif 3:332-336. Beja, O., Aravind, L., Koonin, E.V., Suzuki, M.T., Hadd, A., Nguyen, L.P.,

Jovanovich, S.B., Gates, CM., Feldman, R.A., Spudich, J.L., Spudich, E.N. & DeLong, E.F. (2000) Bacterial rhodopsin: evidence for a new type of phototrophy in the sea. Science 289:1902-1906. Besemer, J. & Borodovsky, M. (1999) Heuristic approach to deriving models for gene finding. Nucleic Acids Res. 27:3911-3920. Binnie, C, Cossar, J.D. & Stewart, D.I. (1997) Heterologous biopharmaceutical protein expression in Streptomyces. Trends Biotechnol. 15:315-320. Birren, B., Green, E.D., Klapholz, S., Myers, R.M. & Roskams, J., eώ.(1999a)

Genome Analysis: A Laboratory Manual, Vol. 1: Analyzing DNA, Cold Spring

Harbor Laboratory Press, Plainview, NY. Birren, B., Green, E.D., Klapholz, S., Myers, R.M., Riethman, H. & Roskams, J., eώ.(1999b) Genome Analysis: A Laboratory Manual, Vol. 3: Cloning Systems,

Cold Spring Harbor Laboratory Press, Plainview, NY. Birren, B., Mancino, V. & Shizuya, H. (1999c) Bacterial artificial chromosomes. In:

Genome Analysis: A Laboratory Manual, Vol. 3: Cloning Systems, (Birren, B., Green, E.D., Klapholz, S., Myers, R.M., Riethman, H. & Roskams, J., eds), pp.

241-295, Cold Spring Harbor Laboratory Press, Plainview, NY. Boder, E.T. & Wittrup, K.D. (1997) Yeast surface display for screening combinatorial polypeptide libraries. Nat. Biotechnol 15:553-557. Borodovsky, M., Hayes, W.S. & Lukashin, AN. (1999) Statistical predictions of coding regions in prokaryotic genomes by using inhomogeneous Markov models.

In: Organization of the Prokaryotic Genome (Charlebois, R.L., ed.), pp. 11-33,

American Society for Microbiology, Washington, D.C Brawner, M.E. (1994) Advances in heterologous gene expression by Streptomyces.

Curr. Opin. Biotechnol. 5:475-481. Brenner, S.E., Chothia, C. & Hubbard, T.J. (1998) Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships.

Proc. Natl. Acad. Sci. USA. 95:6073-6078. Bruccoleri, R.E., Dougherty, T.J. & Davison, D.B. (1999) Concordance analysis of microbial genomes. Nucleic Acids. Res. 26:4482-4486. Bruce, K.D. & Hughes, M.R. (1999) Fluorescent polymerase chain reaction

/restriction fragment length polymoφhism monitoring of genes amplified directly from bacterial communities in soils and sediments. In: Methods in Biotechnology,

Vol. 12: Environmental Monitoring of Bacteria (Edwards, C, ed.), pp. 127-138,

Humana Press, Totowa, ΝJ. Bruce, K.D., Strike, P. & Ritchie, D.A. (1999) DΝA extraction from natural environments. In: Methods in Biotechnology, Vol. 12: Environmental Monitoring of Bacteria (Edwards, C, ed.), pp. 97-107, Humana Press, Totowa, ΝJ. Brutlag, D.L., Dautricourt, J.P., Diaz, R, Fier, J., Moxon, B. & Stamm, R. (1993)

BLAZE: an implementation of the Smith- Waterman comparison algorithm on a massively parallel computer. Comput. Chem. 17:203-207.

Brutlag, D.L. (1998) Genomics and computational molecular biology. Curr. Opin.

Microbiol. 1:340-345. Bull, A.T., Ward, A.C. & Goodfellow, M. (2000) Search and discovery strategies for biotechnology: the paradigm shift. Microbiol. Mol. Biol. Rev. 64:573-606. Burge, C & Karlin, S. (1997) Prediction of complete gene structures in human genomic DΝA. J. Mol. Biol. 268:78-94. Burge, C.B. & Karlin, S. (1998) Finding the genes in genomic DΝA. Curr. Opin.

Struct. Biol. 8:346-354. Burset, M. & Guigo, R. (1996) Evaluation of gene structure prediction programs. Genomics 34:353-367. Cereghino, G.P. & Cregg, J.M. (1999) Applications of yeast in biotechnology: protein production and genetic analysis. Curr. Opin. Biotechnol 10:422-427. Cereghino, J.L. & Cregg, J.M. (2000) Heterologous protein expression in the methylotrophic yeast Pichia pastoris. FEMS Microbiol Rev. 24:45-66. Chen, E.S., Asano, C. & Davison, D.B. (1993) Parallel alignment of DNA sequences on the Connection Machine CM-2. Comput. Appl. Biosci. 9:375. Chen, K. & Arnold, F.H. (1993) Tuning the activity of an enzyme for unusual environments: sequential random mutagenesis of subtilisin E for catalysis in dimethylformamide. Proc. Natl. Acad. Sci. U.S.A. 90:5618-5622. Chung, J.D., Conner, S. & Stephanopoulos, G. (1995) Flow cytometric study of differentiating cultures of Bacillus subtilis. Cytometry 20:324-333. Clayton, R.A., White, O. & Fraser, CM. (1998) Findings emerging from complete microbial genome sequences. Curr. Opin. Microbiol. 1:562-566. Colas, P. & Brent, R. (1998) The impact of two-hybrid and related methods on biotechnology. Trends Biotechnol 16:355-363.

Coleman, W.J., Tanner, M.A., Silva, CM., Bylina, E.J, Robles, S.J., Dilworth, M.R.,

Youvan, D.C & Yang. M.M. (2000) Multispectral taxonomic identification. U.S.

Patent Application. Coφet, F., Servant, F., Gouzy, J., Kahn, D. (2000) ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. Nucleic Acids Res.

28:267-269. Cowell, I.G. & Austin, C.A., eds. (1997) Methods in Molecular Biology, Vol. 69: cDNA Library Protocols. Humana Press, Totowa, N. J. Crosa, J.H., Tolmasky, M.F., Actis, L.A. & Falkow, S. (1994) Plasmids. In: Methods f r General and Molecular Bacteriology (Gerhardt, P., Murray, R.G.E., Wood,

W.A. & Krieg, N.R., eds), pp. 365-386. American Society for Microbiology,

Washington, D.C. Crowley, E.M., Roeder, K., Bina, M. (1997) A statistical model for locating regulatory regions in genomic DNA. J. Mol Biol. 268:8-14. Dalboge, H. & Heldt-Hansen, H.P. (1994) A novel method for efficient expression cloning of fungal enzyme genes. Mol. Gen. Genet. 243:253-260. Dandekar, T., Snel, B., Huynen, M. & Bork, P. (1998) Conservation of gene order: a fingeφrint of proteins that physically interact. Trends Biochem. Sci. 23:324-328. Deb, J.K. & Nath, N. (1999) Plasmids of corynebacteria. FEMS Microbiol. Lett. 175:11-20.

Delagrave, S. & Youvan, D.C. (1993) Searching sequence space to engineer proteins: exponential ensemble mutagenesis. Biotechnology (N Y) 11:1548- 1552. Delagrave, S., Goldman, E.R. & Youvan, D.C. (1993) Recursive ensemble mutagenesis. Protein Eng. 6:327-331. Diaz, E. & Prieto, M.A. (2000) Bacterial promoters triggering biodegradation of aromatic pollutants. Curr. Opin. Biotechnol 11:467-475. Dojka, M.A., Harris, J.K. & Pace, N.R. (2000) Expanding the known diversity and environmental distribution of an uncultured phylogenetic division of bacteria.

Appl. Environ. Microbiol. 66:1617-1621. Dunham, I., Dewar, K., Kim, U.-J. & Ross, M.T. (1999) Bacterial cloning systems.

In: Genome Analysis: A Laboratory Manual, Vol. 3: Cloning Systems, (Birren, B.,

Green, E.D., Klapholz, S., Myers, R.M., Riethman, H. & Roskams, J., eds.), pp. 1-

86, Cold Spring Harbor Laboratory Press, Plainview, NY. Eddy, S.R. (1996) Hidden Markov models. Curr. Opin. Struct. Biol 6:361-365. Eisenstein, E., Gilliland, G.L., Herzberg, O., Moult, J., Orban, J., Poljak, R.J.,

Banerjei, L., Richardson, D. & Howard, A.J. (2000) Biological function made crystal clear - annotation of hypothetical proteins via structural genomics. Curr.

Opin. Biotechnol. 11:25-30. Elofsson, A. & Sonnhammer, E.L. (1999) A comparison of sequence and structure protein domain families as a basis for structural genomics. Bioinformatics 15:480-

500. Emili, A.Q. & Cagney, G. (2000) Large-scale functional analysis using peptide or protein arrays. Nat. Biotechnol 18:393-397. Farming, S. & Gibbs, R.A. (1999) PCR in genome analysis. In: Genome Analysis: A

Laboratory Manual, Vol. 1: Analyzing DNA, (Birren, B., Green, E.D., Klapholz,

S., Myers, R.M. & Roskams, J., eds.), pp. 249-299, Cold Spring Harbor

Laboratory Press, Plainview, ΝY. Farrelly, V., Rainey, F.A. & Stackebrandt, E. (1995) Effect of genome size and rrn gene copy number on PCR amplification of 16S rRΝA genes from a mixture of bacterial species. Appl. Environ. Microbiol 61:2798-2801. Favello, A., Hillier, L. & Wilson, R.K. (1995) Genomic DΝA sequencing methods.

Methods Cell Biol 48:551-569. Felsenstein, J. (1996) Inferring phytogenies from protein sequences by parsimony, distance, and likelihood methods. Methods Enzymol. 266:418-427.

Fichant, G.A.& Quentin, Y. (1995) A frameshift error detection algorithm for DΝA sequencing projects. Nucleic Acids Res. 23:2900-2908. Fickett, J.W. (1996) Finding genes by computer: the state of the art. Trends Genet.

12:316-320. Fickett, J.W. & Wasserman, W.W. (2000) Discovery and modeling of transcriptional regulatory regions. Curr. Opin. Biotechnol. 11:19-24. Fitch, W.M. (2000) Homology: a personal view on some of the problems. Trends

Genet. 16:227-231. Fleischmann, R.D., Adams, M.D., White, O., Clayton, R.A., Kirkness, E.F., Kerlavage, A.R., Bult, C.J., Tomb, J.F., Dougherty, B.A., Merrick, J.M., et al

(1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269:496-512. Fletcher, M. (1999) Biofϊlms and biocorrosion. In: Manual of Industrial Microbiology and Biotechnology, 2^nd ed., (Demain, A.L. & Davies, J.E., eds), pp. 704-714, ASM Press, Washington, D.C.

Fogel, G.B., Collins, C.R., Li, J. & Brunk, CF. (1999) Prokaryotic genome size and

SSU rDΝA copy number: estimation of microbial relative abundance from a mixed population. Microb. Ecol 38:93-113. Francisco, J.A. & Georgiou, G. (1994) The expression of recombinant proteins on the external surface of Escherichia coli. Biotechnological applications. Ann. N. Y.

Acad. Sci. 745:372-382. Fraser, CM., Gocayne, J.D., White, O., Adams, M.D., Clayton, R.A., Fleischmann,

R.D., Bult, C.J., Kerlavage, A.R., Sutton, G., Kelley, J.M., et al. (1995) The minimal gene complement of Mycoplasma genitalium. Science 270:397-403. Frostegard, A., Courtois, S., Ramisse, V., Clerc, S., Bernillon, D., Le Gall, F.,

Jeannin, P., Νesme, X. & Simonet, P. (1999) Quantification of bias related to the extraction of DΝA directly from soils. Appl. Environ. Microbiol. 65:5409-5420. Galperin, MY. & Koonin, EN. (2000) Who's your neighbor? new computational approaches for functional genomics. Nat. Biotechnol. 18:609-613. Goffeau, A. (1995) Life with 482 genes. Science 270:445-446. Gogarten, J.P. & Olendzenski, L. (1999) Orthologs, paralogs and genome comparisons. Curr. Opin. Genet. Dev. 9:630-636. Goldstein, M.A. & Doi, R.H. (1995) Prokaryotic promoters in biotechnology.

Biotechnol. Annu. Rev. 1:105-128. Gonzalez, J.E. & Negulescu, P.A. (1998) Intracellular detection assays for high- throughput screening. Curr. Opin. Biotechnol. 9:624-631. Good, I. J. (1953) The population frequencies of species and the estimation of population parameters. Biometrika 40: 237-264. Gross, G. & Hauser, H. (1995) Heterologous expression as a tool for gene identification and analysis. J. Biotechnol 41:91-110.

Guan, X. & Uberbacher, E.C (1996) Alignments of DNA and protein sequences containing frameshift errors. Comput. Appl. Biosci. 12:31-40. Guerrero, R., Pedros-Alios, C, Schmidt, T.M. & Mas, J. (1985) A survey of buoyant density of microorganisms in pure cultures and natural samples. Microbiologia 1:53-65.

Hall-Stoodley, L., Rayner, J.C, Stoodley, P. & Lappin-Scott (1999) Establishment of experimental biofilms using the modified Robbins device and flow cells. In:

Methods in Biotechnology, Vol. 12: Environmental Monitoring of Bacteria

(Edwards, C, ed), pp. 307-319, Humana Press, Totowa, NJ. Handelsman, J., Rondon, M.R., Brady, S.F., Clardy, J. & Goodman, R.M. (1998)

Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chem. Biol 5:R245-249. HannenhaUi, S.S., Hayes, W.S., Hatzigeorgiou, A.G. & Fickett, J.W. (1999) Bacterial start site prediction. Nucleic Acids Res. 27:3577-3582. Hardison, R.C. (2000) Conserved noncoding sequences are reliable guides to regulatory elements. Trends Genet. 16:369-372. Harris, N . (2000) Annotating sequence data using Genotator. Methods Mol. Biol.

132:259-276. Haugland, R.P. (1995) Detecting enzymatic activity in cells using fluorogenic substrates. Biotechnic Histochem. 70:243-251.

Head, IM. (1999) Recovery and analysis of ribosomal RNA sequences from the environment. In: Methods in Biotechnology, Vol. 12: Environmental Monitoring of Bacteria (Edwards, C, ed), pp. 139-174, Humana Press, Totowa, NJ. Henderson, J., Salzberg, S. & Fasman, K.H. (1997) Finding genes in DNA with a hidden Markov model. J. Comput. Biol. 4: 127-141.

Henikoff, S., Henikoff, J.G. (1993) Performance evaluation of amino acid substitution matrices. Proteins 17:49-61. Henikoff, J.G., Greene, E.A., Pietrokovski, S. & Henikoff, S. (2000) Increased coverage of protein families with the blocks database servers. Nucleic Acids Res. 28:228-230.

Hentschel, U., Steinert, M. & Hacker, J. (2000) Common molecular mechanisms of symbiosis and pathogenesis. Trends Microbiol. 8:226-231. Hershberger, C, Best, E.A., Sterner, J., Frye, C, Menke, M. & Verderber, E.L.

(1999) Design and assembly of polycistronic operons in Escherichia coli. In: Manual of Industrial Microbiology and Biotechnology, 2^nd ed., (Demain, A.L. &

Davies, J.E., eds), pp. 539-550, ASM Press, Washington, D.C. Hirosawa, M., Sazuka, T. & Yada, T. (1997) Prediction of translation initiation sites on the genome of Synechocystis sp. strain PCC6803 by hidden Markov model.

DNA Res. 4:179-184. Hofmann, K., Bucher, P., Falquet, L. & Bairoch, A. (1999) The PROSITE database, its status in 1999. Nucleic Acids Res. 27:215-219. Hoheisel, J.D., Nizetic. D. & Lehrach, H. (1989) Control of partial digestion combining the enzymes dam methylase and Mbol Nucleic Acids Res. 17:9571- 9582.

Hohn, B. (1979) In vitro packaging of lambda and cosmid DNA. Methods Enzymol.

68:299-309. Holben, W.E. & Harris, D. (1995) DNA-based monitoring of total bacterial community structure in environmental samples. Mol. Ecol 4:627-631. Holm, L. & Sander, C (1998) Touring protein fold space with Dali/FSSP. Nucleic

Acids Res. 26:316-319. Holt, J.G. & Krieg, N.R. (1994) Enrichment and isolation. In: Methods for General and Molecular Bacteriology (Gerhardt, P., Murray, R.G.E., Wood, W.A. & Krieg,

N.R, eds), pp. 179-215. American Society for Microbiology, Washington, D.C. Hu, J.C, Kornacker, M.G. & Hochschild, A. (2000) Escherichia coli one- and two- hybrid systems for the analysis and identification of protein-protein interactions.

Methods 20:80-94. Hughes, D.S., Felbeck, H. & Stein, J.L. (1997) A histidine protein kinase homolog from the endosymbiont of the hydrothermal vent tubeworm Riftia pachyptila. Appl Environ. Microbiol. 63:3494-3498.

Hughey, R. (1996) Parallel hardware for sequence comparison and alignment.

Comput. Appl. Biosci. 12:473-479. Hunter-Cevera, J.C. (1998) The value of microbial diversity. Curr. Opin. Microbiol

1:278-285. Hunter-Cevera, J.C. & Belt, A. (1999) Isolation of cultures. In: Manual of Industrial

Microbiology and Biotechnology, 2^nd ed., (Demain, A.L. & Davies, J.E., eds), pp.

3-20, ASM Press, Washington, D.C. Ioannou, P.A. & de Jong, P.J. (1996) Construction of bacterial artificial chromosome libraries using the modified PI (PAC) system. In: Current Protocols in Human Genetics (Dracopoli, N.C, Haines, J.L., Korf, B.R., Moir, D.T., Morton, C.C,

Seidman, C.E., Seidman, J.G. & Smith, D.R., eds), pp. 5.15.1-5.15.24. John

Wiley & Sons, New York. Ioannou, P.A., Amemiya, C.T., Games, J., Kroisel, P.M., Shizuya, H., Chen, C,

Batzer, M.A. & de Jong, P.J. (1994) A new bacteriophage Pl-derived vector for the propagation of large human DNA fragments. Nat. Genet. 6:84-89.

Jansson, J.K. & de Bruijn, F.J. (1999) Biomarkers and bioreporters to track microbes and monitor their gene expression. In: Manual of Industrial Microbiology and

Biotechnology, 2 ed., (Demain, A.L. & Davies, J.E., eds), pp. 651-665, ASM

Press, Washington, D.C. Jaspers E. & Overmann, J. (1997) Separation of bacterial cells by isoelectric focusing, a new method for analysis of complex microbial communities. Appl. Environ.

Microbiol. 63:3176-3181. Julich, A. (1995) Implementations of BLAST for parallel computers. Comput. Appl.

Biosci. 11:3-6. Kim, S.H. (2000) Structural genomics of microbes: an objective. Curr. Opin. Struct.

Biol 10:380-383. Kim, U.J., Shizuya, H., de Jong, P.J., Birren, B. & Simon, M.I. (1992) Stable propagation of cosmid sized human DNA inserts in an F factor based vector.

Nucleic Acids Res. 20:1083-1085. Kolb, A. & Neumann, K. (1997) Beyond the 96-well microplate: instruments and assay methods for the 384-well format. J Biomol Screening 2:103-109. Koonin, EN., Tatusov, R.L. & Galperin, M.Y. (1998) Beyond complete genomes: from sequence to structure and function. Curr. Opin. Struct. Biol. 8:355-363. Kreitman, M. & Comeron, J.M. (1999) Coding sequence evolution. Curr. Opin.

Genet. Dev. 9:637-641. Krogh, A., Mian, IS. & Haussler, D. (1994) A hidden Markov model that finds genes in E. coli DΝA. Nucleic Acids Res. 22:4768-4778. Krogh, A. (1997) Two methods for improving performance of an HMM and their application for gene finding. Ismb 5:179-186.

Krogh, A. (2000) Using database matches with for HMMGene for automated gene detection in Drosophila. Genome Res. 10:523-528. Kubitschek, H.E. (1987) Buoyant density variation during the cell cycle in microorganisms. Crit. Rev. MicrobiolA4:73-97. Kulp, D., Haussler, D., Reese, M.G. & Eeckman, F.H. (1996) A generalized hidden

Markov model for the recognition of human genes in DΝA. Ismb 4:134-142. LaDuca, R.J., Berry, A., Chotani, G., Dodge, T.C, Gosset, G., Nalle, F., Liao, J.C,

Yong-Xiao, J. & Power, S.D. (1999) Metabolic pathway engineering of aromatic compounds. In: Manual of Industrial Microbiology and Biotechnology, 2^nd ed., (Demain, A.L. & Davies, J.E., eds), pp. 605-615, ASM Press, Washington, D.C.

Lam, J.S. & Mutharia, L. (1994) Antigen-antibody reactions. In: Methods for General and Molecular Bacteriology (Gerhardt, P., Murray, R.G.E., Wood, W.A. & Krieg,

Ν.R., eds), pp. 104-132. American Society for Microbiology, Washington, D.C. Lashkari, D.A., McCusker, J.H. & Davis, R.W. (1997) Whole genome analysis: experimental access to all genome sequenced segments through larger-scale efficient oligonucleotide synthesis and PCR. Proc. Natl. Acad. Sci U.S.A.

94:8945-8947. Lawrence, J. (1999) Selfish operons: the evolutionary impact of gene clustering in prokaryotes and eukaryotes. Curr. Opin. Genet. Dev. 9:642-648. Lee, P.S. & Lee, K.H. (2000) Genomic analysis. Curr. Opin. Biotechnol 11:171-175. Lipman, DJ. & Pearson, W.R. (1985) Rapid and sensitive protein similarity searches.

Science 227:1435-1441. Lo Conte, L., Ailey, B., Hubbard, T.J., Brenner, S.E., Murzin, A.G. & Chothia, C.

(2000) SCOP: a structural classification of proteins database. Nucleic Acids Res. 28:257-259.

Lukashin AN. & Borodovsky, M. (1998) GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res. 26 : 1107- 1115. Mai, N. & Wiegel, J. (1999) Recombinant DΝA applications in thermophiles. In:

Manual of Industrial Microbiology and Biotechnology, 2^nd ed., (Demain, A.L. & Davies, J.E., eds), pp. 511-519, ASM Press, Washington, D.C.

Mauchline, M., Davis, T.O. & Minton, Ν.P. (1999) Clostridia. In: Manual of

Industrial Microbiology and Biotechnology, 2" ed., (Demain, A.L. & Davies,

J.E., eds), pp. 475-490, ASM Press, Washington, D.C. Major, J. (1998) Challenges and opportunities in high throughput screening: implications for new technologies. J. Biomol. Screening 3:13-17.

Marcotte, E.M. (2000) Computational genetics: finding protein function by nonhomology methods. Curr. Opin. Struct. Biol. 10:359-365. Marcotte, E.M., Pellegrini, M., Thompson, M.J., Yeates, T.O. & Eisenberg, D. (1999)

A combined algorithm for genome- wide prediction of protein function. Nature 402:83-86. McFadden, G.I., Gilson, P.R., Douglas, S.E., Cavalier-Smith, T., Hofmann, C.J. &

Maier, U.G. (1997) Bonsai genomics: sequencing the smallest eukaryotic genomes. Trends Genet. 13:46-49. Medigue, C, Rechenmann, F., Danchin, A. & Niari, A. (1999) Imagene: an integrated computer enviromnent for sequence annotation and analysis. Bioinformatics 15:2-

15. Miao, F., Todd, P. & Kompala, D.S. (1993) A single-cell assay of beta-galactosidase in recombinant Escherichia coli using flow cytometry. Biotechnol. Bioeng.

42:708-715. Miller, D.Ν., Bryant, J.E., Madsen, E.L. & Ghiorse, W.C. (1999) Evaluation and optimization of DNA extraction and purification procedures for soil and sediment samples. Appl. Environ. Microbiol 65:4715-4724. Millikan, D,S., Felbeck, H. & Stein, J.L. (1999) Identification and characterization of a flagellin gene from the endosymbiont of the hydrothermal vent tubeworm Riftia pachyptila. App Environ. Microbiol. 65:3129-3133.

Mooers, A.O. & Holmes, E.G. (2000) The evolution of base composition and phylogenetic inference. Trends Ecol Evol 15:365-369. Moran, N.A. & Baumann, P. (2000) Bacterial endosymbionts in animals. Curr. Opin.

Microbiol 3:270-275. Moran, N.A. & Wernegreen, J.J. (2000) Lifestyle evolution in symbiotic bacteria: insights from genomics. Trends Ecol Evol. 15:321-326. Munder, T. & Hitmen, A. (1999) Yeast cells as tools for target-oriented screening.

Appl. Microbiol. Biotechnol. 52:311-320. Murzin, A.G., Brenner, S.E., Hubbard, T. & Chothia, C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247:536-540. Mushegian, A. (1999) The minimal genome concept. Curr. Opin. Genet. Dev. 9:709-

714. Mushegian, A.R. & Koonin, EN. (1996) A minimal gene set for cellular life derived by comparison of complete bacterial genomes. Proc. Natl. Acad. Sci. U.S.A.

93:10268-10273. Νevill-Manning, C.G., Wu, T.D. & Brutlag, D.L. (1998) Highly specific protein sequence motifs for genome analysis. Proc. Natl. Acad. Sci. U.S.A. 95:5865-5871. Νishi, T., Sato, M., Saito, A., Itoh, S., Takaoka, C & Taniguchi, T. (1983) Construction and application of a novel plasmid "ATG vector" for direct expression of foreign genes in Escherichia coli. DNA 2:265-273. Νusslein, K. & Tiedje, J.M. (1998) Characterization of the dominant and rare members of a young Hawaiian soil bacterial community with small-subunit ribosomal DΝA amplified from DΝA fractionated on the basis of its guanine and cytosine composition. Appl Environ Microbiol. 64: 1283-1289.

Ohi, S. & Short, J. (1980) A general procedure for preparing messenger RΝA from eukaryotic cells without using phenol. J. Appl. Microbiol 2:398-413. Ohler, U, Harbeck, S., Νiemann, H., Νoth, E. & Reese, M.G. (1999) Inteφolated

Markov chains for eukaryotic promoter recognition. Bioinformatics 15:362-369. Orengo, C.A., Jones. D.T. & Thornton, J.M. (1994) Protein superfamihes and domain superfolds. Nature 372:631-634. Orengo, C.A., Todd, A.E. & Thornton, J.M. (1999) From protein structure to function. Curr. Opin. Struct. Biol. 9:374-382. Osada, Y., Saito, R. & Tomita, M. (1999) Analysis of base-pairing potentials between 16S rRNA and 5' UTR for translation initiation in various prokaryotes. Bioinformatics 15:578-581.

Overbeek, R., Fonstein, M., D'Souza, M., Pusch, G.D. & Maltsev, N. (1999) The use of gene clusters to infer functional coupling. Proc. Natl Acad. Sci. U.S.A.

96:2896-2901.

Overbeek, R., Larsen, N., Pusch, G.D., D'Souza, M., Selkov, E. Jr., Kyφides. N., Fonstein, M., Maltsev, N. & Selkov, E. (2000) WIT: integrated system for high- throughput genome sequence analysis and metabolic reconstruction. Nucleic Acids Res. 28:123-125.

Pace, N.R. (1997) A molecular view of microbial diversity and the biosphere. Science 276:734-740.

Paerl, H.W. & Pinckney, J.L. (1996) A mini-review of microbial consortia: their roles in aquatic production and biogeochemical cycling. Microb. Ecol. 31:225-247. Pearl, F., Todd, A.E., Bray, J.E., Martin, A.C., Salamov, A.A., Suwa, M., Swindells, M.B., Thornton, J.M. & Orengo, CA. (2000) Using the CATH domain database to assign structures and functions to the genome sequences. Biochem. Soc. Trans. 28:269-275.

Pedersen, A.G., Baldi, P., Brunak, S. & Chauvin, Y. (1996) Characterization of prokaryotic and eukaryotic promoters using hidden Markov models. Ismb 4:182-

191.

Pearson, W.R. (1996) Effective protein sequence comparison. Methods Enzymol 266:227-258.

Pearson, W.R. (1998) Empirical statistical estimates for sequence similarity searches. J. Mol. Biol 276:71-84.

Pearson, W.R. & Lipman, DJ. (1988) Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. U.S.A. 85:2444-2448.

Pedersen, A.G., Baldi, P., Brunak, S. & Chauvin, Y. (1996) Characterization of prokaryotic and eukaryotic promoters using hidden Markov models. Ismb 4:182- 191.

Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D. & Yeates, T.O. (1999) Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl. Acad. Sci. U.S.A. 96:4285-4288.

Pimbley, D.W., Patel, P.D. & Robertson, C.J. (1999) Dielectrophoresis. In: Methods in Biotechnology, Vol. 12: Environmental Monitoring of Bacteria (Edwards, C, ed), pp.35-53, Humana Press, Totowa, NJ.

Plovins, A., Alvarez, A.M., Ibanez, M., Molina, M. & Nombela, C. (1994) Use of fluorescein-di-beta-D-galactopyranoside (FDG) and C12-FDG as substrates for beta-galactosidase detection by flow cytometry in animal, bacterial, and yeast cells. Appl. Environ. Microbiol 60:4638-4641.

Porter, J. (1999a) Flow cytometry and cell sorting: rapid analysis and separation of individual bacterial cells from natural environments. In: Methods in Biotechnology, Vol. 12: Environmental Monitoring of Bacteria (Edwards, C, ed), pp.55-73, Humana Press, Totowa, NJ. Porter, J. (1999b) Specific detection, viability assessment, and macromolecular staining of bacteria for flow cytometry. In: Methods in Biotechnology, Vol. 12: Environmental Monitoring of Bacteria (Edwards, C, ed), pp.237-249, Humana Press, Totowa, NJ.

Porter, J. & Pickup, R. (1999) Magnetic particle-based separation techniques for monitoring bacteria from natural environments. In: Methods in Biotechnology, Vol. 12: Environmental Monitoring of Bacteria (Edwards, C, ed), pp.75-96,

Humana Press, Totowa, NJ. Putzer, K.P., Buchholz, L.A., Lidstrom, M.E. & Remsen, C.C (1991) Separation of methanotrophic bacteria by using Percoll and its application to isolation of mixed and pure cultures. Appl. Environ. Microbiol. 57:3656-3659.

Randall, T.A. & Judelson, H.S. (1999) Construction of a bacterial artificial chromosome library of Phytophthora infestans and transformation of clones into

P. infestans. Fungal Genet. Biol 28:160-170. Riethman, H., Birren, B. & Gnirke, A. (1999) Preparation, manipulation, and mapping of HMW DNA. In: Genome Analysis: A Laboratory Manual, Vol. 1: Analyzing

DNA, (Birren, B., Green, E.D., Klapholz, S., Myers, R.M. & Roskams, I, eds), pp. 83-248, Cold Spring Harbor Laboratory Press, Plainview, NY. Rinker, K.D., Han, C.J., Adams, M.W.W. & Kelly, R.M. (1999) Cultivation of hyperthermophilic and extremely thermoacidophilic microorganisms. In: Manual of Industrial Microbiology and Biotechnology, 2ⁿ ed., (Demain, A.L. & Davies,

J.E., eds), pp. 119-136, ASM Press, Washington, D.C. Rodi, D J. & Makowski, L. (1999) Phage-display technology-finding a needle in a vast molecular haystack. Curr. Opin. Biotechnol. 10:87-93. Rondon, M.R., Goodman, R.M. & Handelsman, J. (1999) The Earth's bounty: assessing and accessing soil microbial diversity. Trends Biotechnol. 17:403-409.

Rondon, M.R., August, P.R., Bettermann, A.D., Brady, S.F., Grossman, T.H., Liles,

M.R., Loiacono, K.A., Lynch, B.A., MacNeil, I.A., Minor, C, Tiong, C.L.,

Gilman, M., Osbuπ e, M.S., Clardy, J., Handelsman, J. & Goodman, R.M. (2000)

Cloning the soil metagenome: a strategy for accessing the genetic and functional diversity of uncultured microorganisms. Appl. Environ. Microbiol 66:2541-2547.

Rosteck, P.R., DeHoff, B.S., Norris, F.H. & Rockey, P.K. (1999) Bacterial genomics and genome informatics. In: Manual of Industrial Microbiology and

Biotechnology, 2^nd ed., (Demain, A.L. & Davies, J.E., eds), pp. 493-500, ASM

Press, Washington, D.C. Rouze, P., Pavy, N. & Rombauts, S. (1999) Genome annotation: which tools do we have for it? Curr. Opin. Plant Biol. 2:90-95. Saito, R. & Tomita, M. (1999) Computer analyses of complete genomes suggest that some archaebacteria employ both eukaryotic and eubacterial mechanisms in translation initiation. Gene 238:79-83. Sali, A. & Kuriyan J. (1999) Challenges at the frontiers of structural biology. Trends

Cell Biol. 9.M20-24. Sambrook, J., Fritsch, E.F. & Maniatis, T., eds. (1989) Molecular Cloning: A

Laboratory Manual, 2^nd ed., Cold Spring Harbor Laboratory Press, Plainview,

NY. Sauder, J.M., Arthur, J.W. & Dunbrack, R.L., Jr. (2000) Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins 40:6-

22. Sche, P.P., McKenzie, K.M., White, J.D. & Austin, D J. (1999) Display cloning: functional identification of natural product receptors using cDNA-phage display. Chem. Biol. 6:707-716.

Scherf, M., Klingenhoff, A. & Werner, T. (2000) Highly specific localization of promoter regions in large genomic sequences by Promoterlnspector: a novel context analysis approach. J. Mol. Biol. 297:599-606. Schwartz, R.M. & Dayhoff, M.O. (1978) Matrices for detecting distant relationships.

In: Atlas of Protein Sequence and Structure, Vol. 5, Suppl. 3, (Dayhoff, M.O., ed), pp. 353-358, National Biomedical Research Foundation, Washington, D.C. Scott, J.K. & Smith, G.P. (1990) Searching for peptide ligands with an epitope library. Science 249:386-390.

Serebriiskii, I, Khazak, V. & Golemis, E.A. (1999) A two-hybrid dual bait system to discriminate specificity of protein interactions. J. Biol. Chem. 274:17080-17087. Shapiro, L. & Harris, T. (2000) Finding function through structural genomics. Curr.

Opin. Biotechnol. 11:31-35. Shine, J. & Dalgamo, L. (1974) The 3' terminal sequence of E. coli 16S ribosomal

RNA: complementarity to nonsense triplets and ribosome binding sites. Proc.

Natl. Acad. Sci. U.S.A. 71:1342-1346. Shizuya, H., Birren, B., Kim, U.J., Mancino, V., Slepak, T., Tachiiri, Y. & Simon, M.

(1992) Cloning and stable maintenance of 300-kilobase-pair fragments of human DNA in Escherichia coli using an F-factor-based vector. Proc. Natl. Acad. Sci.

USA. 89:8794-8797. Shmatkov, A.M., Melikyan, A. A., Chemousko, F.L. & Borodovsky, M. (1999)

Finding prokaryotic genes by the 'frame-by-frame' algorithm: targeting gene starts and overlapping genes. Bioinformatics 15:874-886. Short, J.M. (1997) Recombinant approaches for accessing biodiversity. Nat.

Biotechnol 15:1322-1323. Sittapalam, G., Kahl, S. & Janzen, W. (1997) High throughput screening: advances in assay technologies. Curr. Opin. Chem. Biol. 1:384-391. Skolnick, J. & Fetrow, J.S. (2000) From genes to protein structure and function: novel applications of computational approaches in the genomic era. Trends Biotechnol.

18:34-39. Smith, T.F. & Waterman, M.S. (1981) Identification of common molecular subsequences. J. Mol Biol 147:195-197. Shpaer, E.G., Robinson, M., Yee, D., Candlin, J.D., Mines, R. & Hunkapiller, T. (1996) Sensitivity and selectivity in protein similarity searches: a comparison of

Smith- Waterman in hardware to BLAST and FASTA. Genomics 38:179-191. States, D.J., Gish, W. & Altschul, S.F. (1991) Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods 3:66-70. Stein, J.L., Marsh, T.L., Wu, K.Y., Shizuya, H. & DeLong, E.F. (1996) Characterization of uncultivated prokaryotes: isolation and analysis of a 40- kilobase-pair genome fragment from a planktonic marine archaeon. J. Bacteriol

178:591-599. Stemmer, W.P. (1994a) Rapid evolution of a protein in vitro by DNA shuffling.

Nαtwre 370:389-391. Stemmer, W.P. (1994b) DΝA shuffling by random fragmentation and reassembly: in vitro recombination for molecular evolution. Proc. Natl. Acad. Sci. U.S.A.

91:10747-10751. Sterky, F.& Lundeberg, J. (2000) Sequence analysis of genes and genomes. J.

Biotechnol. 76:1-31. Stemberg, Ν. (1990) Bacteriophage PI cloning system for the isolation, amplification, and recovery of DΝA fragments as large as 100 kilobase pairs. Proc. Natl. Acad.

Sci. USA. 87:103-107. Sundberg, S.A. (2000) High-throughput and ultra-high-throughput screening: solution- and cell-based approaches. Curr. Opin. Biotechnol. 11:47-53. Sutherland, J.D. (2000) Evolutionary optimization of enzymes. Curr. Opin. Chem.

Biol 4:263-269. Tamames, J., Casari, G., Ouzounis, C & Valencia, A. (1997) Conserved clusters of functionally related genes in two bacterial genomes. J. Mol. Evol. 44:66-73. Thompson J.D., Higgins D.G. & Gibson T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids

Res. 22:4673-4680. Tiedje, J.M. & Stein, J.L. (1999) Microbial biodiversity: strategies for its recovery. In: Manual of Industrial Microbiology and Biotechnology, 2ⁿ ed., (Demain, A.L. &

Davies, J.E., eds), pp. 682-692, ASM Press, Washington, D.C. Todd, A.E., Orengo, CA. & Thornton, J.M. (1999) Evolution of protein function, from a structural perspective. Curr. Opin. Chem. Biol 3:548-556. Torsvik, V., Daae, F.L., Sandaa, R.A. & Ovreas, L. (1998) Novel techniques for analysing microbial diversity in natural and perturbed environments. J.

Biotechnol 64:53-62. Uberbacher, E.C, Xu, Y. & Mural, R.J. (1996) Discovering and understanding genes in human DNA sequence using GRAIL. Methods Enzymol. 266:259-281. van den Hombergh, J.P., van de Vondervoort, P.J., Fraissinet-Tachet, L. & Visser, J. (1997) Aspergillus as a host for heterologous protein production: the problem of proteases. Trends Biotechnol 15:256-263. Vergin, K.L., Urbach, E., Stein, J.L., DeLong, E.F., Lanoil, B.D. & Giovannoni, SJ.

(1998) Screening of a fosmid library of marine environmental genomic DNA fragments reveals four clones related to members of the order Planctomycetales. Appl. Environ. Microbiol. 64:3075-3078.

Visser, J., Bussink, H.J. & Witteveen, C. (1995) Gene expression in filamentous fungi. Expression of pectinases and glucose oxidase in Aspergillus niger.

Bioprocess Technol 22:241-308. Wagner-Dobler, I, Bennasar, A., Vancanneyt, M., Strompl, C, Brammer, I, Eichner, C, Grammel, I. & Moore, E.R. (1998) Microcosm enrichment of biphenyl- degrading microbial communities from soils and sediments. Appl. Environ.

Microbiol. 64:3014-3022. Watts, J.E.M., Huddleston- Anderson, A.S. & Wellington, E.M.H. (1999)

Bioprospecting. In: Manual of Industrial Microbiology and Biotechnology, 2^nd ed., (Demain, A.L. & Davies, J.E., eds), pp. 631-641, ASM Press, Washington,

D.C. Wery, J., Verdoes, J.C. & van Ooyen, A.J.J. (1999) Genetics of non-Saccharomyces industrial yeasts. In: Manual of Industrial Microbiology and Biotechnology, 2^nd ed., (Demain, A.L. & Davies, J.E., eds), pp. 447-459, ASM Press, Washington, D.C.

Wilson, R.K. & Mardis, E.R. (1999a) Fluorescence-based DNA sequencing. In:

Genome Analysis: A Laboratory Manual, Vol. 1: Analyzing DNA, (Birren, B.,

Green, E.D., Klapholz, S., Myers, R.M. & Roskams, J., eds), pp. 301-395, Cold

Spring Harbor Laboratory Press, Plainview, NY. Wilson, R.K. & Mardis, E.R. (1999b) Shotgun sequencing. In: Genome Analysis: A

Laboratory Manual, Vol. 1: Analyzing DNA, (Birren, B., Green, E.D., Klapholz,

S., Myers, R.M. & Roskams, I, eds), pp. 397-454, Cold Spring Harbor

Laboratory Press, Plainview, NY. Wilson, C.A., Kreychman, J. & Gerstein, M (2000) Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J. Mol. Biol. 97:233-249. Wittrup, K.D. & Bailey, IE. (1988) A single-cell assay of beta-galactosidase activity in Saccharomyces cerevisiae. Cytometry 9:394-404.

Wolff, R. & Gemmill, R. (1999) Purifying and analyzing genomic DNA. In: Genome

Analysis: A Laboratory Manual, Vol. 1: Analyzing DNA, (Birren, B., Green, E.D.,

Klapholz, S., Myers, R.M. & Roskams, J., eds), pp. 1-81, Cold Spring Harbor

Laboratory Press, Plainview, NY. Wootton & Federhen (1993) Statistics of low complexity in amino acid sequences and sequence databases. Comput. Chem. 17:149-163. Yada, T., Totoki, Y., Ishikawa, M., Asai, K. & Nakai, K. (1998) Automatic extraction of motifs represented in the hidden Markov model from a number of DNA sequences. Bioinformatics 14:317-325. Yada, T., Nakao, M., Totoki, Y. & Nakai, K. (1999) Modeling and predicting transcriptional units of Escherichia coli genes using hidden Markov models.

Bioinformatics 15:987-993. Yang, A.S. & Honig, B. (2000) An integrated approach to the analysis and modeling of protein sequences and structures. H On the relationship between sequence and structural similarity for proteins that are not obviously related in sequence. J. Mol.

Biol 301:679-689. Zhang, Z., Schwartz, S., Wagner, L. & Miller, W. (2000) A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7:203-214. Zhou, J., Brans, M.A. & Tiedje, IM. (1996) DNA recovery from soils of diverse composition. Appl. Environ. Microbiol. 62:316-322.

Zlokarnik, G., Negulescu, P.A., Knapp, T.E., Mere, L., Burres, N., Feng, L., Whitney,

M., Roemer, K. & Tsien, R.Y. (1998) Quantitation of transcription and clonal selection of single living cells with beta-lactamase as reporter. Science 279:84-88.

Claims

1. A method for determining a functional nucleic acid sequence from a population of microorganisms, comprising the steps of: (a) extracting nucleic acid fragments from a complex population of microorganisms; (b) cloning the nucleic acid fragments into a vector, and without prior characterization of the nucleic acid fragments by activity screening of an expression product, by hybridization screening with an oligonucleotide probe, or by screening by amplification of a fragment containing a known sequence;

(c) sequencing a plurality of the cloned nucleic acid fragments; and

(d) analyzing the sequences to identify a functional region.

2. The method of claim 1, wherein the nucleic acid is DNA.

3. The method of claim 1, wherein the nucleic acid is RNA.

4. The method of claim 3, wherein the RNA is mRNA

5. The method of claim 2 or claim 3, wherein the identified functional region is a gene.

6. The method of claim 1, wherein the nucleic acid fragments extracted from the complex population of microorganisms comprises less than 1 mole percent nucleic acid from the domain Eucarya.

7. The method of claim 1 , further comprising the step of calculating a diversity index for the complex population of microorganisms, prior to carrying out claim 1 step (b).

8. The method of claim 8, wherein the diversity index is a measure of species entropy.

9. The method of claim 8, wherein the species entropy is at least 6 bits.

10. The method of claim 1 , wherein the average size of the nucleic acid fragments is at least 60 kb.

11. The method of claim 5, wherein the identified gene sequence is incomplete, and additional sequence of the gene is obtained by at least one repetition of claim 1 steps (b) and (c).

12. The method of claim 5, further comprising the steps of:

(e) inserting a nucleic acid fragment encoding the identified gene into an expression vector to generate an expression construct;

(f) transforming a host organism with the expression construct; and

(g) expressing the gene product encoded by the expression construct.

13. The method of claim 12, further comprising the step of: (h) assaying the gene product for biological activity.

14. The method of claim 1, wherein the cloned nucleic acid fragments are organized in an archival library to facilitate retrieval of a desired clone.

15. The method of claim 4, wherein the complex population of microorganisms comprises microorganisms from the domain Eucarya, and step (b) comprises reverse transcribing the mRNA into cDNA and cloning the cDNA into the vector.

16. The method of claim 15, further comprising the step of calculating a diversity index for the complex population of microorganisms.

17. The method of claim 16, wherein the diversity index is a measure of species entropy.

18. The method of claim 17, wherein the species entropy is at least 6 bits.

19. The method of claim 15, wherein the sequence of the identified functional region is incomplete, and additional sequence of the identified functional region is obtained by at least one repetition of claim 1 steps (b) and (c).

20. The method of claim 15, further comprising the steps of: (e) inserting a nucleic acid fragment encoding the identified functional region into an expression vector to generate an expression construct; (f) transforming a host organism with the expression construct; and (g) expressing the identified functional region encoded by the expression construct.

21. The method of claim 20, further comprising the step of: (h) assaying the expressed identified functional region for biological activity.

22. The method of claim 15, wherein the cloned nucleic acid fragments are organized in an archival library to facilitate retrieval of a desired clone.