WO1998026096A1

WO1998026096A1 - Method for rapid gap closure

Info

Publication number: WO1998026096A1
Application number: PCT/US1997/022655
Authority: WO
Inventors: Jeffrey L. Mooney; Christine Marie Debouck
Original assignee: Smithkline Beecham Corporation
Priority date: 1996-12-12
Filing date: 1997-12-11
Publication date: 1998-06-18
Also published as: JP2001505780A; EP0958381A1

Abstract

Methods for gap closure and contiguous sequence assembly of an organism using high density grids from a series of random genomic libraries prepared from isolated DNA of the selected organism are provided.

Description

METHOD FOR RAPID GAP CLOSURE

RELATED APPLICATIONS

This application claims benefit of U.S. Provisional Application 60/032,555, filed December 12, 1996. FIELD OF THE INVENTION

The present invention relates to a simple and cost effective method for the closure of gaps generated during whole genome random sequencing through the use of high-density arrays, or grids, of genomic libraries. The method is also useful for the rapid isolation of full length genomic sequences obtained from partial gene sequences. Such genomic sequences will therefore comprise full length coding regions. The method of the present invention is also useful for the confirmation of computer generated assemblies. That is, the confirmation of the order or assembly of contiguous sequences. This method provides an alternative to chromosome walking.

BACKGROUND OF THE INVENTION Advances in DNA sequencing technology and computational methodologies have drastically altered the rate at which genome sequencing projects can proceed. In particular, Fleischmann et al. (Science, 1995, 269:496-512) have demonstrated the ability to perform whole genome random sequencing and assembly of a complete living organism in just a few months. The strategy of this random "shotgun" approach predicts a certain number of sequence gaps between the assembled contiguous sequences based upon the insert size of the sequenced library, the genome size of the organism, and number of clones sequenced, and the total number of bases sequenced. These gaps correspond to regions within of between genes that have not been identified through random sequencing (see, e.g., E.S. Lander and M.S. Waterman, "Genomic mapping by fingerprinting random clones: a mathematical analysis",

Genomics, (1988), 2: 231-239). Ordering of contiguous sequences and completion of gap closures is typically performed by genomic PCR based on primers designed against every combination of physical gap ends. However, this procedure is both very time consuming and labor intensive and can take 5-10 times longer than the random sequencing itself. Accordingly, there exists a need for a more efficient method of ordering of contiguous sequences and completing gap closure in whole genome random sequencing of any organism.

SUMMARY OF THE INVENTION In one aspect, the invention provides a method for high throughput sequencing and gap closure and contiguous sequence assembly or clone ordering in genome sequencing projects using high density grids of genomic libraries. The method involves constructing a series of random genomic libraries for a selected organism and preparing a grid for each library, each grid having a surface on which is immobilized at predefined regions on said surface a plurality of clones derived from the libraries. This gap closure provides complete sequence for partial genes or genes not found in the original random sequencing step. After contiguous sequence assembly of the partial gene sequences, hybridization probes are generated which correspond to the non-overlapping ends of the known sequence. The probes are then hybridized to a gridded library to identify nucleotide sequences which span the non- overlapping ends of the assembled nucleotide sequence.

Other objects, features, advantages and aspects of the present invention will become apparent to those of skill in the art from the following description. It should be understood, however, that the following description and the specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only. Various changes and modifications within the spirit and scope of the disclosed invention will become readily apparent to those skilled in the art from reading the following description and from reading the other parts of the present disclosure. DETAILED DESCRIPTION OF THE INVENTION

Identification, sequencing and characterization of genes is a major goal of modern scientific research. By identifying genes, determining their sequences and characterizing their biological function, it is possible to employ recombinant technology to produce large quantities of valuable gene products, e.g., proteins and peptides. Additionally, knowledge of gene sequences can provide a key to diagnosis, prognosis and treatment in a variety of disease states in plants and animals which are characterized by inappropriate expression and/or repression of selected genes or by the influence of external factors, e.g., carcinogens or teratogens, on gene function. Methods now exist for whole random sequencing and assembly of a complete living organism. However, methods required to complete genome sequence gap closures and ordering of contiguous sequences are both time consuming and labor intensive.

The present invention provides a method for high throughput gap closure and contiguous sequence assembly which is useful in whole genome random sequencing. This method uses a plurality of high density grids prepared from genomic libraries of a selected organism to perform sequence reactions, gap closure and contiguous sequence assembly. The method of the present invention provides a more rapid and cost effective means to sequence the whole genome of an organism. The method also provides rapid means to obtain the full length genomic sequence for genes for which only a partial sequence is obtained through random sequencing.

/. Definitions

Several words and phrases used throughout this specification are defined as follows:

As used herein, the term "gene" refers to the genomic nucleotide sequence from which a cDNA sequence is derived. The term gene classically refers to the genomic sequence, which upon processing, can produce different cDNAs, e.g., by splicing events. However, for ease of reading, any full-length counterpart cDNA sequence will also be referred to by shorthand herein as gene. "Isolated" means altered "by the hand of man" from its natural state; i.e., that, if it occurs in nature, it has been changed or removed from its original environment, or both. For example, a naturally occurring polynucleotide or a polypeptide naturally present in a living animal in its natural state is not "isolated," but the same polynucleotide or polypeptide separated from the coexisting materials of its natural state is "isolated", as the term is employed herein. For example, with respect to polynucleotides, the term isolated means that it is separated from the chromosome and cell in which it naturally occurs.

By "organism" it is meant to include any living organism such as, but not limited to, bacterium (including both gram negative and gram positive species), viruses, lower eukaryotic cells such as fungi, yeast and molds, simple multicellular organisms (e.g., slime molds) and complex multicellular organisms including man.

As used here, the term "solid support" refers to any known substrate which is useful for the immobilization of a plurality of defined materials derived from a genomic library by any available method to enable detectable hybridization of the immobilized polynucleotide sequences with other polynucleotides in the sample. Among a number of available solid supports, one desirable example is the supports described in International Patent Application No. WO91/07087, published May 30, 1991. Examples of other useful supports include, but are not limited to, nitrocellulose, nylon, glass, silica and Pall BIODYNE C. It is also anticipated that improvements yet to be made to conventional solid supports may also be employed in this invention.

The term "grid" means any generally two-dimensional structure on a solid support to which the defined materials of a genomic library are attached or immobilized.

As used herein, the term "predefined region" refers to a localized area on a surface of a solid support on which is immobilized one or multiple copies of a particular clone and which enables hybridization of that clone at the position, if hybridization of that clone to a sample polynucleotide occurs. By "immobilized", it is meant to refer to the attachment of the genes to the solid support. Means of immobilization are known and conventional to those of skill in the art, and may depend on the type of support being used.

II. Compositions of the Invention

The present invention is based upon the use of high density arrays of genomic libraries as a means for high throughput gap closure, including full length genomic sequences and contiguous sequence assembly in genomic sequencing. A. Preparation of genomic libraries For this analysis a series of random genomic libraries for a selected organism are prepared, each library comprising fractionated and ligated genomic DNA of a selected insert size range. To construct these libraries, genomic DNA from selected organism is first isolated using standard procedures for molecular biology such as those disclosed by Sambrook et al, MOLECULAR CLONING, A LABORATORY MANUAL, 2nd Ed.; Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York, 1989. The isolated DNA is then randomly sheared (e.g., by sonication, partial restriction endonuclease digestion, partial DNAse digestion, etc.), modified and ligated in a plasmid or phage vector in accordance with the procedures described by Fleischmann et al. Science, 1995, 269:496-512. For example, in one embodiment, a small insert library is prepared by fractionating and ligating the genomic DNA into a plasmid or phage based vector so that the average insert size is between 1.0 and 5.0 kb. Examples of plasmid vectors useful in constructing this small insert library include, but are not limited to, pBLuescript, Lambda ZAPII (Stratagene, La Jolla, CA) pUC19 and Ml 3 mpl8/19 (New England BIOLABS, Beverly, MA).

In this embodiment, a large insert library is also constructed in a cosmid vector so that the insert size averages between 10 and 100 kb. Examples of cosmid vectors useful in constructing this large insert library include, but are not limited to, pLorist, pWEIS (Statagene, La Jolla, CA), lambda DASH2 (Statagene) and lambda GEM- 12 (Promega, Madison, WI). In addition, a medium insert library having an insert size range averaging approximately 5.0-10 kb can also be prepared by fractionating the genomic DNA into a cosmid vector.

The final ligation products are electroporated into a bacterium such as E. coli so that the final number of transformants for each library reaches 4 to 6 fold depth of coverage as predicted by the Lander-Waterman theory (Ε.S. Lander and M.S.

Waterman, Genomics, (1988), 2:231-239). Transformants are placed into microtiter plates and grown overnight under standard in vitro culture conditions recognized as normal for the particular bacterium. Replicate plates are made and used for either gridding or template production for sequencing. B. Preparation of Grids

Each of the insert libraries are gridded onto solid supports, with the nucleic acid of each transformed host cell (containing the insert and amplified by PCR), or clone, being placed onto a predefined position within a high density array.

Numerous conventional methods are employed for immobilizing the clones to surfaces of a variety of solid supports. See, e.g., Affinity Techniques, Enzyme

Purification: Part P, Methods in Enzymology, Vol. 34, ed. W.B. Jakoby, M.

Wilcheck, Acad. Press, NY (1971); Immobilized Biochemicals and Affinity

Chromatography, Advances in Experimental Medicine and Biology, Vol. 42, ed. R.

Dunlap, Plenum Press, NY (1974); U.S. Patent 4,762,881; U.S. Patent No. 4,542,102; European Patent Publication No. 391,608 (October 10, 1990); or U.S. Patent No.

4,992,127 (November 21, 1989).

One desirable method for attaching the clones to a solid support is described in International Application No. PCT/US90/06607 (published May 30, 1991).

Briefly, this method involves forming predefined regions on a surface of a solid support, where the predefined regions are capable of immobilizing the clones. The method makes use of binding substrates attached to the surface which enable selective activation of the predefined regions. Upon activation, these binding substances become capable of binding and immobilizing the clones derived from the genomic library. Any of the known solid substrates suitable for binding nucleotide sequences at predefined regions on the surface thereof for hybridization and methods for attaching nucleotide sequences thereto may be employed by one of skill in the art according to the invention. Similarly, known conventional methods for making hybridization of the immobilized clones detectable, e.g., fluorescence, radioactivity, photoactivation, energy transfer dyes, biotinylation, solid state circuitry, and the like may be used in this invention.

III. The Methods of the Invention

The present invention employs the compositions described above in methods for gap closure following high throughput sequencing for confirmation of computer generated assemblies, and to generate full length coding sequences. A. High throughput Sequencing

In the present invention, a small insert library can be used for high throughput sequencing. In a preferred embodiment, sequencing reactions are performed until 2 to 3 fold depth of coverage is obtained by standard sequencing methods (see, e.g., Fleischmann et al. (Science, (1995), 269:p. 496)). The resulting sequences are assembled using standard computational programs such as the GELMERGE Assembler available from Genetics Computer Group, Inc. of Madison, WI (UWGCG). The Assembler identifies non-overlapping regions, or gaps, between the assemblies. These gaps occur because of the statistical consequence of incomplete sampling and non-randomness in the collection of sequence fragments due to deletion of clones from the library. B.I Gap Closure

In one embodiment for gap closure, primer pairs are prepared from the non- overlapping end of each assembled contiguous sequence and used in individual PCR reactions against total genomic DNA isolated from the selected organism. Three types of probes can be prepared from the known non-overlapping ends of assembled sequences: (1) a clone which contains the end of an assembled sequence can be labeled (i.e., can use the clone directly as a probe); (2) a PCR fragment can be amplified using a primer pair from the end of the assembled sequence; and (3) an oligonucleotide from the end of the assembled sequence can be used as a hybridization probe. Products of these reactions are detectably labeled, preferably with a radioisotope or a fluorescent label, and used as probes in separate hybridization reactions against the gridded libraries.

B.2 Isolation of clones containing full length coding region.

Adjacent contiguous sequences are identified by positively hybridizing probes and confirmed by sequence analysis and computational assembly. In addition, when probes from two separate gap ends hybridize to the same gridded clone, the region spanning a gap is immediately identified. Primer walks are performed until the entire clone is finished or the gap is closed. New probes and hybridizations are performed as needed. C. Contiguous Sequence Assembly or Ordering

As will be recognized by those skilled in the art upon reading this disclosure, arrayed clones can be placed in approximate order prior to sequencing. For example, in one embodiment, 100 kb cosmid inserts can be hybridized against a small insert library to identify contiguous clones prior to sequencing. More precise ordering of clones is performed by digesting genomic DNA with rare cutting enzymes (such as NotI, Pad, Spel for E. coli, however, it varies depending on the base composition of the organism. Preferably the enzyme cuts once every 10-lOOkB), or partial digestion with a more frequent cutter (preferably cutting once every 10-lOOkB),., separating and isolating large DNA fragments resulting from the digestion by pulsed field gel electrophoresis, and hybridizing the isolated fragments against the grids.

D. Other Methods of the Invention

As is obvious to one of skill in the art upon reading this disclosure, the compositions and methods of the invention may also be used for other similar purposes. For example, in one embodiment, first pass sequencing of a large insert library is performed using both universal forward and reverse primers directed against the cosmid vector sequence. Universal primers, such as those directed against ml 3 sequences, are well known in the art. A small number of sequencing runs will begin to array these large insert clones into contiguous sequences using standard computational approaches such as the GΕLMΕRGΕ Assembler. Other computer-assisted assembly of nucleotide data is known in the art, see, e.g.,

"Automated DNA Sequencing and Analysis", Adams et al. eds., Academic Press (1995), (E.W. Myers presents a discussion of software systems for fragment assembly in Chapter 32); and J.D. Kececioglu et al, "Combinatorial Algorithms for DNA Sequence Assembly", Algorithmica, 13:7-51 (1995)). Appropriate clones, (i.e., non-overlapping with one another) can then selected for use as probes against a small insert library grid. Identified clones are sequenced using the universal primers and the sequence data is assembled. In this embodiment, the expense and time for designing and constructing primers for secondary walks in a large insert library are eliminated.

The following nonlimiting examples are provided to further illustrate the present invention.

EXAMPLES Example 1:

A genomic library of Staphylococcus aureus was constructed in the vector, lambda ZAP II. The average insert size of this library is approximately 5 kb.

Primary sequence analysis was performed on approximately 10,000 clones from this library using universal forward and reverse primers. The derived DNA sequences were then assembled using an assembly functionally similar to the GELMERGE Assembler but the preprocessing steps of annotating and grouping occurred prior to final assembly.

Annotating is a process of identifying regions of partial gene sequences and putative gene assemblies that may cause two unlike sequences to be considered alike or otherwise produce inaccurate results in the grouping or assembly processes. These regions are likely to interfere with the correctness of the subsequent grouping and assembly steps of the method of the invention. The remaining unidentified regions are considered to contain useful information (for the purpose of grouping and assembly) and are used in the subsequent grouping and assembly steps. Regions identified as likely to interfere with subsequent steps are ignored in those steps. Examples of regions which can be identified in the annotating step are sequences from species other than the one of interest and nucleic acids or DNA from cellular structures such as ribosomes and mitochondria. Low information regions which occur multiple times in a sequence such as polynucleotide runs, simple tandem repeats (STRs) and genomic repetitive sequences, such as ALU, can also be identified. Further, ambiguous regions and regions resulting from experimental error or artifacts are also identified. After annotation, the annotated partial gene sequences are grouped with other annotated partial gene sequences. The step of grouping the annotated partial gene sequences is based on determining association relationships between an annotated partial gene sequence and other existing annotated partial gene sequences, some of which may be components of previously identified putative gene assemblies. This process begins by ignoring the annotated regions from the partial gene sequences and previously identified putative gene assemblies. The partial gene sequences, with the annotated regions ignored, are then compared with the consensus sequence of previously identified putative gene assemblies, with the annotated regions ignored. The partial gene sequences are also compared with each other, ignoring the annotated regions. The partial gene sequences are placed in groups based on the similarities found in these comparisons. Resulting groups thereby contain a collection of partial gene sequences that would appear to belong together, i.e., the grouping step produces a group of partial gene sequences that are thought to assemble together. For each group from the previous step, the positional ordering of the partial gene sequences relative to one another is taken as a group on the assumption that all partial gene sequences belong to the same putative gene assembly. One of the consequences of the ordering may be that more than one putative gene assembly may result should the ordering step uncover inconsistencies among the group of partial gene sequences.

Once positional ordering has been completed for each putative gene assembly, a consensus sequence is generated by a variety of contig assembly programs known to those of ordinary skill in the art. Exemplary is GELMERGE (UWGCG, Madison, WI). Upon completion of the annotating, grouping, and assembling steps, the putative gene assemblies are stored in a database. Putative gene assemblies may be characterized on the basis of their sequence, structure, biological function or other related characteristics. Once categorized, the database can be expanded with information linked to the putative gene assemblies regarding their potential biological function, structure or other characteristics. For example, one method of characterizing putative gene assemblies is by homology to other known genes. Shared homology of a putative gene assembly with a known gene may indicate a similar biological role or function.

Another exemplary method of characterizing putative gene assemblies is on the basis of known sequence motifs. Certain sequence patterns are known to code for regions of proteins having specific biological characteristics such as signal sequences, transmembrane domains, SH2 domains, etc.

After the assembly process, non-overlapping contiguous sequence ends were identified. A representative number of the sequenced clones from this library were gridded onto predefined locations on the surface of a solid support to prepare a high density array.

Two clones, 2AU2142 and 2AU0165 were observed to contain non- overlapping sequences that reside in the contiguous sequence ends. PCR primers were designed to amplify a 250 bp fragment for each of these non-overlapping contiguous sequence ends and used in separate PCR reactions against Staphlyococcus aureus genomic DNA. The PCR products were purified, radiolabeled, and used in separate hybridization reactions against the high density grid. The PCR fragment for clone 2AU2142 was found to hybridize against three clones. Restriction digest analysis demonstrated that these three clones possessed overlapping sequences. Sequence analysis confirmed this result and further showed an additional 450 bp extending from the end of the contiguous sequence. The

2AU0165 derived fragment was shown to hybridize against two clones. Restriction fragment and sequence analysis also confirmed the overlapping nature of these clones.

It will be apparent to those skilled in the art that various modifications can be made to the present method for analyzing partial gene sequences without departing from the scope or spirit of the invention.

Claims

WHAT IS CLAIMED IS:

1. A method for gap closure of a genome following high throughput sequencing and contiguous sequence assembly of a selected organism comprising:

(a) preparing a first random genomic library of a selected organism; (b) sequencing the library of step (a) to identify contiguous nucleotide sequences;

(c) assembling the contiguous sequences and identification of non- overlapping ends, or gaps;

(d) generating labeled hybridization probes corresponding to the non- overlapping ends of an assembled nucleotide sequence from the selected organism;

(e) preparing a second random genomic library of the selected organism;

(f) providing a grid comprising a surface on which is immobilized at predefined regions on said surface a plurality of defined materials derived from the genomic library of step (e);

(g) hybridizing the probes of step (d) against the grid to identify sequences spanning the non-overlapping ends;

(h) determining the nucleotide sequence of those sequences spanning the non-overlapping ends; and (i) if the sequences spanning the non-overlapping ends do not close the gaps between the assembled contiguous sequences, prepare additional labeled hybridization probes corresponding to newly identified non-overlapping ends identified by step (h), and repeat the hybridization step (g) against the grid of step (f) with said additional probes, and determining the nucleotide sequence of said sequences spanning the non-overlapping ends.

2. The method of claim 1 wherein the first random genomic library comprises a small insert library.

3. The method of claim 1 wherein the second random genomic library comprises a medium insert library.

4. The method of claim 1 wherein the second random genomic library comprises a large insert library.

5. The method of claim 1, wherein step (g) comprises:

(gl) hybridizing the probes of step (d) against the grid to identify sequences spanning the non-overlapping ends;

(g2) detectably labeling the sequences identified in step (gl) which span the non-overlapping ends; (g3) providing a second grid comprising a surface on which is immobilized at predefined regions on said surface a plurality of defined materials derived from the genomic library of step (a); and

(g4) hybridizing the sequences identified in step (g2) against the second grid to identify sequences spanning the non-overlapping ends; and step (h) comprises:

(hi) determining the nucleotide sequence of those sequences identified in step (g4) which span the non-overlapping ends; and (h2) assembling the contiguous sequences.

6. The method of claim 1 wherein the hybridization probe of step (d) comprises material from genomic library (a) having a non-overlapping end.

7. The method of claim 1 wherein the hybridization probe of step (d) comprises a PCR amplified fragment from primer pairs to the end of the assembled sequence having a non-overlapping end.

8. The method of claim 1 wherein the hybridization probe of step (d) comprises an oligonucleotide that corresponds to the end of the assembled sequence having a non-overlapping end.

9. An isolated full length gene sequence identified by the method of claim 1.

10. An isolated protein produced by expression of the gene sequence of claim 9.

11. A therapeutic compound capable of modulating expression of the gene sequence of claim 9 for use in the treatment of a disease associated with growth of an organism.

12. A therapeutic compound capable of modulating activity of a protein of claim 10 for use in the treatment of a disease associated with growth of an organism.

13. A diagnostic composition useful for the diagnosis of a disease or infection comprising a reagent capable of detectably targeting a gene sequence of claim 9.