WO2002016642A1

WO2002016642A1 - Methods and compositions for directed molecular evolution using dna-end modification

Info

Publication number: WO2002016642A1
Application number: PCT/US2001/025788
Authority: WO
Inventors: Vaughn Smider
Original assignee: Integrigen, Inc.
Priority date: 2000-08-18
Filing date: 2001-08-17
Publication date: 2002-02-28
Also published as: EP1311709A1; WO2002016642A9; JP2004507247A; CA2419961A1; AU2001286528A1; EP1311709A4

Abstract

Methods as depicted in figure 3, for directed evolution are described where genetic elements are randomly cleaved to permit the deletion or addition of polynucleotides or both to create a library of related genetic elements with additions or deletions. Corresponding library populations are also described. These processes allow a significant sampling of sequence space which is necessary for directed evolution of genes. Further described are methods for effecting very small nucleotide deletions in genetic elements of interest.

Description

METHODS AND COMPOSITIONS FOR DIRECTED MOLECULAR EVOLUTION USING DNA-END MODIFICATION

FIELD OF THE INVENTION [01] The invention relates to directed evolution, which encompasses methods that can be applied to genetic engineering and protein engineering. Directed evolution is used to evolve gene sequences with the goal of improving or altering gene or protein function. Directed evolution can be applied to many areas including, but not limited to, pharmaceutical development, bioremediation, bioleaching, and the chemical industry.

BACKGROUND OF THE INVENTION [02] Recently attempts have been made to simulate the process of evolution in vitro, thereby inducing genetic changes in specific genes to alter or improve their functions. Although techniques that alter genes have been known for several years, generally detailed features about the encoded protein's structure and function were required for these methods to be successful. The technique of DNA shuffling overcame this barrier to a certain extent, and has been applied to evolve several genes successfully in the past few years [Minshull & Stemmer, U.S. Patent # 5,837,458 (1998)].

[03] Natural evolution occurred over millions of years for genes in the environment. In vitro evolution attempts to mimic the natural process in days or weeks. In order for an in vitro strategy to succeed, several facets of evolutionary theory must be understood. First, the concept of sequence space defines the total number of possible sequences of a protein of a given length [Kauffman, (1993)] . Thus,

S = 20^N where S, the sequence space, is the number of possible sequences, and N is the length of the protein. In in vitro evolution experiments, it would be optimal to search S sequences of a given protein in order to identify the fraction of those with the most improved or altered activity. It can quickly be seen that a protein with a modest 50 amino acids has an S of 20⁵ possible different sequences, a number which is virtually infinite in terms of analysis with current molecular biology techniques. Second, it is clear that most amino acid changes are deleterious to proteins. These changes may render the protein inactive, cause disruptions in proper folding, or cause instability to the protein or mRNA in vivo. It has been estimated that the ratio of advantageous to deleterious mutations on average is 1 in 10⁵ [Radman et al., Ann. N. Y. Acad. Sci. 870: 146-55 (1999)]. In this regard, mutation rate is an important parameter when mutating genes to improve their function. If the mutation rate is too high, deleterious mutations will occur in cis with the advantageous mutations, a condition which makes the genes with advantageous mutations impossible to identify since the resulting proteins that contain them will be inactive due to concomitant deleterious mutations. Third, in order to overcome the consequences of higher mutation rates, homologous recombination may be utilized to remove deleterious mutation through double crossover events. Fourth, any in vitro evolution technique requires a selection screen in order to identify those sequences which improve or alter the function of the protein.

[04] A major barrier to current molecular evolution is the inability to efficiently search sequence space for a given protein. In this regard, of critical importance is the ability to generate and identify sequences that differ at more than one residue, wherein those differences may have an additive effect upon protein function. This additive effect may be described at amino acid interdependence. For example, a protein with a single mutation at residue i may not have any detectable increase in function unless a concomitant mutation at is also present. In this case, in order for evolution to be successful, all possible two-mutant variants of the target sequence should be sampled and tested for improved function. In general, the number of R-mutant variants of a protein of length N is given by:

^R (N-R)! R! where R is the number of mutant variants, and 20 denotes the number of amino acids possible at each position. Thus, for a protein of length 50, there are 490,000 different two mutant variants.

[05] In these statistical analyses of sequence space, a critical value is the length of the protein (i.e. the number of R-mutant variants depends on the length of the protein). However, in nature the length of any given protein may not be as critical for its function as the arrangement of amino acid residues in three dimensional space. Indeed, a hypothetical concept of "catalytic task space" has been proposed to account for this principle (Kauffman, 1993). Alterations in amino acid residues without altering protein length, N, may not affect the three dimensional structure of the protein in the same ways that increasing or decreasing N would. Alternatively, changing N may not change the biological function of the protein at all. Analysis of virtually any family of homologous proteins reveals that members have different lengths, sometimes with substantial insertions or deletions, but may retain indistinguishable biological function. Thus, the above formula probably does not give an accurate view of the various R-mutant variants that should be screened when searching for improved or altered biological function. In the laboratory it would be optimum to search all of the R-mutant neighbors of a protein plus a number of variants with nucleotides either added or deleted at every position.

[06] In the case of deletions, the number of D-mutant deletions would be given by,

S - ^N! ^D (N-D)! D! where N is the initial length of the protein and D is the number of positions where a deletion occurs. In the case of amino acid additions, a similar formula for all possible additions accounts for the fact that any of the 20 amino acids may be added at any position:

20^A N! ^A (N-D)! A! where A is the number of possible addition mutants. In the cases of the addition and deletion mutants, these formula both assume that only one amino acid is added or deleted at each position. In terms of in vitro molecular evolution, however, it would be optimal to search all 1, 2, 3,...C number of amino acids added or deleted at each position. Thus, for deletion mutants, the number of sequences with variable amino acids deleted at each position,

S - ^CD ^N! ^CD (N-D)! D! where Co denotes the number of amino acids deleted at each position and D is the number of positions where deletions occur. For addition mutants, the formula becomes:

_ C_A 20^A N! ^CA (N-A)! A! where CA is the number of amino acids added at each position, and A is the number of positions where additions occur. [07] Since current molecular biology techniques only allow a fraction of the total space to be generated and sampled for a given protein, a formula which describes an experimental space to be generated can be defined. This expression will also allow the monitoring of improvements in library construction techniques and allow analysis of the space which is relevant to protein function. A total space to be searched experimentally can be defined as,

SEX ⁼ S SCD SCA where amino acids are mutated to other residues (S_R), are deleted (S_CD), or added (S_CA) hi various combinations and permutations. Certainly current molecular biology techniques would allow a library to be created where R = 1 so S_R = N, D = 1 so S_CD ⁼ N, and A = 1 so S_CA ⁼ 20 N. For a protein of N = 50 then, this hypothetical library would contain 20 * N³ = 2.5 x 10⁶ different sequences where all permutations of one position changes, deletions, and additions are represented. [08] The above discussion on the sequence space relevant to protein evolution may be applied in different ways to the in vitro engineering of evolved sequences. In nature, the evolution of different catalytic activities in families of enzymes can be grouped into two broad categories: 1) Those where the active site amino acids remain the same, but differences in the structural folds cause the enzymes to have different substrate specificities [Perona & Craik, J Biol. Chem. 272: 29987-90 (1997)], and 2) Those where the enzyme structures are the same, but differences in the active site residues cause the enzymes to catalyze different reactions [Babbitt & Gerlt, J Biol Chem. 272: 30591-4 (1997)]. An example of the former is the serine protease family, and of the latter is the enolase superfamily. [09] Although the differences between these categories might seem trivial, they have important implications for the method of molecular evolution and the concept of sequence space. For the enzymes in families with similar structural folds, a molecular evolution approach which samples sequence space throughout the length of the protein would likely be the optimal strategy to alter an enzyme's specificity, since the catalytic active site will likely require the same residues for the catalytic mechanism. However, for the second type of enzymes, increasing sequence space searching through the entire length of the protein is probably not necessary. More likely, increasing the sequence space sampling of the key catalytic domains will optimize the molecular evolution process. In this regard, it would be better to alter five key amino acids to each of the twenty amino acids and sample this more limited space (20⁵) rather than to sample a sequence space of 20⁵ spread out over the entire gene sequence. Additionally, sampling as many of the possible addition or deletion mutants in key regions would also contribute to possible success of the in vitro evolution protocol. Thus, a method which would optimize molecular evolution of a gene belonging to the second type of enzyme family would be a very important and robust technique. [10] The swapping of genetic domains is an efficient means to evolve new or improved function in biomolecules. While the alteration of single nucleotide residues can affect gene and protein function, the wholesale exchange of multiple residues in a gene can have dramatic effects on protein function. For example, E. coli and Salmonella are highly related bacterial species, however the differences in their genetic content are due almost entirely to genetic swapping events as opposed to single residue changes. Additionally, swapping events where exchanges of large chunks of DNA create genes are thought to have occurred several times in pathways such as the clotting cascade, as well as to create novel transcription cassettes through transposition [Bell, (1997); Patthy, (1999)]. [11 ] A well known example of molecular evolution occurring naturally is that which underlies the production of antibodies in the immune system. In mammalian pre- lymphocytes natural molecular evolution occurs successfully on a daily basis. Antibodies are capable of binding a bewildering array of different antigens, yet have similar amino acid sequences and secondary structures. Antibody genes are arranged in the germline as gene segments (Figure 2). During lymphocyte maturation these segments (named variable or "V", diversity or "D", and junctional or "J") are juxtaposed to one another in the process termed V(D)J recombination to create a functional antibody or T-cell receptor gene. Multiple V, D, and J segments allow a substantial amount of diversity, and hence different antigen binding specificities, to be created in the final repertoire of lymphocytes. The diversity created by this mechanism is referred to as combinatorial diversity. Another type of diversity is also created during V(D) J recombination, which is as important as combinatorial diversity [Davis & Bjorkman, Nature 334: 395-402 (1988)]. This diversity is termed junctional diversity, and is created when nucleotides are lost or gained at the joints of the gene segments. Importantly, these joints encode the regions of the antibody molecule which contact antigen, so this type of diversity is critical to creating a diverse, but functional immune system.

[12] The two types of diversity utilized by the immune system might be characterized in the following way with regards to the practice of molecular evolution. Generation of combinatorial diversity in immunoglobulin genes allow a sampling of the total sequence space by providing multiple functional V, D, and J gene segments, each member of which is slightly different in sequence but still homologous to other members of the family of segments. In this respect, the combinatorial rearrangement of V, D, and J gene segments functions as a "domain swapping" event in order to generate novel antibody genes. Generation of junctional diversity allows a greater local sampling of sequence space at the critical residues for contacting antigen through the mechanism of adding or deleting random nucleotides at the ends of the DNA that are to be ligated.

[13] Due to the aforementioned issues regarding genetic evolution; namely the difficulty in searching a vast sequence space, a preponderance of deleterious mutations in random mutagenesis, and amino acid interdependence, it has been difficult to devise robust methods for searching functional sequence space in the laboratory. Current methods in widespread use for creating mutant proteins in a library format are error-prone polymerase chain reaction [Caldwell & Joyce, (1992); Gram et al., Proc Natl Acad Sci 89: 3576-80 (1992)] and cassette mutagenesis [Arkin & Youvan, Proc Natl Acad Sci 89: 7811-5 (1992); Hermes et al, Proc Natl Acad Sci 87: 696-700 (1990); Oliphant et al., Gene 44: 177-83

(1986); Stemmer & Morris, Biotechniques 13: 214-20 (1992)], in which the specific region to be optimized is replaced with a synthetically mutagenized oligonucleotide. Alternatively, mutator strains of host cells have been employed to add mutational frequency [Greener et al., Mol Biotechnol 7: 189-95 (1997)]. In each case, a mutant cloud^' [Kauffman, (1993)] is generated around certain sites in the original sequence.

[14] Error-prone PCR uses low-fidelity polymerization conditions to introduce a low level of point mutations randomly over a long sequence. Error prone PCR can also be used to mutagenize a mixture of fragments of unknown sequence. Error-prone PCR can randomly mutate genes by altering the concentrations of respective dNTP's in the presence of dITP [Caldwell & Joyce, (1992); Leung & Miyamoto, Nucleic Acids Res 17: 1177-95 (1989); Spee et al, Nucleic Acids Res 21: 777-8 (1993)].

[15] However, computer simulations have suggested that point mutagenesis alone may often be too gradual to allow the block changes that are required for continued sequence evolution. The published error-prone PCR protocols are generally unsuited for reliable amplification of DNA fragments greater than 0.5 to 1.0 kb, limiting their practical application. Further, repeated cycles of error-prone PCR lead to an accumulation of neutral mutations, which, for example, may make a protein immunogenic.

[16] In oligonucleotide-directed mutagenesis, a short sequence is replaced with a synthetically mutagenized oligonucleotide. This approach does not generate combinations of distant mutations and is thus not significantly combinatorial. The limited library size relative to the vast sequence length means that many rounds of selection are unavoidable for protein optimization. Mutagenesis with synthetic oligonucleotides requires sequencing of individual clones after each selection round followed by grouping into families, arbitrarily choosing a single family, and reducing it to a consensus motif, which is resynthesized and reinserted into a single gene followed by additional selection. This process constitutes a statistical bottleneck, it is labor intensive and not practical for many rounds of mutagenesis.

[17] Methods of saturation mutagenesis utilizing random or partially degenerate primers that incorporate restriction sites have also been described [Hill et al., Methods Enzymol 155: 558-68 (1987); Oliphant et al, Gene 44: 177-83 (1986); Reidhaar- Olson et al., Methods Enzymol 208: 564-86 (1991)].

[18] "Cassette" mutagenesis is another method for creating libraries of mutant proteins [Bock et al., U.S. Patent # 5,830,720 (1995); Christou & McCabe, U.S. Patent # 5,830,728 (1998); Hill et al., Methods Enzymol 155: 558-68 (1987); Miller et al., U.S. Patent # 5,830,740 (1998); Shiraishi & Snimura, Gene 64: 313-9 (1988); Stemmer & Crameri, U.S. Patent # 5,830,721 (1998)]. Cassette mutagenesis typically replaces a sequence block length of a template with a partially randomized sequence. The maximum information content that can be obtained is thus limited statistically to the number of random sequences in the randomized portion of the cassette.

[19] A protocol has also been developed by which synthesis of an oligonucleotide is "doped" with non-native phosphoramidites, resulting in randomization of the gene section targeted for random mutagenesis [Wang & Hoover, J Bacteriol 179: 5812-9 (1997)]. This method allows control of position selection, while retaining a random substitution rate.

[20] Zaccolo and Gherardi (1999) describe a method of random mutagenesis utilizing pyrimidine and purine nucleoside analogs [Zaccolo & Gherardi, JMol Biol 285: 775-83 (1999)]. This method was successful in achieving substitution mutations which rendered -lactamase with an increased catalytic rate against the cephalosporin cefotaxime. Crea describes a "walk through" method, wherein a predetermined amino acid is introduced into a targeted sequence at pre-selected positions [Crea, U.S. Patent # 5,798,208 (1998)].

[21] Methods for mutating a target gene by insertion and/or deletion mutations have also been developed. It has been demonstrated that insertion mutations could be accommodated in the interior of staphylococcal nuclease [Keefe et al., Protein Sci 3 : 391- 401 (1994)]. Examples of deletional mutagenesis methods developed include the utilization of an exonuclease (such as exonuclease III or Bal31) or through oligonucleotide directed deletions incorporating point deletions [Ner et al., Nucleic Acids Res 17: 4015-23 (1989)]. Additionally, Lietz describes a method whereby oligonucleotides with random sequences may be combined with PCR to induce insertions and deletions. Enhancement of function by this technique has not been shown, and the capacity to overmutagenize (i.e. make too many insertions or deletions per polynucleotide) is substantial in this method [Lietz, U.S. Patent # 6,251,604 (2001)]. [22] The technique most often used to evolve proteins in vitro is known as

"DNA Shuffling". In this method, a library of gene modifications is created by fragmenting homologous sequences of a gene, allowing the fragments to randomly anneal to one another, and filling in the overhangs with polymerase. A full length gene library is then reconstructed with polymerase chain reaction (PCR). The utility of this method occurs at the step of annealing, whereby homologous sequences may anneal to one another, producing sequences with attributes of both starting sequences. In effect, the method affects recombination between two or more genes that are homologous, but that contain significant differences at several positions. It has been shown that creation of the library using several homologous sequences allows a sampling of more sequence space than using a randomly mutated single starting sequence [Crameri et al., Nature 391 : 288-91 (1998)]. This effect is likely due to the fact that years of evolution have already selected for different advantageous or neutral mutations amongst the homologs of the different species. Starting with homologs, then, appreciably limits the number of deleterious mutations in the creation of the library which is to be screened. Combinatorially rearranging the advantageous positions of the homologs can apparently allow for an optimized secondary protein structure for catalyzing a biochemical reaction. The resulting evolved protein appears to contain positive features contributed from each of the starting sequences, which results in drastically improved function following selection.

[23] Alterations to the DNA shuffling technique have been devised. One process is termed the "staggered extension^' process, or StEP. Instead of reassembling the pool of fragments created by the extended primers, full-length genes are assembled directly in the presence of the template(s). The StEP consists of repeated cycles of denaturation followed by extremely abbreviated annealing/extension steps. In each cycle the extended fragments can anneal to different templates based on complementarity and extend a little further to create "recombinant cassettes." Due to this template switching, most of the polynucleotides contain sequences from different parental genes (i.e. are novel recombinants). This process is repeated until full-length genes form. It can be followed by an optional gene amplification step [Arnold et al, U.S. Patent # 6,177,263 (2001)]. [24] In another technique, fragmentation of the initial DNA can be accomplished by premature termination of the polymerase in an extension reaction by inducing adduct formation in the target gene [Short, U.S. Patent # 5,965,408 (1999)]. In a different technique, a library is created by inducing incremental truncations in each of two homologs to produce a library of fusion genes, each of which contains domains donated from each homolog [Ostermeier et al., Nat. Biotechnol 17: 1205-9 (1999)]. The advantage of this approach is that significant homology amongst the starting sequences is not required since the annealing step of previous methods is omitted. It is unclear, however, whether this modified technique actually will lead to generation of improved gene function after selection techniques are applied to the library.

[25] The previously described methods of gene shuffling using alleles of genes from different organisms allows combinatorial diversity to occur, but is limited by the homology found in the starting sequences. Additionally, these methods do not provide for a mechanism which would generate the junctional diversity formed through V(D)J joining of antibody gene segments. The present invention makes use of mechanisms analogous to junctional diversity by adding and deleting residues from protein or nucleic acid sequences in either a directed or a random fashion. The present invention also provides for "gene swapping" events analogous to the combinatorial diversity generated by combinatorial V(D)J recombination. This will greatly enhance the means by which genes are evolved in vitro.

SUMMARY OF THE INVENTION [26] The present invention involves the directed molecular evolution of nucleic acid sequences by:

(a) adding or deleting nucleotide residues at random in a polynucleotide to produce a library of polynucleotides containing additions or deletions; and

(b) optionally subjecting the pool of polynucleotides in step (a) to a selection procedure capable of identifying polynucleotides encoding for a desired function or feature. Steps (a) and (b) can optionally be repeated. Libraries produced by the methods of the invention are also described and contemplated. [27] Uniquely, the present invention allows a sampling of sequence space which will include sequences that significantly affect secondary protein structure, thus increasing the probability of identifying altered or improved function in an evolved gene. Further, the present invention allows a sampling of sequence space which cannot be sampled by other current technologies. Moreover, libraries of polynucleotides created with the present invention cannot be obtained utilizing other current technologies.

[28] Several methods and compositions are described and contemplated below. One method of the invention generates a library of polynucleotide sequences having nucleotide deletions at differing positions in a sequence of a genetic element comprising the steps of:

(a) subjecting multiple copies of circular polynucleotides comprising the genetic element to random cleavage to obtain multiple linear polynucleotides each polynucleotide having at least one 3' and 5' end; and

(b) subjecting said polynucleotides from step (a) to a process which removes at least one nucleotide from one of said DNA ends of said polynucleotides producing a library of deletion polynucleotide sequences, said library comprising multiple deletion polynucleotide sequences with deletions at different random positions. Further, if desired, polynucleotides from step (b) may be subjected to a process that co valently joins the 3' and 5' ends to one another and the library of polynucleotides may be further subjected to a process that selects for a function of interest. The library of deletion polynucleotides may comprise more than two or more, for example, deletions of at least 10, 20 or 30 or more or even 50 to 100 individual polynucleotides each having a random deletion at a different position from the others may be obtained. The number of deletions made will depend upon the starting material and the goal of the technician. In some embodiments, the library of deletion polynucleotides comprises very short deletions of at least 1, 2, 3, 4, or 5 individual nucleotides or more. In different embodiments, the library may comprise larger deletions of 50-100 or more nucleotides. In another emmbodiment, the composition of multiple copies of circular polynucleotides is free of naturally-occurring homologs to the genetic element. Further, steps (a) and (b) may optionally be repeated. Another optional method includes a process for inserting nucleotides at the position of deletion in step (b).

[29] Substantially pure compositions comprising a library of multiple (preferably more than two, more preferably more than 5, most preferably more than 10) linear polynucleotides each having a different 3' and a 5' end, but each linear polynucleotide being identical to the others if circularized are described and contemplated.

[30] Substantially pure compositions comprising a library of at least 2 (preferably more than 5, more preferably more than 10) deletion polynucleotides each differing from the other only by having a different random deletion are also described and contemplated. Optionally such deletion polynucleotides further comprise at least one nucleotide inserted at the position of deletion.

[31] Another method of the invention generates a library of polynucleotide sequences having nucleotide additions at random positions in a genetic element comprising the steps of:

(a) subjecting a composition of multiple copies of circular polynucleotides with the genetic element to random cleavage to obtain multiple linear polynucleotides each polynucleotide having at least one 3' and 5' end; and (b) subjecting said polynucleotides from step (a) to a process which adds at least one nucleotide to one of said ends of said polynucleotides producing a library of addition polynucleotide sequences, said library comprising multiple addition sequences with additions at different random positions. Further, if desired, the addition polynucleotides from step (b) may be subjected to a process that co valently joins said 3' and 5' DNA ends to one another. Optionally, the library of polynucleotides maybe subjected to a process that selects for a function of interest.

In any of the methods described here, cleavage preferably occurs with the use of an endonuclease, preferably SI. This method permits the library of addition polynucleotides to comprise any number of different polynucleotides, for example, at least 5, 10, 20 or 30 individual polynucleotides each having a random addition of nucleotides at a different position from the others. In one embodiment of the claimed invention, the composition of multiple copies of circular polynucleotides is free of naturally-occurring homologs to the genetic element. Optionally, steps (a) and (b) of the method may be repeated. Another option includes a process for deleting nucleotides at the point of addition in step (b). Any number of nucleotides may be added in step (b) depending upon the starting molecule and the goal of the technician, for example, 1-3, 3-50, or 50-100 or more nucleotides may be added in step (b). [32] Substantially pure compositions comprising a library of at least 2 (preferably at least 5, most preferably at least 10) addition polynucleotides each differing from the other only by having a different random addition are contemplated.

[33] Further, the present invention surprisingly provides a method to make short deletions at the end of a polynucleotide, producing a population of polynucleotides with short deletions (from 1 to 100), preferably from 1 to 35, most preferably 1 to 10 at the end. A DNA end having such deletions can then be co valently joined with other DNA ends, producing a library of polynucleotides containing deletions at a specific internal position. Often the two ends to be ligated will be present on the same DNA molecule, such that the resulting ligation product comprises circular polynucleotides. Such methods and compositions are important in the areas of protein engineering and directed evolution.

BRIEF DESCRIPTION OF THE DRAWINGS [34] FIG. 1 is a diagram of the process of DNA shuffling, an earlier method of choice for molecular evolution. Homologs of a gene of interest are fragmented, subjected to denaturation and reannealing such that the single-strand fragments from the homologs can prime one another in an extension reaction. Amplification of the full length gene then produces a library of hybrid genes. A genetic screen is then applied to select an altered or improved gene. [35] FIG. 2 is a diagram of the immunoglobulin heavy chain locus illustrating the process of V(D)J recombination which produces combinatorial diversity, and DNA end-joining which produces junctional diversity.

[36] FIG 3. is a diagram illustrating an example of a method which produces nucleotide deletions and insertions at random positions in a polynucleotide. A target gene is cleaved to produce a pool of genes each of which are fragmented at random positions in the gene. Residues can be deleted (left), or inserted (right) at the DNA ends to produce libraries containing deletions, insertions, or deletions and insertions at random positions.

[37] FIG 4 is a diagram illustrating the random cleavage of a polynucleotide. In panel A (Fig. 4 A), the DNA plasmid pLacZi (Clontech, Palo Alto, CA) was either uncleaved (lane 2), cleaved with the single cutting restriction enzyme Cla I (lane 3), or increasing concentrations of SI nuclease (lanes 4-7). Lanes 1 and 8 are lambda Hind III DNA markers. In panel B (Fig. 4B), the pLacZi plasmid is uncut (lane 2), cleaved with Cla I (lane 3), or SI nuclease (lane 4). A sample of the SI cleaved pLacZi was gel purified and run in lane 5, or further cleaved with Cla I and run in lane 6. Equal amounts of DNA were run in lanes 2-4 (1 μg), and lanes 5-6 (100 ng). The smear in lane 6 illustrates that cleavage by SI was not site-specific. Lanes 1 and 7 contain lambda/Hind III DNA markers. [38] FIG. 5 is a diagram illustrating an example of a method which produces short nucleotide deletions at a DNA end. Exonuclease III deletes nucleotides from the ends of a fluorescently labeled 232 bp DNA fragment in a salt dependent reaction. As salt is increased the number of deletions decreases.

[39] FIG 6 is a diagram illustrating the deletion of nucleotides in the LacZ gene. The plasmid pLacZi was cleaved with Cla I, treated with exonuclease III as described in FIG. 5, re-ligated, electroporated into E. coli, and plated on plates containing the colorimetric lactose analog X-Gal. Clones with either a blue or white color were picked, grown in LB, and DNA prepared. Plasmid was subjected to PCR with primers flanking the Cla I site, where one primer was fluorescently labeled. The PCR product was run on a 6% denaturing acrylamide gel in an ABI 373 DNA sequencer and analyzed with Genescan software (Perkin Elmer, Foster City, CA). The top panel shows PCR with the wild-type LacZ gene, producing a 312 bp fragment. Clones 1-6 had variable short deletions present. Clones 1 and 6 had blue phenotypes and 2-5 had white phenotypes.

[40] FIG. 7 is a 1.5% agarose gel showing 3 clones containing an insertion in pLacZi. CHO cell cDNA was fragmented with DNase I, ligated into the Cla I site of pLacZi, electroporated into E. coli, and plated on X-Gal plates. Three clones were analyzed by PCR of plasmid DNA using primers flanking the Cla I site. Lanes labeled 1-3 are clones containing different sized insertions, and lane 4 is pLacZi. The DNA in the first and last lanes are ΦX174/Hae III DNA markers with their sizes in basepairs indicated at the right.

DETAILED DESCRIPTION OF THE INVENTION

[41] Gene swapping events constitute a major driver in the evolution of macromolecules. Swapping events may include nucleotide insertions, deletions, or replacements. A swapping event may occur by means of homologous recombination, but may also occur by non-homologous means as they do in V(D)J recombination and the DNA- end joining mechanism used by antibody gene segments [Smider & Chu, Se . Immun. 9: 189-97 (1997)]. Current technologies for molecular evolution do not provide a generally applicable non-homologous means for gene swapping. [42] Applications of the current invention include producing novel genetic elements with improved or altered function. These genetic elements can have significant commercial value. For instance, the genetic element may enhance production of a protein pharmaceutical. The genetic element may encode a protein pharmaceutical such as a monoclonal antibody, or an enzyme used to treat a disease. Further, the genetic element may encode an enzyme important in industrial processes such as chemical manufacturing, or may be used in a product such as laundry detergent (i.e. proteases, lipases, or esterases). Further, the genetic element may have important uses in agriculture, such as to provide a means for pathogen resistance, or to allow production of novel nutrients by a plant species. Additionally, the genetic element may be used in microorganisms to produce novel products for human use, such as novel antibiotics, pigments or other small molecules. As can be seen, the modification of genetic elements in order to improve or alter their function has a myriad of applications in several diverse industries.

[43] For the purposes of describing this invention the following terms will be helpful and will have the following meanings:

Definitions

[44] The term "base" refers to a component of nucleic acid consisting of either adenine, guanine, thymine, cytosine, or uracil. Additionally, "purine" refers to either adenine or guanine, and "pyrimidine" refers to either thymine, cytosine, or uracil. [45] The term "nucleoside" refers to a molecule comprising the covalent linkage of a pyrimidine or purine to a pentose ring (such as ribose or deoxyribose).

[46] The term "nucleotide" refers to the phosphate ester of a nucleoside. [47] The term "polynucleotide(s)" refers to a molecule containing at least one 5' hydroxyl of one nucleotide covalently linked to one 3' hydroxyl of at least one other nucleotide through a bond such as a phosphodiester bond. A polynucleotide is necessarily composed of "positions" containing "residues" as defined below.

[48] The term "position" as it relates to a polynucleotide sequence or polypeptide sequence refers to the location of a given residue in the polynucleotide or polypeptide chain. For example, "position" in a polynucleotide sequence is defined as the location of a nucleotide in the polynucleotide chain in reference to at least one other nucleotide. For instance in the simple polynucleotide TG, the T is in position 1 (in reference to itself) and G is in position 2 (in reference to the T in position 1). Often it is convention to label the furthest 5' nucleotide as a reference and label it as position 1. In a double stranded polynucleotide encoding a gene, such as DNA, often the translation start site of a gene is labeled as position 1. This is often the adenine in the ATG translation start sequence. Positions located 5' from the ATG would be given a negative position (such as -11, -35, etc.) and positions located 3' to the ATG would be given positive positions. Those skilled in the art will recognize the nature of the term "position" as it relates to the numbering scheme in sequences of polynucleotides. A "sequence" refers to the string resulting from the composition of the residues occupying each position. For example the sequence ATG means that the base adenine occupies a position which immediately precedes thymine, and thymine occupies a position which immediately precedes guanine. A "specific position" refers to a position in a polynucleotide between at least two nucleotides whose sequence and composition is known.

[49] The term "residue" as it relates to a polynucleotide or polypeptide refers to either a purine or pyrimidine nucleotide for polynucleotides, or an amino acid for a polypeptide. [50] A "genetic element" means a sequence of polynucleotide encoding a function. For example, a "genetic element" may encode a polypeptide sequence, may encode a promoter function, an enhancer function, a transcription start or stop site, or RNA splice sites and the like. Genetic elements may be operatively linked to other genetic elements, for example a promoter may be operatively linked to a genetic element encoding a protein to allow expression of a protein in a given cell type. The term "gene" and "gene of interest" refer to a polynucleotide capable of encoding a polypeptide.

[51] The term "swap" or "gene swapping" in reference to a polynucleotide means either: 1) the occurrence of a deletion of at least two residues occupying consecutive positions in a polynucleotide, or 2) the occurrence of an addition of at least two residues occupying consecutive positions into a polynucleotide, or 3) the replacement of at least two residues occupying consecutive positions in a polynucleotide with other residues.

[52] The term "nucleotide deletions" as applied to a polynucleotide means that a polynucleotide has had one or more specific residues removed from one or more positions in the polynucleotide chain when the resulting polynucleotide is compared to the parental, wild-type, or other reference sequence.

[53] The term "nucleotide insertions" or "nucleotide additions" means that a polynucleotide has had specific residues added to the polynucleotide chain, such that at least one of the original residues now occupies a new position in the polynucleotide when compared to the parental, wild-type, or other reference sequence. [54] The term "library of polynucleotide sequences" refers to a mixture of polynucleotides, wherein at least one of the sequences differs from at least one other sequence in the mixture by sequence composition or length, for example, where at least one position is occupied by a different nucleotide when the two sequences are compared or at least one nucleotide position is absent in one sequence when compared with the other sequence.

[55] The term "DNA" refers to deoxyribonucleic acid. It will be understood by those of skill in the art that where manipulations are described herein that relate to DNA they will also apply to RNA. [56] The term "DNA ends" or ends refers to the position in a DNA strand wherein a phosphodiester bond is broken. In a single-stranded DNA end a nucleotide is only covalently linked with one other nucleotide. A "double-stranded DNA or RNA end" refers to the position in a double-stranded DNA or RNA molecule wherein the molecule is no longer double-stranded. Generally DNA ends are recognizable to those skilled in the art. Double- stranded DNA ends are characterized as blunt, having a 5' overhang, a 3' overhang, or a hairpin structure. A DNA end may or may not contain a 5' phosphate group.

[57] The term "cleavage" as used herein refers to the breakage of a bond between two nucleotides, such as a phosphodiester bond.

[58] The term "circular polynucleotide" refers to a polynucleotide wherein no double-stranded DNA ends are present. A circular polynucleotide may be single-stranded or double-stranded. A circular polynucleotide may, however, contain single-stranded DNA ends. A circular polynucleotide will be present if single-stranded DNA ends exist but hydrogen bonding keeps the two strands of the double-stranded molecule hybridized to one another such that a double-stranded DNA end is not created by the presence of two single- stranded ends in proximity to one another. Such a circular double-stranded polynucleotide is often referred to as "nicked".

[59] The term "linear polynucleotide" is a polynucleotide which contains at least one, but most often two DNA ends. A linear polynucleotide may be either single- stranded or double-stranded. [60] The term "random" or "random position" as applied to a polynucleotide refers to a process by which any of the specific residue positions may be selected. Random as used here does not mean that all points or point of cleavage or nucleotides or positions are selected or chosen with equal frequency. Rather random focuses on the unpredictable nature of the process, i.e. the worker cannot predict a priori where an event will occur or what position any base will have. Finally, not all positions need be available for cleavage for the process to be random as to the available positions or bases. For example, a polynucleotide with a length of N may have any or all of its positions (i.e. 1, 2,...N) affected by a manipulation. In the addition (insertion) or deletion of residues, a polynucleotide necessarily must have covalent bonds (such as phosphodiester bonds) cleaved, thereafter which residues are deleted or added (i.e. the total number of positions is decreased or increased, respectively). In describing "deletions at random positions" in a polynucleotide of length N, it is meant that any or all of the N (in a circular polynucleotide) or N-l (in a linear polynucleotide) covalent linkages between nucleotides (i.e. phosphodiester bonds) are broken, and at least one nucleotide at the end is removed prior to re-ligation. Thus, in a process causing "deletions at random positions" the final length of the polynucleotide (N, or the number of positions) necessarily decreases. Similarly, In describing "insertions at random positions" in a polynucleotide of length N, it is meant that any or all of the N (in a circular polynucleotide) or N-l (in a linear polynucleotide) covalent linkages between nucleotides (i.e. phosphodiester bonds) are broken, and at least one new nucleotide (i.e. a new position) is added at the end prior to re-ligation. Thus, in a process causing "insertions at random positions" the final length of the polynucleotide (N, or the number of positions) necessarily increases. It is recognized that a combination of processes involving "deletions at random positions" and "insertions at random positions" may allow the final length of the polynucleotide to remain unchanged (i.e. the additions cancel out the deletions and the final number of positions remains the same, however the nucleotides occupying the positions may be different). In describing "random cleavage" or a "single random break" in a polynucleotide of length N, it is meant that any one of the N (in a circular polynucleotide) or N-l (in a linear polynucleotide) covalent linkages between residue positions in a single polynucleotide molecule are cleaved. Accordingly, in one vessel containing many copies of a polynucleotide, a single random break can occur at different positions in different molecules.

[61] As used herein, "substantially pure" means an object species is the predominant species present (i.e., on a molar basis it is more abundant than any other individual macromolecular species in the composition), and preferably a substantially purified fraction is a composition wherein the object species comprises at least about 50 percent (on a molar basis) of all macromolecular species present. Generally, a substantially pure composition will comprise more than about 80 to 90 percent of all macromolecular species present in the composition. Most preferably, the object species is purified to essential homogeneity (contaminant species cannot be detected in the composition by conventional detection methods) wherein the composition consists essentially of a single macromolecular species. Solvent species, small molecules (<500 Daltons), and elemental ion species are not considered macromolecular species.

[62] The term "homologous" or "homeologous" means that one single- stranded nucleic acid sequence may hybridize to a complementary single-stranded nucleic acid sequence. The degree of hybridization may depend on a number of factors including the amount of identity between the sequences and the hybridization conditions such as temperature and salt concentration as discussed later. Preferably the region of identity is greater than about 5 bp, more preferably the region of identity is greater than 10 bp. Thus, "homologs" are nucleic acid molecules that are not identical but are capable of hybridizing to one another under physiological conditions. Double-stranded homologs are capable of hybridizing to one another following denaturation.

[63] The term "heterologous" means that one single-stranded nucleic acid sequence is unable to hybridize to another single-stranded nucleic acid sequence or its complement. Thus areas of heterology means that nucleic acid fragments or polynucleotides have areas or regions in the sequence which are unable to hybridize to another nucleic acid or polynucleotide. Such regions or areas are, for example, areas of mutations.

[64] The term "identical" or "identity" means that two nucleic acid sequences have the same sequence or a complementary sequence. Thus, "areas of identity" means that regions or areas of a nucleic acid fragment or polynucleotide are identical or complementary to another polynucleotide or nucleic acid fragment.

[65] The term "amplification" means that the number of copies of a nucleic acid fragment is increased.

[66] The term "wild-type" means that the nucleic acid fragment does not comprise any mutations. A "wild-type" protein means that the protein will be active at a comparable level of activity found in nature and typically will comprise the amino acid sequence found in nature. In an aspect of the invention, the term "wild type" or "parental sequence" can indicate a starting or reference sequence prior to a manipulation of the sequence. [67] The term "related polynucleotides" means that regions or areas of the polynucleotides are identical and regions or areas of the polynucleotides are heterologous. [68] The term "chimeric polynucleotide" means that the polynucleotide comprises nucleotide regions which are wild-type and regions which are mutated. It may also mean that the polynucleotide comprises wild-type regions from one polynucleotide and wild- type regions from another related polynucleotide.

[69] The term "population" as used herein means a collection of components such as polynucleotides, nucleic acid fragments or proteins. A "mixed population" means a collection of components which belong to the same family of nucleic acids or proteins (i.e. are related) but which differ in their sequence (i.e. are not identical) and hence in their biological activity. A "library" necessarily implies a population wherein at least two of the components is different in some aspect (chemical composition, length, etc.) [70] The term "specific nucleic acid fragment" means a nucleic acid fragment having certain end points and having a certain nucleic acid sequence. Two nucleic acid fragments wherein one nucleic acid fragment has the identical sequence as a portion of the second nucleic acid fragment but different ends comprise two different specific nucleic acid fragments. Two nucleic acid fragments with identical sequences but different 5' or 3' ends comprise two different specific nucleic acid fragments. [71] The term "mutations" means changes in the sequence of a wild-type nucleic acid sequence or changes in the sequence of a peptide. Such mutations may be point mutations such as transitions or transversions. The mutations may be deletions, insertions or duplications.

[72] In the polypeptide notation used herein, the left-hand direction is the amino terminal direction and the right-hand direction is the carboxy-terminal direction, in accordance with standard usage and convention. Similarly, unless specified otherwise, the left-hand end of single-stranded polynucleotide sequences is the 5' end; the left-hand direction of double-stranded polynucleotide sequences is referred to as the 5' direction. The direction of 5' to 3' addition of nascent RNA transcripts is referred to as the transcription direction; sequence regions on the DNA strand having the same sequence as the RNA and which are 5' to the 5' end of the RNA transcript are referred to as "upstream sequences"; sequence regions on the DNA strand having the same sequence as the RNA and which are 3' to the 3' end of the coding RNA transcript are referred to as "downstream sequences".

[73] The term "naturally-occurring" as used herein as applied to an object refers to the fact that an object can be found in nature. For example, a polypeptide or polynucleotide sequence that is present in an organism (including viruses) that can be isolated from a source in nature and which has not been intentionally modified by man in the laboratory is naturally-occurring. Generally, the term naturally-occurring refers to an object as present in a non-pathological (undiseased) individual, such as would be typical for the species.

[74] As used herein the term "physiological conditions" refers to temperature, pH, ionic strength, viscosity, and like biochemical parameters which are compatible with a viable organism, and/or which typically exist intracellularly in a viable cultured yeast cell or mammalian cell. For example, the intracellular conditions in a yeast cell grown under typical laboratory culture conditions are physiological conditions. Suitable in vitro reaction conditions for in vitro transcription cocktails are generally physiological conditions. In general, in vitro physiological conditions comprise 50-200 mM NaCl or KC1, pH 6.5-8.5, 20-45°C. and 0.001-10 mM divalent cation (e.g., Mg⁺⁺, Ca⁺⁺); preferably about 150 mM NaCl or KC1, pH 7.2-7.6, 5 mM divalent cation, and often include 0.01-1.0 percent nonspecific protein (e.g., BSA). A non-ionic detergent (Tween, NP-40, Triton X-100) can often be present, usually at about 0.001 to 2%, typically 0.05-0.2% (v/v). Particular aqueous conditions may be selected by the practitioner according to conventional methods. For general guidance, the following buffered aqueous conditions may be applicable: 10-250 mM NaCl, 5-50 mM Tris HC1, pH 5-8, with optional addition of divalent cation(s) and/or metal chelators and/or nonionic detergents and/or membrane fractions and/or antifoam agents and/or scintillants.

[75] As used herein, "linker" or "spacer" refers to a molecule or group of molecules that connects two molecules, such as a DNA binding protein and a random peptide, and serves to place the two molecules in a preferred configuration, e.g., so that the random peptide can bind to a receptor with minimal steric hindrance from the DNA binding protein.

[76] As used herein, the term "operably linked" refers to a linkage of polynucleotide elements in a functional relationship. A nucleic acid is "operably linked" when it is placed into a functional relationship with another nucleic acid sequence. For instance, a promoter or enhancer is operably linked to a coding sequence if it affects the transcription of the coding sequence. Operably linked means that the DNA sequences being linked are typically contiguous and, where necessary to join two protein coding regions, contiguous and in reading frame.

Producing Libraries of Evolving Random Molecules

[77] The present invention provides a method to create libraries of polynucleotides containing either nucleotide deletions, insertions or combinations of deletions and insertions at random positions. In effect this invention provides a means to "swap" genetic elements without the need for homology or amplification techniques. The swapping of genetic elements is known to be a driving force in evolution of macromolecules, cells, and organisms [Ostermeier & Benkovic, Adv Protein Chem 55: 29-77 (2000)]. Current techniques, such as PCR based gene shuffling, do not allow significant swapping of genetic elements independent of homology.

Deletions

[78] In one embodiment, the invention provides a method to create a population of polynucleotides, with members of the population differing from one another by the presence of deletions at a single random position. One method of the invention, for example, comprises the steps of:

(a) cleavage of a composition of multiple copies of polynucleotides at random positions to create two ends;

(b) subjecting said polynucleotides from step (a) to a process which removes at least one nucleotide from one end of the ends of said polynucleotides; and

(c) optionally subjecting said polynucleotides from step (b) to a process which covalently joins said ends to one another, producing a library of polynucleotides which contains at least one polynucleotide that differs from the others by a deletion at one position.

[79] Further, the invention provides a population of polynucleotides, with members of the population differing from one another by the presence of deletions at a single random position. It is contemplated that deletions will allow removal of detrimental or unwanted functions of a genetic element. These functions might include protease sites, ion binding domains, DNA binding sequences for inhibitory transcription factors, immunogenic domains of proteins and the like.

[80] In a further embodiment, the invention provides a method, for example, to generate polynucleotides wherein the polynucleotides contain deletions at more than one position. One method comprises the steps of: (a) cleavage of a composition of multiple copies of polynucleotides at random positions to create two ends; (b) subjecting said polynucleotides from step (a) to a process which removes at least one nucleotide from one end of said ends of said polynucleotides; and

(c) optionally, subjecting said polynucleotides from step (b) to a process which covalently joins said ends to one another, producing a library of polynucleotides which contains at least one polynucleotide that differs from the others by a deletion at one position. A function of interest may then be selected for if desired (step(d)). Further, if desired, steps (a) to (c) or steps (a) to (d) may be repeated from 1 to 50 times or more. [81] Further, the invention provides a population of polynucleotides wherein the polynucleotides contain deletions at more than one position. It is contemplated that deletions at multiple positions will allow removal of multiple detrimental or unwanted functions of a genetic element. These functions might include any combination of protease sites, ion binding domains, DNA binding sequences for inhibitory transcription factors, immunogenic domains of proteins or other functions of interest as will be well appreciated by those of skill in the art.

Insertions

[82] In one embodiment, the invention provides a method to create a population of polynucleotides, with members of the population differing from one another by the presence of insertions at a single random position. One method comprises the steps of:

(b) subjecting said polynucleotides from step (a) to a process which inserts at least one nucleotide to at least one end of said polynucleotides; (c) optionally subjecting said polynucleotides from step (b) to a process which covalently joins said ends to one another, producing a library of polynucleotides which contains at least one polynucleotide that differs from the others by an insertion at one position. [83] Further, the invention provides a population of polynucleotides, with members of the population differing from one another by the presence of insertions at a single random position. This embodiment of the invention will allow novel fusion of genetic elements to occur. For example, a toxin could be fused to a targeting molecule (like an antibody), enzyme modules in important metabolic pathways (such as polyketide synthetases) could be fused in new ways, or new functions like binding domains (i.e. nucleic acid binding domains, small molecule or ion binding domains, protease sites, or other post-translational modification modules) could be incorporated into existing genetic elements.

[84] Likewise in another embodiment, the invention provides a method to generate polynucleotides wherein the polynucleotides contain insertions at more than one position. One method comprises the steps of:

(a) cleavage of a composition of multiple copies of polynucleotides at random positions;

(b) subjecting said polynucleotides from step (a) to a process which inserts at least one nucleotide to at least one end of said DNA ends of said polynucleotides; and

(c) optionally, subjecting said polynucleotides from step (b) to a process which covalently joins said DNA ends to one another, producing a library of polynucleotides which contains at least one polynucleotide that differs from the others by an insertion at one position; and

(d) optionally selecting for a function of interest. Steps (a)-(b), (a)-(c) or (a)-(d) may be repeated from 1 to 50 times or more.

[85] Further, the invention provides a population of polynucleotides wherein the polynucleotides contain insertions at more than one position. It is contemplated that this embodiment of the invention will allow multiple novel fusions of genetic elements to occur. For example, the following could be fused to a gene of interest in a combinatorial fashion: a toxin could be fused to a targeting molecule (like an antibody), enzyme modules in important metabolic pathways (such as polyketide synthetases) could be fused in new ways, or new functions like multiple binding domains (i.e. nucleic acid binding domains, ion binding domains, protease sites, or other post-translational modification modules) could be incorporated into existing genetic elements.

Combinations of insertions and deletions

[86] In one embodiment, the invention provides a method to create a population of polynucleotides, with members of the population differing from one another by the presence of deletions and insertions at a single random position. This method comprises the steps of:

(a) cleavage of a composition of multiple copies of polynucleotides at random positions to create two ends; (b) subjecting said polynucleotides from step (a) to a process which removes at least one nucleotide from one end of said ends of said polynucleotides;

(c) subjecting said polynucleotides from step (b) to a process which inserts at least one nucleotide to at least one end of said DNA ends of said polynucleotides from step (b);

(d) optionally subjecting said polynucleotides from step (c) to a process which covalently joins said DNA ends to one another, producing a library of polynucleotides which contains at least one polynucleotide that differs from the others by a deletion and insertion at one position.

[87] Further, the invention provides a population of polynucleotides, with members differing from one another by a combination of deletions and insertions at a single random position. It is contemplated that this embodiment will allow for new heterologous domains to replace domains in the gene of interest. In this regard, new functions, such as ligand binding or enzymatic catalysis could be conferred upon a genetic element. Also, native function could be enhanced utilizing this embodiment.

[88] In another embodiment, the invention provides a method to generate polynucleotides wherein the polynucleotides contain insertions and deletions at more than one position. In this regard deletions may occur at different positions than insertions, or deletions and insertions can occur at the same position. Further, deletions and/or insertions can occur at multiple positions. This method comprises the steps of:

(b) subjecting said polynucleotides from step (a) to a process which removes at least one nucleotide from one end of said ends of said polynucleotides;

(c) optionally subjecting said polynucleotides from step (b) to a process which inserts at least one nucleotide to at least one end of said ends of said polynucleotides; (d) optionally subjecting said polynucleotides from step (c) to a process which covalently joins said ends to one another, producing a library of polynucleotides which contains at least one polynucleotide that differs from the others by a deletion and insertion at one position; (e) optionally selecting for a function of interest; and optionally repeating any of steps (a) to (d) from 1 to 50 times or more. [89] Further, the invention provides a population of polynucleotides wherein the polynucleotides contain insertions and deletions at more than one position. It is contemplated that this embodiment of the invention will allow for classical directed evolution, wherein multiple rounds of insertions at random positions, deletions at random positions, and combinations of insertions and deletions, are produced with the genetic element being optionally subjected to selection between each round. This embodiment allows for the improvement or alteration of the function of a genetic element.

Starting material

[90] The present invention can be applied to any polynucleotide of interest to the researcher. The polynucleotide can be nucleic acid, i.e. RNA or DNA. Often the polynucleotide will be DNA consisting of genetic elements or one or more genes of interest. The starting material may be obtained through natural sources, or may be polynucleotides which have been synthesized in a laboratory (e.g. gene synthesis), or may be polynucleotides derived from natural sources which have been manipulated in a laboratory. Several sources of polynucleotides are available through publicly held databanks such as Genbank (ht1p://www.ncbi.nlm.nih.gov:80/Genbank/index.html) or available commercially (Celera, Rockville, MD; Incyte, Palo Alto, CA; Clontech, Palo Alto, CA; Invitrogen, Carlsbad, CA). [91] The nucleic acid may be obtained from any source, for example, from plasmids such a pBR322, from cloned DNA or RNA or from natural DNA or RNA from any source including bacteria, yeast, viruses and higher organisms such as plants or animals. DNA or RNA may be extracted from blood or tissue material. The template polynucleotide may be obtained by amplification using the polynucleotide chain reaction (PCR) [Mullis, U.S. Patent # 4,683,202 (1987); Mullis et al., U.S. Patent # 4,683,195 (1987)]. Alternatively, the polynucleotide may be present in a vector present in a cell and sufficient nucleic acid may be obtained by culturing the cell and extracting the nucleic acid from the cell by methods known in the art.

[92] The choice of vector depends on the size of the polynucleotide sequence and the host cell to be employed in the methods of this invention. The templates may be plasmids, phages, cosmids, phagemids, viruses (e.g., retroviruses, parainfluenzavirus, herpesviruses, reoviruses, paramyxoviruses, and the like), or selected portions thereof (e.g., coat protein, spike glycoprotein, capsid protein). For example, cosmids, phagemids, YACs, and BACs are preferred where the specific nucleic acid sequence to be mutated is larger because these vectors are able to stably propagate large nucleic acid fragments.

[93] If the specific nucleic acid sequence is cloned into a vector it can be clonally amplified by inserting each vector into a host cell and allowing the host cell to amplify the vector. This is referred to as clonal amplification because while the absolute number of nucleic acid sequences increases, the number of mutants does not increase.

[94] Starting material should be in substantially pure form. The polynucleotide may be double-stranded or single-stranded, but more preferably is double- stranded. Further, the polynucleotide may be linear or circular, but in a preferred embodiment the polynucleotide is circular. Polynucleotides in circular form may be prepared by preparation of plasmid DNA from organisms such as bacteria, yeast, plants, or mammalian cells by techniques well known to those skilled in the art [Maniatis et al., (1989)]. The number of different specific nucleic acid fragments in the reaction vessel will be at least about 100, preferably at least about 500, and more preferably at least about 1000. [95] The starting material (i.e. the polynucleotide), while in substantially pure form, can also be present without homologs or related sequences. In other words, the polynucleotides in the initial vessel may all be identical, although they may also be related, unrelated or heterologous. In fact, performance of the present invention will be unaffected by the sequence of the starting material. Furthermore, the sequence of the starting material may be known or unknown. For directed evolution purposes, all that is required is a method to detect the function of the polynucleotide (such as a screening assay).

Cleaving the polynucleotide at a random position

[96] In general, a nucleic acid fragment may be cleaved by a number of different methods. The nucleic acid fragment may be digested with a nuclease such as DNAse I, SI nuclease, PI or mung bean nuclease, or RNAse, which are readily available. Other enzymes, such as RAGl and RAG2, topoisomerases, and integrases are capable of cleaving polynucleotides. The nucleic acid may be randomly sheared by the method of sonication or by passage through a tube having a small orifice. The use of radiation, such as gamma radiation or ultraviolet radiation is also capable of cleaving polynucleotides. Chemical agents, such as bleomycin or methyl methanesulfonate (MMS) can also cleave polynucleotides.

[97] Of substantial importance to the generation of functionally mutated genes containing insertions or deletions is to cleave the polynucleotide a small number of times, usually between 1 and 10, preferably between 1 and 5, and most preferably once. The present invention provides a means to cleave a polynucleotide such that cleavage occurs only at one position per polynucleotide in the reaction vessel. Of importance is that the present invention provides a means for a near random cleavage of a polynucleotide (i.e. cleavage at several different positions in different molecules). Cleavage can be double-stranded or single-stranded (i.e. produce single-stranded ends or double-stranded ends). Examples of enzymes which can cleave polynucleotides include DNase I, SI nuclease, PI nuclease, as well as topoisomerases, transposons, and integrases. Cleavage can occur transiently with enzymes such as topoisomerases, transposons, and integrases. These enzymes may cleave the polynucleotide once, or more than once. SI nuclease can be used to cleave double or single-stranded polynucleotides in a generally random fashion. In a preferred embodiment, with circular double-stranded DNA, SI nuclease will cleave the polynucleotide only once, producing two DNA ends (FIG 4).

[98] It is also contemplated that the nucleic acid may also be partially digested with one or more restriction enzymes which cleave DNA at a high frequency (i.e. at several positions within a polynucleotide), such that certain polynucleotides are cleaved only once, and that the resulting population contains polynucleotides cleaved one time, but with different polynucleotides cleaved at different positions. The cleavage with a restriction enzyme may not be entirely random, but if the genetic element of interest has enough specific restriction sites at different positions, the cleavage pattern may be useful enough to generate substantial diversity.

[99] It is contemplated that single cleavage of a polynucleotide can be accomplished through other alternative mechanisms which normally cleave polynucleotides several times. A polynucleotide can be randomly sheared by the method of sonication or by passage through a tube having a small orifice. The use of radiation, such as gamma radiation or ultraviolet radiation is also capable of cleaving polynucleotides. If any of these modalities is carefully titrated and a means of purification is utilized, the singly cleaved molecules can be obtained in substantially pure form (i.e. singly cleaved molecules can be purified away from uncleaved or multiply cleaved molecules). [100] Furthermore, enzymes which act to cleave and rejoin DNA, such as topoisomerases, transposons, and integrases can be utilized to effectively cleave a polynucleotide [Singh et al., Proc Natl Acad Sci 94: 1304-9 (1997)]. In these cases the cleavage and rejoining steps may be coupled. Preferably the DNA ends are linked, or are in physical proximity to one another, following cleavage. This is in order to prevent the re- ligation of the wrong ends to one another following deletional or insertional events. One mechanism to keep the ends linked is through the use of a circular polynucleotide as a starting material. In this case, the ends are linked by the intervening polynucleotide chain. Thus, the re-ligation will be an intramolecular event as opposed to intermolecular, and will proceed with greater efficiency. Other mechanisms to keep the ends in proximity is through a protein bridge, such as through chromatin (i.e. histones, or other DNA binding proteins), or through enzymes which couple cleavage with, rejoining, such as transposons, integrases, or topoisomerases. Alternatively, ends could conceivably be left in proximity to one another through the linkage of opposite ends (the non-cleaved ends) to solid supports. [101] Cleavage of a circular polynucleotide consisting of supercoiled plasmid DNA can be accomplished by incubating from 0.1 to 100 μg, preferably from 1 to 10 μg with a nuclease such as SI nuclease. The nuclease can be present in amounts from 0.1 to 1000 units, but preferably from 1 to 100 units in a reaction of 10 μl. The temperature of the reaction can occur from between 0 and 100°C, but preferably between 4 and 50 °C. The reaction time can vary from 30 seconds to 1 hour, but preferably is between about 1 and 30 minutes. The degree of linearization can be measured by analyzing the plasmid DNA on an agarose gel as in FIG. 4. The linear DNA should preferably be purified from the uncut DNA by any of a number of methods well known to those skilled in the art. Such methods include utilization of agarose gel purification kits (Qiagen, Valencia, CA), HPLC, column chromatography and the like.

Deletion of nucleotides

[102] Nucleotide deletions can be generated at a DNA end by a variety of means. For instance, an exonuclease, such as exonuclease III, can be used to remove nucleotides in a 3' to 5' direction from a DNA end. The resulting DNA end then contains a 5' overhang which can be removed by digestion of the DNA with a single-stranded endonuclease such as PI nuclease, SI nuclease, or mung bean nuclease. Bal 31 nuclease is an enzyme which possesses 5' to 3' as well as 3' to 5' nucleolytic activity and can be used to delete nucleotides from a DNA end. Furthermore, several polymerases, like DNA polymerase I from E. coli, Klenow fragment, and Taq polymerase contain exonuclease activity and could conceivably be used to make deletions from a DNA end. Cell extracts from all organisms contain DNA repair enzymes which can act to delete nucleotides, thus unpure cell extract could conceivably be used as a source for exonuclease activity. Other nucleases, which may not have exonuclease activity under certain conditions may be capable of producing deletions at a DNA end under other conditions. For example, SI nuclease can produce short deletions when used at high enzyme concentrations. Furthermore, it is contemplated that mild denaturation of a DNA molecule, such that the DNA ends become "frayed", will allow deletions to occur upon application of a single-stranded endonuclease, such as S 1 , P 1 , or mung-bean nuclease.

[103] In a preferred embodiment the conditions of the deletion reaction are set such that the number of individual deletions occurring at each DNA end may be well controlled. For example, altering the salt concentration, altering the pH, altering the temperature, or altering any of the other biochemical parameters of the reaction can change the activity of the nuclease enzyme such that more or less deletions will occur depending on the intent of the investigator (for instance decreasing temperature or increasing salt may lower the processivity of the exonuclease and cause fewer deletions). Figure 5 shows altering conditions allowing differing numbers of deletions to occur on a DNA end. In some cases large deletions might be warranted (i.e. to completely remove a large domain in a genetic element), in other cases small deletions might be preferable (i.e. to remove a single amino acid, or a few amino acids such as those that comprise a protease site). Generally deletions could be obtained numbering from 1 to 1000, more preferably they would be from 1 to 100. In certain instances, as described, the deletions may number from 1 to 10.

[104] Due to cleavage at a random position in the polynucleotide, the location of the deletions in the resulting polynucleotide will also be located at a random position. Also, since residues are deleted from either end of the molecule, the total number of deletions will equal the sum of the deletions occurring on the 5' end and the 3' end.

Adding nucleotides

[105] In order to make additions to a polynucleotide in random positions, the polynucleotide is necessarily cleaved at a random position, as described above. Prior to insertion, nucleotides may be deleted from the DNA ends produced during the cleavage event. Alternatively, the DNA ends formed by the cleavage reaction can be used as substrates to which new nucleotides or polynucleotides are added.

[106] Several different mechanisms exist to add nucleotides to the ends of a polynucleotide. For example, nucleotides can be added by chemical coupling. A polymerase, such as terminal deoxynucleotidyl transferase can be utilized to add nucleotides in a semirandom fashion to a DNA end [Gauss & Lieber, Mol Cell Biol 1996 16: 258-69 (1996)]. Alternatively, the cleavage step may be coupled to the insertion event, as can be the case when employing transposons or integrases to the insertion event.

[107] A ligase such as E. coli ligase or phage T4 ligase can be utilized to covalently couple a new polynucleotide to the parent polynucleotide. In a preferred embodiment the polynucleotide is a genetic element or a fragment of a genetic element. A genetic element predisposes the resulting polynucleotide to have function since genetic elements are functional in some way by definition. The genetic element may be a gene, the regulatory element of a gene, or a genetic element encoding a useful domain. The genetic element may be a library of genetic elements such as a cDNA library or genomic DNA library. Fragments of a genetic element can be produced by digesting the polynucleotide with a nuclease, such as DNAse I, SI nuclease, PI or mung bean nuclease, or RNAse. Other enzymes, such as restriction enzymes and topoisomerases, can also cleave polynucleotides into fragments. The polynucleotides may be randomly sheared by the method of sonication or by passage through a tube having a small orifice. The use of radiation, such as gamma radiation or ultraviolet radiation is also capable of cleaving polynucleotides into fragments. Chemical agents, such as bleomycin or MMS can also cleave polynucleotides into fragments.

[108] It is contemplated that the mixture of a parent polynucleotide cleaved at a random position, with a population of genetic elements or fragments of genetic elements, and a ligase such .as T4 DNA ligase, under the appropriate salt, buffer, and temperature conditions, will allow covalent coupling of the genetic elements with the parent polynucleotide at the position of the original cleavage event. Thus, a mixture of polynucleotides is produced comprising an insertion at a random position within the parent polynucleotide. The content (i.e. the sequence) of each insertion may be identical if the genetic elements or fragments of genetic elements are identical, or different if the fragments of genetic elements were non- identical.

Rejoining the DNA ends

[109] DNA ends may be rejoined covalently by incubating the DNA ends with an enzyme like a DNA ligase which will form phosphodiester bonds between nucleotides at the DNA end. Examples of ligases include E. coli DNA ligase, phage T4 DNA ligase, or human DNA ligases. These enzymes can be used under conditions well known to those skilled in the art to ligate DNA. Other enzymes are also capable of creating covalent linkages (like phosphodiester bonds) between nucleotides at DNA ends. Such enzymes are topoisomerases, transposons, integrases, and other recombination enzymes. Other mechanisms can be used to join DNA ends such as the utilization of an oligonucleotide whose sequence can hybridize to sequences on either end (i.e. both the 5' and 3' ends) to "bridge" the ends with hydrogen bonds. The intervening sequence on the opposite strand could be filled in with a polymerase, such as E. coli polymerase, Klenow fragment, phage T4 polymerase, or Taq polymerase. Nicks could then be repaired by a DNA ligase as described above. Cellular extracts also contain ligase activities and cell or nuclear extracts could be used to rejoin DNA ends. Alternatively, DNA molecules could be introduced into intact cells and the cell's machinery could rejoin DNA ends by homologous or non-homologous means.

Library compositions [110] The present invention provides for novel libraries of which the following compositions are examples: Deletions

[111] The invention provides a population of polynucleotides, with members of the population differing from one another by the presence of deletions at a single random position. Such single deletion libraries can contain at least 2 molecules, but preferably 100 molecules, and most preferably at least about 1000 molecules. Deletion libraries should contain at least one molecule that differs from at least one other molecule by the deletion of at least one nucleotide at one random position. The number of deletions at each position could be from 1 to 1000, but should be at least one. It is contemplated that deletions will allow removal of detrimental or unwanted functions of a genetic element. These functions might include protease sites, ion binding domains, DNA binding sequences for inhibitory transcription factors, immunogenic domains of proteins and the like.

[112] Further, the invention provides a population of polynucleotides wherein the polynucleotides contain deletions at more than one position. Such a library should contain at least 2 molecules, but preferably 100 molecules, and most preferably at least about 1000 molecules. These multiple deletion libraries should contain at least one molecule that differs from at least one other molecule by the deletion of at least one nucleotide at more than one random position. It is contemplated that deletions at multiple positions will allow removal of multiple detrimental or unwanted functions of a genetic element. These functions might include any combination of multiple protease sites, ion binding domains, DNA binding sequences for inhibitory transcription factors, immunogenic domains of proteins and the like. Insertions

[113] The invention provides a population of polynucleotides, with members of the population differing from one another by the presence of insertions at a single random position. Insertion libraries can contain at least 2 molecules, but preferably 100 molecules, and most preferably at least about 1000 molecules. Insertion libraries should contain at least one molecule that differs from at least one other molecule by the insertion of at least one nucleotide at one random position. The number of insertions at each position could be from 1 to 10,000, but preferably will be at least one. For example, a toxin could be fused to a targeting molecule (like an antibody), enzyme modules in important metabolic pathways (such as polyketide synthetases) could be fused in new ways, or a new function like binding domains (i.e. nucleic acid binding domains, ion binding domains, protease sites, or other post-translational modification modules) could be incorporated into existing genetic elements.

[114] Further, the invention provides a population of polynucleotides wherein the polynucleotides contain insertions at more than one position. Such a library should contain at least 2 molecules, but preferably 100 molecules, and most preferably at least about 1000 molecules. These multiple insertion libraries should contain at least one molecule that differs from at least one other molecule by the insertion of at least one nucleotide at more than one random position. It is contemplated that this embodiment of the invention will allow novel fusion of genetic elements to occur. It is contemplated that this embodiment of the invention will allow multiple novel fusions of genetic elements to occur. For example the following could be fused to a gene of interest in a combinatorial fashion: a toxin could be fused to a targeting molecule (like an antibody), enzyme modules in important metabolic pathways (such as polyketide synthetases) could be fused in new ways, or new function like binding domains (i.e. nucleic acid binding domains, ion binding domains, protease sites, or other post-translational modification modules) could be incorporated into existing genetic elements.

Combinations of insertions and deletions

[115] The invention provides a population of polynucleotides, with members differing from one another by a combination of deletions and insertions at a single random position. Such a library should contain at least 2 molecules, but preferably 100 molecules, and most preferably at least about 1000 molecules. These combination libraries should contain at least one molecule that differs from at least one other molecule by the insertion of one nucleotide and the deletion of at least one nucleotide at one random position. It is contemplated that this embodiment will allow for heterologous domains to replace domains in the gene of interest. In this regard, new functions, such as ligand binding or enzymatic catalysis could be conferred upon a genetic element. Also, native function could be enhanced utilizing this embodiment.

[116] Further, the invention provides a population of polynucleotides wherein the polynucleotides contain insertions and deletions at more than one position. Such a library should contain at least 2 molecules, but preferably 100 molecules, and most preferably at least about 1000 molecules. These combination libraries should contain at least one molecule that differs from at least one other molecule by the insertion of at least one nucleotide at one random position and the deletion of at least one nucleotide at one random position. This embodiment of the invention will allow for classical directed evolution, wherein multiple rounds of insertions at random positions, deletions at random positions, and combinations of insertions and deletions, are produced with the gene of interest being optionally subjected to selection between each round. This embodiment allows for the improvement or alteration of function of a genetic element.

Analyzing the composition

[117] The composition of such libraries can be determined by mechanisms well known to those in the art. In order to determine whether a library contains insertions or deletions, the library can be analyzed by agarose or acrylamide gel electrophoresis and size can be compared to the parental sequence. Other methods, like HPLC, mass spectrometry, column chromatography can be used to identify size differences between polynucleotides.

Because the present invention relates to random positions of insertions and deletions, the most definitive method to determining the composition of a library is to subject representative polynucleotides within the composition to sequencing, a method well known to those skilled in the art. Comparison of sequences of representative clones would allow one to determine if deletions or insertions occurred at random positions in different molecules in the library.

[118] The resulting library could be ligated into an expression vector for use as a vehicle to express the resulting variants contained within the library. The nature of the expression vector is described below in the "screening" section.

Screening for a function of interest [119] In testing a library of polynucleotides for a function of interest, the library should be inserted in an appropriate expression vector. Alternatively, the library can be constructed in an expression vector (i.e. the library comprises an expression vector). The vector used for cloning is not critical provided that it will accept a DNA fragment of the desired size. If expression of the DNA fragment is desired, the cloning vehicle should further comprise transcription and translation signals next to the site of insertion of the DNA fragment to allow expression of the DNA fragment in the host cell. For screening in bacterial cells, preferred vectors include the pUC series and the pBR series of plasmids.

[120] The resulting bacterial population will include a number of recombinant DNA fragments having random mutations. This mixed population may be tested to identify the desired recombinant nucleic acid fragment. The method of selection will depend on the DNA fragment desired.

[121] The choice of vector depends on the size of the polynucleotide sequence and the host cell to be employed in the methods of this invention. The templates may be plasmids, phages, cosmids, phagemids, viruses (e.g., retroviruses, parainfluenzavirus, herpesviruses, reoviruses, paramyxoviruses, and the like), or selected portions thereof (e.g., coat protein, spike glycoprotein, capsid protein). For example, cosmids, phagemids, YACs, and BACs are preferred where the specific nucleic acid sequence is larger because these vectors are able to stably propagate large nucleic acid fragments. [122] If a DNA fragment which encodes for a protein with increased binding efficiency to a ligand is desired, the proteins expressed by each of the DNA fragments in the population or library may be tested for their ability to bind to the ligand by methods known in the art (i.e. panning, affinity chromatography). If a DNA fragment which encodes for a protein with increased drug resistance is desired, the proteins expressed by each of the DNA fragments in the population or library may be tested for their ability to confer drug resistance to the host organism. One skilled in the art, given knowledge of the desired protein, could readily test the population to identify DNA fragments which confer the desired properties onto the protein.

[123] In the context of the present invention the term "positive polypeptide variants" means resulting polypeptide variants possessing functional properties which has been improved in comparison to the polypeptides producible from the corresponding input DNA sequences. Examples, of such improved properties can be as different as, for example, enhanced or lowered biological activity, increased wash performance, thermostability, ^• oxidation stability, substrate specificity, antibiotic resistance or others that may be of interest. [124] Consequently, the screening method to be used for identifying positive variants depend on which property of the polypeptide in question it is desired to change, and in what direction the change is desired.

[125] A number of suitable screening or selection systems to screen or select for a desired biological activity are described in the art. For example, Strausberg et al.

[Strausberg et al., Biotechnology (N Y) 13: 669-73 (1995)] describes a screening system for subtilisin variants having calcium-independent stability. Bryan et al. [Bryan et al., Proteins 1: 326-34 (1986)] describes a screening assay for proteases having an enhanced thermal stability. [126] It is contemplated that one skilled in the art could use a phage display system in which fragments of the protein are expressed as fusion proteins on the phage surface (Pharmacia, Milwaukee Wis.). The recombinant DNA molecules are cloned into the phage DNA at a site which results in the transcription of a fusion protein, a portion of which is encoded by the recombinant DNA molecule. The phage containing the recombinant nucleic acid molecule undergoes replication and transcription in the cell. The leader sequence of the fusion protein directs the transport of the fusion protein to the tip of the phage particle. Thus the fusion protein which is partially encoded by the recombinant DNA molecule is displayed on the phage particle for detection and selection by the methods described above.

Methods of Effecting Targeted Short Deletions in Nucleic Acids [127] The ability to make short deletions in a polynucleotide is generally hampered by the high activity and processivity of exonucleases that act at a DNA end. Several methods exist to make large (i.e. more than 100 base) deletions at DNA ends [Sambrook et al, (1989)]. However, methods to create short deletions, such as from 1 to 100 bases or very short deletions like from 1 to 10 bases in a controlled fashion have not been possible. The ability to make such deletions at specific sites is important in the field of protein engineering [Altamirano et al, Nature 403: 617-22 (2000)] and is highlighted in the end-joining mechanism of V(D)J recombination, the method which produces the substantial diversity in antibody genes [Smider & Chu, Sem. Immun. 9: 189-97 (1997)].

Starting material [128] The deletion generating mechanism can be applied to any polynucleotide of interest to the researcher. The polynucleotide can be nucleic acid, i.e. RNA or DNA. Often the polynucleotide will be DNA consisting of genetic elements or one or more genes of interest. The starting material may be obtained through natural sources, or may be polynucleotides which have been synthesized in a laboratory (e.g. gene synthesis), or may be polynucleotides derived from natural sources which have been manipulated in a laboratory. Several sources of polynucleotides are available through publicly held databanks such as Genbank (http://www.ncbi.nlm.nih.gov: 80/Genbank/index.html) or available commercially (Celera, Rockville, MD; Incyte, Palo Alto, CA; Clontech, Palo Alto, CA; Invitrogen, Carlsbad, CA).

[129] The nucleic acid may be obtained from any source, for example, from plasmids such a pBR322, from cloned DNA or RNA or from natural DNA or RNA from any source including bacteria, yeast, viruses and higher organisms such as plants or animals. DNA or RNA may be extracted from blood or tissue material. The template polynucleotide may be obtained by amplification using the polynucleotide chain reaction (PCR) [Mullis, U.S. Patent # 4,683,202 (1987); Mullis et al., U.S. Patent # 4,683,195 (1987)]. Alternatively, the polynucleotide may be present in a vector present in a cell and sufficient nucleic acid may be obtained by culturing the cell and extracting the nucleic acid from the cell by methods known in the art.

Deletion of nucleotides

[130] Nucleotide deletions can be generated at a DNA end by a variety of means. For instance, an exonuclease, such as exonuclease III, can be used to remove nucleotides in a 3' to 5' direction from a DNA end. Often the resulting DNA end contains a 5' overhang which can be removed by digestion of the DNA with a single-stranded endonuclease such as PI nuclease, SI nuclease, or mung bean nuclease. Other exonucleases could also be used in the present invention. Bal 31 nuclease is an enzyme which possesses 5' to 3' as well as 3' to 5' nucleolytic activity and can be used to delete nucleotides from a DNA end. Exonuclease T can remove nucleotides in a 3' to 5' direction. Exonuclease 7 can remove nucleotides in a 5' to 3' direction, and can act at single-stranded ends such as nicks or gaps. Exonuclease I catalyzes the removal of nucleotides from single-stranded DNA in the 3' to 5 ' direction. Lambda exonuclease is a highly processive enzyme that acts in the 5 ' to 3 ' direction, catalyzing the removal of 5^' mononucleotides from duplex DNA. RecJ is a single- stranded DNA specific exonuclease that catalyzes the removal of deoxynucleotide monophosphates from DNA in the 5 ' to 3 ' direction. Furthermore, several polymerases, like DNA polymerase I from e. coli, Klenow fragment, and Taq polymerase contain exonuclease activity and could conceivably be used to make deletions from a DNA end. Cell extracts from all organisms contain DNA repair enzymes which can act to delete nucleotides, thus unpure cell extract could conceivably be used as a source for exonuclease activity. Other nucleases, which may not have exonuclease activity under certain conditions may be capable of producing deletions at a DNA end under other conditions. For example, SI nuclease can produce short deletions when used at high enzyme concentrations. Furthermore, it is contemplated that mild denaturation of a DNA molecule, such that the DNA ends become "frayed", will allow deletions to occur upon application of a single-stranded endonuclease, such as SI, PI, or mung-bean nuclease.

[131] In a preferred embodiment, the conditions of the deletion reaction are set such that the number of individual deletions occurring at each DNA end may be well controlled. For example, altering the salt concentration and the temperature, altering the pH, or altering any of the other biochemical parameters of the reaction can change the activity of the nuclease enzyme such that more or less deletions will occur depending on the intent of the investigator. Most particularly and surprisingly we have found that decreasing temperature and/or increasing salt lowers the processivity of the exonuclease and results in more controlled small deletions. Salts used in the reaction may be any salt. Examples of salts include sodium chloride, sodium acetate, potassium chloride, or potassium acetate. Preferably the salt is either sodium chloride or potassium chloride. Salt concentrations can range from 10 mM to 1.0 M, but preferably is between 50 mM and 500 mM. Temperature of the reaction can also vary in the present invention. The temperature can range from 0°C to 30°C, but preferably is between 0°C and 24°C. Figure 5 shows altering conditions allowing differing numbers of deletions to occur on a DNA end. In some cases large deletions might be warranted (i.e. to completely remove a large domain in a genetic element), in other cases small deletions might be preferable (i.e. to remove a single amino acid, or a few amino acids such as those that comprise a protease site). The resulting population of polynucleotides contain variable amounts of deletions at the ends of the starting sequence. Generally deletions could be obtained numbering from 1 to 1000, more preferably they would be from 1 to 100. In a preferred embodiment, the deletions may number from 1 to 30 or even 1 to 10.

Rejoining the DNA ends [132] In some cases it might be useful to join the DNA ends of a molecule containing a deletion with a second DNA end, such that the deletion now occurs at an internal position. Often the two ends to be ligated will be present on the same DNA molecule, such that the resulting ligation product is a circular polynucleotide. DNA ends may be rejoined by incubating the DNA ends with an enzyme like a DNA ligase which will form phosphodiester bonds between nucleotides at the DNA end. Examples of ligases include E. coli DNA ligase, phage T4 DNA ligase, or human DNA ligases. These enzymes can be used under conditions well known to those skilled in the art to ligate DNA. Other enzymes are also capable of creating covalent linkages (like phosphodiester bonds) between nucleotides at DNA ends. Such enzymes are topoisomerases, transposons, integrases, and other recombination enzymes. Other mechanisms can be used to join DNA ends such as the utilization of an oligonucleotide whose sequence can hybridize to sequences on either end (i.e. both the 5' and 3' ends) to "bridge" the ends with hydrogen bonds. The intervening sequence on the opposite strand could be filled in with a polymerase, such as e.coli polymerase, Klenow fragment, phage T4 polymerase, or Taq polymerase. Nicks could then be repaired by a DNA ligase as described above. Cellular extracts also contain ligase activities and cell or nuclear extracts could be used to rejoin DNA ends. Alternatively, DNA molecules could be introduced into intact cells and the cell's machinery could rejoin DNA ends by homologous or non- homologous means.

Deletion compositions

[133] In one embodiment the current invention provides for a composition of polynucleotides, wherein members of the population differ from one another by the presence of deletions at one or both ends of the polynucleotide. The number of deletions may range from 1 to 100 at each end, but more preferable is from 1 to 30.

[134] Additionally, the current invention provides for a composition of polynucleotides differing from one another by short deletions at a specific internal position (i.e. not at an end). This composition is obtained by joining the composition of polynucleotides with deletions at the ends to other DNA ends, such that the deletion now occurs internally. Often the two ends to be ligated will be present on the same DNA molecule, such that the resulting ligation product is a circular polynucleotide. The number of deletions may range from 1 to 100 at each end, but more preferable is from 1 to 30.

[135] All references and patent publications referred to herein are hereby incorporated by reference herein. [136] As can be appreciated from the disclosure provided above, the present invention has a wide variety of applications. Accordingly, the following examples are offered for illustration purposes and are not intended to be construed as a limitation on the invention in any way. EXAMPLES Example 1 : Random cleavage of a plasmid

[137] Molecular evolution techniques utilizing insertions or deletions require a gene to be cleaved, at least transiently, a small number of times. Optimally, each molecule within a mix is cleaved once, at different random positions. There is significant difficulty in preparing singly cleaved DNA, wherein cleavage occurs at random positions. Biondi, et.al. described a cumbersome method using DNase I and DNA polymerase to induce nicks, followed by further cleavage of these nicks to produce a double stranded break [Biondi et al., Nucleic Acids Res 26: 4946-52 (1998)]. This process required tedious and time consuming cesium chloride gradient purification and linker ligation steps, and is not generally applicable to high throughput molecular biology techniques like molecular evolution.

[138] The strategy of utilizing a single-stranded endonuclease to induce double-stranded breaks at random positions in DNA has heretofore not been utilized. It was reasoned that a single-stranded nuclease, like SI, PI, or mung bean nuclease, would specifically cleave single-stranded regions in tightly supercoiled DNA, thus producing a nick. A nick is the natural substrate for these enzymes, so cleavage to produce a double-stranded break may then occur in the same reaction. Following cleavage, the single-stranded regions are no longer present since the plasmid is no longer supercoiled, so the DNA is no longer a substrate for the enzyme. Thus, cleavage would occur once and only once. This example illustrates the utility of this hypothesis.

[139] The plasmid pLacZi (Clontech, Palo Alto, CA) was used to illustrate the mechanism by which a polynucleotide can be cleaved at random positions. The plasmid was propagated in DH10B E. coli cells (Invitrogen, Carlsbad, CA) and plasmid was prepared by Qiagen maxiprep columns (Qiagen, Valencia, CA). Plasmid DNA at 200 ng/μl was incubated with 0.4, 2.0, 10, or 50 units of SI nuclease (Promega, Madison, WT) in IX SI buffer (50 mM sodium acetate pH 4.5, 280 mM NaCl, 4.5 mM ZnSO₄) for 10 minutes at room temperature. The reaction was stopped by the addition of EDTA to 0.025 M and heated to 70°C for 10 minutes. Protein was removed by twice extracting with an equal volume of phenol:chloroform:isoamyl alcohol (25:24:1), once with an equal volume of ether, precipitated with sodium acetate and resuspended in water.

[140] Cleaved pLacZi was analyzed by 1.5% agarose gel electrophoresis (Figure 4, panel A). SI nuclease cleaved plasmid was seen to co-migrate with pLacZi cleaved with Cla I, which cuts pLacZi once. Thus, S 1 nuclease can linearize a circular DNA molecule. Although SI nuclease is not known to cut DNA in a sequence specific manner, it was important to determine that the cleavage of plasmid by SI was not site specific. To this end, linear plasmid produced by SI cleavage was gel purified (Figure 4, panel B, lane 5), or purified and further cleaved with Cla I (lane 6). Controls included supercoiled plasmid (lane 2), plasmid linearized with Cla I (lane 3), or plasmid linearized with SI nuclease and un- purified (lane 4). The SI /Cla I cleaved plasmid is seen as a smear, showing that SI is cleaving in several different positions in the plasmid. If SI cleaved at only one position, then the Sl/Cla I cleaved plasmid would migrate as two bands; if SI cleaved at two positions, then the Sl/Cla I plasmid would migrate as three bands, and so on. The importance of this example is that a polynucleotide is able to be cleaved once (i.e. linearization of a circle), and only once, at different positions.

Example 2: Deletions at a site in LacZ

[141] Nucleotide deletions have been made for structural analysis of genes, and for nucleotide sequence analysis. Generally these deletions are large, in the range of well over 100 nucleotides. Under normal conditions, for example, exonuclease III removes over 100 bases per minute [Sambrook et al., (1989)]. The ability to create small deletions, however, would be useful to alter small domains in proteins or remove deleterious functions. In order to make small deletions at the end of a polynucleotide, exonuclease III was utilized under various conditions of salt (Figure 5) and temperature. A fluorescently labeled 232 base pair PCR product from pLacZi was exposed to 100 mM, 150 mM, and 200 mM NaCl in the presence of 10 U exonuclease III (New England Biolabs, Beverly, MA) in 10 μl of 66 mM Tris-Cl (pH 7.4), 0.66 mM MgCl₂ at 15°C in a 5 minute reaction. The reaction was stopped by the addition of EDTA to 0.025 M, and extracted once with an equal volume of phenol:chloroform:isoamyl alcohol (25:24:1), once with an equal volume of ether, and precipitated with sodium acetate. DNA was resuspended in 20 μl deionized formamide, and 0.5 μl was run on a 6% polyacrylamide denaturing gel in ABI 373 sequencer (Perkin-Elmer, Foster City, CA) set to the genescan setting according to the manufacturers recommendation.

[142] Nearly 25 nucleotides can be removed under conditions of 100 mM NaCl (Figure 5, second panel), up to 15 nucleotides with 150 mM NaCl, and a few nucleotides with 200 mM NaCl (bottom panel).

[143] The Cla I site in pLacZi exists in the coding region of the LacZ gene. This site was utilized to make short deletions within the gene itself, which could then be analyzed further by PCR to determine the extent to which deletions were made. Additionally, plasmids containing deletions were selected on LB agar plates containing 40 μg/ml X-Gal to determine the functionality of the LacZ gene. The pLacZi plasmid (10 μg) was linearized with Cla I in 200 μl, then incubated with 20 U of SI nuclease in 400 μl to remove the 2 bp 5' overhangs. Further, the linearized plasmid was concentrated and filtered through an ultrafree MC membrane (30 kD cutoff, Millipore, Bedford, MA), then brought to a volume of 400 μl in IX calf intestinal phosphatase buffer containing 100 U of calf intestinal phosphatase (New England Biolabs, Beverly, MA) and incubated for 45 minutes at room temperature. Plasmid was extracted with an equal volume of phenol:chloroform:isoamyl alcohol (25:24:1), once with an equal volume of ether, precipitated with sodium acetate, and resuspended in water. The plasmid was then incubated with exonuclease III as described in example 1, in the presence of either 100 mM, 150 mM or 200 mM NaCl for 5 minutes at 15°C in a 10 μl reaction. In a control arm, plasmid was not incubated with exonuclease III, to test for the frequency of religation of the dephosphorylated plasmid in the absence of deletions. After 5 minutes of exonuclease III reaction, a mix containing SI nuclease 50 U in IX SI buffer was added. This mix was further incubated at room temperature for 15 minutes. The reaction was stopped by the addition of EDTA to 0.025 M and heated to 70°C for 10 minutes. The DNA was then extracted once with an equal volume of phenol:chloroform:isoamyl alcohol (25:24:1), once with an equal volume of ether, precipitated with sodium acetate and resuspended in 10 μl of IX ligase buffer containing 1.0 U of T4 DNA ligase (Invitrogen, Carlsbad, CA). Ligation reactions were incubated at 15°C for 12 hours. Electroporation of E. coli strain DH10B (Invitrogen, Carlsbad, CA) was accomplished with 1.0 μl of ligation mix. Cells were plated on LB agar plates containing 40 μg/ml X-Gal and 100 μg/ml ampicillin and incubated overnight at 30°C. Table 1 illustrates the results of the plating experiment.

Table 1. Colony characteristics after site directed deletions.

Blue Colonies White Colonies Blue/White

No Exo III 0 0 -

Exo III, 177 66 0.37

100 mM NaCl

Exo III, 340 140 0.41

150 mM NaCl

Exo III, 77 34 0.44

200 mM NaCl [144] Notably, no background is realized when dephosphorylated plasmid is not exposed to exonuclease III (first row, Table 1). Several blue and white colonies are evident with exonuclease III treatment under different salt concentrations. Interestingly, the theoretical maximum of the blue/white ratio is 0.33, since at least 2/3 of religations should be out of frame. However, the blue/white ratio in this experiment is slightly more than 0.33, and appears to increase as salt concentration increases. This bias may be due to the fact that a one basepair deletion from one end would allow in-frame religation to occur, and fewer deletions are favored as salt is increased. The statistical significance of this result has not been analyzed, so the true frequency may actually be nearer to 0.33.

[145] Six of the colonies were analyzed by PCR with primers flanking the Cla I site. FIG 6 shows these results. In the upper panel the wild-type 312 basepair fragment from pLacZi is shown. Clone 1 contains an in frame deletion of 291 bases (PCR product of 291 bases) and retains a blue phenotype. Clone 2 contains a 4 basepair out of frame deletion (PCR product of 308 bases) and has a white phenotype. Clone 3 contains a 9 basepair in frame deletion (PCR product of 303 bases) and has a white phenotype. Clone 4 contains a 6 basepair in frame deletion (PCR product of 306 bases) and has a white phenotype. Clone 5 contains a 7 basepair out of frame deletion (PCR product of 305 bases) and has a white phenotype. Clone 6 has a 3 basepair deletion (PCR product of 309 bases) and has a blue phenotype. Although it may be thought that shorter deletions would lead to less severe phenotype, this experiment illustrates that this is not necessarily the case. Clone 1 contains a deletion encompassing 7 amino acids but retains function whereas clones 3 and 4 contain in frame shorter deletions but do not retain function. Furthermore, this example illustrates the ability of deletional technology to search functional sequence space.

Example 3: Insertions in LacZ

[146] Insertions of random DNA in the LacZ gene was accomplished by employing DNase I to fragment cDNA derived from CHO cells, followed by ligation of these fragments into linearized pLacZi. Since cDNA is by definition functional, it is contemplated that the use of cDNA will optimize the likelihood of obtaining functional proteins. CHO cell cDNA (5 μg ) was fragmented with 0.001 units of DNase I in a buffer containing 40 mM Tris-Cl pH 7.4 and 10.0 mM MgCl for 5 minutes at room temperature. The reaction was stopped by the addition of EDTA to 0.025 M and heated to 70°C in the presence of 10 μg of protease K. DNA was extracted with an equal volume of phenol:chloroform:isoamyl alcohol (25:24:1), once with an equal volume of ether, and precipitated with sodium acetate. Plasmid linearized with Cla I or SI nuclease were dephosphorylated as described above, then again extracted with an equal volume of phenol:chloroform:isoamyl alcohol (25:24:1), once with an equal volume of ether, and precipitated with sodium acetate. To insert random cDNA fragments into plasmid DNA, 0.2 mg of linearized, dephosphorylated plasmid was incubated with 1 ng of cDNA fragments in the presence of T4 DNA ligase (1.0 U) in a reaction volume of 10 ml at 15°C for 12 hours. As controls, linearized plasmid was incubated with ligase in the absence of cDNA fragments, and cDNA fragments were incubated with ligase in the absence of linearized vector. DH10B E. coli were then electroporated with 1.0 μl of each ligation mix.

[147] Several e. coli colonies were identified in the vector plus insert arms of the experiment which exhibited either white, intermediate, or blue phenotype on X-Gal plates. PCR across the Cla I site in the colonies which arose from vector linearized with Cla I ligated to cDNA fragments revealed several clones containing inserts of sizes from 100-300 basepairs. Three ofthese are illustrated in FIG 7. Thus, the insertion of fragments of cDNA into a genetic element can be accomplished with the present invention.

Example 4: Functional changes at random positions

[148] The lac operon is a model system by which genetic elements are easily studied. The enzyme β-galactosidase is encoded by the LacZ gene, but is normally only produced when lactose is present in the environment. Control of enzyme levels is accomplished at the level of transcription. The lac repressor protein binds to the operator sequence upstream from the ATG start site of LacZ, and inhibits transcription by RNA polymerase. In the presence of lactose, however, the repressor is removed from the operator and transcription can proceed. The mechanism of promotor activation is through the binding of lactose, the inducer, to the lac repressor and causing an allosteric change that causes its affinity for the operator to decrease dramatically. In the laboratory setting, LacZ transcription can be assessed by plating E. coli on the colorimetric substrate X-Gal, which causes colonies to turn blue when hydrolyzed by β-galactosidase. The operator can be de- repressed by utilizing the lactose analog IPTG, which is non-hydrolizable, and strongly induces LacZ transcription by binding the lac repressor.

[149] In order to test the ability of random deletions to affect gene function, the pBluescript II KS+ plasmid was linearized with SI nuclease, gel purified, dephosphorylated, and subjected to exonuclease III digestion as described in examples 1 and 2. Linearized plasmid at 20 ng/μl was incubated with 10 U exonuclease III in 66 mM Tris-Cl pH 7.4, 0.66 mM MgCl₂ buffer at 15°C for 5 minutes, followed by addition of IX SI solution containing 50 mM sodium acetate pH 4.5, 280 mM NaCl, 4.5 mM ZnSO₄ and 10 U SI nuclease, and incubation for 15 minutes at room temperature. The reaction was stopped by adding EDTA to 0.025 M, and extraction with an equal volume of phenol:chloroform:isoamyl alcohol (25:24:1), once with an equal volume of ether, and precipitated with sodium acetate. DNA was resuspended in IX T4 DNA ligase buffer containing 1.0 U T4 DNA ligase and incubated at 15°C for 12 hours. The ligation reaction (1 μl) was then used to electroporate E. coli strain TOP 10 F', which produces the lac repressor protein (Invitrogen, Carlsbad, CA). The E. coli were incubated on LB plates either with or without IPTG as inducer, and in the presence of X-Gal to measure β-galactosidase activity. Additionally, pBluescript plasmid was plated in the presence or absence of IPTG on X-Gal containing plates. Table 2 illustrates the results of the experiment.

Table 2. Functional changes in transcription of β-galactosidase

___ - IPTG

Blue White Blue White pBluescript 100% 0 0 100% pBluescript/deletions 66% 34% 2% 98%

Several colonies gained the ability to transcribe LacZ in the absence of the inducer IPTG in the arm of the experiment where deletions were made at random positions. Additionally, several colonies lost their ability to produce functional β-galactosidase in the presence of LPTG. One white colony in the presence of IPTG from the pBluescript/deletions arm was sequenced and found to have an eight basepair deletion at the translation start site. This sequence is illustrated below, with the translation start site (ATG) encoding methionine codon underlined.

CACACAGGAAA ACCATGATTACGCCAAGCGCGCAATTAACCCTCACTAAAGGGAACAA

CACACAGGAAACAGCTATGACCATGATTACGCCAAGCGCGCAATTAACCCTCACTAAAGGGAACAA (SEQ ID NO: 1 and SEQ ID NO:2, respectively) Thus, random cleavage of a plasmid, followed by short deletions made by exonuclease III can cause functional changes in regulatory and protein coding regions of genetic elements. These changes can then be detected with a functional assay. SEQUENCE LISTING

SEQ ID NO: 1

Mutation in 5' end of gene encoding β- galactosidase

CACACAGGAAAACCATGATTACGCCAAGCGCGCAATTAACCCTCACTAAAGGGAACAA

SEQ ID NO : 2

5' end of wild type gene encoding β- galactosidase

CACACAGGAAACAGCTATGACCATGATTACGCCAAGCGCGCAATTAACCCTCACTAAAGGGA ACAA

Claims

WHAT IS CLAIMED IS:

1. A method for generating a library of polynucleotide sequences having nucleotide deletions at differing positions in a sequence of a genetic element comprising the steps of: (a) subjecting multiple copies of circular polynucleotides comprising the genetic element to random cleavage to obtain multiple linear polynucleotides each polynucleotide having at least one 3' and 5' end; and (b) subjecting said polynucleotides from step (a) to a process which removes at least one nucleotide from one of said ends of said polynucleotides producing a library of deletion polynucleotide sequences, said library comprising multiple deletion polynucleotide sequences with deletions at different random positions.

2. The method of claim 1 , further wherein said polynucleotides from step (b) are subjected to a process that covalently joins said 3' and 5' ends to one another.

3. The method of claim 1, wherein said library of polynucleotides is further subjected to a process that selects for a function of interest.

4. The method of claim 1, wherein the cleavage occurs with an endonuclease.

5. The method of claim 4, wherein the endonuclease is S 1.

6. The method of claim 1 , wherein the library of deletion polynucleotides comprises at least 5 individual polynucleotides each having a random deletion at a different position from the others.

7. The method of claim 1, wherein the library of deletion polynucleotides comprises at least 10 individual polynucleotides each having a random deletion at a different position from the others.

8. The method of claim 1, wherein the library of deletion polynucleotides comprises at least 30 individual polynucleotides each having a random deletion at a different position from the others.

9. The method of claim 1, wherein the composition of multiple copies of circular polynucleotides is free of naturally-occurring homologs to the genetic element.

10. The method of claim 1, wherein steps (a) and (b) are repeated.

11. The method of claim 1 , wherein step (b) further includes a process for inserting nucleotides at the position of deletion.

12. The method of claim 1, wherein 1-3 nucleotides are deleted in step (b).

13. 13. The method of claim 1 , wherein 50-100 nucleotides are deleted in step (b).

14. A substantially pure composition comprising a library of multiple linear polynucleotides each having a different 3' and a 5' end, but each linear polynucleotide being identical to the others if circularized.

15. The composition of claim 14, wherein said library comprises at least 5 polynucleotides having a different 3' and a 5' end.

16. A substantially pure composition comprising a library of at least 2 deletion polynucleotides each differing from the other only by having a different random deletion.

17. The substantially pure composition of claim 16, wherein said deletion polynucleotides further comprise at least one nucleotide inserted at the position of deletion.

18. The composition of claim 16, wherein the library has at least 5 polynucleotides each differing from the other only by having a different random deletion.

19. A method for generating a library of polynucleotide sequences having nucleotide additions at random positions in a genetic element comprising the steps of: (a) subjecting a composition of multiple copies of circular polynucleotides with the genetic element to random cleavage to obtain multiple linear polynucleotides each polynucleotide having at least one 3' and 5' end; and (b) subjecting said polynucleotides from step (a) to a process which adds at least one nucleotide to one of said ends of said polynucleotides producing a library of addition polynucleotide sequences, said library comprising multiple addition sequences with additions at different random positions.

20. The method of claim 19, further wherein said addition polynucleotides from step (b) are subjected to a process that covalently joins said 3' and 5' ends to one another.

21. The method of claim 19, further subjecting said library of polynucleotides to a process that selects for a function of interest.

22. The method of claim 19, wherein the cleavage occurs with an endonuclease.

23. The method of claim 22, wherein the endonuclease is S 1.

24. The method of claim 19, wherein the library of addition polynucleotides comprises at least 5 individual polynucleotides each having a random addition of nucleotides at a different position from the others.

25. The method of claim 19, wherein the library of addition polynucleotides comprises at least 10 individual polynucleotides each having a random addition at a different position from the others.

26. The method of claim 19, wherein the library of addition polynucleotides comprises at least 30 individual polynucleotides each having a random addition at a different position from the others.

27. The method of claim 19, wherein the composition of multiple copies of circular polynucleotides is free of naturally-occurring homologs to the genetic element.

28. The method of claim 19, wherein steps (a) and (b) are repeated.

29. The method of claim 19, wherein step (b) includes a process for deleting nucleotides at the point of addition.

30. The method of claim 19, wherein 1-3 nucleotides are added in step (b).

31. 31. The method of claim 19, wherein 3-50 nucleotides are added in step (b).

32. The method of claim 19, wherein 50-100 nucleotides are added in step (b).

33. A substantially pure composition comprising a library of at least 2 addition polynucleotides each differing from the other only by having a different random addition.

34. A substantially pure composition comprising a library of at least 5 addition polynucleotides each differing from the other only by having a different random addition.

35. A method for producing short deletions from the end of a polynucleotide by incubating a population of polynucleotides with an exonuclease at a temperature from 0°C to 24°C in the presence of 10 to 500 mM salt, thereby producing a population of polynucleotides containing deletions of 1-100 residues from at least one end of the polynucleotide.

36. The method of claim 35, wherein the polynucleotide is double- stranded.

37. The method of claim 35, wherein the exonuclease is exonuclease III.

38. The method of claim 36, wherein the double-stranded nucleic acid is incubated with a single-stranded endonuclease to produce a blunt end.

39. The method of claim 35, further wherein the resulting population of polynucleotides containing deletions at the ends are covalently joined to at least a second end, producing a population of polynucleotides containing a deletion at an internal position.

40. The method of claim 38, wherein the single-stranded endonuclease is SI nuclease.

41. The method of claim 39, wherein the polynucleotides resulting from covalent joining are circular polynucleotides.

42. The method of claim 35 wherein the population of polynucleotides contains deletions of 1-50 residues from at least one end of the polynucleotide.

43. The method of claim 35, wherein the population of polynucleotides contains deletions of 1-30 residues from at least one end of the polynucleotide.

44. A substantially pure composition of at least two polynucleotides each having two ends and each differing from one another only by having different deletions of 1 to 100 residues at one or both ends.

45. The composition of claim 44, wherein the composition of polynucleotides differs from one another by deletions of 1 to 50 residues at one or both ends.

46. The composition of claim 44, wherein the composition of polynucleotides differs from one another by deletions of 1 to 30 residues at one or both ends.

47. The composition of claim 44, wherein the composition of polynucleotides differs from one another by deletions of 1 to 10 residues at one or both ends.

48. A substantially pure composition of at least two polynucleotides each differing from one another only by deletions of 1 to 100 residues at a specific internal position within the polynucleotides.

49. The substantially pure composition of claim 48, wherein the polynucleotides differ from one another by deletions of 1 to 50 residues at the specific internal position.

50. The substantially pure composition of claim 48, wherein the polynucleotides differ from one another by deletions of 1 to 30 residues at the specific internal position.

51. The substantially pure composition of claim.48, wherein the polynucleotides differ from one another by deletions of 1 to 10 residues at the specific internal position.