WO2008127213A2 - Methods, systems, and software for regulated oligonucleotide-mediated recombination - Google Patents

Methods, systems, and software for regulated oligonucleotide-mediated recombination Download PDF

Info

Publication number
WO2008127213A2
WO2008127213A2 PCT/US2002/014866 US0214866W WO2008127213A2 WO 2008127213 A2 WO2008127213 A2 WO 2008127213A2 US 0214866 W US0214866 W US 0214866W WO 2008127213 A2 WO2008127213 A2 WO 2008127213A2
Authority
WO
WIPO (PCT)
Prior art keywords
character strings
population
nucleic acids
amino acid
recombination
Prior art date
Application number
PCT/US2002/014866
Other languages
French (fr)
Other versions
WO2008127213A3 (en
Inventor
Jeremy Minshull
Claes Gustafsson
Sridar Govindarajan
Ajoy Roy
Original Assignee
Maxygen, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Maxygen, Inc. filed Critical Maxygen, Inc.
Priority to AU2002368549A priority Critical patent/AU2002368549A1/en
Publication of WO2008127213A2 publication Critical patent/WO2008127213A2/en
Publication of WO2008127213A3 publication Critical patent/WO2008127213A3/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/1034Isolating an individual clone by screening libraries
    • C12N15/1089Design, preparation, screening or analysis of libraries using computer algorithms
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6806Preparing nucleic acids for analysis, e.g. for polymerase chain reaction [PCR] assay
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B01PHYSICAL OR CHEMICAL PROCESSES OR APPARATUS IN GENERAL
    • B01JCHEMICAL OR PHYSICAL PROCESSES, e.g. CATALYSIS OR COLLOID CHEMISTRY; THEIR RELEVANT APPARATUS
    • B01J2219/00Chemical, physical or physico-chemical processes in general; Their relevant apparatus
    • B01J2219/00274Sequential or parallel reactions; Apparatus and devices for combinatorial chemistry or for making arrays; Chemical library technology
    • B01J2219/0068Means for controlling the apparatus of the process
    • B01J2219/007Simulation or vitual synthesis
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B01PHYSICAL OR CHEMICAL PROCESSES OR APPARATUS IN GENERAL
    • B01JCHEMICAL OR PHYSICAL PROCESSES, e.g. CATALYSIS OR COLLOID CHEMISTRY; THEIR RELEVANT APPARATUS
    • B01J2219/00Chemical, physical or physico-chemical processes in general; Their relevant apparatus
    • B01J2219/00274Sequential or parallel reactions; Apparatus and devices for combinatorial chemistry or for making arrays; Chemical library technology
    • B01J2219/00718Type of compounds synthesised
    • B01J2219/0072Organic compounds
    • B01J2219/00722Nucleotides

Abstract

The present invention provides novel in silico recombination techniques in which all or part of a nucleic acid recombination procedure is performed or modeled in a digital system. In particular, this invention relates to methods of designing oligonucleotides for regulated recombination that approximates linkage characteristics obtained from fragmentation-based recombination techniques, such as family-based recombination. The methods of this invention include adjusting overlap regions in pairs of overlapping oligonucleotide character strings to bias recombination towards a desired genetic linkage. This invention also provides systems, computer program products, and kits for practicing the methods of the invention.

Description

METHODS, SYSTEMS, AlNfD SOFTWARE FOR REGULATED OLIGONUCLEOTIDE-MEDIATED RECOMBINATION
REFERENCE TO RELATED APPLICATIONS [0001] This application claims the benefit of U.S. Provisional Patent
Application No. 60/289,947, filed May 9, 2001 by Minshull et al., the disclosure of which is incorporated by reference.
COPYRIGHT NOTIFICATION
[0002] Pursuant to 37 CRR. § 1.71(e), a portion of this disclosure contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
BACKGROUND OF THE INVENTION [0003] Directed evolutionary techniques, such as recursive sequence recombination, provide for the rapid evolution of nucleic acids and encoded polypeptides whether performed in vitro or in vivo. Genetic algorithm guided approaches to recombination have similarly been utilized to evolve character string representations of such biological molecules "in silico" (in a computer system). One significant technical advance relating to these technologies came with the realization that recombining sequences from multiple related nucleic acids could dramatically accelerate evolution. More particularly, these multi-parental (e.g., FamilyShuffling™) fragmentation-based recombination technologies generally include recombining oligonucleotides from two or more phylogenetically-related parent molecules to create libraries that include millions of different chimeric sequences, which can be selected or screened for new and/or improved traits or properties. An illustration of the attributes of family-based directed evolution involved the recombination of four cephalosporinase genes, which yielded a 270-540-fold improvement in moxalactamase activity in a single round of recombination, whereas single gene recombination of the four genes individually resulted in only eightfold improvements. Crameri et al. (1998) "DNA Shuffling of a Family of Genes from Diverse Species Accelerates Directed Evolution," Nature 391:288-291.
[0004] A number of publications by the inventors and co-workers, as well as by other investigators in the art further describe techniques that facilitate family-based recombination. These include, e.g., Christians et al. (1999) "Directed Evolution of Thymidine Kinase for AZT Phosphorylation Using DNA Family Shuffling," Nat. Biotechnol. 17:259-264, Ness et al. (1999) "DNA Shuffling of Subgenomic Sequences of Subtilisin," Nat. Biotechnol. 17:893-896, Chang et al. (1999) "Evolution of a Cytokine Using DNA Family Shuffling," Nat. Biotechnol. 17:793-797, Hansson et al. (1999) "Evolution of Differential Substrate Specificities in Mu Class Glutathione Transferase Probed by DNA Shuffling," J. MoI. Biol. 287:265- 276, Kikuchi et al. (1999) "Novel Family Shuffling Methods for the In Vitro Evolution of Enzymes," Gene 236:159-167, Kikuchi et al. (2000) "An Effective Family Shuffling Method Using Single-Stranded DNA," Gene 243:133-137, Ostermeier et al. (1999) "Incremental Truncation as a Strategy in the Engineering of Novel Biocatalysts" Bioorg. Med. Chem. 7:2139-2144, Stemmer (1994) "DNA Shuffling by Random Fragmentation and Reassembly: In Vitro Recombination for Molecular Evolution," Proc. Natl. Acad. USA 91:10747-10751, and Ostermeier et al. (1999) "A Combinatorial Approach to Hybrid Enzymes Independent of DNA Homology," Nat. Biotechnol. 17:1205-1209.
[0005] In addition to the above noted publications, aspects relating to family-based recombination are also described in various U.S. Patents including, e.g., Stemmer, U.S. Patent No. 5,603,793, entitled "METHODS FOR IN VITRO RECOMBINATION," Stemmer et al., U.S. Pat. No. 5,830,721, entitled "DNA MUTAGENESIS BY RANDOM FRAGMENTATION AND REASSEMBLY,"
Stemmer et al., U.S. Pat. No. 5,811,238, entitled "METHODS FOR GENERATING POLYNUCLEOTIDES HAVING DESIRED CHARACTERISTICS BY ITERATIVE SELECTION AND RECOMBINATION," and Stemmer et al. U.S. Pat. No. 5,834,252, entitled "END COMPLEMENTARY POLYMERASE REACTION." [0006] Synthetic oligonucleotide-based recombination techniques generally involve building libraries of genes using degenerate oligonucleotides in gene assembly reactions. These techniques are described further in, e.g., Published International Application Nos. WO 00/42561, entitled "OLIGONUCLEOTIDE MEDIATED NUCLEIC ACID RECOMBINATION," by Crameri et al. and WO 01/23401, entitled "USE OF CODON- V ARIED OLIGONUCLEOTIDE SYNTHESIS FOR SYNTHETIC SEQUENCE RECOMBINATION," by Welch et al.
[0007] One limitation of certain approaches to synthetic oligonucleotide- based recombination is that oligonucleotides in a population are typically designed to anneal with one another to allow extension without regard to degeneracies within oligonucleotides. As a consequence, over distances greater than one oligonucleotide (e.g., 40 or 50 nucleotides), genetic information within a family of sequences may become completely unlinked. In a functional protein, amino acids close in space in the tertiary structure may be a considerable distance apart in the primary structure of the polypeptide. However, it is proximity in space that generally determines properties, including whether the protein folds properly, the shape of an active site, and the protein stability. Although a detailed analysis of large numbers of genes would reveal patterns of amino acid covariance reflecting the importance of compatibility in the folded protein, in practice there is rarely sufficient sequence information to determine which amino acids need to be linked. In contrast to synthetic oligonucleotide-based recombination, during family-based recombination, genetic linkages are generally maintained within a sequence family, because fragments derived from a given family of genes possess greater homology to one another than to gene fragments from other families and hence, are more likely to recombine with one another than with the non- familial fragments.
[0008] Accordingly, improved synthetic oligonucleotide-mediated recombination methods that more closely approximate linkage attributes of family- based recombination (e.g., capturing amino acid covariation, etc.) would be desirable. The present invention is directed to these and other features by providing methods of designing oligonucleotides for regulated recombination, which approximate the linkage characteristics of family-based recombination to yield high quality libraries. The invention also relates to systems and software for performing the methods described herein. These and many other features will be apparent upon complete review of the following disclosure.
SUMMARY OF THE INVENTION
[0009] The present invention generally relates to the directed evolution of nucleic acids and encoded polypeptides. More specifically, the invention provides new in silico recombination techniques, in which all or part of a nucleic acid recombination procedure is performed or modeled in a digital system. In particular, the invention includes methods that provide for the maintenance of genetic linkages beyond individual synthetic oligonucleotide lengths in populations of such sequences used for recombination. Significantly, the regulated oligonucleotide-mediated recombination methods of the present invention yield results that approximate, e.g., the linkage characteristics of family-based recombination.
[0010] In one aspect, the invention relates to methods of designing oligonucleotides for regulated recombination. The methods include adjusting an overlap region of a pair of overlapping oligonucleotide character strings in a population of overlapping oligonucleotide character strings to provide a selected population that includes adjusted overlapping oligonucleotide character strings such that recombination of two or more members of the selected population is biased towards a desired genetic linkage. The method typically includes adjusting the overlap region of a plurality of pairs of overlapping oligonucleotide character strings. In preferred embodiments, the adjusting step is performed in a digital system (e.g., under the control of a genetic algorithm or the like) and typically includes changing at least one nucleotide in one or more portions of the overlap region such that oligonucleotide overlap is increased or decreased. As a result, a probability of hybridization for the overlap region is increased or decreased at a selected temperature. For example, the adjusting step optionally includes determining an annealing frequency for the overlap region and changing at least one nucleotide in one or more portions of the overlap region such that the annealing frequency of the overlap region is substantially proportional to the desired genetic linkage. [0011] In certain embodiments, the methods disclosed herein include applying genetic operators to character string representations of biological molecules (e.g., nucleic acids, polypeptides, or the like). For example, the methods optionally also include determining sequences of recombinant nucleic acids resulting from in silico recombination of the selected population of adjusted overlapping oligonucleotide character strings. These embodiments also typically include performing in silico simulations of activity for the recombinant nucleic acids or expression products therefrom. [0012] In some embodiments, the methods additionally include providing (e.g., synthesizing, etc.) a population of nucleic acids that correspond to the selected population that includes the adjusted overlapping oligonucleotide character strings. These embodiments also typically include recombining the population of nucleic acids with a polymerase and/or a ligase (e.g., by performing assembly PCR) to provide a population of recombined nucleic acids. For example, the providing step typically includes synthesizing a set of single-stranded oligonucleotides that corresponds to the selected population that includes the adjusted overlapping oligonucleotide character strings. The methods typically further include selecting or screening an encoded polypeptide of a member of the population of recombined nucleic acids for a desired trait or property. In other embodiments, the methods also include selecting (e.g., enriching for) a member of the population of the recombined nucleic acids, which member includes the desired genetic linkage. Optionally, the selecting step includes amplifying the member using nucleic acid primers that correspond to a portion of the desired genetic linkage. As another option, the selecting step includes affinity purifying the member using nucleic acid sequences that correspond to a portion of the desired genetic linkage.
[0013] In another aspect, the invention provides additional methods of designing oligonucleotides for regulated recombination that are also typically performed in a digital system, such as a computer. The methods include providing at least two parental polypeptide character strings, which character strings, when aligned for maximum identity, include at least one amino acid difference, and providing a desired amino acid linkage. The methods also include reverse-translating the parental polypeptide character strings into parental polynucleotide character strings, and segmenting each of the parental polynucleotide character strings into overlapping oligonucleotide character strings. The methods additionally include adjusting an overlap region of a pair of overlapping oligonucleotide character strings such that recombination is biased towards the desired amino acid linkage to provide a selected population that includes adjusted overlapping oligonucleotide character strings. The methods generally include adjusting the overlap region of a plurality of pairs of overlapping oligonucleotide character strings.
[0014] Typically, the at least two parental polypeptide character strings, when aligned for maximum identity, include at least one region of amino acid sequence similarity. The methods also typically include deriving the desired amino acid linkage from a portion of at least one of the at least two parental polypeptide character strings. For example, at least two of the at least two parental polypeptide character strings are optionally orthologs or paralogs. Additionally, at least two of the at least two parental polypeptide character strings are optionally members of an identical or a different phylogenetic family. Further options include defining the phylogenetic family computationally or manually. Furthermore, the methods optionally include reverse- translating at least one of the at least two parental polypeptide character strings according to a species codon-bias of a selected expression host. [0015] In preferred embodiments, the methods further include performing each step in a digital system, e.g., under the direction of a genetic algorithm. For example, the first providing step typically includes inputting the at least two parental polypeptide character strings into the digital system, while the second providing step typically includes inputting the at least one desired linkage into the digital system. Optionally, the second providing step includes calculating the at least one desired amino acid linkage using a probabilistic technique (e.g., a Markov chain modeling method or the like) or manually selecting the at least one desired amino acid linkage. In preferred embodiments, the desired amino acid linkage limits a size of the selected population that includes adjusted overlapping oligonucleotide character strings. In addition, one or more members of the selected population that includes adjusted overlapping oligonucleotide character strings optionally capture amino acid covariation. The methods optionally further include graphically displaying at least one member of the selected population that includes adjusted overlapping oligonucleotide character strings, e.g., using an output device, such as a monitor. [0016] The methods described herein are typically utilized to capture amino acid covariation to produce high quality synthetic libraries. For example, the second providing step optionally includes (a) aligning the at least two parental polypeptide character strings for maximum identity to produce a parental polypeptide character string profile and (b) identifying allowed sequence paths through the parental polypeptide character string profile. Thereafter, the method includes (c) selecting the at least one desired amino acid linkage from the allowed sequence paths identified in (b). In certain embodiments, (b) includes quantifying a site-entropy for one or more amino acid sites in the parental polypeptide character string profile to identify the allowed sequence paths. Optionally, (b) includes quantifying a mutual information content between pairs of amino acid sites in the parental polypeptide character string profile to identify the allowed sequence paths. In one class of preferred embodiments, the allowed sequence paths include Markov chains. One advantage of these methods is that the allowed sequence paths identified in (b) limit a size of the selected population that includes adjusted overlapping oligonucleotide character strings. Furthermore, according to these methods, members of the selected population that includes adjusted overlapping oligonucleotide character strings typically capture amino acid covariation. The amino acid covariation generally corresponds, e.g., to a structural or functional domain, a phylogenetic motif, or the like. Optionally, at least some of the amino acid covariation is artificially defined (e.g., by the user, etc.). The amino acid covariation captured by a member of the selected population that includes adjusted overlapping oligonucleotide character strings typically corresponds to from about two to about 20 amino acids. [0017] The adjusting steps of the methods described herein typically include changing at least one nucleotide in one or more portions of the overlap region such that oligonucleotide overlap is increased or decreased, which increases or decreases a probability of hybridization for the overlap region at a selected temperature. Optionally, the one or more portions are disposed proximal to an end of an overlapping oligonucleotide character string in the at least one overlap region. This generally provides an additional level of control to the methods of regulating oligonucleotide- mediated recombination that are described herein. In certain embodiments, the adjusting steps include determining an annealing frequency for the overlap region and changing at least one nucleotide in one or more portions of the overlap region such that the annealing frequency of the overlap region is substantially proportional to the at least one desired amino acid linkage. The annealing frequency typically includes, e.g., a percentage of the at least one pair of overlapping oligonucleotide character strings that anneals at a selected temperature.
[0018] In preferred embodiments, the methods also include providing a population of nucleic acids that corresponds to the selected population that includes the adjusted overlapping oligonucleotide character strings. These embodiments also typically include recombining the population of nucleic acids, e.g., to provide a population of recombined nucleic acids. At least one member of the population of recombined nucleic acids typically encodes a full-length protein. The providing step typically includes synthesizing a set of single-stranded oligonucleotides that corresponds to the selected population that includes the adjusted overlapping oligonucleotide character strings. [0019] The recombining step optionally includes annealing one or more members of the population of nucleic acids to one or more other members of the population of nucleic acids to provide an annealed nucleic acid, and elongating and/or ligating the annealed nucleic acid to provide the population of recombined nμcleic acids. Optionally, the population of recombined nucleic acids is reiteratively recombined. For example, the methods optionally further include fragmenting (e.g., chemically, enzymatically, or the like) the population of recombined nucleic acids to provide fragmented nucleic acids, denaturing the fragmented nucleic acids to provide denatured nucleic acids, and hybridizing the denatured nucleic acids to provide hybridized nucleic acids. Thereafter, the methods typically include elongating or ligating, or both elongating and ligating, the hybridized nucleic acids to provide a population of further recombined nucleic acids.
[0020] Following recombination, the invention typically includes performing various downstream operations. For example, the method optionally further includes deconvoluting, sequencing, or cloning one or more members of the population of recombined nucleic acids, or expressing the population of recombined nucleic acids to provide a recombined polypeptide product. Other options include introducing members of the population of recombined nucleic acids into at least one cell in which the introduced members are expressed to provide a recombined polypeptide product to the at least one cell. The methods also typically include selecting or screening the recombined polypeptide product for a desired trait or property. For example, the desired trait or property is optionally selected or screened for in an assay selected from, e.g., an in vivo selection assay, a parallel solid phase assay, an in vitro selection assay, or the like.
[0021] In yet another aspect, the invention provides computer implemented methods of maintaining genetic linkages within sequence families over distances greater than a length of a single oligonucleotide during synthetic recombination. The methods include inputting at least one amino acid sequence character string into the computer, and calculating a desired amino acid linkage using a probabilistic technique. The method further includes reverse-translating the at least one amino acid sequence character string into at least one corresponding nucleic acid sequence character string, and segmenting the at least one corresponding nucleic acid sequence character string into at least two overlapping oligonucleotide character strings. In addition, the methods include inputting an annealing temperature for an assembly reaction, and adjusting overlap between the at least two overlapping oligonucleotide character strings such that recombination is biased towards the desired amino acid linkage at the at least one annealing temperature, thereby providing at least two adjusted overlapping oligonucleotide character strings. [0022] The adjusting step generally includes changing at least one nucleotide in at least one overlap region such that oligonucleotide character string overlap is increased or decreased, which increases or decreases a probability of annealing for the overlap region at the annealing temperature. The method typically further includes providing a population of overlapping oligonucleotides that corresponds to the at least two adjusted overlapping oligonucleotide character strings, and recombining the population of overlapping oligonucleotides with a polymerase and/or a ligase to provide a population of recombined nucleic acids. For example, the providing step typically includes synthesizing a set of single-stranded oligonucleotides that corresponds to the at least two adjusted overlapping oligonucleotide character strings.
[0023] In one aspect, the invention relates to methods of designing oligonucleotides for regulated recombination that include providing a population of overlapping oligonucleotide character strings and selecting at least one pair of overlapping oligonucleotide character strings from the population of overlapping oligonucleotide character strings. Each overlapping oligonucleotide character string typically includes, e.g., between about 20 and about 60 nucleotides, while the at least one overlap region of the selected pair of overlapping oligonucleotide character strings generally includes, e.g., between about 15 and about 25 nucleotides. The methods also include changing at least one character in at least one overlap region of the at least one selected pair (e.g., a plurality, etc.) of overlapping oligonucleotide character strings to adjust a probability of hybridization of oligonucleotides corresponding in sequence to the selected pair of overlapping oligonucleotide character strings, thereby designing oligonucleotides for regulated recombination. For example, the selected pair of overlapping oligonucleotide character strings optionally correspond in sequence to subsequences from at least two different phylogenetic families of polynucleotides. The methods regulate recombination by maintaining genetic linkages within sequence families over distances greater than a length of any individual member of the population of nucleic acids.
[0024] The adjusting step typically includes changing at least one nucleotide in one or more portions of the at least one overlap region such that oligonucleotide overlap is increased or decreased, thereby increasing or decreasing a probability of hybridization for the at least one overlap region at a selected temperature. The methods optionally include, e.g., performing the changing step in a logic device or performing the changing step manually. In certain embodiments, the methods include graphically displaying at least one member of the population of overlapping oligonucleotide character strings.
[0025] In preferred embodiments, the methods include providing a population of nucleic acids that include one or more pairs of designed oligonucleotides, and recombining the population of nucleic acids with a polymerase or a ligase, or both a polymerase and a ligase, to provide a population of recombined nucleic acids. The providing step typically includes synthesizing a set of single-stranded oligonucleotides. These embodiments also generally include selecting or screening an encoded polypeptide of at least one member of the population of recombined nucleic acids for at least one desired trait or property. Optionally, the methods include selecting at least one member of the population of the recombined nucleic acids, which member comprises the at least one desired genetic linkage. For example, the selecting step optionally includes amplifying the at least one member using one or more nucleic acid primers that correspond to at least a portion of the at least one desired genetic linkage, or affinity purifying the at least one member using one or more nucleic acid sequences that correspond to at least a portion of the at least one desired genetic linkage.
[0026] In further aspects, the present invention provides a system that includes a logic device and a computer readable medium operably connected to the logic device that stores at least one computer program for designing oligonucleotides for regulated recombination. The system typically includes various additional components for performing assorted operations beyond oligonucleotide design, such as oligonucleotide synthesis, nucleic acid amplification, or the like. In addition, the invention relates to a computer program product that includes a computer readable medium having a computer program for designing oligonucleotides for regulated recombination. The invention also provides kits that include various system components, such as the computer program product. [0027] In one aspect, the present invention provides methods and associated systems that differentiate between ancestral and functional covariation of monomer subunits in biological molecules (e.g., amino acid residues that covary in proteins or nucleotides that covary in nucleic acids). For example, in one class of methods, covariation is characterized in a population of homologous polypeptides. In the methods, covarying amino acid residues in a character string population are identified that represents homologous parental polypeptides to produce a first covariation data set. Unlinked nucleic acids comprising the covarying amino acid residues are recombined to produce a set of recombinants that encode variants of the parental polypeptides, which are selected or screened for an encoded activity of at least a subset of the recombinants, producing a set of screened recombinants. Covarying residues are identified in the set of screened recombinants to produce a second covariation data set. Differences between the first and second covariation data sets are identified, thereby characterizing covariation in the population of homologous polypeptides. The differences between the first and second covariation set provide a measure of whether covariation in either set is a result of functional constraints of screened molecules during selection, or whether the covariation is simply a result of the ancestry of the molecules that are recombined.
[0028] The methods are generally applicable to any number of covarying residues. For example, 3, 4, 5, 6, 7, 8, 9, 10, or more amino acid residues in the character string population or the screened recombinants can be identified that covary with one another.
[0029] The methods can be applied to homologous nucleic acids or polypeptides, whether artificial or naturally occurring. For example, the homologous polypeptides can represent a phylogenetic family of molecules, whether artificially or naturally derived. Any biomolecule set used or made during the methods can provide systematically varied data, e.g., systematically varied amino acid sequences. [0030] Most typically, the unlinked nucleic acids comprise overlapping synthetic oligonucleotides. By selecting the overlap sites, the nucleic acids can be unlinked from their usual nearest neighbor genetic linkage relationships.
[0031] In one embodiment, the first and/or second covariation data sets are produced by analysis of mutual information. Optionally, the methods can include normalizing the first covariation data set prior to identifying differences between covarying data sets, e.g., to reduce background noise in the data sets.
[0032] As mentioned, the covariation present in both the first and second covariation data sets provides a measure of functional covariation present in the population of homologous polypeptides. Covariation present in the first data set that is not present in the second data set has a higher probability of being a result of the ancestry of the molecules that are produced, rather than as a result of a functional constraint of the selection that is applied. This information can be used for selecting residues to be altered during mutation, to increase the probability that mutation will have a functional effect. Similarly, additional rounds of mutagenesis (e.g., recursive recombination, cassette mutagenesis, site-directed mutagenesis, or the like) can be performed such that oligonucleotides for recombination and/or mutagenesis comprise covarying residues, thereby preserving covariation and making resulting mutants more likely to encode functional molecules. [0033] The methods can include generating a statistical model based on any covariation characterized in the population of homologous polypeptides. For example, regression-based algorithms such as partial least squares regression, multiple linear regression, inverse least squares regression, principal component regression, and variable importance for projection can be applied. Similarly, the statistical model can be produced using at least one probabilistic technique, such as Markov chain modeling.
[0034] Thus, in one embodiment, the invention provides a method of characterizing covariation in a population of homologous polypeptides, the method comprising identifying varying amino acid residues in a character string population that represents homologous parental polypeptides; identifying amino acid residues in the character string population that covary with one another to produce a parental covariation data set; providing a set of overlapping synthetic oligonucleotides comprising members that encode one or more varying amino acids identified in the character string population; recombining the overlapping synthetic oligonucleotides to produce a set of recombined polynucleotides that encode progeny of the homologous parental polypeptides; expressing at least a subset of the set of recombined polynucleotides to produce a set of progeny polypeptides; selecting or screening at least a subset of the progeny polypeptides for a desired property; sequencing one or more progeny polypeptides, or one or more recombined polynucleotides that encode the one or more progeny polypeptides, that comprise the desired property to produce a progeny sequence data set; identifying at least pairs of amino acid residues in the progeny sequence data set that covary with one another to produce a progeny covariation data set; and, identifying differences between the parental and progeny covariation data sets, thereby characterizing the covariation in the population of homologous polypeptides.
[0035] Systems, e.g., including computer readable instructions for performing computer readable instructions for performing any of the above methods are also a feature of the invention.
DEFINITIONS [0036] Before describing the present invention in detail, it is to be understood that this invention is not limited to particular compositions or systems, which can. of course vary It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting. As used in this specification and appended claims, the singular forms "a," "an," and "the" include plural referents unless the content and context clearly dictates otherwise. Thus, for example, reference to "a parental polypeptide character string" includes a combination of two or more such character strings, and the like. Unless indicated otherwise, an "or" conjunction is intended to be used in its correct sense as a Boolean logical operator, encompassing both the selection of features in the alternative (A or B, where the selection of A is mutually exclusive from B) and the selection of features in conjunction (A or B, where both A and B are selected).
[0037] The following definitions supplement those known to persons of skill in the art.
[0038] The terms "family-based recombination," "multi-parental fragmentation-based recombination," and "multi-parental recombination" refer to the recombination of nucleic acid sequences, or character string representations thereof, derived from or based upon two or more parental nucleic acid sequences. For example, in certain embodiments, nucleic acid sequences, or character string representations thereof, derived from or based upon sequences from 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, or more parental sequences are recombined. The parental nucleic acid sequences are typically derived from a selected set of homologous nucleic acids, e.g., from a defined phylogenetic family. Parental sequences are derived from a selected set of homologous nucleic acids when they (individually or collectively) have regions of sequence identity (and, optionally, regions of sequence diversity) with more than one of the homologous nucleic acids. Most commonly, the parental nucleic acid sequences include multiple member types, each having regions of sequence identity to at least one member of the selected set of homologous nucleic acids. [0039] A "character string" represents any entity capable of storing sequence information (e.g., the subunit structure of a biological molecule such as the nucleotide sequence of a nucleic acid, the amino acid sequence of a protein, the sugar sequence of a polysaccharide, etc.). In one embodiment, the character string can be a simple sequence of characters (letters, numbers, or other symbols) or it can be numeric representation of such information in tangible or intangible (e.g., electronic, magnetic, etc.) form. The character string need not be "linear," but can also exist in a number of other forms, e.g., a linked list or other non-linear axray (e.g., used as a code to generate a linear array of characters), or the like. Character strings are preferably those which encode polynucleotide or polypeptide strings, directly or indirectly, including any encrypted strings, or images, or arrangements of objects which can be transformed unambiguously to character strings representing sequences of monomers or multimers in polynucleotides, polypeptides, or the like (whether made of natural or artificial monomers).
[0040] The term "substring" refers to a character string that is found within another character string. The substring represents a portion of the full-length string.
[0041] An "oligonucleotide character string" refers to a character string representation of an oligonucleotide. For example, an oligonucleotide character string may be a substring of another string, such as a parental character string representing a parental nucleic acid sequence (e.g., a gene or the like).
[0042] A "character" when used in reference to a character of a character string refers to a subunit of the string. In a preferred embodiment, the character of a character string encodes one subunit of an encoded biological molecule. Thus, for example, in preferred embodiments, where the encoded biological molecule is a protein, a character of the string encodes a single amino acid, or where the encoded biological molecule is a polynucleotide or oligonucleotide, a character of the string encodes a single nucleotide. [0043] A "parental polypeptide character string profile" refers to an alignment of polypeptide character strings. According to the methods of the present invention, parental polypeptide character string profiles are optionally utilized to identify allowed sequence paths (e.g., 'threaded') through the parental polypeptide character strings. [0044] An "allowed sequence path" refers to sequence of subunits or sites (e.g., characters, amino acids, nucleotides, etc.) that has a probability of occurrence that is greater than zero in a population of character strings or other sequences.
[0045] The term "covariation" refers to the correlated variation of two or more variables (e.g., amino acids in a polypeptide, etc.). "Ancestral covariation" refers to the correlated variation of two or more residues that is due to a common ancestral origin. "Functional covariation" refers to the correlated variation of two or more residues that preserves protein structure and/or function. Functional links between covarying residues in a polypeptide can be due, e.g., to structural contact, overall charge distribution, any indirect effect, such as interactions with substrate, or the like. Residues which covary (based upon either functional or ancestral linkage) can be proximal to one another or distal to one another on a biopolymer of interest (e.g., a protein, nucleic acid, etc.). That is, the residues that covary can be closely grouped in a biopolymer of interest, or can be distantly spaced on the biopolymer. [0046] A "covariation data set" refers to one or more sets of covarying amino acid residues that are identified in a population of polypeptides. For example, each set of covarying amino acids can include 2, 3, 4, 5, 6, 7, 8, 9, 10, or more residues. [0047] A "biological molecule" refers to a molecule typically found in a biological organism. Preferred biological molecules include biological macromolecules that are typically polymeric in nature being composed of multiple subunits. Typical biological molecules include, but are not limited to molecules that share some structural features with naturally occurring polymers such as an RNAs (formed from nucleotide subunits), DNAs (formed from nucleotide subunits), and polypeptides (formed from amino acid subunits), including, e.g., RNAs, RNA analogues, DNAs, DNA analogues, polypeptides, polypeptide analogues, peptide nucleic acids (PNAs), combinations of RNA and DNA (e.g., chimeraplasts), or the like.
[0048] The term "subunit" when used in reference to a biological molecule refers to the characteristic "monomer" of which a biological is composed. Thus, for example, the subunit of a nucleic acid is a nucleotide, the subunit of a polypeptide is an amino acid, etc.
[0049] "Genetic algorithms" are processes which mimic evolutionary processes. Genetic algorithms (GAs) are used in a wide variety of fields to solve problems which are not fully characterized or too complex to allow full characterization, but for which some analytical evaluation is available. That is, GAs are used to solve problems which can be evaluated by some quantifiable measure for the relative value of a solution (or at least the relative value of one potential solution in comparison to another). In the context of the present invention, a genetic algorithm is typically a process for selecting or manipulating character strings in a computer, typically where the character string corresponds to one or more biological molecules (e.g., nucleic acids, proteins, PNAs, or the like).
[0050] "Directed evolution of character strings or objects" refers to a process of artificially changing a character string by artificial selection, recombination, or other manipulation, i.e., which occurs in a reproductive population in which there are (1) varieties of individuals, with some varieties being (2) heritable, of which some varieties (3) differ in fitness (reproductive success determined by outcome of selection for a predetermined property (desired characteristic). The reproductive population can be, e.g., a physical population or a virtual population in a computer system. [0051] "Genetic operators" are user-defined operations, or sets of operations, each comprising a set of logical instructions for manipulating character strings. Genetic operators are applied to cause changes in populations of individuals in order to find interesting (useful) regions of the search space (populations of individuals with predetermined desired properties) by predetermined means of selection. Predetermined (or partially predetermined) means of selection include computational tools (operators comprising logical steps guided by analysis of information describing libraries of character strings), and physical tools for analysis of physical properties of physical objects, which can be built (synthesized) from matter with the purpose of physically creating a representation of information describing libraries of character strings. In a preferred embodiment, some or all of the logical operations are performed in a computer.
[0052] When referring to operations on strings (e.g., recombinations, hybridizations, elongations, fragmentations, segmentations, insertions, deletions, transformations, etc.) it will be appreciated that the operation can be performed on the encoded representation of a biological molecule or on the "molecule" prior to encoding so that the encoded representation captures the operation.
[0053] "Similarity" can refer to a similarity measurement between the encoded representations of molecules (e.g., the initial character strings) and/or between the molecules represented by the encoded character strings.
[0054] "Genetic linkage" or "linkage" refers to co-assortment following recombination of two or more heritable elements (in this sense, "heritability" can be measured in an in vitro reaction, or as an in vivo phenomenon). For example, in classical genetics, genes or portions thereof, e.g., particular nucleotides or nucleotide sequences, co-assort (display co-variance) during descent. For example, two or more heritable elements do not display independent assortment, e.g., in gametes of an organism, if they are genetically linked. Genetic linkage is the result of a physical linkage of the heritable elements, e.g., on a chromosome of an organism, or, in an in vitro context, on a first nucleic acid that is to be recombined, e.g., with a second nucleic acid (the nucleic acids can be single or double stranded). By convention, the amount of recombination observed between heritable elements is a measure of the distance by which the heritable elements are separated (a 1% recombination rate is equivalent to a "map unit" in classical genetics). The present invention provides mechanisms for the controllable co-assortment of heritable elements (whether in vitro or in vivo), e.g., during synthetic recombination of one or more nucleic acids. That is, two or more heritable elements can occur in a physically linked arrangement, e.g., two nucleotides can occur in a given relative arrangement on a given nucleic acid. If the nucleic acid is fragmented during an enzymatic cleavage reaction and then rejoined via a polymerase or ligase reaction, genetic linkages are generally preserved because fragmentation of the nucleic acid occurs as a function of the distance between the two nucleotides, and the resulting fragments are rejoined at the sites of the cleavage reaction. If a recombination reaction is performed by synthesizing fragments of one or more nucleic acids to be recombined, the physical linkage of heritable elements can be lost, as the synthesis can be conducted in a manner that is independent of the distance between heritable elements. The present invention provides a number of mechanisms for maintaining co-assortment (or any co-variance phenomena), e.g., in the context of recombination of synthesized nucleic acids or nucleic acid fragments. In contrast, "Unlinked" nucleic acids do not co-assort with, or are otherwise separate from, one another. For example, prior to performing a recombination reaction, as described herein, nucleic acids (e.g., synthetic oligonucleotides, etc.) are not linked to each other. [0055] The term "nucleic acid" refers to deoxyribonucleotides or ribonucleotides and polymers thereof in either single- or double-stranded form. Unless specifically limited, the term encompasses nucleic acids containing known analogs of natural nucleotides which have similar binding properties as the reference nucleic acid and are metabolized in a manner similar to naturally occurring nucleotides. Unless otherwise indicated, a particular nucleic acid sequence also implicitly encompasses conservatively modified variants thereof (e.g., degenerate codon substitutions) and complementary sequences and as well as the sequence explicitly indicated. Specifically, degenerate codon substitutions may be achieved by generating sequences in which the third position of one or more selected (or all) codons is substituted with mixed-base and/or deoxyinosine residues (Batzer et al. (1991) Nucleic Acid Res. 19:5081; Ohtsuka et al. (1985) J. Biol. Chem. 260:2605-2608; Rossolini et al. (1994) MoI. Cell. Probes 8:91-98). The term nucleic acid is used interchangeably with, e.g., gene, cDNA, and mRNA encoded by a gene.
[0056] A "nucleic acid sequence" refers to the order and identity of the nucleotides comprising a nucleic acid. [0057] A "polynucleotide" is a polymer of nucleotides (A, C, T, U, G, etc. or naturally occurring or artificial nucleotide analogues) or a character string representing a polymer of nucleotides, depending on context. Either the given nucleic acid or the complementary nucleic acid can be determined from any specified polynucleotide sequence. [0058] Two nucleic acids "correspond" when they have the same sequence, or when one nucleic acid is complementary to the other, or when one nucleic acid is a subsequence of the other, or when one sequence is derived, by natural or artificial manipulation from the other. [0059] An "annealing frequency" refers to a percentage of overlapping oligonucleotides that anneal at a selected temperature.
[0060] Nucleic acids are "elongated" when additional nucleotides (or other analogous molecules) are incorporated into the nucleic acids. The reaction is typically catalyzed by a polymerase, e.g., a DNA polymerase, an RNA polymerase, or the like, which typically adds sequences at the 3' terminus of the nucleic acid.
[0061] Nucleic acids are "Ii gated" or joined together in a reaction typically catalyzed by, e.g., a ligase or by an enzyme having ligase activity .(e.g., which catalyzes formation of phosphodi ester linkages between 3' and 5' positions of nucleic acids and nucleic acid analogs).
[0062] A "chimeric" nucleic acid sequence can include a sequence composed of nucleic acid subsequences derived from different sources, e.g., nucleic acid fragments from different genes, different organisms, and the like.
[0063] Two nucleic acids are "recombined" when sequences from each of the two nucleic acids are combined in a progeny nucleic acid. Two sequences are "directly" recombined when both of the nucleic acids are substrates for recombination. [0064] Nucleic acid sequences or character strings "overlap" when they possess at least one substantially complementary subsequence or substring. An "overlap region" refers to subsequences or substrings of a pair of nucleic acid character stings (e.g., oligonucleotide character strings) that are substantially complementary to one another. Two nucleic acid sequences or character strings are "substantially complementary" to one another when at least 80%, preferably 90%, and more preferably 95% or more nucleotides or characters in the overlap region of the two sequences or character strings are complementary to one another when aligned for maximum correspondence.
[0065] "Adjusted overlapping oligonucleotide character strings" include pairs of oligonucleotide character strings having adjusted overlap regions, e.g., increased or decreased by changing one or more nucleotides therein, such that the probability of hybridization or annealing between two oligonucleotide character strings in a particular pair is increased or decreased (e.g., at a given temperature, such as an assembly reaction temperature, etc., in in sϊlico simulations or otherwise).
[0066] The "probability of hybridization" is the likelihood that a pair of nucleic acid sequences or character string representations (e.g., in in silico simulations or the like) will hybridize to one another, e.g., at a selected temperature (e.g., an assembly reaction temperature, etc.). See, e.g., Garzon et al. "Virtual test tubes: A new methodology for computing," in SPIRE-20004th Int. Meeting on String Processing and Information Retrieval, Sep 2000. A Coruna, Spain, pp. 116-121, IEEE Computer Society Press, 2000.
[0067] The term "gene" is used broadly to refer to any segment of DNA associated with a biological function. Thus, genes include coding sequences and/or the regulatory sequences required for their expression. Genes also include nonexpressed DNA segments that, for example, form recognition sequences for other proteins. Genes can be obtained from a variety of sources, including cloning from a source of interest or synthesizing from known or predicted sequence information, and may include sequences designed to have desired parameters.
[0068] The word "degenerate" refers to more than one codon coding or representing an amino acid. [0069] Nucleic acid sequences or character strings representing such sequences include "degeneracies" when they include regions that are substantially the same or similar, or regions that are complementary to such regions.
[0070] Nucleic acids are "homologous" when they share sequence similarity that is derived, naturally or artificially, from a common ancestral sequence. This occurs naturally as two or more descendent sequences deviate from a common ancestral sequence over time as the result of mutation and natural selection. Artificially homologous sequences may be generated in various ways. For example, a nucleic acid sequence can be synthesized de novo to yield a nucleic acid that differs in sequence from a selected parental nucleic acid sequence. Artificial homology can also be created by artificially recombining one nucleic acid sequence with another, as occurs, e.g., during cloning or chemical mutagenesis, to produce a homologous descendent nucleic acid. Artificial homology may also be created using the redundancy of the genetic code to synthetically adjust some or all of the coding sequences between otherwise dissimilar nucleic acids in such a way as to increase the frequency and length of highly similar stretches of nucleic acids while minimizing resulting changes in amino acid sequences to the encoded gene products. Preferably, such artificial homology is directed to increasing the frequency of identical stretches of sequence of at least three base pairs in length. More preferably, it is directed to increasing the frequency of identical stretches of sequence of at least four base pairs in length.
[0071] It is generally assumed that the two nucleic acids have common ancestry when they demonstrate sequence similarity. However, the exact level of sequence similarity necessary to establish homology varies in the art. In general, for purposes of this disclosure, two nucleic acid sequences are deemed to be homologous when they share enough sequence identity, to permit direct recombination to occur between the two sequences.
[0072] The terms "identical" or percent "identity," in the context of two or more nucleic acid or polypeptide sequences, refer to two or more sequences or subsequences that are the same or have a specified percentage of amino acid residues or nucleotides that are the same, when compared and aligned for maximum correspondence, as measured using sequence comparison algorithm or by visual inspection. [0073] A region of "high sequence similarity" refers to a region that is
90% or more identical to a second selected region when aligned for maximal correspondence (e.g., manually or using the common program BLAST set to default parameters). A region of "low sequence similarity" is 50% or less identical, preferably 40% or less identical, e.g., 25% or less identical to a second selected region, when aligned for maximal correspondence (e.g., manually or using BLAST set with default parameters). A "region of similarity" or "region of sequence similarity" refers to a region that includes at least 25% or greater identity to a second selected region when aligned for maximal correspondence. For example, a "region of amino acid sequence similarity" refers to a region in an amino acid sequence that includes at least 25% or greater sequece identity to a second selected region when aligned for maximal correspondence.
[0074] The phrase "substantially identical," in the context of two nucleic acids or polypeptides, refers to two or more sequences or subsequences that have at least 60%, preferably 80%, most preferably 90-95% nucleotide or amino acid residue identity, when compared and aligned for maximum correspondence, as measured using, e.g., a sequence comparison algorithm or by visual inspection. Preferably, the substantial identity exists over a region of the sequences that is at least about 50 subunits in length, more preferably over a region of at least about 100 subunits, and most preferably the sequences are substantially identical over at least about 150 subunits. In some embodiments, the sequences are substantially identical over the entire length of, e.g., the coding regions.
[0075] Nucleic acids "hybridize" or "anneal" when they associate, typically in solution (or with one component fixed to a solid support). Nucleic acids hybridize due to a variety of well-characterized physico-chemical forces, such as hydrogen bonding, solvent exclusion, base stacking and the like. An extensive guide to the hybridization of nucleic acids is found in Tijssen (1993) Laboratory Techniques in Biochemistry and Molecular Biology— Hybridization with Nucleic Acid Probes part I chapter 2 "Overview of principles of hybridization and the strategy of nucleic acid probe assays," (Elsevier, New York), as well as Current Protocols in Molecular Biology, F.M. Ausubel et ai, eds., Current Protocols, a joint venture between Greene Publishing Associates, Inc. and John Wiley & Sons, Inc., (2000 Supplement). Hames and Higgins (1995) Gene Probes 1 IRL Press at Oxford University Press, Oxford, England, and Hames and Higgins (1995) Gene Probes 2 IRL Press at Oxford University Press, Oxford, England provide details on the synthesis, labeling, detection and quantification of DNA and RNA, including oligonucleotides.
[0076] The terms "polypeptide," "peptide," and "protein" are used interchangeably herein to refer to a polymer of amino acid residues, or a character string representing an amino acid polymer, depending on context. The terms apply to amino acid polymers in which one or more amino acid residues are artificial chemical analogs of corresponding naturally occurring amino acids, as well as to naturally occurring amino acid polymers.
[0077] A "polypeptide character string" refers to a character string representation of a polypeptide.
[0078] A "polypeptide sequence" refers to the order and identity of the amino acids comprising a polypeptide.
[0079] A polypeptide is "reverse-translated" in a process that determines at least one sequence of nucleotides that encodes the polypeptide from the sequence of amino acids in the polypeptide.
[0080] A "full-length protein" is a protein with substantially the same sequence domains as a corresponding protein encoded by a natural gene. Such a protein can have altered sequences relative to the corresponding naturally encoded gene, e.g., due to recombination and selection, but unless specified to the contrary, is typically at least about 95% the length of a corresponding naturally encoded protein. The protein can include additional sequences such as purification tags not found in the corresponding naturally encoded protein. [0081] An "ortholog" is a polypeptide or nucleic acid sequence that evolved by vertical descent from a common ancestor. A "paralog" arises from duplication and domain shuffling within a genome.
[0082] A "phylogenetic family" refers to organisms, nucleic-acid sequences, polypeptides sequences, or the like that share a common evolutionary relationship or lineage pattern.
[0083] A "subsequence" or "fragment" is a portion of an entire sequence of nucleic acids or amino acids.
[0084] A "library" or "population" refers to a collection of at least two different molecules and/or character strings, such as nucleic acid sequences (e.g., genes, oligonucleotides, etc.) or expression products (e.g., enzymes) therefrom. A library or population generally includes large numbers of different molecules. For example, a library or population typically includes at least about 100 different molecules, more typically at least about 1000 different molecules, and often at least about 10000 or more different molecules. [0085] A "set" refers to a collection of at least two molecule or sequence types, e.g., 2, 3, 4, 5, 10, 20, 50, 100, 1,000 or more molecule or sequence types.
[0086] "Screening" refers to the process in which one or more properties of one or more bio-molecule is determined. For example, typical screening processes include those in which one or more properties of one or more members of one or more libraries is/are determined.
[0087] "Selection" refers to the process in which one or more bio- molecules is identified as having one or more properties of interest. Thus, for example, one can screen a library to determine one or more properties of one or more library members. If one or more of the library members is/are identified as possessing a property of interest, it is selected. Selection can include the isolation of a library member, but this is not necessary. Further, selection and screening can be, and often are, simultaneous. [0088] "Heuristically-derived" modeling or analytical techniques are approaches to scoring the relative importance of one or more variables towards one or more parameters (e.g., functional parameters or the like). To illustrate, heuristically- derived analytical techniques include regression-based algorithms, motif- or pattern- based algorithms, etc. Examples of regression-based algorithms include partial least squares (PLS), variable importance for projection (VIP), multiple linear regression (MLR), inverse least squares (ELS), principal components regression (PCR), and the like. Examples of motif-based algorithms include neural networks, classification and regression trees (CART), multivariate adaptive regression splines (MARS), and the like.
[0089] An "encoded activity" refers to functional activity or property (e.g., a catalytic property, etc.) of a polynucleotide or a polypeptide expressed by a polynucleotide.
[0090] "Normalized" data refers to data that has been conformed to or reduced to a norm or standard, e.g., by removing biasing effects, such as imperfectly distributed datapoints from a data set. As described herein, for example, covariation found among a set of screened proteins can be normalized to an inherent distribution of covariance in a library of proteins by characterizing and utilizing the sequence distribution in a pre-screened to remove, e.g., artifactual covariation attributable to, e.g., oligonucleotide degeneracy biases produced during synthesis.
[0091] "Systematically varied data" refers to data in a data set which more than one parameter is changed simultaneously to produce the data set. Thus, for example, recursively recombined sequence information can provide a systematically varied data set (e.g., a set of nucleic acid or polypeptide sequences) where more than one nucleic acid or amino acid residue is simultaneously varied to produce the data set.
BRIEF DESCRIPTION OF THE DRAWINGS
[0092] Figure 1 schematically illustrates overlap regions of oligonucleotide character strings.
[0093] Figure 2 schematically shows an adjusted overlap region of a pair of oligonucleotide character strings that would increase the probability of hybridization between the two oligonucleotides, which are from different sequence families.
[0094] Figure 3 schematically illustrates an adjusted overlap region of a pair of oligonucleotide character strings that would decrease the probability of hybridization between the two oligonucleotides, which are from different sequence families.
[0095] Figure 4 schematically depicts an overlap region of a pair of oligonucleotide character strings adjusted in a segment encoding identical amino acids so as to decrease the probability of hybridization between the two oligonucleotides, which are from different sequence families.
[0096] Figure 5 schematically shows steps performed in one embodiment of a method of designing oligonucleotides according to the present invention. [0097] Figure 6 schematically provides a possible hidden Markov model for the peptide ACCY.
[0098] Figure 7 schematically shows a multiple sequence alignment and corresponding Markov chains.
[0099] Figure 8 is a chart that schematically shows certain steps performed in an embodiment of a method of identifying amino acids in polypeptides that are important for a polypeptide sequence-activity relationship.
[01001 Figure 9 schematically illustrates steps performed under the control of system software in one embodiment of the invention.
[0101] Figure 10 schematically depicts a representational digital device according to the present invention.
[0102] Figure 11 schematically illustrates a recombined library that incorporates diversity from each position uniformly and independently of context and abundance among the parental genes. The resulting variants are systematically varied for each independent residue position. [0103] Figures 12A-C show unrooted tree representations of relationships between subtilisin sequences with evolutionary distances indicated in point accepted mutations (PAM) units that were generated using the DARWIN software package (cbrg.inf.ethz.ch). The phylogenetic trees illustrate the uniform distribution of the variants in the library as compared to the highly clustered distribution of the natural subtilisin genes. The graphs also illustrate the similar uniform distribution of active and non-active sequences. In particular, Figure 12A shows a parental set of 15 subtilisin orthologs, Figure 12B shows 89 variants isolated and sequenced prior to functional screening, and Figure 12C shows 96 variants characterized as positives hits.
[0104] Figures 13A-C show normalized mutual information (MT) content between positions in a protein sequence indicating the extent of covariation between amino acid substitutions. The matrix displays the degree of mutual information for all-against-all residue pairs (black is high MI content and grey is low MI content). Only residues that varied in the sequence alignment were retained in the matrix. More specifically, Figure 13A shows 15 Parental sequences, Figure. 13B shows 89 variants isolated prior to functional screening, and Figure 13C shows 96 functionally active variants selected from a synthetically recombined library.
[0105] Figure 14 schematically depicts covarying sites mapped onto the Savinase® crystal structure (ISVN).
DETAILED DISCUSSION OF THE INVENTION
I. INTRODUCTION [0106] Computer programs exist which analyze families of polynucleotide or polypeptide sequences probabilistically rather than by constructing consensus sequences based on averages. For example, systems based upon Markov chain models generally define the probability of each amino acid occurring at a given position n based on the previous n-1 amino acid by comparison with an aligned family of sequences. The present invention provides similar probabilistic linkages of sequence information via appropriate design and construction of oligonucleotides, or character strings representations of such sequences, upon subsequent physical or in silico recombination. The results closely approximate linkage characteristics typically achieved with recombination techniques, such as, fragmentation-based multi-parental recombination.
[0107] In particular, family-based recombination typically preserves genetic linkages within a given family of sequences beyond the length of individual fragments. To illustrate, consider a set of eight protein sequences with an overall identity of 75%, that can be further divided into two "families" (e.g., phylogenetic families) of sequences A and B, where the amino acid identity within each family is 90%. Fragmentation-based recombination of the genes encoding these proteins would tend to favor family A genes remaining together and family B genes remaining together. However, there would be some lower probability of crossovers between the two families.
[0108] In contrast, synthetic oligonucleotide-based recombination as it is generally practiced involves oligonucleotides designed to overlap along the length of the gene to be assembled. Figure 1 schematically illustrates such an embodiment. As shown, each synthetic oligonucleotide (oligo) includes, e.g., two 20 base pair (bp) overlap regions, one at either end of the molecule, and some selected length (e.g., another 20 bp sequence) disposed between the overlap regions. The degeneracies within the oligonucleotides which allow them to encode multiple amino acids are typically designed to minimize effects on hybridization, that is, any oligo 1 can hybridize with any oligo 2 which can hybridize with any oligo 3.
[0109] As described herein, the invention provides physical ways of achieving bias in hybridization to more closely mimic fragmentation-based recombination, such as by making oligonucleotide overlap regions non-identical. For example, as shown in Figures 2 and 3, if,the overlap region between oligos 2 and 3 occurs in a region where the specified amino acid sequences are different, the nucleic acid sequence in this region is optionally chosen to give a hybridization bias, such that A oligos can be used to encode variations within the A family, while B oligos are used to encode variations within the B family. As shown in the examples schematically depicted in Figures 2 and 3, Family A sequences (oligo 2A) all have a glutamine (CAA or CAG) and Family B (oligo 2B) a glutamate (GAA or GAG) at a certain position. Oligonucleotides can be designed that vary the degree or probability of hybridization between Family A and Family B oligos (e.g., oligos 2 and 3) by the choice of codons used to represent these amino acids. As shown in Figure 2, to maximize homology between oligos 2 A and 3B, and thereby maximize the chances of crossovers occurring between the two families in this region of overlap and minimize the linkage probability of oligos 2A and 3 A, the closest GIu codon to CAA is selected (i.e., GAA). In contrast, as shown in Figure 3, to minimize homology between oligos 2A and 3B, and thereby minimize the chances of crossovers occurring between the two families in this region of overlap and maximize the linkage probability of oligos 2A and 3 A, the furthest GIu codon from CAA is selected (i.e., GAG).
[0110] In a similar approach, linkages can be introduced by using the degeneracies in the genetic code. Figure 4 schematically illustrates that using A oligos to encode the variations within the A family, and B oligos to encode the variations within the B family, oligos can be designed which encode the same amino acids, but by varying the nucleotide sequence, can be made to anneal preferentially within the same family to thereby maintain genetic linkage in that family. As shown, both families encode aspartic acid and cysteine at the same positions in the overlap region of oligos 2 and 3. Thus, to minimize homology between the A and B families in this region, different codons can be selected to encode these amino acids, namely, GAC for aspartic acid and UGC for cysteine in oligo 2A, and GAU for aspartic acid and UGU for cysteine in oligo 2B. As a consequence and similar to fragmentation-based recombination, family A genes tend to remain together and family B genes tend to remain together with some lower probability of crossovers between the two families.
[0111] Using these strategies, it is possible to maintain genetic linkages within families over large distances, even over regions in which sequences from different families may be identical. In particular, the degree of crossover between different families may be regulated both by the degree of mismatch between the oligonucleotide pairs and by the proximity of the mismatches to the ends of the sequences, which typically results in a calculable melting temperature for each oligonucleotide pair.
[0112] In overview, the following discussion provides details pertaining to regulated synthetic oligonucleotide-mediated recombination which is optionally performed entirely or partially in silico. In particular, the discussion describes methods of designing oligonucleotides for regulated recombination and computer implemented methods of maintaining genetic linkages within families of sequences over distances greater than individual oligonucleotide lengths during recombination. The discussion additionally relates to systems and software for performing the methods described herein.
II. OLIGONUCLEOTIDE DESIGN
[0113] The invention provides methods of designing oligonucleotides for regulated recombination that approximate certain attributes of family-based recombination, including maintaining desired linkages within families of sequences beyond individual oligonucleotide lengths. In general, the methods include adjusting overlap regions of pairs of overlapping oligonucleotide character strings in a population of overlapping oligonucleotide character strings such that subsequent recombination is biased towards a desired amino acid linkage. For example, the level of complementarity in an overlap region of an oligonucleotide pair is optionally varied to either increase or decrease the probability that the two oligonucleotides will hybridize to one another in a given assembly reaction mixture. [0114] During oligonucleotide design, overlap adjustments typically entail changing (e.g., substituting, adding, deleting, etc.) nucleotides (e.g., character representations of nucleotides, etc.) in overlap regions to effect increases or decreases in overlap regions (regions of hybridization). As described above, such adjustments to regulate hybridization bias between overlapping oligonucleotide pairs are optionally made in regions where the encoded amino acids encoded at given positions are the same or different. Additional levels of control over hybridization probabilities for pairs of overlapping oligonucleotides are achieved by varying the particular segments within overlap regions selected for modification (e.g., proximity to an end or the center of an overlap region). Optionally, multiple segments (e.g., multiple nucleotides, multiple codons, etc.) within an overlap region are adjusted as described herein, whether adjoining one another or in different segments of the overlap region. For example, the adjustments optionally include determining annealing frequencies for overlap regions and accordingly changing nucleotides in selected portions of overlap regions such that the annealing frequencies of the overlap regions are substantially proportional to the degree of desired genetic or amino acid linkage. In addition, multiple overlap regions in a population of overlapping oligonucleotides are typically adjusted such that the probability of hybridization for multiple pairs of overlapping oligonucleotides is increased or decreased as desired. Furthermore, although overlap adjustments are optionally performed manually (e.g., using a physical alignment of oligonucleotide character strings, etc.), in preferred embodiments, they are performed using a digital system, such as a computer or other logic device. Logic devices, systems, and software useful in performing the methods of the present invention are described further below.
[0115] Oligonucleotide character strings of essentially any length are optionally utilized in the methods described herein. In preferred embodiments, each overlapping oligonucleotide character string typically includes between about 10 and about 100 nucleotides (e.g., character representations of nucleotides, etc.), more typically between about 15 and about 75 nucleotides, and usually between about 20 and about 60 nucleotides (e.g., about 30, about 40, or about 50 nucleotides). Optionally, each overlapping oligonucleotide character string includes an identical or a different number of nucleotides. As additional options, a maximum or a minimum length of at least one of the overlapping oligonucleotide character strings is automatically or manually set. Furthermore, the overlap region of the pair of overlapping oligonucleotide character strings typically includes between about 5 and about 35 nucleotides, more typically between about 10 and about 30 nucleotides, and usually between about 15 and 30 nucleotides.
[0116] As noted above, one aspect of the present invention provides for selectively adjusting regions of overlap between nucleic acids that are to be hybridized (the region of overlap is the portion of the relevant nucleic acids that hybridize). The hybridization characteristics of an overlap region can be adjusted by modifying the number of complementary bases in an overlap region (in general, more complementary residues provide for more stable hybridization) and/or by adjusting the types of residues in the overlapping regions (in general, C-G bases form more stable hybrids than A-T base pairs, e.g., due to the number of hydrogen bonds between hybridized base residues, as well as base-stacking and solvent exclusion phenomena). Thus, in general, one adjusts the number and/or type of residues in an overlap region to achieve a Tm that provides for hybridization under selected recombination conditions, or alternately, one can perform an adjustment to reduce hybridization stability (e.g., where unlinking, rather than linking of oligos is desired).
[0117] The basic parameters for adjusting hybridization are well known, and include, e.g., a variety of well-characterized physico-chemical forces, such as hydrogen bonding, solvent exclusion, base stacking and the like. An extensive guide to the hybridization of nucleic acids is found in Tijssen (1993) Laboratory Techniques in Biochemistry and Molecular Biology— Hybridization with Nucleic Acid Probes, e.g., part I, Chapter 2, "Overview of principles of hybridization and the strategy of nucleic acid probe assays," (Elsevier, New York), as well as in Current Protocols in Molecular Biology. F.M. Ausubel et al., eds., Current Protocols, a joint venture between Greene Publishing Associates, Inc. and John Wiley & Sons, Inc., (supplemented through 2002) ("Ausubel"); Hames and Higgins (1995) Gene Probes 1. IRL Press at Oxford
University Press, Oxford, England (Hames and Higgins 1) and Hames and Higgins (1995) Gene Probes 2. IRL Press at Oxford University Press, Oxford, England (Hames and Higgins 2); PCR Protocols A Guide to Methods and Applications (Innis et al, eds.) Academic Press Inc. San Diego, CA (1990)and Rapley, R. and Walker, J.M. eds., Molecular Biomethods Handbook (Humana Press, Inc. 1998). The Tm of a nucleic acid duplex indicates the temperature at which the duplex is 50% denatured under a given set of conditions (temperature, solvent, presence of salts, presence of denaturants, or the like) and it represents a measure of the stability of the nucleic acid hybrid. Thus, the Tm depends on length of the region of overlap, nucleotide composition of the overlap, and the solvent that oligos are hybridized together in. Solvents provide lower hybridization stringency (and, thus, increased Tn, for a given region of overlap) at higher salt concentrations. Similarly, higher stringency (and thus, a lower Tm for a given region of overlap) are provided by reducing salt concentration.
[0118] In one example, in one simple embodiment, the Tm for a region of overlap can be approximated as follows: Tn, (°C) = 4(G + C) + 2(A + T), where A (adenine), C (cytosine), T (thymine), and G (guanine) are the numbers of the corresponding nucleotides and the salt conditions are roughly those used in a typical polymerase or ligase mediate reaction. The hybridization properties of the region of overlap are adjusted to change the overall number or type of complementary residues. [0119] As noted above, the overlap regions can be adjusted to increase or decrease the likelihood that two or more nucleic acids will hybridize under a given set of solvent and temperature conditions. One aspect of the present invention utilizes the ability to adjust regions of overlap to control linkage between biological subunits of interest. That is, controlling the probability of hybridization between two oligonucleotides directly affects whether sequences contained within or encoded by the two oligonucleotides will be linked. The higher the Tn, for a region of overlap between two nucleic acids, the higher the likelihood that the regions will recombine and, thus, the tighter the linkage will be between nucleotides (or encoded amino acids) in the two oligonucleotides. Conversely, the lower the Tm, the lower the linkage. Actual linkage can be calculated according to standard methods, e.g., by measuring how frequently two nucleotides or encoded amino acids vary together in progeny populations of nucleic acids and encoded polypeptides. This empirical measure of linkage is optionally correlated to the Tn, which was used to provide hybridization between different oligonucleotides that were hybridized to produce the progeny populations.
[0120] Figure 5 further schematically illustrates certain steps typically performed in the methods of designing oligonucleotides for regulated recombination described herein. As shown, step Al includes providing parental polypeptide character strings that are distinguished from one another by at least one amino acid difference when aligned for maximum identity (e.g., members of a phylogenetic family, artificial constructs, or the like). Step A2 includes providing a desired amino acid linkage. Exemplary target polypeptides and polynucleotides encoding the same are provided below. In preferred embodiments, at least some of the method steps described herein are performed in a digital system (e.g., a computer or the like). For example, the methods typically include inputting the parental polypeptide character strings and the desired linkage into such a device prior to further manipulation. Optionally, desired amino acid linkages are calculated using a probabilistic or other statistical technique (e.g., a Markov chain modeling method or the like), or are manually (e.g., designed, etc.) selected. Probabilistic techniques, such as Markov chain models are described further below.
[0121] Typically, the methods include providing at least two parental polypeptide character strings that are optionally members of an identical or a different phylogenetic family. Optionally, the methods include defining a phylogenetic family computationally or manually. For example, when aligned for maximum identity, parental polypeptide character strings generally include at least one region of amino acid sequence similarity. Further, the methods also typically include deriving the desired amino acid linkage from portions of multiple parental polypeptide character strings, which are optionally, e.g., orthologs or paralogs. Sequence search and alignment algorithms useful in practicing the methods described herein, including those such as the Basic Local Alignment Search Tool (BLAST) are described further below. [0122] As further shown in step A3 of Figure 5, the parental polypeptide character strings are subsequently reverse-translated into parental polynucleotide character strings, which parental polynucleotide character strings are then segmented into overlapping oligonucleotide character strings in step A4. Optionally, the methods include reverse-translating the parental polypeptide character strings according to a species codon-bias of a selected host in which recombinant products are to be expressed. Thereafter, in step A5, the methods include adjusting overlap regions of pairs of overlapping oligonucleotide character strings to bias subsequent recombination towards the desired amino acid linkage provided, e.g., in step A2. [0123] It will be appreciated by those of skill in the art to which this invention pertains that there are many conceivable variations in practicing the methods described herein. As such there is no attempt made herein to provide all possible variations within the scope of this invention, such as the number of segments into which a target polynucleotide is divided, the number of different sequence families included in oligonucleotide design processes, or the like. However, additional details relating to oligonucleotide-mediated recombination are provided in, e.g., Published International Application Nos. WO 00/42561, entitled "OLIGONUCLEOTIDE MEDIATED NUCLEIC ACID RECOMBINATION," by Crameri et al., WO 01/23401, entitled "USE OF CODON- V ARIED OLIGONUCLEOTIDE SYNTHESIS FOR SYNTHETIC SEQUENCE RECOMBINATION," by Welch et al., and the other references cited herein, which are all incorporated herein by reference in their entirety for all purposes.
III. SEQUENCE ALIGNMENT AND COVARIATION DETERMINATIONS [0124] A threshold issue in practicing the present invention is selecting or designing the oligonucleotide sequences to be, e.g., synthesized, recombined, manipulated further in siiico, or the like. They can be derived from nucleic acid sequences that are homologous, non-homologous, and/or purely practitioner designed. Sequence information available from nucleic acid databases is a useful reference during the selection and design process. Genbank®, Entrez®, EMBL, DDBJ, and the NCBI are examples of public database/search services that can be accessed. Many sequence databases are available via the internet or on a contract basis from a variety of companies specializing in genomic information generation and/or storage.
[0125] When designing oligonucleotides according to the methods described herein, the present invention optionally includes aligning nucleic acid sequences or regions of similarity. For example, in one aspect, the invention relates to methods of regulating the recombination of at least two parental nucleic acids. In an embodiment of these methods, the composition of nucleic acids to be recombined is provided by aligning homologous nucleic acid sequences (e.g., orthologs, paralogs, or the like) to select conserved regions of sequence identity and regions of sequence diversity. Similarly, a reiterative aspect of the invention includes deriving the sequences of an additional population of oligonucleotides from selected nucleic acids produced in previous rounds of recombination by aligning those sequences to identify regions of identity and regions of diversity.
[0126] In these processes of sequence comparison and homology determination, one sequence is often used as a reference against which other test nucleic acid sequences are compared. This comparison can be accomplished with the aid of a sequence comparison algorithm (e.g., embodied in a set of logic instructions), or by visual inspection. When an algorithm is employed, test and reference sequences are input into a computer, subsequence coordinates are designated, as necessary, and sequence algorithm program parameters are specified. The sequence comparison algorithm then calculates the percent sequence identity for the test nucleic acid sequence(s) relative to the reference sequence, based on the specified program parameters.
[0127] Optimal alignment of sequences for comparison can be conducted, e.g., by the local homology algorithm of Smith & Waterman, Adv. Appl. Math. 2:482 (1981), by the homology alignment algorithm of Needleman & Wunsch, J1 MoI. Biol. 48:443 (1970), by the search for similarity method of Pearson & Lipman, Proc. Nat'l. Acad. Sci. USA 85:2444 (1988), by computerized implementations of these algorithms (GAP, BESTFIT, FASTA, and TFASTA in the Wisconsin Genetics Software Package, Genetics Computer Group, 575 Science Dr., Madison, WI), or by visual inspection.
[0128] One example of an algorithm that is suitable for determining percent sequence identity and sequence similarity (e.g., among polynucleotides, polypeptides, etc.) is the BLAST algorithm, which is described in, e.g., Altschul et al.( J. MoI. Biol. 215:403-410 (1990). Software for performing BLAST analyses is publicly available through the National Center for Biotechnology Information
(http://www.ncbi.nlm.nih.gov/). This algorithm involves first identifying high scoring sequence pairs (HSPs) by identifying short words of length W in the query sequence, which either match or satisfy some positive-valued threshold score T when aligned with a word of the same length in a database sequence. T is referred to as the neighborhood word score threshold (Altschul et al., supra). These initial neighborhood word hits act as seeds for initiating searches to find longer HSPs containing them. The word hits are then extended in both directions along each sequence for as far as the cumulative alignment score can be increased. Cumulative scores are calculated using, for nucleotide sequences, the parameters M (reward score for a pair of matching residues; always > 0) and N (penalty score for mismatching residues; always < 0). For amino acid sequences, a scoring matrix is used to calculate the cumulative score. Extension of the word hits in each direction are halted when: the cumulative alignment score falls off by the quantity X from its maximum achieved value; the cumulative score goes to zero or below, due to the accumulation of one or more negative-scoring residue alignments; or the end of either sequence is reached. The BLAST algorithm parameters W, T, and X determine the sensitivity and speed of the alignment. The BLASTN program (for nucleotide sequences) uses as defaults a wordlength (W) of 11, an expectation (E) of 10, a cutoff of 100, M=5, N=-4, and a comparison of both strands. For amino acid sequences, the BLASTP program uses as defaults a wordlength (W) of 3, an expectation (E) of 10, and the BLOSUM62 scoring matrix (see, Henikoff & Henikoff (1989) Proc. Natl. Acad. Sci. USA 89: 10915).
[0129] In addition to calculating percent sequence identity, the BLAST algorithm also performs a statistical analysis of the similarity between two sequences (see, e.g., Karlin & Altschul (1993) Proc. Nat'l. Acad. Sci. USA 90:5873-5787). One measure of similarity provided by the BLAST algorithm is the smallest sum probability (P(N)), which provides an indication of the probability by which a match between two nucleotide or amino acid sequences would occur by chance. For example, a nucleic acid is considered similar to a reference sequence if the smallest sum probability in a comparison of the test nucleic acid to the reference nucleic acid is less than about 0.1, more preferably less than about 0.01, and most preferably less than about 0.001. Additional details regarding BLAST programs, including newer BLAST 2.0 programs that are optionally used in practicing the methods described herein are provided in, e.g., Rashidi and Buehler, Bioinformatics Basics: Applications in Biological Science and Medicine, CRC Press (2000), Pevzner, Computational Molecular Biology: An Algorithmic Approach. MIT Press (2000) (Pevzner), and in the references disclosed therein, which are incorporated herein by reference. Other available sequence alignment programs include, e.g., PILEUP, DNAPLOT, or the like. [0130] One example of aligning proteins relies on the flexible statistical model called the hidden Markov model (HMM), which was initially applied to speech recognition. A general introduction to HMMs is provided in, e.g., Rabiner (1989) "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proceedings of the IEEE 77:257-286. In the context of the present invention, this model utilizes defined 'threading' through a multiple sequence alignment, e.g., to identify and quantitate amino acid motifs in an alignment of polypeptide character strings. In particular, the threading matrix, but not the sequence alignment consensus itself, is subsequently used to identify novel proteins that can be clustered with the original group of HMM structures. HMM or variations thereof can be excellent statistical tools used in defining oligonucleotide sequences for practicing the methods described herein including, e.g., divergence in split-pool oligonucleotide synthesis. Using the HMM matrix, each position is given a certain set of options (e.g., delete, insert or add the next amino acid) and the percentage of oligonucleotide-carrying beads going down each path in split-pool synthesis can easily be calculated based on a parental display. Oligonucleotide synthesis is described further below. Additional details relating to Markov chains, hidden Markov models, and other statistical techniques optionally utilized in the methods of the present invention is included in, e.g., Pevzner, supra, Baldi and Brunak, Bioinformatics: The Machine Learning
Approach, MIT Press (1998), Durbin et al., Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press (1998), and in the references cited therein, which are incorporated by reference.
[0131] As shown in Figure 6, the typical hidden Markov model is a chain of match (square), insert (diamond), and delete (circle) nodes, with all transitions between nodes and all character costs in the insert and match nodes trained to specific probabilities (i.e., the known parents). The single best path through an HMM corresponds to a path from a start state to an end state in which each character of the sequence is related to a successive match or insertion state along that path. Delete states indicate that the sequence has no character corresponding to that position in the HMM.
[0132] Transitions from state to state progress from left to right through the model, with the exception of self-loops on insertion states (FIG. 6). The self-loops allow deletions of any length to fit the model, regardless of the length of other sequences in the family. A path through the model can represent any sequence. The probability of any sequence, given the model, is computed by multiplying the emission and transition probabilities along the path. Figure 6 illustrates a possible hidden Markov model for the peptide ACCY. As shown, a path through the model represented by ACCY is highlighted. The peptide is represented as a sequence of probabilities. The numbers in the boxes show the probability that an amino acid occurs in a particular state, and the number next to the directed arcs show probabilities which connect the states. For instance, the probability of A being emitted in position one is 0.3, and the probability of C being emitted in position two is 0.6. The probability of ACCY along this path is:
.4 * .3 * .46 * .6 * .97 * .5 * .015 * .73 *.01 * 1 = 1.7OxIO"6. Or, by transforming probabilities to logs so that addition can replace multiplication: loge(.4) + loge (.3) + loge (.46) + loge (.6) + lOge (.97) + loge (.5) + loge (.015) + loge (.73) + loge (-01) + loge (1) = -13.25.
Capturing Covariation in Synthetic Libraries
[0133] One problem associated with libraries produced by recombining synthesized oligonucleotides relates to the typically vast size of such libraries, which increases the difficulty in identifying library members with desired properties during selection or screening. This is generally a consequence of not capturing covariation, which is typically captured in fragmentation-based recombination procedures. For example, a library produced by a fragmentation-based recombination technique will typically include a "window" of between about 50 and about 100 nucleotides of covariation depending upon recombined fragment size, whereas a library created by recombining synthesized oligonucleotides will generally have a "window" of about one (i.e., no captured covariance). Covariation identified between independent amino acid residues among phylogenetically-related genes is due either to evolutionary artifacts or direct or indirect functional constraints.
[0134] In particular, by capturing covariation (i.e., linking residues), the present invention minimizes the size of a given synthetic library while retaining the number of sequences that score positively for one or more desired properties. In other words, the present invention increases synthetic library quality by capturing residue (e.g., amino acid, nucleotide, etc.) covariation unlike many other synthetic approaches to library construction. For example, each introduced link having two optional linkages (i.e., selecting one of the two options), reduces the library size by one half.
Accordingly, the cost associated with screening or selecting the library will also be reduced by one half. Additional aspects of amino acid covariation are illustrated below in an example. [0135] While linking covariation of amino acid residues in proteins is emphasized herein for purposes of clarity of illustration, it will be understood that the invention is optionally utilized to capture covariation of other monomers, such as nucleotides in RNA and DNA. For example, in certain cases, one may want to link specific codons, such as the rare leucine (UUA) codon in Streptomyces, which is involved in the regulation of antibiotic production (see, e.g., White and Bibb (1997) "bldA dependence of undecylprodigiosin production in Streptomyces coelicolor A3(2) involves a pathway-specific regulatory cascade," J. Bacteriol. 179(3):627-33, van Wezel GP et al. (1995) 'The tuf3 gene of Streptomyces coelicolor A3(2) encodes an inessential elongation factor Tu that is apparently subject to positive stringent control," Microbiology 141(10):2519-28, and Leskiw et al. (1991) "TTA codons in some genes prevent their expression in a class of developmental, antibiotic-negative, Streptomyces mutants," Proc. Natl. Acad. Sci. U S A. 88(6):2461-5). Thus, there may be instances where one would want to link specific codons, e.g., instead of or in conjunction with amino acids, or RNA/DNA nucleotides.
[0136] In certain embodiments of the present invention, statistical techniques such as Markov chain modeling and the like are optionally used to capture covariation (e.g., in sequences of a few amino acids or in longer sequences) in libraries produced by recombining synthesized oligonucleotides. For example, the methods optionally include aligning parental polypeptide character strings for maximum identity to produce a character string profile and identifying allowed sequence paths (e.g., Markov chains, etc.) through the character string profile. Figure 7 schematically illustrates a multiple sequence alignment of polypeptide character strings and corresponding Markov chains. Markov chains are characterized by having parental character strings 'threaded' through the profile so that all allowed sequence paths are given a probability factor. The probability of a given sequence is dependent upon the frequency of its occurrence among the compiled parental character strings. The resulting information is optionally extracted as a covariance-derived matrix (a table of information setting forth covariation relationships for the compiled strings), rather than as a consensus sequence. A desired amino acid linkage is optionally selected from among the allowed sequence paths provided in the matrix. In particular, the constructed Markov matrix is optionally used as a constraint for all progeny, which will reduce the size of a library produced by recombining synthetic oligonucleotides, while retaining the same covariation patterns present among the parental strings. Accordingly, oligonucleotides are optionally computationally derived in view of these constraints and subsequently synthesized, e.g., using a split-pool process that incorporates trinucleotides in order to minimize the number of oligonucleotides utilized.
[0137] An additional advantage of these methods is that Markov chain- based systematics for introducing covariance operates as an intermediate recombination concept between fragmentation-based recombination procedures with "windows" of covariation of between about 50 and about 100 nucleotides and non-systematic synthetic oligonucleoti de-based recombination with "windows" of covariation of about one nucleotide. In particular, the "window" size will automatically reflect the distribution of covariance among the parents, which can optionally be scaled up or down by user defined constraints (i.e., artificially defined), such as specified library sizes (e.g., library larger than X, but smaller than Y, etc.), removal of covariance in selected percentages of parents, or the like.
Using Cross Products in Heuristically-derived Models for Sequence
Space Exploration
[0138] Interactions (e.g., second order, third order, etc.) among amino acid residues are important for protein sequence-activity (function) relationships (PSAR (PSFR)). Another aspect of the invention involves calculating cross product terms (i.e., co-varying residues) among various columns corresponding to amino acid residue positions in a matrix. The cross product terms are then typically added to the linear terms, which correspond to amino acid residues, to generate an expanded X predictor matrix. Heuristically-derived models are generated with the expanded predictor matrix to identify important cross terms along with linear terms. This cross product and linear term information is then typically utilized in the construction of subsequent libraries. For example, two amino acid residues alone may not be important, e.g., as manifested by weights of linear terms in PLS modeling, but their cross product term may be important. Accordingly, the corresponding amino acid positions may be good candidates for exploration in subsequent rounds of artificial evolution to ensure optimal sequence space searching.
[0139] To further illustrate, Figure 8 is a chart that shows certain steps performed in an embodiment of a method of identifying amino acids in polypeptides that are important for a polypeptide sequence-activity relationship (e.g., to provide desired amino acid linkages, etc.). As shown in Bl, the methods include providing an X predictor matrix that includes a data set corresponding to at least two parental polypeptide character strings in which a physicochemical property or biological activity is known for at least one of the at least two parental polypeptide character strings. The at least two parental polypeptide character strings typically include, e.g., a set of systematically varied polypeptide character strings or the like, e.g., produced by one or more artificial evolution procedures, such as any of those described herein. -As further shown in B2, the methods also include calculating one or more cross product terms between or among columns of the X predictor matrix. Each column entry corresponds to an amino acid of a parental polypeptide character string from the at least two parental polypeptide character strings. In addition, the methods also include adding at least one of the one or more cross product terms calculated in step B2 to one or more linear terms (e.g., which correspond to amino acid residues) of the X predictor matrix to produce an expanded X predictor matrix (B3). Cross product terms identify covarying amino acids in the at least two parental polypeptide character strings, whereas the linear terms correspond to amino acids in the at least two parental polypeptide character strings. Thereafter, the methods include generating a model with the expanded X predictor matrix to identify important cross product terms and/or linear terms to identify the amino acids in the at least two parental polypeptide character strings that are important for a polypeptide sequence-activity relationship (B4). An example that further illustrates aspects of amino acid covariation is provided below.
[0140] Optionally, the heuristically-derived models are produced using one or more regression-based algorithms selected from, e.g., a partial least squares regression, a multiple linear regression, an inverse least squares regression, a principal component regression, a variable importance for projection, or the like. As an additional option, the model is produced using one or more pattern-based algorithm selected from, e.g., a neural network, a classification and regression tree, a multivariate adaptive regression spline, or the like. Heuristically-derived models are generally known in the art and are described further in, e.g., International Publication Nos. WO 00/42560, entitled "METHODS FOR MAKING CHARACTER STRINGS, POLYNUCLEOTIDES AND POLYPEPTIDES HAVING DESIRED CHARACTERISTICS," filed January 18, 2000 by Selifonov et al., WO 01/75767, entitled "IN SIUCO CROSS-OVER SELECTION," filed March 30, 2001 by Gustafsson et al., and the references cited therein, which are incorporated by reference in their entirety for all purposes.
[0141] Typically, the important cross product terms and/or linear terms identified in B4 are used to design polypeptide libraries. As mentioned, in certain aspects, two or more linear terms individually may include unimportant terms for the polypeptide sequence-activity relationship. However, cross product terms calculated from the two or more linear terms may be identified as important for the polypeptide sequence-activity relationship. Cross product terms typically correspond to interactions (e.g., structural or functional interactions) between or among amino acids in the polypeptide sequence variants. For example, the interactions include, e.g., secondary or tertiary interactions, direct interactions, indirect interactions, long-range interactions, physicochemical interactions, interactions due to folding intermediates, translational effects, ligand binding, and/or the like. [0142] In one aspect, the present invention includes multivariate analysis, e.g., analysis of covariation in systematically varied data sets. Several methods of constructing and analyzing dataspace, e.g., including multivariate analysis are available. See, e.g., Hinchliffe (1996) Modeling Molecular Structures John Wiley and Sons, NY, NY; Gibas and Jambeck (2001) Bioinformatics Computer Skills O'Reilly, Sebastipol, CA; Pevzner (2000) Computational Molecular Biology and Algorithmic Approach. The MTT Press, Cambridge MA; Durbin et al. (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, Cambridge, UK; Rashidi and Buehler (2000) Bioinformatic Basics: Applications in Biological Science and Medicine. CRC Press LLC, Boca Raton, FL; and Mount (2001) Bioinformatics: Sequence and Genome Analysis. Cold Spring Harbor Press, New York.
[0143] For example, multivariate characterization of biological macromolecules can be performed via any of a variety of statistical methods, including PCA, PLS, Markov modeling and the like. Examples of statistical modeling methods can be found in the literature, including Hellberger et al. (1986) 'The prediction of bradykinin potentiating potency of pentapeptides. An example of a peptide quantitative structure-activity relationship." Acta Chem. Scand. B. 40: 135-140; Ufkes, et al., (1978) "Structure-activity relationships of bradykinin potentiating peptides." Eur. J. Pharmacol. 50; Ufkes et al. (1982) "Further studies on the structure-activity relationships of bradykinin-potentiating peptides." Eur. J. Pharmacol., 79:155-158; Mee et al. (1997) "Design of active analogues of a 15-residue peptide using D-optimal design, QSAR and a combinatorial search algorithm." J. Pept. Res. 49:89-102); Jonsson et al., "Quantitative sequence-activity models (QSAM) - tools for sequence design" Nucleic Acids Res. 21: 733-739; Knaus and Bujard, (1988) "of coliphage lambda: an alternative solution for an efficient promoter." EMBO J.. 7:2919-2923; Lanzer and Bujard (1988) "Promoters largely determine the efficiency of repressor action." Proc. Natl. Acad. Sci. U S A. 85: 8973-8977; and Brunner and Bujard (1987) "Promoter recognition and promoter strength in the Escherichia coli system," EMBO J, 6: 3139- 3144.
[0144] In one example, a matrix is used to correlate each multidimensional data point with a specific output vector in order to identify the relationship between a matrix of dependant variables Y and a matrix of predictor variables X. A common analytical tool for this type of analysis is Partial Least Square Projections to Latent Structures (PLS). Each data point can consist of multiple different parameters that are plotted against each other in an n-dimensional dataspace (one dimension for each parameter). Manipulations are done in a computer system, which adds whatever number of dimensions are needed to be able to handle the input data. PCA, PLS and other methods that can be used to find projections and planes in the data space so that data space.
[0145] Multivariate characterization can also be performed by use of genetic algorithms, or neural networks. Examples of useful genetic algorithms and neural network models that can be applied to multivariate analysis are found in the literature, e.g., Schneider and Wrede (1994), 'The rational design of amino acid sequences by artificial neural networks and simulated molecular evolution: de novo design of an idealized leader peptidase cleavage site," Biophvs. J., 66(2 Pt 1): 335-44; Wrede, et al., (1998) "Peptide design aided by neural networks: biological activity of artificial signal peptidase I cleavage sites." Biochemistry, 37: 3588-3593; Schneider et al., (1998) "Peptide design by artificial neural networks and computer-based evolutionary search." Proc. Natl. Acad. Sci. U S A. 95(21): 12179-84; Patel et al. (1998), "Patenting computer-designed peptides." J. Comput. Aided. MoI. Pes., 12(6): 543-56; and Schneider and Wrede (1998) "Artificial neural networks for computer- based molecular design" Prog Biophvs MoI Biol 1998;70(3): 175-222; Schneider et al. (1998) "Peptide design by artificial neural networks and computer-based evolutionary search" Proc Natl Acad Sci U S A 95(21): 12179-84.
IV. OLIGONUCLEOTIDE SYNTHESIS [0146] In general, sets of oligonucleotides can be combined for assembly in many different formats and different combinations schemes to effect correlation with genetic events and operators at the physical level.
[0147] As noted, overlapping sets of oligonucleotides, e.g., which correspond to populations that include adjusted oyerlapping oligonucleotide character strings, can be synthesized and then hybridized and elongated to form full-length nucleic acids. A full length nucleic acid is any nucleic acid desired by an investigator which is longer than the oligos which are used in the gene reconstruction methods. This can correspond to any percentage of a naturally occurring full length sequence, e.g., 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, or 90% or more of the corresponding natural sequence.
[0148] Oligo sets often have at least about 5, sometimes about 10, often about 15, generally about 20, υr more, nucleotide overlap sequences to facilitate gene reconstruction. Oligo sets are optionally simplified for gene reconstruction purposes where regions of fortuitous overlap are present, i.e., where repetitive sequence elements are present or designed into a gene sequence to be synthesized. Lengths of oligos in a set can be the same or different, as can the regions of sequence overlap (regions of hybridization). To facilitate hybridization and elongation (e.g., during cycles of PCR), overlap regions are optionally designed with similar melting temperatures.
[0149] Parental sequences can be gridded (conceptually or physically) and the common sequences used to select common sequence oligos, thereby combining oligo members into one or more sets to reduce the number of oligos required for making full-length nucleic acids. Similarly, oligonucleotides with some sequence similarity can be generated by pooled and/or split synthesis where pools of oligos under synthesis are split into different pools during the addition of heterologous bases, optionally followed by rejoined synthesis steps (pooling) at subsequent stages where the same additions to the oligos are desired. In oligonucleotide recombination formats, heterologous oligos corresponding to many different parents can be split and rejoined during synthesis. In simple degenerate synthetic approaches, more than one nucleobase can be added during single synthetic steps to produce two or more variations in sequence in two or more resulting oligonucleotides. The relative percentage of nucleobase addition can be controlled to bias synthesis towards one or more parental sequence. Similarly, partial degeneracy can be practiced to prevent the insertion of stop codons during degenerated oligonucleotide synthesis.
[0150] Oligos which correspond to similar subsequences from different parents can be the same length or different, depending on the subsequences. Thus, in split and pooled formats, some oligos are optionally not elongated during every synthetic step (to avoid frame-shifting, some oligos are not elongated for the steps corresponding to one or more codon).
[0151] Various approaches are optionally used to further regulate olignucleotide-mediated recombination as described herein. For example, when constructing oligos, crossover oligos are optionally constructed at one or more point of difference between two or more parental sequences (a base change or other difference is a genetic locus, which can be treated as a point for a crossover event). The crossover oligos have a region of sequence identity to a first parental sequence, followed by a region of identity to a second parental sequence, with crossover point occurring at the locus. For example, every natural mutation can be a cross over point.
[0152] Another way of biasing sequence recombination is to spike a mixture of oligonucleotides with fragments of one or more parental nucleic acids (if more than one parental nucleic acid is fragmented, the resulting segments can be spiked into a recombination mixture at different frequencies to bias recombination outcomes towards one or more parent). Optionally, selected synthesized oligonucleotides (e.g., adjusted oligonucleotides, etc.) are used to spike a reaction mixture to bias recombination. Recombination events can also be engineered simply by omitting one or more oligonucleotide corresponding to one or more parent from a recombination mixture.
[0153] In addition to the use of families of related oligonucleotides, diversity is optionally modulated by the addition of selected, pseudo-random or random oligos to elongation mixture, which can be used to bias the resulting full-length sequences. Similarly, mutagenic or non-mutagenic conditions can be selected for PCR elongation, resulting in more or less diverse libraries of full-length nucleic acids. [0154] In addition to mixing oligo sets which correspond to different parents in the elongation mixture, oligo sets which correspond to just one parent can be elongated to reconstruct that parent. In either case, any resulting full-length sequence can be fragmented and recombined, as in the DNA recombination methods noted in the references cited herein.
[0155] Optionally, the methods of the invention include displaying designed members of a selected population oligonucleotide character strings (e.g., that includes adjusted overlapping oligonucleotide character strings) graphically, using an output device, such as a monitor. For example, a selected population of oligonucleotide character strings is optionally displayed in an order form format, e.g., suitable for submission to providers of oligonucleotide synthesis services when physical embodiments of the character strings are desired for recombination. Custom oligonucleotide synthesis is available from commercial suppliers, such as The Midland Certified Reagent Company (mcrc@oligos.com), The Great American Gene Company (genco.com), ExpressGen Inc. (expressgen.com), Operon Technologies, Inc. (operon.com), and many others.
[0156] Many other oligonucleotide synthetic variations, such as trinucleotide, i.e., codon-based phosphoramidite synthesis techniques, which are optionally correlated to genetic events and operators at the physical level are found in, e.g., WO 00/42561, entitled "OLIGONUCLEOTIDE MEDIATED NUCLEIC ACID RECOMBINATION" by Crameri et al. and WO 01/23401, entitled "USE OF CODON- VARIED OLIGONUCLEOTIDE SYNTHESIS FOR SYNTHETIC SEQUENCE RECOMBINATION" by Welch et al.
V. IN VITRO RECOMBINATION [0157] When a population of nucleic acids (e.g., a population of single- stranded oligonucleotides) corresponding to the selected population that includes adjusted overlapping oligonucleotide character strings is synthesized or otherwise provided, the population of nucleic acids can be recombined, e.g., in vitro in a pool of such sequences. To briefly illustrate, the population of single-stranded oligonucleotides can be hybridized to one another, e.g., by cooling to about 20°C to about 75°C, and preferably from about 400C to about 65°C. Hybridization is optionally accelerated by the addition of polyethylene glycol ("PEG") or salt to the reaction mixture. The salt concentration is typically, e.g., from about 0 mM to about 600 mM, or, e.g., from about 10 mM to about 100 mM. Exemplary salts optionally include, e.g., (NHt)2SO4, KCl, NaCl, or the like. The concentration of PEG is preferably from about 0% to about 20%, more preferably from about 5% to about 10%.
[0158] During elongation, the hybridized oligonucleotides are then incubated in the presence of a nucleic acid polymerase (e.g., Taq or Klenow polymerases) and/or a ligase (e.g., T4 or T7 DNA ligases), and dNTP's (i.e., dATP, dCTP, dGTP and dTTP). If regions of sequence identity (e.g., overlap regions) are large, Taq or other high-temperature polymerase can be used with a hybridization temperature of between about 45° C to about 65°C. If the areas of identity are relatively small, Klenow or other low-temperature polymerases can be used with a hybridization temperature of between about 20°C to about 30°C. The polymerase and/or ligase can be added to the reaction mixture prior to, simultaneously with, or after hybridization. As noted elsewhere in this disclosure, certain embodiments of the invention can involve denaturing the resulting elongated double-stranded nucleic acid sequences and then hybridizing and elongating those sequences again. This cycle can be repeated for any desired number of times. Preferably the cycle is repeated from about 2 to about 100 times, e.g., from about 10 to about 40 times.
[0159] Other suitable in vitro olignucleotide-mediated recombination techniques or variants including, e.g., library spiking (e.g., to further bias recombination), crossover PCR recombination (e.g., for recombining distantly related or even non-homologous sequences), alternative reaction conditions, materials, or the like that are optionally readily adapted by those of skill in the art for use with the regulated recombination methods described herein are provided in, e.g., WO 00/42561, entitled "OLIGONUCLEOTIDE MEDIATED NUCLEIC ACID RECOMBINATION" by Crameri et al., WO 01/23401, entitled "USE OF CODON- V ARIED OLIGONUCLEOTIDE SYNTHESIS FOR SYNTHETIC SEQUENCE RECOMBINATION" by Welch et al., and in the other references cited herein. In addition, certain in vivo recombination techniques that are optionally adapted for use with the methods of the present invention are also provided in these references.
VI. IN S/L/CO RECOMBINATION
[0160] In silico recombination utilizes computer algorithms to perform simulated recombinations that involve genetic operators in digital systems. As applied to the present invention, oligonucleotide character string populations that include the adjusted overlapping oligonucleotide character strings generated according to the methods described herein are optionally recombined in silico. Following computer- simulated recombination, the methods also optionally include determining sequences of recombinant nucleic. acids resulting from the in silico recombination of the selected population that includes adjusted overlapping oligonucleotide character strings.
Optionally, these embodiments also include performing in silico simulations of activity for the recombinant nucleic acids or expression products therefrom. Additional options include synthesizing physical embodiments of recombinant nucleic acids or. expression products resulting from in silico recombination and screening or selecting the products for desired traits or properties.
[0161] In brief, genetic operators (i.e., algorithms that represent given genetic events, such as recombination of two strands of homologous nucleic acids, point mutations, or the like) are used to model recombinational or mutational events, which can occur in one or more nucleic acids, e.g., by aligning nucleic acid sequence character strings (e.g., using standard alignment software, or by manual inspection and alignment) such as those representing homologous nucleic acids and predicting recombinatorial outcomes. The predicted recombinatorial outcomes are optionally used to produce corresponding molecules, e.g., by oligonucleotide synthesis and reassembly PCR. Additional details relating to in silico recombination are provided in, e.g., Published International Application No. WO 00/42560, entitled "METHODS FOR
MAKING CHARACTER STRINGS, POLYNUCLEOTIDES AND POLYPEPTIDES HAVING DESIRED CHARACTERISTICS," by Selifonov et al. See also, WO 00/42559, entitled "METHODS OF POPULATING DATA STRUCTURES FOR USE IN EVOLUTIONARY SIMULATIONS" by Selifonov et al.
VII. ITERATIVE OLIGONUCLEOTIDE-MEDIATED TECHNIQUES [0162] In one aspect, the present invention provides iterative oligonucleotide-mediated recombination formats. These formats can be combined with standard recombination methods, also, optionally, in an iterative format.
[0163] In particular, recombinant nucleic acids produced' by oligonucleotide-mediated recombination can be screened for activity and sequenced. The sequenced recombinant nucleic acids are aligned and regions of identity and diversity are identified. Oligonucleotides are then selected and optionally adjusted as described herein for recombination of the sequenced recombinant nucleic acids. This process of screening, sequencing active recombinant nucleic acids, and recombining the active recombinant nucleic acids can be iteratively repeated until a molecule with a desired property is obtained.
[0164] In addition, recombinant nucleic acids made using the oligonucleotide populations described herein can be cleaved and recombined using standard recombination methods, which are, optionally, reiterative. Standard recombination can be used in conjunction with oligonucleotide-mediate recombination and either or both steps are optionally reiteratively repeated.
[0165] One useful example of iterative recombination by oligonucleotide-mediated recombination of family-based oligonucleotides occurs when extremely fine grain recombination is desired. For example, small genes encoding small protein such as defensins (antifungal proteins of about 50 amino acids), EF40 (an antifungal protein family of about 28 amino acids), peptide antibiotics, peptide insecticidal proteins, peptide hormones, many cytokines and many other small proteins, are difficult to recombine by standard recombination methods, because the recombination often occurs with a frequency that is roughly the same as the size of the gene to be recombined. limiting the diversity resulting from recombination. In contrast, oligonucleotide-mediated recombination methods can recombine essentially any region of diversity in any set of sequences, with recombination events (e.g., crossovers) occurring at any selected base-pair.
[0166] Thus, libraries of sequences prepared by recursive oligonucleotide mediated recombination are optionally screened and selected for a desired property, and improved (or otherwise desirable) clones are sequenced (or otherwise deconvoluted, e.g., by real time PCR analysis such as FRET or TaqMan, or using restriction enzyme analysis) with the process being iteratively repeated to generate additional libraries of nucleic acids. Thus, additional recombination rounds are performed either by standard fragmentation-based recombination methods, or by sequencing positive clones, designing appropriate family shuffling oligonucleotides and performing a second round of recombination/selection to produce an additional library (which can be recombined as described). In addition, libraries made from different recombination rounds can also be recombined, either by sequencing/ oligonucleotide recombination or by standard recombination methods. VHI. POST-RECOMBINATION PROCESSING, SCREENING, AND SELECTION
[0167] The recombinant nucleic acids produced by the methods of the invention are optionally cloned into cells for activity screening (or used in in vitro transcription reactions to make products which are screened). Furthermore, the nucleic acids can be enriched, sequenced, expressed, amplified in vitro or treated in any other common recombinant method.
[0168] In particular, libraries of sequences resulting from oligonucleotide-mediated recombination are optionally enriched for desired linkages, e.g., those specified during the oligonucleotide design and adjustment processes described above. For example, sequences are optionally enriched via PCR amplification using oligonucleotide primers that hybridize only to alleles encoding the desired linkages such that only the desired linkages are amplified in the reaction mixture. Alternatively, oligonucleotides that hybridize only to desired linkages are coupled, e.g., to resins in columns to effect affinity-based chromatographic separation of desired sequences from reaction mixtures. Optionally, such columns are applied in series to enrich for subsets of subsets of desired linkage sets. Desired sequences so enriched are optionally amplified via PCR, or directly transformed into a suitable expression host, if present in plasmids or other vectors. Additional details regarding affinity-based separations are described in, e.g., Bailon et al. (Eds.), Affinity
Chromatography: Methods and Protocols (Methods in Molecular Biology), Humana Press (2000), Kline, Handbook of Affinity Chromatography, Marcel Dekker, Inc. (1993), and Chaiken (Ed.), Analytical Affinity Chromatography. CRC Press (1987). [0169] General texts that describe molecular biological techniques useful herein, including cloning, mutagenesis, library construction, screening assays, cell culture and the like include Berger and Kimmel, Guide to Molecular Cloning Techniques, Methods in Enzvmology volume 152 Academic Press, Inc., San Diego, CA (Berger); Sambrook et al., Molecular Cloning - A Laboratory Manual (2nd Ed.), Vol. 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, 1989 (Sambrook) and Current Protocols in Molecular Biology. F.M. Ausubel et al., eds.,
Current Protocols, a joint venture between Greene Publishing Associates, Inc. and John Wiley & Sons, Inc., New York (supplemented through 2000) (Ausubel)). Methods of transducing cells, including plant and animal cells, with nucleic acids are generally available, as are methods of expressing proteins encoded by such nucleic acids. In addition to Berger, Ausubel and Sambrook, useful general references for culture of animal cells include Freshney (Culture of Animal Cells, a Manual of Basic Technique, third edition Wiley- Liss, New York (1994)) and the references cited therein, Humason (Animal Tissue Techniques, fourth edition W.H. Freeman and Company (1979)) and Ricciardelli, et al., In Vitro Cell Dev. Biol. 25: 1016-1024 (1989). References for plant cell cloning, culture and regeneration include Payne et al. (1992) Plant Cell and Tissue Culture in Liquid Systems John Wiley & Sons, Inc. New York, NY (Payne); and Gamborg and Phillips (eds) (1995) Plant Cell, Tissue and Organ Culture; Fundamental Methods Springer Lab Manual, Springer- Verlag (Berlin Heidelberg New York) (Gamborg). A variety of Cell culture media are described in Atlas and Parks (eds) The Handbook of Microbiological Media (1993) CRC Press, Boca Raton, FL (Atlas). Additional information for plant cell culture is found in available commercial literature such as the Life Science Research Cell Culture Catalogue (1998) from Sigma- Aldrich, Inc (St Louis, MO) (Sigma-LSRCCC) and, e.g., the Plant Culture Catalogue and supplement (1997) also from Sigma- Aldrich, Inc (St Louis, MO) (Sigma-PCCS). [0170] Examples of techniques sufficient to direct persons of skill through in vitro amplification methods, useful e.g., for amplifying oligonucleotide recombined nucleic acids including polymerase chain reactions (PCR), ligase chain reactions (LCR), Qβ-replicase amplifications and other RNA polymerase mediated techniques (e.g., NASBA). These techniques are found in Berger, Sambrook, and Ausubel, supra, as well as in Mullis et al., (1987) U.S. Patent No. 4,683,202; PCR Protocols A Guide to Methods and Applications (Innis et al. eds) Academic Press Inc. San Diego, CA (1990) (Innis); Arnheim & Levinson (October 1, 1990) C&EN 36-47; The Journal Of NM Research (1991) 3, 81-94; Kwoh et al. (1989) Proc. Natl. Acad. Sci. USA 86, 1173; Guatelli et al. (1990) Proc. Natl. Acad. Sci. USA 87, 1874; Lomell et al. (1989) J. Clin. Chem 35, 1826; Landegren et al., (1988) Science 241, 1077-1080; Van Brunt (1990) Biotechnology 8, 291-294; Wu and Wallace, (1989) Gene 4, 560; Barringer et al. (1990) Gene 89, 117, and Sooknanan and Malek (1995) Biotechnology 13: 563-564. Improved methods of cloning in vitro amplified nucleic acids are described in Wallace et al., U.S. Pat. No. 5,426,039. Improved methods of amplifying large nucleic acids by PCR are summarized in Cheng et al. (1994) Nature 369: 684-685 and the references therein, in which PCR amplicons of up to 40kb are generated. One of skill will appreciate that essentially any RNA can be converted into a double stranded DNA suitable for restriction digestion, PCR expansion and sequencing using reverse transcriptase and a polymerase. See, Ausubel, Sambrook and Berger, all supra. [0171] In one preferred method, reassembled sequences are checked for incorporation of family-based recombination oligonucleotides. This can be done by cloning and sequencing the nucleic acids, and/or by restriction digestion, e.g., as essentially taught in Sambrook, Berger and Ausubel, supra. In addition, sequences can be PCR amplified and sequenced directly. Thus, in addition to, e.g., Sambrook, Berger, Ausubel and Innis (supra), additional PCR sequencing methodologies are also particularly useful. For example, direct sequencing of PCR generated amplicons by selectively incorporating boronated nuclease resistant nucleotides into the amplicons during PCR and digestion of the amplicons with a nuclease to produce sized template fragments has been performed (Porter et al. (1997) Nucleic Acids Research 25(8): 1611- 1617). In the methods, four PCR reactions on a template are performed, in each of which one of the nucleotide triphosphates in the PCR reaction mixture is partially substituted with a 2'deoxynucleoside 5'-[P-borano]-triphosphate. The boronated nucleotide is stochastically incorporated into PCR products at varying positions along the PCR amplicon in a nested set of PCR fragments of the template. An exonuclease that is blocked by incorporated boronated nucleotides is used to cleave the PCR amplicons. The cleaved amplicons are then separated by size using polyacrylamide gel electrophoresis, providing the sequence of the amplicon. An advantage of this method is that it uses fewer biochemical manipulations than performing standard Sanger-style sequencing of PCR amplicons.
[0172] Synthetic genes are amenable to conventional cloning and expression approaches; thus, properties of the genes and proteins they encode can readily be examined after their expression in a host cell. Synthetic genes can also be used to generate polypeptide products by in vitro (cell-free) transcription and translation. Polynucleotides and polypeptides can thus be examined for their ability to bind a variety of predetermined ligands, small molecules and ions, or polymeric and heteropolymeric substances, including other proteins and polypeptide epitopes, as well as microbial cell walls, viral particles, surfaces and membranes.
[0173] For example, many methods can be used for detecting polynucleotides encoding phenotypes associated with catalysis of chemical reactions by either polynucleotides directly, or by encoded polypeptides. Solely for the purpose of illustration, and depending on specifics of particular pre-determined chemical reactions of interest, these methods may include a multitude of techniques well known in the art which account for a physical difference between substrate(s) and product(s), or for changes in the reaction media associated with chemical reaction (e.g. changes in electromagnetic emissions, adsorption, dissipation, and fluorescence, whether UV, visible or infrared (heat). These methods also can be selected from any combination of the following: mass-spectrometry; nuclear magnetic resonance; isotopically labeled materials, partitioning and spectral methods accounting for isotope distribution or labeled product formation; spectral and chemical methods to detect accompanying changes in ion or elemental compositions of reaction product(s) (including changes in pH, inorganic and organic ions and the like). Other assays can be based on the use of biosensors specific for reaction product(s), including those comprising antibodies with reporter properties, or those based on in vivo affinity recognition coupled with expression and activity of a reporter gene. Enzyme-coupled assays for reaction product detection and cell life-death-growth selections in vivo can also be used where appropriate. Regardless of the specific nature of the assays, they all are used to select a desired property, or combination of desired properties, encoded by the recombinant nucleic acids generated according to the methods described herein. Polynucleotides found to have desired properties are thus selected from the library. [0174] The methods of the invention typically include selection and/or screening steps to select nucleic acids having desirable characteristics. The relevant assay used for the selection will depend on the application. Many assays for proteins, receptors, ligands and the like are known. Formats include binding to immobilized components, cell or organismal viability, production of reporter compositions, and the like.
[0175] In high throughput assays, it is possible to screen up to several thousand different recombined variants in a single day. For example, each well of a microtiter plate can be used to run a separate assay, or, if concentration or incubation time effects are to be observed, every 5-10 wells can test a single variant (e.g., at different concentrations). Thus, a single standard microtiter plate can assay about 100 (e.g., 96) reactions. If 1536 well plates are used, then a single plate can easily assay from about 100 to about 1500 different reactions. It is possible to assay several different plates per day; assay screens for up to about 6,000-20,000 different assays (i.e., involving different nucleic acids, encoded proteins, concentrations, etc.) is possible using the integrated systems of the invention. More recently, microfluidic approaches to reagent manipulation have been developed, e.g., by Caliper Technologies (Mountain View, CA) which can provide very high throughput microfluidic assay methods.
[0176] In one aspect, cells, viral plaques, spores or the like, comprising regulated in vitro oligonucleotide-mediated recombination products or physical embodiments of in silico recombined nucleic acids, are separated on solid media to produce individual colonies (or plaques). Using an automated colony picker (e.g., the Q-bot, Genetix, U.K.), colonies or plaques are identified, picked, and up to 10,000 different mutants inoculated into 96 well microtiter dishes containing two 3 mm glass balls/well. The Q-bot does not pick an entire colony but rather inserts a pin through the center of the colony and exits with a small sampling of cells, (or mycelia) and spores (or viruses in plaque applications). The time the pin is in the colony, the number of dips to inoculate the culture medium, and the time the pin is in that medium each effect inoculum size, and each parameter can be controlled and optimized.
[0177] The uniform process of automated colony picking such as the Q-bot decreases human handling error and increases the rate of establishing cultures (roughly 10,000/4 hours). These cultures are optionally shaken in a temperature and humidity controlled incubator. Optional glass balls in the microtiter plates act to promote uniform aeration of cells and the dispersal of cellular (e.g., mycelial) fragments similar to the blades of a fermentor. Clones from cultures of interest can be isolated by limiting dilution. As also described supra, plaques or cells constituting libraries can also be screened directly for the production of proteins, either by detecting hybridization, protein activity, protein binding to antibodies, or the like. To increase the chances of identifying a pool of sufficient size, a prescreen that increases the number of mutants processed by 10-fold can be used. The goal of the primary screen is to quickly identify mutants having equal or better product titers than the parent strain(s) and to move only these mutants forward to liquid cell culture for subsequent analysis. [0178] One approach to screening diverse libraries is to use a massively parallel solid-phase procedure to screen cells expressing recombined nucleic acids, e.g., which encode enzymes for enhanced activity. Massively parallel solid-phase screening apparatus using absorption, fluorescence, or FRET are available. See, e.g., United States Patent 5,914,245 to Bylina, et al. (1999); see also, http://www.kairos- scientific.com/: Youvan et al. (1999) "Fluorescence Imaging Micro-Spectrophotometer (FIMS)" Biotechnology et alia, <www.et-al.com> 1:1-16; Yang et al. (1998) "High Resolution Imaging Microscope (HIRIM)" Biotechnology et alia, <www.et-al.com> 4: 1-20; and Youvan et al. (1999) "Calibration of Fluorescence Resonance Energy Transfer in Microscopy Using Genetically Engineered GFP Derivatives on Nickel Chelating Beads" posted at www.kairos-scientific.com. Following screening by these techniques, sequences of interest are typically isolated, optionally sequenced and the sequences used as set forth herein to design new sequences for in silico or other recombination methods.
[0179] Similarly, a number of well known robotic systems have also been developed for solution phase chemistries useful in assay systems. These systems include automated workstations like the automated synthesis apparatus developed by Takeda Chemical Industries, LTD. (Osaka, Japan) and many robotic systems utilizing robotic arms (Zymate II,-Zymark Corporation, Hopkinton, Mass.; Orca, Beckman
Coulter, Inc. (Fullerton, CA)) which mimic the manual synthetic operations performed by a scientist. Any of the above devices are suitable for use with the present invention, e.g., for high-throughput screening of molecules encoded by nucleic acids evolved as described herein. The nature and implementation of modifications to these devices (if any) so that they can operate as discussed herein will be apparent to persons skilled in the relevant art.
[0180] High throughput screening systems are commercially available (see, e.g., Zymark Corp., Hopkinton, MA; Air Technical Industries, Mentor, OH; Beckman Instruments, Inc. Fullerton, CA; Precision Systems, Inc., Natick, MA, etc.). These systems typically automate entire procedures including all sample and reagent pipetting, liquid dispensing, timed incubations, and final readings of the microplate in detector(s) appropriate for the assay. These configurable systems provide high throughput and rapid start up as well as a high degree of flexibility and customization. [0181] The manufacturers of such systems provide detailed protocols for various high throughput screening assays. Thus, for example, Zymark Corp. provides technical bulletins describing screening systems for detecting the modulation of gene transcription, ligand binding, and the like. [0182] A variety of commercially available peripheral equipment and software is available for digitizing, storing and analyzing a digitized video or digitized optical or other assay images, e.g., using PC (Intel x86 or pentium chip-compatible DOS™, OS2™, WINDOWS™, or WINDOWS NT™ based machines), MACINTOSH™, or UNIX based (e.g., SUN™ work station) computers.
IX. SYSTEMS
[0183] The present invention also provides systems, e.g., for designing oligonucleotides, for in silico recombination under the direction of genetic 'algorithms, and for other upstream and/or downstream operations. The systems typically include a logic device and a computer readable medium operably connected to the logic device that stores at least one computer program (e.g., as a component of the system's software), e.g., for designing oligonucleotides for regulated recombination. The computer program for designing oligonucleotides generally includes, e.g., a logic instruction which directs the logic device to receive one or more inputted parental polypeptide character strings, a logic instruction which directs the logic device to receive or determine a desired amino acid linkage, a logic instruction which directs the logic device to reverse-translate the one or more inputted parental polypeptide character strings into one or more parental polynucleotide character strings, a logic instruction which directs the logic device to segment the one or more parental polynucleotide character strings into two or more overlapping oligonucleotide character strings, a logic instruction which directs the logic device to determine an annealing frequency for one or more pairs of overlapping oligonucleotide character strings at a selected temperature, a logic instruction which directs the logic device to change at least one nucleotide in one or more portions of one or more overlap regions of one or more pairs of overlapping oligonucleotide character strings such that an annealing frequency for the one or more pairs of overlapping oligonucleotide character strings is substantially proportional to a desired amino acid linkage to provide a selected population of adjusted overlapping oligonucleotide character strings, a logic instruction which directs the logic device to receive one or more inputted changes to at least one nucleotide in one or more portions of one or more overlap regions of one or more pairs of overlapping oligonucleotide character strings to provide a selected population of adjusted overlapping oligonucleotide character strings, or the like. [0184] In certain embodiments, the system includes various additional components for performing assorted operations. For example, the system optionally further includes an oligonucleotide synthesis device operably connected to the logic device for automatically synthesizing overlapping oligonucleotides corresponding to members of the selected population that includes adjusted overlapping oligonucleotide character strings. As another option, the system further includes a nucleic acid amplification device (e.g., a PCR thermocycler, etc.) operably connected to the logic device for producing recombinant nucleic acids from the synthesized overlapping oligonucleotides. Other system components, such as robotic liquid control armatures, image scanners, and appropriate software are described above and in the references cited herein.
[0185] Figure 9 further schematically illustrates steps performed under the control of system software in one embodiment of the invention. As shown, in step Cl a logic instruction directs the computer to receive inputted parental polypeptide character strings and step C2, a logic instruction directs the computer to receive or determine a desired amino acid linkage. As also shown, in step C3 a logic instruction directs the computer to reverse-translate the inputted parental polypeptide character strings into parental polynucleotide character strings, in step C4 a logic instruction directs the computer to segment the parental polynucleotide character strings into overlapping oligonucleotide character strings, and in step C5 a logic instruction directs the computer to determine annealing frequencies for pairs of overlapping oligonucleotide character strings at a selected temperature. Thereafter, a logic instruction optionally directs the computer to effect a change nucleotide sequences in overlap regions of pairs of overlapping oligonucleotide character strings so that an annealing frequency for the pairs of overlapping oligonucleotide character strings is substantially proportional to the desired amino acid linkage to provide a selected population that includes adjusted overlapping oligonucleotide character strings (step C6) or to receive inputted changes to nucleotide sequences in overlap regions of pairs of overlapping oligonucleotide character strings to provide a selected population that includes adjusted overlapping oligonucleotide character strings (step C7).
[0186] Various methods and genetic algorithms (GOs) can be used to perform desirable functions as noted herein. In addition, digital or analog systems such as digital or analog computer systems can control a variety of other functions such as the display and/or control of output files.
[0187] For example, standard desktop applications such as word processing software (e.g., Microsoft Word™ or Corel WordPerfect™) and database software (e.g., spreadsheet software such as Microsoft Excel™, Corel Quattro Pro™, or database programs such as Microsoft Access™ or Paradox™) can be adapted to the present invention by inputting one or more character strings into the software which is loaded into the memory of a digital system, and performing various operations on the character strings, including alignments, overlap adjustments, GO-guided recombinations as described herein. For example, systems can include the foregoing software having the appropriate character string information, e.g., used in conjunction with a user interface (e.g., a GUI in a standard operating system such as a Windows, Macintosh or LINUX system) to manipulate strings of characters, e.g., with GOs being programmed into the applications, or with the GOs being performed manually by the user (or both). As noted, specialized alignment programs such as PILEUP and BLAST can also be incorporated into the systems of the invention, e.g., for alignment of nucleic acids or proteins (i.e., corresponding character strings) as a preparatory step to performing an additional operation on the resulting aligned sequences, such as character string segmentation and overlap adjustments as described herein. Software for performing PCA can also be included in the digital system.
[0188] Systems for character string manipulation typically include, e.g., a digital computer with software for aligning and manipulating sequences according to the methods of the invention, or for performing PCA, or the like, as well as data sets entered into the software system comprising sequences to be manipulated. The computer can be, e.g., a PC (Intel x86 or Pentium chip- compatible DOS™, OS2™, WINDOWS™, WINDOWS NT™, WINDOWS95™, WINDOWS98™, LINUX, Apple-compatible, MACINTOSH™ compatible, Power PC compatible, or a UNIX compatible (e.g., SUN™ work station) machine) or other commercially common computer which is known to one of skill. Software for aligning or otherwise manipulating sequences can be constructed by one of skill using a standard programming language such as Visualbasic, Fortran, Basic, Java, or the like, according to the methods herein. [0189] Any controller or computer optionally includes a monitor which can include, e.g., a cathode ray tube ("CRT") display, a flat panel display (e.g., active matrix liquid crystal display, liquid crystal display), or others. Computer circuitry is often placed in a box which includes numerous integrated circuit chips, such as a microprocessor, memory, interface circuits, and others. The box also optionally includes a hard disk drive, a floppy disk drive, a high capacity removable drive such as a writeable CD-ROM, and other common peripheral elements. Inputting devices such as a keyboard or mouse optionally provide for input from a user and for user selection of sequences to be compared or otherwise manipulated in the relevant computer system.
[0190] The computer typically includes appropriate software for receiving user instructions, either in the form of user input into a set parameter fields, e.g., in a GUI, or in the form of preprogrammed instructions, e.g., preprogrammed for a variety of different specific operations. The software then converts these instructions to appropriate language for instructing the system to carry out any desired operation. For example, in addition to performing character string manipulations, a digital system can instruct an oligonucleotide synthesizer to synthesize oligonucleotides for gene reconstruction, or even to order oligonucleotides from commercial sources (e.g., by printing appropriate order forms or by linking to an order form on the internet). [0191] The digital system can also include output elements for controlling nucleic acid synthesis (e.g., based upon a sequence or an alignment of a sequences herein), i.e., an integrated system of the invention optionally includes an oligonucleotide synthesizer or an oligonucleotide synthesis controller. The system can include other operations which occur downstream from an alignment or other operation performed using a character string corresponding to a sequence herein, e.g., as noted above with reference to assays.
[0192] In one example, the software of the invention is embodied in a fixed media or transmissible program component containing logic instructions and/or data that when loaded into an appropriately configured computing device causes the device to perform a manipulation, such as a GO on one or more character strings as described herein. Figure 10 shows example digital device 1000 that should be understood to be a logical apparatus that can read instructions from media 1017, network ports, user input keyboard 1009, user input 1011 or other inputting means. Apparatus 1000 can thereafter use those instructions to direct modification of one or more character strings, e.g., to construct one or more data sets (e.g., comprising a plurality of pairs of adjusted overlapping oligonucleotide character strings). One type of logical apparatus that can embody the invention is a computer system as in computer system 1000 comprising CPU 1007, optional user input devices keyboard 1009, and GUI pointing device 1011, as well as peripheral components such as disk drives 1015 and monitor 1005 (which displays designed oligonucleotide character string populations or GO modified character strings and provides for simplified selection of subsets of such character strings by a user. Fixed media 1017 is optionally used to program the overall system and can include, e.g., a disk-type optical or magnetic media or other electronic memory storage element. Communication ports can be used to program the system and can represent any type of communication connection.
[0193] The invention can also be embodied within the circuitry of an application specific integrated circuit (ASIC) or programmable logic device (PLD). In such a case, the invention is embodied in a computer readable descriptor language that can be used to create an ASIC or PLD. The invention can also be embodied within the circuitry or logic processors of a variety of other digital apparatus, such as PDAs, laptop computer systems, displays, image editing equipment, etc.
[0194] In one preferred aspect, the digital system comprises a learning component where the outcomes of physical oligonucleotide assembly schemes (compositions, abundance of products, different processes) are monitored in conjunction with physical assays, and correlations are established. Successful and unsuccessful combinations are documented in a database to provide justification/preferences for user-base or digital system based selection of sets of parameters, e.g., for subsequent in silico recombination processes involving the same set of parental character strings/nucleic acids/proteins (or even unrelated sequences, where the information provides process improvement information). The correlations are used to modify subsequent in silico processes to optimize the process. This cycle of physical synthesis, selection, and correlation is optionally repeated to optimize the system. For example, a learning neural network can be used to optimize outcomes.
[0195] Optionally, the methods of this invention can be implemented in a localized or distributed computing environment. In a distributed environment, the methods may implemented on a single computer comprising multiple processors or on a multiplicity of computers. The computers can be linked, e.g., through a common bus, but more preferably the computers are nodes on a network. The network can be a generalized or a dedicated local or wide-area network and, in certain preferred embodiments, the computers may be components of an intranet or an internet. Web- based embodiments and systems generally are described further in, e.g., Published International Application No. WO 00/42560, entitled "METHODS FOR MAKING CHARACTER STRINGS, POLYNUCLEOTIDES AND POLYPEPTIDES HAVING DESIRED CHARACTERISTICS," by Selifonov et al. See also, WO 00/42559, entitled "METHODS OF POPULATING DATA STRUCTURES FOR USE IN EVOLUTIONARY SIMULATIONS," by Selifonov et al.
X. COMPUTERPROGRAMPRODUCT
[0196] The present invention also provides a computer program product that is optionally used to practice the methods described herein. In particular, the computer program product includes a computer readable medium having a computer program for designing oligonucleotides for regulated recombination. The computer program includes, e.g., a logic instruction which directs a logic device to receive inputted parental polypeptide character strings, a logic instruction which directs the logic device to receive or determine a desired amino acid linkage, a logic instruction which directs the logic device to reverse-translate the inputted parental polypeptide character strings into parental polynucleotide character strings, a logic instruction which directs the logic device to segment the parental polynucleotide character strings into two or more overlapping oligonucleotide character strings, a logic instruction which directs the logic device to determine an annealing frequency for pairs of overlapping oligonucleotide character strings at a selected temperature, a logic instruction which directs the logic device to change at least one nucleotide in portions of overlap regions of pairs of overlapping oligonucleotide character strings such that an annealing frequency for the pairs of overlapping oligonucleotide character strings is substantially proportional to a desired amino acid linkage to provide a selected population of adjusted overlapping oligonucleotide character strings, a logic instruction which directs the logic device to receive inputted changes to at least one nucleotide in portions of overlap regions of pairs of overlapping oligonucleotide character strings to provide a selected population of adjusted overlapping oligonucleotide character strings, or the like. The computer program is described further above with reference to Figure 8. Furthermore, the computer readable medium optionally includes, e.g., a CD-ROM, a floppy disk, a tape, a flash memory device or component, a system memory device or component, a hard drive, a data signal embodied in a carrier wave, or the like.
XI. INTEGRATION OF REGULATED OLIGONUCLEOTIDE-MEDIATED RECOMBINATION AND OTHER DIRECTED EVOLUTION
TECHNOLOGIES
[0197] The methods of regulating oligonucleotide-mediated recombination disclosed herein constitute a self-sufficient and independent technology which is optionally practiced independent of other available directed evolution methods. However, one or more rounds of oligonucleotide-mediated recombination, whether performed physically or in silico, can be, and often are, practiced in combination with other nucleic acid recombination techniques, and/or in combination with site directed mutagenesis, or error-prone PCR (e.g., as alternating cycles of a directed evolution process) or other diversity generation methods. Thus, the methods described herein are optionally performed as a stand-alone technology, or, e.g., followed by other recombination techniques, mutagenesis, random priming PCR, etc.
[0198] The following publications describe a variety of recursive recombination procedures and/or related diversity generation methods which can be practiced in conjunction with the methods of the present invention: Stemmer, et al., (1999) "Molecular breeding of viruses for targeting and other clinical properties.
Tumor Targeting" 4: 1-4; Nesset al. (1999) "DNA Shuffling of subgenomic sequences of subtilisin" Nature Biotechnology 17:893-896; Chang et al. (1999) "Evolution of a cytokine using DNA family shuffling" Nature Biotechnology 17:793-797; Minshull and Stemmer (1999) "Protein evolution by molecular breeding" Current Opinion in Chemical Biology 3:284-290; Christians et al. (1999) "Directed evolution of thymidine kinase for AZT phosphorylation using DNA family shuffling" Nature Biotechnology 17:259-264; Crameri et al. (1998) "DNA shuffling of a family of genes from diverse species accelerates directed evolution" Nature 391:288-291; Crameri et al. (1997) "Molecular evolution of an arsenate detoxification pathway by DNA shuffling," Nature Biotechnology 15:436-438; Zhang et al. (1997) "Directed evolution of an effective fucosidase from a galactosidase by DNA shuffling and screening" Proceedings of the National Academy of Sciences. U.S.A. 94:4504-4509; Patten et al. (1997) "Applications of DNA Shuffling to Pharmaceuticals and Vaccines" Current Opinion in Biotechnology 8:724-733; Crameri et al. (1996) "Construction and evolution of antibody-phage libraries by DNA shuffling" Nature Medicine 2:100-103; Crameri et al. (1996) "Improved green fluorescent protein by molecular evolution using DNA shuffling" Nature Biotechnology 14:315-319; Gates et al. (1996) "Affinity selective isolation of ligands from peptide libraries through display on a lac repressor "headpiece dimer"' Journal of Molecular Biology 255:373-386; Stemmer (1996) "Sexual PCR and Assembly PCR" In: The Encyclopedia of Molecular Biology. VCH Publishers, New York, pp.447-457; Crameri and Stemmer (1995) "Combinatorial multiple cassette mutagenesis creates all the permutations of mutant and wildtype cassettes" BioTechniques 18: 194-195; Stemmer et al., (1995) "Single-step assembly of a gene and entire plasmid form large numbers of oligodeoxyribonucleotides" Gene, 164:49-53; Stemmer (1995) "The Evolution of Molecular Computation" Science 270: 1510; Stemmer (1995) "Searching Sequence Space" Bio/Technology 13:549-553; Stemmer (1994) "Rapid evolution of a protein in vitro by DNA shuffling" Nature 370:389-391; and Stemmer (1994) "DNA shuffling by random fragmentation and reassembly: In vitro recombination for molecular evolution." Proceedings of the National Academy of Sciences, U.S.A. 91:10747-10751.
[0199] Other diversity generating approaches can also be used to modify character strings or nucleic acids. Additional diversity can be introduced into input or output nucleic acids by methods which result in the alteration of individual nucleotides or groups of contiguous or non-contiguous nucleotides, i.e., mutagenesis methods. Mutational methods of generating diversity include, for example, site-directed mutagenesis (Ling et al. (1997) "Approaches to DNA mutagenesis: an overview" Anal Biochem. 254(2): 157-178; Dale et al. (1996) "Oligonucleotide-directed random mutagenesis using the phosphorothioate method" Methods MoI. Biol. 57:369-374; Smith (1985) "In vitro mutagenesis" Ann. Rev. Genet. 19:423-462; Botstein and Shortle (1985) "Strategies and applications of in vitro mutagenesis" Science 229:1193- 1201; Carter (1986) "Site-directed mutagenesis" Biochem. J. 237:1-7; and Kunkel (1987) "The efficiency of oligonucleotide directed mutagenesis" in Nucleic Acids & Molecular Biology (Eckstein, F. and Lilley, D.M.J, eds., Springer Verlag, Berlin)); mutagenesis using uracil containing templates (Kunkel (1985) "Rapid and efficient site- specific mutagenesis without phenotypic selection" Proc. Natl. Acad. Sci. USA 82:488- 492; Kunkel et al. (1987) "Rapid and efficient site-specific mutagenesis without phenotypic selection" Methods in Enzvmol. 154, 367-382; and Bass et al. (1988) "Mutant Tip repressors with new DNA-binding specificities" Science 242:240-245); oligonucleotide-directed mutagenesis (Methods in Enzvmol. 100: 468-500 (1983); Methods in Enzvmol. 154: 329-350 (1987); Zoller and Smith (1982) "Oligonucleotide- directed mutagenesis using M13-derived vectors: an efficient and general procedure for the production of point mutations in any DNA fragment" Nucleic Acids Res. 10:6487- 6500; Zoller and Smith (1983) "Oligonucleotide-directed mutagenesis of DNA fragments cloned into M13 vectors" Methods in Enzvmol. 100:468-500; and Zoller and Smith (1987) "Oligonucleotide-directed mutagenesis: a simple method using two oligonucleotide primers and a single-stranded DNA template" Methods in Enzvmol.
154:329-350); phosphorothioate-modified DNA mutagenesis (Taylor et al. (1985) 'The use of phosphorothioate-modified DNA in restriction enzyme reactions to prepare nicked DNA" Nucl. Acids Res. 13: 8749-8764; Taylor et al. (1985) 'The rapid generation of oligonucleotide-directed mutations at high frequency using phosphorothioate-modified DNA" Nucl. Acids Res. 13: 8765-8787 (1985); Nakamaye and Eckstein (1986) "Inhibition of restriction endonuclease Nci I cleavage by phosphorothioate groups and its application to oligonucleotide-directed mutagenesis" Nucl. Acids Res. 14: 9679-9698; Sayers et al. (1988) "Y-T Exonucleases in phosphorothioate-based oligonucleotide-directed mutagenesis" Nucl. Acids Res. 16:791-802; and Sayers et al. (1988) "Strand specific cleavage of phosphorothioate- containing DNA by reaction with restriction endonucleases in the presence of ethidium bromide" Nucl. Acids Res. 16: 803-814); mutagenesis using gapped duplex DNA (Kramer et al. (1984) 'The gapped duplex DNA approach to oligonucleotide-directed mutation construction" Nucl. Acids Res. 12: 9441-9456; Kramer and Fritz (1987) Methods in Enzvmol. "Oligonucleotide-directed construction of mutations via gapped duplex DNA" 154:350-367; Kramer et al. (1988) "Improved enzymatic in vitro reactions in the gapped duplex DNA approach to oligonucleotide-directed construction of mutations" Nucl. Acids Res. 16: 7207; and Fritz et al. (1988) "Oligonucleotide- directed construction of mutations: a gapped duplex DNA procedure without enzymatic reactions in vitro" Nucl. Acids Res. 16: 6987-6999).
[0200] Additional suitable methods include point mismatch repair (Kramer et al. (1984) "Point Mismatch Repair" Cell 38:879-887), mutagenesis using repair-deficient host strains (Carter et al. (1985) "Improved oligonucleotide site- directed mutagenesis using Ml 3 vectors" Nucl. Acids Res. 13: 4431-4443; and Carter (1987) "Improved oligonucleotide-directed mutagenesis using M13 vectors" Methods in Enzvmol. 154: 382-403), deletion mutagenesis (Eghtedarzadeh & Henikoff (1986) "Use of oligonucleotides to generate large deletions" Nucl. Acids Res. 14: 5115), restriction-selection and restriction-selection and restriction-purification (Wells et al. (1986) "Importance of hydrogen-bond formation in stabilizing the transition state of subtilisin" Phil. Trans. R. Soc. Lond. A 317: 415-423), mutagenesis by total gene synthesis (Nambiar et al. (1984) 'Total synthesis and cloning of a gene coding for the ribonuclease S protein" Science 223: 1299-1301; Sakamar and Khorana (1988) 'Total synthesis and expression of a gene for the a-subunit of bovine rod outer segment guanine nucleotide-binding protein (transducin)" Nucl. Acids Res. 14: 6361-6372; Wells et al. (1985) "Cassette mutagenesis: an efficient method for generation of multiple mutations at defined sites" Gene 34:315-323; and Grundstrom et al. (1985) "Oligonucleotide-directed mutagenesis by microscale 'shot-gun' gene synthesis" Nucl. Acids Res. 13: 3305-3316), double-strand break repair (Mandecki (1986); Arnold (1993) "Protein engineering for unusual environments" Current Opinion in Biotechnology 4:450-455. "Oligonucleotide-directed double-strand break repair in plasmids of Escherichia colϊ. a method for site-specific mutagenesis" Proc. Natl. Acad. Sci. USA 83:7177-7181). Additional details on many of the above methods can be found in Methods in Enzvmology Volume 154, which also describes useful controls for trouble-shooting problems with various mutagenesis methods.
[0201] Additional details regarding the directed evolution of nucleic acids can be found in the following U.S. patents, PCT publications, and EPO publications: U.S. Pat. No. 5,605,793 to Stemmer (February 25, 1997), "Methods for In Vitro Recombination;" U.S. Pat. No. 5,811,238 to Stemmer et al. (September 22, 1998) "Methods for Generating Polynucleotides having Desired Characteristics by Iterative Selection and Recombination;" U.S. Pat. No. 5,830,721 to Stemmer et al. (November 3, 1998), "DNA Mutagenesis by Random Fragmentation and Reassembly;" U.S. Pat. No. 5,834,252 to Stemmer, et al. (November 10, 1998) "End-Complementary Polymerase Reaction;" U.S. Pat. No. 5,837,458 to Minshull, et al. (November 17, 1998), "Methods and Compositions for Cellular and Metabolic Engineering;" WO 95/22625, Stemmer and Crameri, "Mutagenesis by Random Fragmentation and Reassembly;" WO 96/33207 by Stemmer and Lipschutz "End Complementary Polymerase Chain Reaction;" WO 97/20078 by Stemmer and Crameri "Methods for Generating Polynucleotides having Desired Characteristics by Iterative Selection and Recombination;" WO 97/35966 by Minshull and Stemmer, "Methods and Compositions for Cellular and Metabolic Engineering;" WO 99/41402 by Punnonen et al. 'Targeting of Genetic Vaccine Vectors;" WO 99/41383 by Punnonen et al. "Antigen Library Immunization;" WO 99/41369 by Punnonen et al. "Genetic Vaccine Vector Engineering;" WO 99/41368 by Punnonen et al. "Optimization of Immunomodulatory Properties of Genetic Vaccines;" EP 752008 by Stemmer and Crameri, "DNA Mutagenesis by Random Fragmentation and Reassembly;" EP 0932670 by Stemmer "Evolving Cellular DNA Uptake by Recursive Sequence Recombination;" WO
99/23107 by Stemmer et al., "Modification of Virus Tropism and Host Range by Viral Genome Shuffling;" WO 99/21979 by Apt et al., "Human Papillomavirus Vectors;" WO 98/31837 by del Cardayre et al. "Evolution of Whole Cells and Organisms by Recursive Sequence Recombination;" WO 98/27230 by Patten and Stemmer, "Methods and Compositions for Polypeptide Engineering;" WO 98/27230 by Stemmer et al., "Methods for Optimization of Gene Therapy by Recursive Sequence Shuffling and Selection," WO 00/00632, "Methods for Generating Highly Diverse Libraries," WO 00/09679, "Methods for Obtaining in Vitro Recombined Polynucleotide Sequence Banks and Resulting Sequences," WO 98/42832 by Arnold et al., "Recombination of Polynucleotide Sequences Using Random or Defined Primers," WO 99/29902 by
Arnold et al., "Method for Creating Polynucleotide and Polypeptide Sequences," WO 98/41653 by Vind, "An in Vitro Method for Construction of a DNA Library," WO 98/41622 by Borchert et al., "Method for Constructing a Library Using DNA Shuffling," and WO 98/42727 by Pati and Zarling, "Sequence Alterations using Homologous Recombination."
[0202] Certain U.S. applications provide additional details regarding various methods of evolving nucleic acids, including "SHUFFLING OF CODON ALTERED GENES" by Patten et al. filed September 28, 1999, (USSN 09/407,800); "EVOLUTION OF WHOLE CELLS AND ORGANISMS BY RECURSIVE SEQUENCE RECOMBINATION" by del Cardayre et al., filed July 15, 1998 (USSN 09/166,188), and July 15, 1999 (USSN 09/354,922); "OLIGONUCLEOTIDE MEDIATED NUCLEIC ACID RECOMBINATION" by Crameri et al., filed September 28, 1999 (USSN 09/408,392), and "OLIGONUCLEOTIDE MEDIATED NUCLEIC ACID RECOMBINATION" by Crameri et al., filed January 18, 2000 (PCT/USOO/01203); "USE OF CODON- V ARIED OLIGONUCLEOTIDE SYNTHESIS FOR SYNTHETIC SHUFFLING" by Welch et al., filed September 28,
1999 (USSN 09/408,393); "METHODS FOR MAKING CHARACTER STRINGS, POLYNUCLEOTIDES & POLYPEPTIDES HAVING DESIRED
CHARACTERISTICS" by Selifonov et al., filed January 18, 2000, (PCT/USOO/01202) and, e.g., "METHODS FOR MAKING CHARACTER STRINGS, POLYNUCLEOTIDES & POLYPEPTIDES HAVING DESIRED CHARACTERISTICS" by Selifonov et al., filed July 18, 2000 (USSN 09/618,579); "METHODS OF POPULATING DATA STRUCTURES FOR USE IN
EVOLUTIONARY SIMULATIONS" by Selifonov and Stemmer, filed January 18,
2000 (PCT/USOO/01138); and "SINGLE-STRANDED NUCLEIC ACID TEMPLATE- MEDIATED RECOMBINATION AND NUCLEIC ACID FRAGMENT ISOLATION" by Affholter, filed Sept. 6, 2000 (USSN 09/656,549). [0203] The following exemplify some of the different types of preferred formats for evolving nucleic acids in the context of the present invention, including, e.g., certain recombination based formats, which are optionally performed separately or in combination.
[0204] Nucleic acids can be recombined in vitro by any of a variety of techniques discussed in the references above, including, e.g., DNAse digestion of nucleic acids to be recombined followed by ligation and/or PCR reassembly of the nucleic acids. For example, sexual PCR mutagenesis can be used in which random (or pseudo random, or even non-random) fragmentation of the DNA molecule is followed by recombination, based on sequence similarity, between DNA molecules with different but related DNA sequences, in vitro, followed by fixation of the crossover by extension in a polymerase chain reaction. This process and many process variants are described in several of the references above including, e.g., in Stemmer (1994) Proc. Natl. Acad. Sci. USA 91:10747-10751.
[0205] Similarly, nucleic acids can be recursively recombined in vivo, e.g., by allowing recombination to occur between nucleic acids in cells. Many such in vivo recombination formats are set forth in the references noted above. Such formats optionally provide direct recombination between nucleic acids of interest, or provide recombination between vectors, viruses, plasmids, etc., comprising the nucleic acids of interest, as well as other formats. Details regarding such procedures are found in the references noted above.
[0206] Whole genome recombination methods can also be used in which whole genomes of cells or other organisms are recombined, optionally including spiking of the genomic recombination mixtures with desired library components (e.g., genes corresponding to the pathways of the present invention). These methods have many applications, including those in which the identity of a target gene is not known. Details regarding such methods are found, e.g., in WO 98/31837 by del Cardayre et al. "Evolution of Whole Cells and Organisms by Recursive Sequence Recombination;" and in, e.g., PCT/US99/15972 by del Cardayre et al., also entitled "Evolution of Whole Cells and Organisms by Recursive Sequence Recombination."
[0207] Details regarding various synthetic recombination approaches are found in the references noted above, including, e.g., "OLIGONUCLEOTIDE MEDIATED NUCLEIC ACID RECOMBINATION" by Crameri et al., filed September 28, 1999 (USSN 09/408,392), and "OLIGONUCLEOTIDE MEDIATED NUCLEIC ACID RECOMBINATION" by Crameri et al., filed January 18, 2000 (PCT/USOO/01203); "USE OF CODON-V ARTED OLIGONUCLEOTIDE SYNTHESIS FOR SYNTHETIC SHUFFLING" by Welch et al., filed September 28, 1999 (USSN 09/408,393); "METHODS FOR MAKING CHARACTER STRINGS, POLYNUCLEOTIDES AND POLYPEPTIDES HAVING DESIRED
CHARACTERISTICS" by Selifonov et al., filed January 18, 2000, (PCT/USOO/01202); "METHODS OF POPULATING DATA STRUCTURES FOR USE IN EVOLUTIONARY SIMULATIONS" by Selifonov and Stemmer (PCT/USOO/01138), filed January 18, 2000; and, e.g., "METHODS FOR MAKING CHARACTER STRINGS, POLYNUCLEOTIDES & POLYPEPTIDES HAVING DESIRED
CHARACTERISTICS" by Selifonov et al., filed July 18, 2000 (USSN 09/618,579).
[0208] Extensive details regarding in silico recombination, including the use of genetic algorithms, genetic operators and the like in computer systems, combined with generation of corresponding nucleic acids (and/or proteins), as well as combinations of designed nucleic acids and/or proteins (e.g., based on cross-over site selection) as well as designed, pseudo-random or random recombination methods are described in "METHODS FOR MAKING CHARACTER STRINGS, POLYNUCLEOTIDES & POLYPEPTIDES HAVING DESIRED CHARACTERISTICS" by Selifonov et ah, filed January 18, 2000, (PCT/USOO/01202) "METHODS OF POPULATING DATA STRUCTURES FOR USE IN EVOLUTIONARY SIMULATIONS" by Selifonov and Stemmer (PCT/USOO/01138), filed January 18, 2000; and, e.g., "METHODS FOR MAKING CHARACTER STRINGS, POLYNUCLEOTIDES & POLYPEPTIDES HAVING DESIRED
CHARACTERISTICS" by Selifonov et al., filed July 18, 2000 (USSN 09/618,579).
[0209] Many methods of accessing natural diversity, e.g., by hybridization of diverse nucleic acids or nucleic acid fragments to single-stranded templates, followed by polymerization and/or ligation to regenerate full-length sequences, optionally followed by degradation of the templates and recovery of the resulting modified nucleic acids can be similarly used. In one method employing a single-stranded template, the fragment population derived from the genomic library or libraries is/are annealed with partial, or, often approximately full length ssDNA or RNA corresponding to the opposite strand. Assembly of complex chimeric genes from this population is then mediated by nuclease-base removal of non-hybridizing fragment ends, polymerization to fill gaps between such fragments and subsequent single stranded ligation. The parental polynucleotide strand can be removed by digestion (e.g., if RNA or uracil-containing), magnetic separation under denaturing conditions (if labeled in a manner conducive to such separation) and other available separation/purification methods. Alternatively, the parental strand is optionally co- purified with the chimeric strands and removed during subsequent screening and processing steps. Additional details regarding this approach are found, e.g., in "SINGLE-STRANDED NUCLEIC ACED TEMPLATE-MEDIATED RECOMBINATION AND NUCLEIC ACID FRAGMENT ISOLATION" by Affholter, USSN 09/656,549, filed Sept. 6, 2000.
[0210] In another approach, single-stranded molecules are converted to double-stranded DNA (dsDNA) and the dsDNA molecules are bound to a solid support by ligand-mediated binding. After separation of unbound DNA, the selected DNA molecules are released from the support and introduced into a suitable host cell to generate a library enriched sequences that hybridize to the probe. A library produced in this manner provides a desirable substrate for further diversification using any of the procedures described herein. [0211] Any of the preceding general recombination formats can be practiced in a reiterative fashion (e.g., one or more cycles of mutation/recombination or other diversity generation methods, optionally followed by one or more selection methods, such as the stereoselectivity screens described herein) to generate a more diverse set of recombinant nucleic acids.
[0212] Mutagenesis employing polynucleotide chain termination methods have also been proposed (see, e.g., U.S. Patent No. 5,965,408, "Method of DNA reassembly by interrupting synthesis" to Short, and the references above), and can be applied to the present invention. In this approach, double stranded DNAs corresponding to one or more genes sharing regions of sequence similarity are combined and denatured, in the presence or absence of primers specific for the gene. The single stranded polynucleotides are then annealed and incubated in the presence of a polymerase and a chain terminating reagent (e.g., ultraviolet, gamma or X-ray irradiation; ethidium bromide or other intercalators; DNA binding proteins, such as single strand binding proteins, transcription activating factors, or histones; polycyclic aromatic hydrocarbons; trivalent chromium or a trivalent chromium salt; or abbreviated polymerization mediated by rapid thermocycling; and the like), resulting in the production of partial duplex molecules. The partial duplex molecules, e.g., containing partially extended chains, are then denatured and reannealed in subsequent rounds of replication or partial replication resulting in polynucleotides which share varying degrees of sequence similarity and which are diversified with respect to the starting population of DNA molecules. Optionally, the products, or partial pools of the products, can be amplified at one or more stages in the process. Polynucleotides produced by a chain termination method, such as described above, are suitable substrates for any other described recombination format.
[0213] Diversity also can be generated in nucleic acids or populations of nucleic acids using a recombinational procedure termed "incremental truncation for the creation of hybrid enzymes" ("ITCHY") described in Ostermeier et al. (1999) "A combinatorial approach to hybrid enzymes independent of DNA homology" Nature Biotech 17:1205. This approach can be used to generate an initial a library of variants which can optionally serve as a substrate for one or more in vitro or in vivo recombination methods. See also, Ostermeier et al. (1999) "Combinatorial Protein Engineering by Incremental Truncation," Proc. Natl. Acad. Sci. USA 96: 3562-67; Ostermeier et al. (1999), "Incremental Truncation as a Strategy in the Engineering of Novel Biocatalysts," Biological and Medicinal Chemistry 7: 2139-44.
[0214] Mutational methods which result in the alteration of individual nucleotides or groups of contiguous or non-contiguous nucleotides can be favorably employed to introduce nucleotide diversity. Many mutagenesis methods are found in the above-cited references; additional details regarding mutagenesis methods can be found in following, which can also be applied to the present invention.
[0215] For example, error-prone PCR can be used to generate nucleic acid variants. Using this technique, PCR is performed under conditions where the copying fidelity of the DNA polymerase is low, such that a high rate of point mutations is obtained along the entire length of the PCR product. Examples of such techniques are found in the references above and, e.g., in Leung et al. (1989) Technique 1:11-15 and Caldwell et al. (1992) PCR Methods Applic. 2:28-33. Similarly, assembly PCR can be used, in a process which involves the assembly of a PCR product from a mixture of small DNA fragments. A large number of different PCR reactions can occur in parallel in the same reaction mixture, with the products of one reaction priming the products of another reaction.
[0216] Oligonucleotide directed mutagenesis can be used to introduce site-specific mutations in a nucleic acid sequence of interest. Examples of such techniques are found in the references above and, e.g., in Reidhaar-Olson et al. (1988) Science, 241:53-57. Similarly, cassette mutagenesis can be used in a process that replaces a small region of a double stranded DNA molecule with a synthetic oligonucleotide cassette that differs from the native sequence. The oligonucleotide can contain, e.g., completely and/or partially randomized native sequence(s). [0217] Recursive ensemble mutagenesis is a process in which an algorithm for protein mutagenesis is used to produce diverse populations of phenotypically related mutants, members of which differ in amino acid sequence. This method uses a feedback mechanism to monitor successive rounds of combinatorial cassette mutagenesis. Examples of this approach are found in Arkin and Youvan (1992) Proc. Natl. Acad. Sci. USA 89:7811-7815.
[0218] Exponential ensemble mutagenesis can be used for generating combinatorial libraries with a high percentage of unique and functional mutants. Small groups of residues in a sequence of interest are randomized in parallel to identify, at each altered position, amino acids which lead to functional proteins. Examples of such procedures are found in Delegrave and Youvan (1993) Biotechnology Research 11:1548-1552.
[0219] In vivo mutagenesis can be used to generate random mutations in any cloned DNA of interest by propagating the DNA, e.g., in a strain of E. coli that carries mutations in one or more of the DNA repair pathways. These "mutator" strains have a higher random mutation rate than that of a wild-type parent. Propagating the DNA in one of these strains will eventually generate random mutations within the DNA. Such procedures are described in the references noted above. [0220] Other procedures for introducing diversity into a genome, e.g., a bacterial, fungal, animal or plant genome can be used in conjunction with the above described and/or referenced methods. For example, in addition to the methods above, techniques have been proposed which produce nucleic acid multimers suitable for transformation into a variety of species (see, e.g., Schellenberger, U.S. Patent No. 5,756,316 and the references above). Transformation of a suitable host with such multimers, consisting of genes that are divergent with respect to one another, (e.g., derived from natural diversity or through application of site directed mutagenesis, error prone PCR, passage through mutagenic bacterial strains, and the like), provides a source of nucleic acid diversity for DNA diversification, e.g., by an in vivo recombination process as indicated above.
[0221] Alternatively, a multiplicity of monomelic polynucleotides sharing regions of partial sequence similarity can be transformed into a host species and recombined in vivo by the host cell. Subsequent rounds of cell division can be used to generate libraries, members of which, include a single, homogenous population, or pool of monomelic polynucleotides. Alternatively, the monomelic nucleic acid can be recovered by standard techniques, e.g., PCR and/or cloning, and recombined in any of the recombination formats, including recursive recombination formats, described above.
[0222] Methods for generating multispecies expression libraries have been described (in addition to the reference noted above, see, e.g., Peterson et al. (1998) U.S. Pat. No. 5,783,431 "METHODS FOR GENERATING AND SCREENING NOVEL METABOLIC PATHWAYS," and Thompson, et al. (1998) U.S. Pat. No. 5,824,485 METHODS FOR GENERATING AND SCREENING NOVEL METABOLIC PATHWAYS) and their use to identify protein activities of interest has been proposed (In addition to the references noted above, see, Short (1999) U.S. Pat. No. 5,958,672 "PROTEIN ACTIVITY SCREENING OF CLONES HAVING DNA FROM UNCULTIVATED MICROORGANISMS"). Multispecies expression libraries include, in general, libraries comprising cDNA or genomic sequences from a plurality of species or strains, operably linked to appropriate regulatory sequences, in an expression cassette. The cDNA and/or genomic sequences are optionally randomly ligated to further enhance diversity. The vector can be a shuttle vector suitable for transformation and expression in more than one species of host organism, e.g., bacterial species, eukaryotic cells. In some cases, the library is biased by preselecting sequences which encode a protein of interest, or which hybridize to a nucleic acid of interest. Any such libraries can be provided as substrates for any of the methods described herein. [0223] The above described procedures have been largely directed to increasing nucleic acid and/or encoded protein diversity. However, in many cases, not all of the diversity is useful, e.g., functional, and contributes merely to increasing the background of variants that must be screened or selected to identify the few favorable variants. In some applications, it is desirable to preselect or prescreen libraries (e.g., an amplified library, a genomic library, a cDNA library, a normalized library, etc.) or other substrate nucleic acids prior to diversification, e.g., by recombination-based mutagenesis procedures, or to otherwise bias the substrates towards nucleic acids that encode functional products. For example, in the case of antibody engineering, it is possible to bias the diversity generating process toward antibodies with functional antigen binding sites by taking advantage of in vivo recombination events prior to manipulation by any of the described methods. For example, recombined CDRs derived from B cell cDNA libraries can be amplified and assembled into framework regions (e.g., Jirholt et al. (1998) "Exploiting sequence space: shuffling in vivo formed complementarity determining regions into a master framework" Gene 215:471) prior to diversifying according to any of the methods described herein.
[0224] Libraries can be biased towards nucleic acids which encode proteins with desirable enzyme activities, such as the ability to stereoselectively catalyze a given reaction. For example, after identifying a clone from a library which exhibits a specified activity, the clone can be mutagenized using any known method for introducing DNA alterations. A library comprising the mutagenized homologues is then screened for a desired activity, which can be the same as or different from the initially specified activity. An example of such a procedure is proposed in Short (1999) U.S. Patent No. 5,939,250 for "PRODUCTION OF ENZYMES HAVING DESIRED ACTIVITIES BY MUTAGENESIS." Desired activities can be identified by any method known in the art. For example, WO 99/10539 proposes that gene libraries can be screened by combining extracts from the gene library with components obtained from metabolically rich cells and identifying combinations which exhibit the desired activity. It has also been proposed (e.g., WO 98/58085) that clones with desired activities can be identified by inserting bioactive substrates into samples of the library, and detecting bioactive fluorescence corresponding to the product of a desired activity using a fluorescent analyzer, e.g., a flow cytometry device, a CCD, a fluorometer, or a spectrophotometer.
[0225] Libraries can also be biased towards nucleic acids which have specified characteristics, e.g., hybridization to a selected nucleic acid probe. For example, application WO 99/10539 proposes that polynucleotides encoding a desired activity (e.g., an enzymatic activity, for example: a lipase, an esterase, a protease, a glycosidase, a glycosyl transferase, a phosphatase, a kinase, an oxygenase, a peroxidase, a hydrolase, a hydratase, a nitrilase, a transaminase, an amidase or an acylase) can be identified from among genomic DNA sequences in the following manner. Single stranded DNA molecules from a population of genomic DNA are hybridized to a ligand-conjugated probe. The genomic DNA can be derived from either a cultivated or uncultivated microorganism, or from an environmental sample. Alternatively, the genomic DNA can be derived from a multicellular organism, or a tissue derived therefrom. Second strand synthesis can be conducted directly from the hybridization probe used in the capture, with or without prior release from the capture medium or by a wide variety of other strategies known in the art. Alternatively, the isolated single-stranded genomic DNA population can be fragmented without further cloning and used directly in, e.g., a recombination-based approach, that employs a single-stranded template, as described above. [0226] "Non-Stochastic" methods of generating nucleic acids and polypeptides are alleged in Short "Non-Stochastic Generation of Genetic Vaccines and Enzymes" WO 00/46344. These methods, including proposed non-stochastic polynucleotide reassembly and site-saturation mutagenesis methods can be applied to the present invention as well. Random or semi-random mutagenesis using doped or degenerate oligonucleotides is also described in, e.g., Arkin and Youvan (1992) "Optimizing nucleotide mixtures to encode specific subsets of amino acids for semi- random mutagenesis" Biotechnology 10:297-300; Reidhaar-Olson et al. (1991) "Random mutagenesis of protein sequences using oligonucleotide cassettes" Methods Enzymol. 208:564-86; Lim and Sauer (1991) 'The role of internal packing interactions in determining the structure and stability of a protein" J. MoI. Biol. 219:359-76; Breyer and Sauer (1989) "Mutational analysis of the fine specificity of binding of monoclonal antibody 5 IF to lambda repressor" J. Biol. Chem. 264:13355-60); and "Walk-Through Mutagenesis" (Crea, R; US Patents 5,830,650 and 5,798,208, and EP Patent 0527809 Bl.
[0227] It will readily be appreciated that any of the above described techniques suitable for enriching a library prior to diversification are optionally also used to screen the products, or libraries of products, produced by the diversity generating methods.
[0228] Kits for mutagenesis, library construction and other diversity generation methods are also commercially available. For example, kits are available from, e.g., Stratagene (e.g., QuickChange™ site-directed mutagenesis kit; and Chameleon™ double-stranded, site-directed mutagenesis kit), Bio/Can Scientific, Bio- Rad (e.g., using the Kunkel method described above), Boehringer Mannheim Corp., Clonetech Laboratories, DNA Technologies, Epicentre Technologies (e.g., 5 prime 3 prime kit); Genpak Inc, Lemargo Inc, Life Technologies (Gibco BRL), New England Biolabs, Pharmacia Biotech, Promega Corp., Quantum Biotechnologies, Amersham International pic (e.g., using the Eckstein method above), and Anglian Biotechnology Ltd (e.g., using the Carter/Winter method above).
[0229] The above references provide many mutational formats, including recombination, recursive recombination, recursive mutation and combinations or recombination with other forms of mutagenesis, as well as many modifications of these formats. Regardless of the diversity generation format that is used, the nucleic acids of the invention can be recombined (with each other, or with related (or even unrelated) sequences) to produce a diverse set of recombinant nucleic acids, including, e.g., sets of homologous nucleic acids, as well as corresponding polypeptides. XII. TARGET MOLECULES
[0230] Essentially any nucleic acid can be evolved according to the methods described herein. Common sequence repositories for known proteins include GenBank®, EMBL, DDBJ and the NCBI. Other repositories can easily be identified by searching the internet. Suitable nucleic acids include those that are commercially available. Specific target sequences of interest typically include commercially important coding sequences or sequences complementary thereto. These include, e.g., various pharmaceutically, agriculturally, and/or industrially relevant nucleic acids, including those noted above (and in the references cited herein) and those described herein below. The exemplary enzymes and other polypeptides listed herein, and sequences corresponding to them, are offered to illustrate but not to limit the present invention. For example, more extensive lists of suitable target molecules are provided in, e.g., USSN 09/656,549, entitled "SINGLE-STRANDED NUCLEIC ACID TEMPLATE-MEDIATED RECOMBINATION AND NUCLEIC ACBD FRAGMENT ISOLATION," by Affholter, filed September 6, 2000. Additional sequences corresponding to these and to other potential targets are known in the art and are readily obtainable by cloning, PCR, synthesis, or the like. Any of the following proteins, nucleic acids, enzymes, pathways, or other systems can be modified, produced, or otherwise developed according to the methods described herein. For example, any of the proteins, nucleic acids, enzymes, pathways, other systems, or character string representations thereof can be modified via the regulated in vitro oligonucleotide- mediated recombination methods or the related in silico-based simulations described herein.
Pharmaceutically-Related Parental Nucleic Acids and Expression Products [0231] One class of parental nucleic acid sequences well suited for use as substrates in the methods described herein include those encoding expression products with at least potential pharmaceutical relevance. These expression products include, e.g., therapeutic proteins, transcriptional and expression activators, vaccines, small proteins, antibodies, or the like. More specific examples of these pharmaceutically-related target molecules suitable for use in the methods of the present invention are provided in the references cited herein. Agriculturally-Related Parental Nucleic Acids and Expression Products
[0232] Other proteins relevant to non-medical uses, such as inhibitors of transcription or toxins of crop pests, e.g., insects, fungi, weed plants, and the like, are also preferred targets for recombination by one or more of the methods herein. Many agriculturally-related target sequences which are suitably used in the methods of the invention are disclosed in a variety of patent-related publications and the references noted herein, including, e.g., WO 00/09727 "DNA Shuffling to Produce Herbicide Selective Crops;" WO 99/57128 "Optimization of Pest Resistance Genes Using Shuffling;" USSN 60/167,452 "Shuffling of Agrobacterium and Viral Genes, Plasmids and Genomes for Improved Plant Transformation;1" WO 00/20573 "DNA Shuffling to Produce Nucleic Acids for Mycotoxin Detoxification;" WO 00/28018 "Modified ADP- Glucose Pyrophosphorylase for Improvement and Optimization of Plant Phenotypes;" WO 00/28017 "Modifed Phosphoenoylpyruvate Carboxylase for Improvement and Optimization of Plant Phenotypes;" WO 00/28008 "Modified Ribulose 1,5- Bisphosphate Carboxylase/Oxygenase;" PCT/USOO/09285 "Modified Lipid
Production;" PCT/USOO/09840 "Modified Starch Metabolism Enzymes and Encoding Genes for Improvement and Optimization of Plant Phenotypes;" and USSN 60/202,233 "Evolution of Plant Disease Response Pathways to Enable the Development of Plant Based Biological Sensors and to Develop Novel Disease Resistance Strategies;" which are each incorporated by reference herein in their entirety for all purposes. Any of these can be made, modified or developed according to the methods described herein.
Industrially-Related Parental Nucleic Acids and Expression Products
[0233] Industrially important enzymes such as monooxygenases (e.g.,
P450s, DBT monooxygenases encoded by the dszC gene from, e.g., Rhodococcus spp., or the like), dioxygenases, Upases, esterases, proteases, glycosidases, glycosyl transferases, phosphatases, kinases, haloperoxidases, lignin peroxidases, diarylpropane peroxidases, epoxide hydrolases, nitrile hydratases, nitrilases, transaminase, amidases, acylases, dehalogenases, isomerases, epimerases, glucose isomerases, amino acid racemases, and nucleases are also generally preferred targets. Proteins which aid in folding such as the chaperonins are preferred targets. Many of these and other industrial enzymes, and corresponding nucleic acid sequences, are provided in various published documents including, e.g., WO 00/01712 "CHEMICALLY MODIFIED PROTEINS WITH A CARBOHYDRATE MOIETY," WO 00/37658 "CHEMICALLY MODIFIED ENZYMES WITH MULTIPLE CHARGED VARIANTS," WO 00/28007 "CHEMICALLY MODIFIED MUTANT SERINE HYDROLASES SHOW IMPROVED CATALYTIC ACTIVITY AND CHIRAL SELECTIVITY," WO 99/37324 "MODIFIED ENZYMES AND THEIR USE FOR PEPTIDE SYNTHESIS," WO 99/34003 "PROTEASES FROM GRAM POSITIVE ORGANISMS," WO 99/31959 "ACCELERATED STABILITY TEST," and WO 98/23732 "CHEMICALLY MODIFIED ENZYMES," all of which are incorporated herein by reference in their entirety for all purposes.
XIII. KITS [0234] The present invention also provides kits that typically include systems, system software, modules, and workstations for performing the regulated in vitro oligonucleotide-mediated recombination and in silico-based embodiments, and other methods described herein. In certain embodiments, a kit includes only system software, e.g., the computer program products described herein. A kit optionally contains additional components for the assembly and/or operation of a multimodule workstation of the invention including, but not restricted to robotic elements (e.g., a track robot, a robotic armature, or the like), reagent, solid phase synthesis unit, and/or reaction vessel handling devices, and computers (including, e.g., input/output devices, CPUs, or the like). Kits are optionally packaged to include reagents, control/calibrating materials, solid phase synthesis units, and/or reaction vessels for performing the methods of the invention. In the case of pre-packaged reagents, the kits optionally include pre-measured or pre-dosed reagents that are ready to incorporate into the synthetic methods without measurement, e.g., pre-measured fluid aliquots, or pre- weighed or pre-measured solid reagents that can be easily reconstituted by the end-user of the kit. Generally, reagents are provided in a stabilized form, so as to prevent degradation or other loss during prolonged storage, e.g., from leakage. A number of stabilizing processes are widely used for reagents that are to be stored, such as the inclusion of chemical stabilizers (i.e., enzymatic inhibitors, microcides/bacteriostats, anticoagulants), the physical stabilization of the material, e.g., through immobilization on a solid support, entrapment in a matrix (i.e., a gel), lyophilization, or the like. Kits typically include appropriate instructions for using the reagents, practicing the methods, and operating the systems. Kits also typically include packaging materials or containers for holding kit components. XIV. EXAMPLE: IDENTIFYING FUNCTIONAL AND ANCESTRAL COVARIATION
Introduction
[0235] As proteins evolve, amino acid changes are constrained by selection for function. Proteins frequently contain pairs of amino acids that appear to change together. Analysis of naturally occurring sets of orthologs cannot distinguish covariation that reflects functional requirements of the protein from that which simply results from a common ancestral origin. As described in this example, every naturally occurring amino acid variant was independently recombined within a set of -15 subtilisin orthologs and the corresponding enzyme activity was measured. In this family of orthologs, only 5% of the residue pairs analyzed are functionally constrained, which indicates that proteins have evolved to minimize the interdependence of allowed amino acid changes to facilitate adaptation. As described herein, this has implications with respect to the plasticity of proteins, sequence-structure-function correlations, and protein engineering strategies.
[0236] During divergent evolution, protein sequences change while the biochemical function of the protein is retained. Correlated changes between functionally linked residues in a protein are essential for the preservation of protein structure and function and knowledge of coevolving residues are key in protein engineering studies. The functional links between covarying residues may be due to structural contact, overall charge distribution(l) or any indirect effect such as interactions with substrate. Independent mutations among functionally linked residues are often disadvantageous, but two simultaneous mutations may allow the protein to retain function. Alternatively, two or more residues may appear to be covarying simply due to a common ancestral origin. Current analytical tools are severely limited in the ability to separate the functional from the phylogenetic (ancestral) covariation in a family of orthologous proteins. Statistical tools are limited both by the amount of data to infer covariation and also limited by the evolutionary models to explain the data(14). [0237] The present example describes experiments in which all amino acid residues in a family of 15 subtilisins were deliberately uncoupled by synthetic
DNA recombination(5). By allowing all residues to vary independently of context and then screen for function, any covariation derived from common ancestral origin is eliminated and only covariation that contribute to function is retained. Functional subtilisin variants were analyzed using mutual information theory to assess covariation between residues. Most of the covariation observed among the parental sequences was not preserved in functional chimeric subtilisins, indicating that it is primarily a measure of common ancestral descent. Further, several covarying residues were identified that are not seen among the parents due to lack of sufficient sampling in parental sequences. These results have implications for the understanding of evolution, the relationship between sequence and function, and for library design in directed evolution experiments.
Analysis
[0238] Subtilisins are commercially important serine endoproteases with broad specificity for peptide bonds and relative ease of production. A family of 26 subtilisin orthologs was obtained by PCR amplification from natural Bacillus isolates(6). Fifteen of the subtilisin sequences were used as starting points for synthetic shuffling. The parental set of 15 subtilisin orthologs was between 79-99 % identical at the amino acid level. The amino acid alignments of the orthologs used in the experiment contained 54 positions having two or more alternate amino acids. Most (46 of 54) of these positions encoded one of two different amino acids. In addition, five positions had one of three different amino acids and three positions had one of four different amino acids. Synthetic oligonucleotide-based recombination is homology independent and allows equal probability of each allowed residue at any given position to be incorporated into the final product(5). This is in contrast to other recombination formats where the distribution of any single residue is dependent on its abundance and context among the parental genes. Synthetic oligonucleotide-based recombination described by Ness et.al.(5) (Figure 11) results in a library of sequences that are completely systematically varied on the single residue level and is rich in natural diversity.
[0239] By completely uncoupling the spatial linkage of amino acid residues, the theoretical size of the entire library is 240 x 35 x 43 = ~ 1018. Even though this is a very large library, much larger than could ever be screened, it still only incorporates amino acids that have been pre-selected during the Darwinian natural selection process. Each individual amino acid residue present in the combinatorial library is the outcome of millions of years of selective pressure in multiple parallel evolutionary processes. Despite the vast total size of the library, characterization of only a small subset of the library is sufficient to test all covarying residue pairs for correlation with function. Any pair of covarying amino acid residues is sampled many times over among the fully characterized variants, as are all triplets and most of the quadruplets. In the most extreme case of a pair of positions each encoding one of four alternate amino acids there is a total of 16 possible combinations. If 100 variants are sampled, all pairs will be present on average 6X coverage (100/16) and each pair is sampled at -95% confidence. Most residue pairs have only 4 possible combinations, resulting in 25X coverage (> 99% confidence). This permits the selective preference of certain residue pairs over others to be determined. Triplet, quadruplet and higher order interactions are all factorials of each pairwise interaction and the information captured accordingly. The library generated through synthetic oligonucleoti de-based recombination is an excellent unbiased source of data to analyze the relative importance of covariance and its distribution in a biological system.
[0240] A total of 96 variants derived from synthetic oligonucleotide- based recombination of the 15 parents that had activity corresponding to greater than 65% of the Savinase® parent under standard assay conditions were functionally characterized and the corresponding DNA sequence determined. In addition, 89 variants were isolated prior to the functional screening and the DNA sequence determined. Characterizing the sequence distribution of the pre-screened library allowed the covariation found among the active variants to be normalized to the inherent distribution of covariance the library. Any spurious artifactual covariation derived from an imperfect library (for example oligonucleotide degeneracy biases produced during synthesis) can thus be eliminated.
[0241] The 15 parents, 89 pre-screened variants, and the 96 active variants are displayed as unrooted phylogenetic trees (Figures 12A, B and C respectively). These trees reveal a very different phylogenetic distribution of parents and progeny. As shown in the figures, there is no or very little difference in the diversity distribution between the pre-screened and active variants. In both cases, the variants are evenly distributed, suggesting no significant bias towards diversity originating from any given parent or cluster of parents. The mean difference in the number amino acid substitutions to the closest variant among the pre-screened sequences is 19 substitutions, compared to the mean distance of 14 substitutions among the active variants. The mean difference is 2.8 among parental sequences, showing that the parental sequences are highly clustered in the sequence space. This suggests that many new regions of sequence space can be explored for functional activity by distributing the characterized variants evenly across the same sequence space covered by the parental genes. Sequence distance traversed using classic directed evolution techniques such as random mutagenesis is usually limited to 1-3 amino acid residues per gene per round. Most of the solutions found through synthetic oligonucleotide- based recombination are consequently inaccessible by random mutagenesis.
[0242] Covariation between residues inferred from biological sequence data can be attributed to either functional constraints or phylogenetic relationship. Since the historical origin of the sequences is unknown, the covariant nature of residues involved cannot be de-convoluted. The issue has typically been addressed either through collecting as many sequences as possible under a given node in a phylogenetic tree(3), or by computer simulations of possible evolutionary paths(2) using a model for sequence evolution. Both approaches have significant complications and drawbacks. An inherent complication of the first type of covariation analysis is the inclusion of sequences having diverged not only in neutral mutations, but also in function. The divergence can be small, as in evolving to a slightly different pH optimum, or large as in evolving to catalyze a related but different reaction. No single orthologous enzyme pair has truly evolved under the exact same physiological conditions. Including sequences in the covariation analysis that have diverged in function adds noise to the correlations as they are subjected to different selective pressures. Another perhaps more serious concern is the inability to ever gather all sequences under a phylogenetic node to ensure that the distribution in the data set is unbiased due to sampling effects. In a synthetic DNA recombination library all inherent covariation is removed and amino acid diversity occurring in any one position has an equal probability of occurring in any variant. Screening such a library in vitro for a defined biochemical function identifies all covariation derived from functional constraints required for the assayed biological activity of the enzyme. The remainder of the covariation found among the parental genes but not present among the functional progeny is consequently the result of common ancestral origin. [0243] The covariation among a set of variants from the library was assessed and visualized by aligning the sequences and removing residues that are conserved throughout the alignment. As described further below, covariation was captured using an information theoretical approach. The mutual information (degree of coupling) between each varying residue pair was plotted in a two-dimensional 54 x 54 matrix. Each row/column represents one of the 54 varying residue positions and each cell in the matrix represents all 1458 (542/2) residue pairs. A filled cell corresponds to highly covarying residues. The mutual information distribution was normalized to have a mean of 0 and variance of 1. Covariation here is defined as residue pairs with mutual information higher than 2σ away from mean mutual information for that alignment.
[0244] Displaying every residue pair for the parental genes identified all residue pairs that covary (Figure 13A). After making the synthetic library, but before exposing the variants to any selective pressure, 89 variants were isolated. The 89 pre- screened variants were characterized for covariation in the same way as the parental genes, showing that the distribution of the varying residues was uniform, and that all varying residues exist in conjunction with all other varying residues (Figure 13C). The one exception was a perfectly covarying adjacent residue pair that was a consequence of an oligonucleotide not encoding the designed degeneracy. In addition, there were three residue pairs displaying covariation slightly above noise, i.e., more than 2σ away from the average mutual information content. By normalizing mutual information values from the functionally active data set to the values of the pre-screened set, any bias effects derived from an imperfect distribution in the library was accounted for. The normalized covariation captured among the 96 active library variants is shown in figure 13B. The comparison of Figure 13B with 13 A illustrates the difference between all covarying residues in the parental data set and the residues that covary due to functional constraints.
[0245] The 15 parental subtilisin genes shared a sequence identity of 79- 99%. Characterization of all varying residues for covariation using mutual information identifies a total of 138 pairs (MI > 2σ) of covarying residues. After synthetic oligonucleotide-based recombination and selection for function, 24 pairs were retained. In addition, 44 new pairs, which were not identified to be covarying based on the parental sequences, were identified. The additional covarying pairs represent amino acid pairs that were only observed in one or two parental sequences and therefore had no significant information content. By systematically varying these residues in the synthetic recombination library, the information content was increased such that these additional covarying pairs could be identified. In the context of this data set, approximately 5% (24 + 44 pairs out of 1458) of all analyzed residue pairs were functionally constrained for covariation. The remaining covarying pairs found among the parental genes and not among the functionally active library variants were derived from common ancestry. It may also reflect a selective pressure for indirect effects on the organism. Indirect effects can be any trait, such as sequestering of cofactors or cellular localization, etc. that is not specifically related to the screening criteria of the in vitro assay.
[0246] The most highly coupled residues (MI = 3.5σ) were in the positions corresponding to residue numbers 128, 131, 135 and 139 in the ISVN (PDB LD. number) sequence(7), corresponding to columns 25, 26, 27 and 28 in Figure 13C. In figure 14, these covarying sites are mapped onto the Savinase® crystal structured). The residues are located on the surface of helix 4 and are probably important for maintaining the conformation of the helix to sustain structural rigidity. Another example of strongly covarying residue positions is between residues 154 and 182 (columns 33, 39 in Figure 13C). Several instances of long-range covariation were also identified in the molecule, e.g. residue pair 114 and 160 (columns 18 and 37), where direct physical interaction between the residues is unlikely due to the distance between the residues.
[0247] The results presented in this example indicate that protein sequence variability (tolerance for changes) is high. If there would be a high degree of residue covariation in proteins, evolution would be a much slower process. Proteins have probably evolved to be stabilized by a variety of subtle interactions to reduce functional covariation, thereby allowing faster adaptation. The inherent plasticity of the protein sequence-function relationship demonstrated in this example suggests that evolution relies to a large degree on the additive effect of single mutations, and that there are many independent sequence solutions to a given functional constraint. This is supported by theoretical models(8). The sequence space where the protein folds properly and has function is very large and highly degenerate. By recombining naturally occurring diversity, and allowing each residue to vary independently, one can open up a highly functional space for sequence evolution. Proteins are not optimized for the corresponding enzymatic activity; they are just good enough to not be the rate- limiting step in the survival of the organism. The fact that proteins are not optimized ensures searching a highly functional space for improved traits is very productive. [0248] It has been argued that detailed phylogenetic information and historical models (palaeogenomics) is necessary for understanding the behavior of biomolecules(9). The results presented in this example show that a heuristic approach using systematically varied residues is an alternate and preferred path for acquiring protein sequence-function relationships(lθ).
[0249] Protein structure prediction, identification of functional determinants and an understanding the dynamics of the evolution of protein sequences all rely on analytical tools for detecting and validating functional covariation. With the advent of systematically varied data sets such as presented in this example, the appropriate tools can be used to quantitatively analyze sequence variability and functional constraints in proteins and effectively use this knowledge in protein design.
Experimental Methods
High Crossover Recombination
[0250] DNA oligonucleotides were assembled by PCR to generate a library that incorporated all the natural amino acid diversity represented in the 15 subtilisins. The amino acid diversity was captured in a series of 22 oligonucleotides using nucleotide degeneracies and overlapping oligonucleotides to encode alternative amino acids. Each of the backbone oligonucleotides was 60 nucleotides in length. Overlapping forward and reverse oligonucleotides share 20 bp of end complementarity, which is typically necessary for assembly of the 660 bp recombined product. All oligonucleotides were designed to recombine diversity at the level of single codons using Bacillus subtilis codon usage. Oligonucleotides were purchased from Operon Technologies (Alameda, CA) and assembled using the method of Stemmer et al. (12). The 660 bp recombined product was amplified from the assembly reaction using the outermost forward and reverse backbone oligonucletides and cloned by PCR multimerization (13) into a Savinase® expression vector. The multimerization reaction was transformed into a Bacillus subtilis 168 apr npr strain via natural competence and the resulting chimeric subtilisins were expressed and screened for activity.
Functional Screening [0251] Screening was performed in two tiers (6). For the first tier, transformants were picked with a QBot robotic colony picker (Genetix, Hampshire, UK) and grown 384-well microtiter plates in Luria-Bertani (LB) broth. The microcultures were gridded on LB agar trays containing 2% skim milk (2,304 clones per 23 x 23 cm tray). Transformants secreting an active subtilisin formed clearing zones as a result of hydrolysis of casein from the skim milk. Protease-positive clones were picked and grown in 96-well microti ter plates containing LB broth in preparation for the second tier assay for determining activity relative to Savinase®. For this, culture supernatants were diluted 100-fold into a reaction mixture containing 50 mM sodium borate (pH 10), 1 mM CaCl2, and 5 μg/ml BODIPY FL casein (Molecular Probes, Eugene, OR). The CV (%) observed for independent determinations with Savinase® was < 17%. A total of 179 unique clones were characterized for activity and the corresponding DNA sequence determined. The 179 clones were separated into 96 high activity clones and 83 low activity clones using a functional cut-off corresponding to 65% compared to the Savinase® parent. In addition, a random set of 89 clones were isolated and sequenced prior to any functional screening. The DNA sequences of these 89 clones were used to establish a baseline of amino acid distribution at all positions in the library. Mutual Information
[0252] In a protein alignment, the entropy measure for each position in the alignment indicates the degree of variability and preference for each amino acid. The following equation was used to quantify site-en tropy(l 1).
I, = ∑kP(Ak t) log P(Aki) (1) The sum is over all k amino acids {Aki} occurring at position i in the alignment. P(AkO is the probability of amino acid k at position i. Likewise, covariance between amino acids can be measured by using the mutual information content between pairs of sites.
MIυ = ∑k ∑, P(AS and A 'J ) log P(Aki and A1;) (2)
P(A1S) P(A'j) The double summation is over all possible pairs of amino acids {A J and {A j} at positions i and j, respectively. P(A i) is the probability of amino acid k at position i and P(Aki and A j) is combined probability of amino acid k at position i and amino acid 1 at position j.
[0253] The MI values were normalized for each group of clones to have the same mean of 0.0 and standard deviation of 1.0. The degree of covariation among any residue pair is identified by the deviation of the MI for the given pair from the expected mutual information content. References
[0254] 1. D. D. Pollock, W. R. Taylor, N. Goldman, J MoI Biol 287,
187-98. (1999). 2. K. R. Wollenberg, W. R. Atchley, Proc. Nat'l Acad. Sci 97, 3288-91. (2000). 3. W. R. Atchley, K. R. Wollenberg, W. M. Fitch, W. Terhalle, A. W. Dress, MoI Biol Evol 17, 164-78. (2000). 4. E. A. Gaucher, M. M. Miyamoto, S. A. Benner, Proc. Nat'l Acad. Sci 98, 548-552 (2001); S. M. Larson, A. A. Di Nardo, A. R. Davidson, J MoI Biol 303, 433-46. (2000). 5. J. E. Ness et al. Nature Biotech Submitted (2001). 6. J. E. Ness et al, Nat Biotechnol 17, 893-6. (1999). 7. C. Betzel et al, J MoI Biol 223, 427-45. (1992). 8. D. M. Taverna, R. A. Goldstein, J MoI Biol 315, 479-84. (2002). 9. S. A. Benner, Nature 409, 459. (2001). 10. C. Gustafsson, S.
Govindarajan, R. Emig, J MoI Recognit 14, 308-14. (2001). 11. C. E. Shannon, MD Comput 14, 306-17. (1997). 12. Stemmer et al. Gene 164:49-53 (1995). 13. Shafikhani et al. Biotechniques 23(2):304-10 (1997).
[0255] While the foregoing invention has been described in some detail for purposes of clarity and understanding, it will be clear to one skilled in the art from a reading of this disclosure that various changes in form and detail can be made without departing from the true scope of the invention. For example, all the techniques and apparatus described above may be used in various combinations. All publications, patents, patent applications, or other documents cited in this application are incorporated by reference in their entirety for all purposes to the same extent as if each individual publication, patent, patent application, or other document were individually indicated to be incorporated by reference for all purposes.

Claims

WHAT IS CLAIMED IS:
1. A method of designing oligonucleotides for regulated recombination, the method comprising: providing a population of overlapping oligonucleotide character strings; selecting at least one pair of overlapping oligonucleotide character strings from the population of overlapping oligonucleotide character strings; and, changing at least one character in at least one overlap region of the at least one selected pair of overlapping oligonucleotide character strings to adjust a probability of hybridization of oligonucleotides corresponding in sequence to the selected pair of overlapping oligonucleotide character strings, thereby designing oligonucleotides for regulated recombination.
2. The method of claim 1, wherein the selected pair of overlapping oligonucleotide character strings correspond in sequence to subsequences from at least two different phylogenetic families of polynucleotides.
3. The method of claim 1, wherein each overlapping oligonucleotide character string comprises between about 20 and about 60 nucleotides.
4. The method of claim 1, wherein the at least one overlap region of the selected pair of overlapping oligonucleotide character strings comprises between about 15 and about 25 nucleotides.
5. The method of claim 1, comprising changing one or more overlap regions of a plurality of pairs of overlapping oligonucleotide character strings.
6. The method of claim 1, the adjusting step comprising changing at least one nucleotide in one or more portions of the at least one overlap region such that oligonucleotide overlap is increased or decreased, thereby increasing or decreasing a probability of hybridization for the at least one overlap region at a selected temperature.
7. The method of claim 1, wherein in the changing step is performed in a logic device.
8. The method of claim 1, wherein the changing step is performed manually.
9. The method of claim 1, further comprising graphically displaying at least one member of the population of overlapping oligonucleotide character strings.
10. The method of claim 1, further comprising: providing a population of nucleic acids that comprises one or more pairs of designed oligonucleotides; and, recombining the population of nucleic acids with a polymerase or a ligase, or both a polymerase and a ligase, to provide a population of recombined nucleic acids.
11. The method of claim 10, wherein the providing step comprises synthesizing a set of single-stranded oligonucleotides.
12. The population of recombined nucleic acids made by the method of claim 10.
13. The method of claim 10, wherein the method regulates recombination by maintaining genetic linkages within sequence families over distances greater than a length of any individual member of the population of nucleic acids.
14. The method of claim 10, further comprising selecting or screening an encoded polypeptide of at least one member of the population of recombined nucleic acids for at least one desired trait or property.
15. The method of claim 10, further comprising selecting at least one member of the population of the recombined nucleic acids, which member comprises the at least one desired genetic linkage.
16. The method of claim 15, wherein the selecting step comprises amplifying the at least one member using one or more nucleic acid primers that correspond to at least a portion of the at least one desired genetic linkage.
17. The method of claim 15, wherein the selecting step comprises affinity purifying the at least one member using one or more nucleic acid sequences that correspond to at least a portion of the at least one desired genetic linkage.
18. A method of designing oligonucleotides for regulated recombination, the method comprising: providing at least two parental polypeptide character strings, which character strings, when aligned for maximum identity, comprise at least one amino acid difference; providing at least one desired amino acid linkage; reverse-translating the at least two parental polypeptide character strings into at least two parental polynucleotide character strings; segmenting each of the at least two parental polynucleotide character strings into at least two overlapping oligonucleotide character strings; and, adjusting at least one overlap region of at least one pair of overlapping oligonucleotide character strings such that recombination is biased towards the at least one desired amino acid linkage to provide a selected population comprising adjusted overlapping oligonucleotide character strings.
19. The method of claim 18, wherein the at least two parental polypeptide character strings, when aligned for maximum identity, comprise at least one region of amino acid sequence similarity.
20. The method of claim 18, wherein at least one of the at least two parental polypeptide character strings corresponds to a full-length protein.
21. The method of claim 18, wherein at least one step of the method is performed in a digital system.
22. The method of claim 18, wherein at least one step of the method is performed manually.
23. The method of claim 18, comprising deriving the at least one desired amino acid linkage from at least one portion of at least one of the at least two parental polypeptide character strings.
24. The method of claim 18, comprising reverse-translating at least one of the at least two parental polypeptide character strings according to a species codon-bias of a selected expression host.
25. The method of claim 18, wherein a maximum or a minimum length of at least one the overlapping oligonucleotide character strings is automatically set.
26. The method of claim 18, wherein a maximum or a minimum length of at least one the overlapping oligonucleotide character strings is manually set.
27. The method of claim 18, wherein each overlapping oligonucleotide character string comprises between about 20 and about 60 nucleotides.
28. The method of claim 18, wherein each overlapping oligonucleotide character string comprises an identical number of nucleotides.
29. The method of claim 18, wherein each overlapping oligonucleotide character string comprises a different number of nucleotides.
30. The method of claim 18, wherein the at least one overlap region of the at least one pair of overlapping oligonucleotide character strings comprises between about 15 and about 25 nucleotides.
31. The method of claim 18, comprising adjusting the at least one overlap region of a plurality of pairs of overlapping oligonucleotide character strings.
32. The method of claim 18, wherein at least two of the at least two parental polypeptide character strings are orthologs or paralogs.
33. The method of claim 18, further comprising displaying at least one member of the selected population comprising the adjusted overlapping oligonucleotide character strings graphically.
34. The method of claim 18, further comprising determining one or more sequence of one or more recombinant nucleic acids resulting from in silico recombination of the selected population comprising the adjusted overlapping oligonucleotide character strings, and performing one or more in silico simulations of activity for the one or more recombinant nucleic acids or one or more expression products therefrom.
35. The method of claim 18, wherein at least two of the at least two parental polypeptide character strings are members of different phylogenetic families.
36. The method of claim 35, further comprising defining the phylogenetic family computationally.
37. The method of claim 18, further comprising performing each step in a digital system.
38. The method of claim 37, wherein the at least two parental polypeptide character strings are provided by inputting the at least two parental polypeptide character strings into the digital system.
39. The method of claim 37, wherein the at least one desired amino acid linkage is provided by inputting the at least one desired amino acid linkage into the digital system.
40. The method of claim 18, wherein the at least one desired amino acid linkage is provided by using at least one probabilistic technique to select the at least one desired amino acid linkage or manually selecting the at least one desired amino acid linkage.
41. The method of claim 40, wherein the at least one probabilistic technique comprises a Markov chain modeling method.
42. The method of claim 40, wherein the at least one desired amino acid linkage limits a size of the selected population comprising adjusted overlapping oligonucleotide character strings.
43. The method of claim 40, wherein one or more members of the selected population comprising adjusted overlapping oligonucleotide character strings capture amino acid covariation.
44. The method of claim 18, wherein the second providing step comprises: (a) aligning the at least two parental polypeptide character strings for maximum identity to produce a parental polypeptide character string profile;
(b) identifying allowed sequence paths through the parental polypeptide character string profile; and, (c) selecting the at least one desired amino acid linkage from the allowed sequence paths identified in (b).
45. The method of claim 44, wherein (b) comprises quantifying a site-entropy for one or more amino acid sites in the parental polypeptide character string profile to identify the allowed sequence paths.
46. The method of claim 44, wherein (b) comprises quantifying a mutual information content between pairs of amino acid sites in the parental polypeptide character string profile to identify the allowed sequence paths.
47. The method of claim 44, wherein the allowed sequence paths comprise Markov chains.
48. The method of claim 44, wherein the allowed sequence paths identified in (b) limit a size of the selected population comprising adjusted overlapping oligonucleotide character strings.
49. The method of claim 18, wherein one or more members of the selected population comprising adjusted overlapping oligonucleotide character strings capture amino acid covariation.
50. The method of claim 49, wherein at least some of the amino acid covariation is artificially defined.
51. The method of claim 49, wherein the amino acid covariation corresponds to a structural or functional domain.
52. The method of claim 49, wherein the amino acid covariation corresponds to a phylogenetic motif.
53. The method of claim 49, wherein the amino acid covariation captured by a member of the selected population comprising adjusted overlapping oligonucleotide character strings corresponds to from about two to about 20 amino acids.
54. The method of claim 18, the second providing step comprising: (a) providing an X predictor matrix comprising a data set corresponding to the at least two parental polypeptide character strings, wherein a physicochemical property or functional activity is known for at least one of the at least two parental polypeptide character strings; (b) calculating one or more cross product terms between or among columns of the X predictor matrix, wherein each column entry corresponds to an amino acid of a parental polypeptide character string from the at least two parental polypeptide character strings;
(c) adding at least one of the one or more cross product terms calculated in step (b) to one or more linear terms of the X predictor matrix to produce an expanded X predictor matrix; and,
(d) generating a model with the expanded X predictor matrix to identify important cross product terms and/or linear terms to identify amino acids in the at least two parental polypeptide character strings that are important for a polypeptide sequence-activity relationship, thereby providing the at least one desired amino acid linkage.
55. The method of claim 54, wherein the model is produced using one or more regression-based algorithms selected from the group consisting of: a partial least squares regression, a multiple linear regression, an inverse least squares regression, a principal component regression, and a variable importance for projection.
56. The method of claim 54, wherein the model is produced using one or more pattern-based algorithm selected from the group consisting of: a neural network, a classification and regression tree, and a multivariate adaptive regression spline.
57. The method of claim 54, wherein the cross product terms identify covarying amino acids in the at least two parental polypeptide character strings.
58. The method of claim 54, wherein the linear terms correspond to amino acids in the at least two parental polypeptide character strings.
59. The method of claim 54, wherein two or more linear terms individually comprise unimportant terms for the polypeptide sequence-activity relationship and wherein cross product terms calculated from the two or more linear terms are identified as important for the polypeptide sequence-activity relationship.
60. The method of claim 54, wherein the at least two parental polypeptide character strings comprise a set of systematically varied polypeptide character strings.
61. The method of claim 54, wherein the at least two parental polypeptide character strings are produced by one or more artificial evolution procedures.
62. The method of claim 61, wherein at least one of the one or more artificial evolution procedures is performed in silico.
63. The method of claim 54, wherein the cross product terms correspond to interactions between or among amino acids in the at least two parental polypeptide character strings.
64. The method of claim 63, wherein the interactions comprise structural or functional interactions.
65. The method of claim 63, wherein the interactions comprise secondary or tertiary interactions.
66. The method of claim 63, wherein the interactions comprise physicochemical interactions.
67. The method of claim 63, wherein the interactions comprise direct or indirect interactions.
68. The method of claim 18, the adjusting step comprising changing at least one nucleotide in one or more portions of the at least one overlap region such that oligonucleotide overlap is increased or decreased, thereby increasing or decreasing a probability of hybridization for the at least one overlap region at a selected temperature.
69. The method of claim 68, wherein the one or more portions are disposed proximal to an end of an overlapping oligonucleotide character string in the at least one overlap region.
70. The method of claim 18, the adjusting step comprising: determining an annealing frequency for the at least one overlap region; and, changing at least one nucleotide in one or more portions of the at least one overlap region such that the annealing frequency of the at least one overlap region is substantially proportional to the at least one desired amino acid linkage.
71. The method of claim 70, wherein the annealing frequency comprises a percentage of the at least one pair of overlapping oligonucleotide character strings that anneals at a selected temperature.
72. The method of claim 70, wherein the at least one changed nucleotide increases or decreases oligonucleotide overlap in the at least one overlap region, thereby increasing or decreasing a probability of hybridization for the at least one overlap region at a selected temperature.
73. The method of claim 70, wherein the one or more portions are disposed proximal to an end of an overlapping oligonucleotide character string in the at least one overlap region.
74. The method of claim 18, further comprising: providing a population of nucleic acids that corresponds to the selected population comprising the adjusted overlapping oligonucleotide character strings; and, recombining the population of nucleic acids, thereby providing a population of recombined nucleic acids.
75. The population of recombined nucleic acids made by the method of claim 74.
76. The method of claim 74, the providing step comprising synthesizing a set of single-stranded oligonucleotides that corresponds to the selected population comprising the adjusted overlapping oligonucleotide character strings.
77. The method of claim 74, wherein at least one member of the population of recombined nucleic acids encodes a full-length protein.
78. The method of claim 74, wherein the method regulates recombination by maintaining genetic linkages within sequence families over distances greater than a length of any individual member of the population of nucleic acids.
79. The method of claim 74, wherein the method approximates fragmentation- based recombination linkage characteristics.
80. The method of claim 74, further comprising sequencing or cloning one or more members of the population of recombined nucleic acids.
81. The method of claim 74, further comprising deconvoluting one or more members of the population of recombined nucleic acids.
82. The method of claim 74, further comprising expressing the population of recombined nucleic acids to provide at least one recombined polypeptide product.
83. The at least one recombined polypeptide product made by the method of claim 82.
84. The method of claim 82, further comprising selecting or screening the at least one recombined polypeptide product for at least one desired trait or property.
85. The method of claim 84, wherein the at least one desired trait or property is selected or screened for in an assay selected from the group consisting of: an in vivo selection assay, a parallel solid phase assay, and an in vitro selection assay.
86. The method of claim 74, wherein the recombining step comprises: annealing one or more members of the population of nucleic acids to one or more other members of the population of nucleic acids to provide at least one annealed nucleic acid, and, elongating or ligating, or both elongating and ligating, the at least one annealed nucleic acid to provide the population of recombined nucleic acids.
87. The method of claim 86, further comprising fragmenting the population of recombined nucleic acids to provide fragmented nucleic acids; denaturing the fragmented nucleic acids to provide denatured nucleic acids; hybridizing the denatured nucleic acids to provide hybridized nucleic acids; and, elongating or ligating, or both elongating and ligating, the hybridized nucleic acids to provide a population of further recombined nucleic acids.
88. The method of claim 87, wherein the population of recombined nucleic acids is chemically or enzymatically fragmented.
89. The method of claim 74, further comprising introducing one or more members of the population of recombined nucleic acids into at least one cell, wherein the one or more introduced members are expressed to provide at least one recombined polypeptide product to the at least one cell.
90. The at least one cell made by the method of claim 89.
91. A method of characterizing covariation in a population of homologous polypeptides, the method comprising: identifying covarying amino acid residues in a character string population that represents homologous parental polypeptides to produce a first covariation data set; recombining unlinked nucleic acids comprising the covarying amino acid residues to produce a set of recombinants that encode variants of the parental polypeptides; selecting or screening for encoded activity of at least a subset of the recombinants to produce a set of screened recombinants; identifying covarying residues in the set of screened recombinants to produce a second covariation data set; and, identifying differences between the first and second covariation data sets, thereby characterizing the covariation in the population of homologous polypeptides.
92. The method of claim 91, wherein the covarying residues are identified by applying one or more heuristically-derived analytical techniques to the character string population and/or to the set of screened recombinants.
93. The method of claim 91, wherein 3, 4, 5, 6, 7, 8, 9, 10, or more amino acid residues in the character string population are identified to covary with one another.
94. The method of claim 91, wherein the homologous parental polypeptides comprise artificially evolved polypeptides.
95. The method of claim 91, wherein the homologous parental polypeptides comprise systematically varied amino acid sequences.
96. The method of claim 91, wherein a phylogenetic family comprises the homologous polypeptides.
97. The method of claim 91, wherein the unlinked nucleic acids comprise overlapping synthetic oligonucleotides.
98. The method of claim 91, wherein the first and/or second covariation data sets are produced by analysis of mutual information.
99. The method of claim 91, wherein covariation present in the first covariation data set and absent in the second covariation data set provides a measure of ancestral covariation present in the population of homologous polypeptides.
100. The method of claim 91, further comprising normalizing the first covariation data set prior to the third identifying step.
101. The method of claim 91, wherein the covarying residues are identified by applying one or more probabilistic techniques to the character string population and/or to the set of screened recombinants.
102. The method of claim 101, wherein the one or more probabilistic techniques comprises Markov chain modeling.
103. The method of claim 91, wherein covariation present in both the first and second covariation data sets provides a measure of functional covariation present in the population of homologous polypeptides.
104. The method of claim 103, further comprising mutating residues that functionally covary in one or more nucleic acids that encode the residues.
105. The method of claim 103, further comprising designing oligonucleotides for recombination and/or mutagenesis that encode functionally covarying residues.
106. The method of claim 105, wherein the mutagenesis comprises cassette mutagenesis or site-directed mutagenesis.
107. The method of claim 91 , further comprising generating a statistical model with the covariation characterized in the population of homologous polypeptides.
108. The method of claim 107, wherein the statistical model is produced using one or more regression-based algorithms selected from the group consisting of: a partial least squares regression, a multiple linear regression, an inverse least squares regression, a principal component regression, and a variable importance for projection.
109. The method of claim 107, wherein the statistical model is produced using at least one probabilistic technique.
110. The method of claim 109, wherein the at least one probabilistic technique comprises Markov chain modeling.
111. A method of characterizing covariation in a population of homologous polypeptides, the method comprising: identifying varying amino acid residues in a character string population that represents homologous parental polypeptides; identifying amino acid residues in the character string population that covary with one another to produce a parental covariation data set; providing a set of overlapping synthetic oligonucleotides comprising members that encode one or more varying amino acids identified in the character string population; recombining the overlapping synthetic oligonucleotides to produce a set of recombined polynucleotides that encode progeny of the homologous parental polypeptides; expressing at least a subset of the set of recombined polynucleotides to produce a set of progeny polypeptides; selecting or screening at least a subset of the progeny polypeptides for a desired property; sequencing one or more progeny polypeptides, or one or more recombined polynucleotides that encode the one or more progeny polypeptides, that comprise the desired property to produce a progeny sequence data set; identifying at least pairs of amino acid residues in the progeny sequence data set that covary with one another to produce a progeny covariation data set; and identifying differences between the parental and progeny covariation data sets, thereby characterizing the covariation in the population of homologous polypeptides.
112. A computer implemented method of maintaining genetic linkages within sequence families over distances greater than a length of a single oligonucleotide during synthetic recombination, the method comprising: inputting at least one amino acid sequence character string into the computer; determining at least one desired amino acid linkage; reverse-translating the at least one amino acid sequence character string into at least one corresponding nucleic acid sequence character string; segmenting the at least one corresponding nucleic acid sequence character string into at least two overlapping oligonucleotide character strings; inputting at least one annealing temperature for at least one assembly reaction; and, adjusting overlap between the at least two overlapping oligonucleotide character strings such that recombination is biased towards the at least one desired amino acid linkage at the at least one annealing temperature, thereby providing at least two adjusted overlapping oligonucleotide character strings.
113. The computer implemented method of claim 112, the adjusting step comprising changing at least one nucleotide in at least one overlap region such that oligonucleotide character string overlap is increased or decreased, thereby increasing or decreasing a probability of annealing for the at least one overlap region at the at least one annealing temperature.
114. The computer implemented method of claim 112, wherein the determining step comprises calculating the at least one desired amino acid linkage using at least one statistical technique.
115. The computer implemented method of claim 112, the method further comprising: providing a population of overlapping oligonucleotides that corresponds to the at least two adjusted overlapping oligonucleotide character strings; and, recombining the population of overlapping oligonucleotides with a polymerase or a ligase, or both the polymerase and the ligase, to provide a population of recombined nucleic acids.
116. The computer implemented method of claim 115, wherein the providing step comprises synthesizing a set of single-stranded oligonucleotides that corresponds to the at least two adjusted overlapping oligonucleotide character strings.
117. A system, comprising: at least one logic device; and, at least one computer readable medium operably connected to the at least one logic device that stores at least one computer program for designing oligonucleotides for regulated recombination, the at least one computer program comprising: at least one logic instruction which directs the at least one logic device to receive one or more inputted parental polypeptide character strings; at least one logic instruction which directs the at least one logic device to receive or determine a desired amino acid linkage; at least one logic instruction which directs the at least one logic device to reverse-translate the one or more inputted parental polypeptide character strings into one or more parental polynucleotide character strings; at least one logic instruction which directs the at least one logic device to segment the one or more parental polynucleotide character strings into two or more overlapping oligonucleotide character strings; at least one logic instruction which directs the at least one logic device to determine an annealing frequency for one or more pairs of overlapping oligonucleotide character strings at a selected temperature; and, at least one logic instruction which directs the at least one logic device to change at least one nucleotide in one or more portions of one or more overlap regions of one or more pairs of overlapping oligonucleotide character strings such that the annealing frequency for the one or more pairs of overlapping oligonucleotide character strings is substantially proportional to the desired amino acid linkage to provide a selected population comprising adjusted overlapping oligonucleotide character strings; or, at least one logic instruction which directs the at least one logic device to receive one or more inputted changes to at least one nucleotide in one or more portions of one or more overlap regions of one or more pairs of overlapping oligonucleotide character strings to provide a selected population comprising adjusted overlapping oligonucleotide character strings.
118. The system of claim 117, the system further comprising an oligonucleotide synthesis device operably connected to the at least one logic device for automatically synthesizing one or more overlapping oligonucleotides corresponding to one or more members of the selected population comprising adjusted overlapping oligonucleotide character strings.
119. The system of claim 118, the system further comprising a nucleic acid amplification device operably connected to the at least one logic device for producing one or more recombinant nucleic acids from the one or more synthesized overlapping oligonucleotides .
120. A computer program product comprising a computer readable medium having at least one computer program for designing oligonucleotides for regulated recombination, the at least one computer program comprising: at least one logic instruction which directs a logic device to receive one or more inputted parental polypeptide character strings; at least one logic instruction which directs the logic device to receive or determine a desired amino acid linkage; at least one logic instruction which directs the logic device to reverse-translate the one or more inputted parental polypeptide character strings into one or more parental polynucleotide character strings; at least one logic instruction which directs the logic device to segment the one or more parental polynucleotide character strings into two or more overlapping oligonucleotide character strings; at least one logic instruction which directs the logic device to determine an annealing frequency for one or more pairs of overlapping oligonucleotide character strings at a selected temperature; arid, at least one logic instruction which directs the logic device to change at least one nucleotide in one or more portions of one or more overlap regions of one or more pairs of overlapping oligonucleotide character strings such that the annealing frequency for the one or more pairs of overlapping oligonucleotide character strings is substantially proportional to the desired amino acid linkage to provide a selected population comprising adjusted overlapping oligonucleotide character strings; or, at least one logic instruction which directs the logic device to receive one or more inputted changes to at least one nucleotide in one or more portions of one or more overlap regions of one or more pairs of overlapping oligonucleotide character strings to provide a selected population comprising adjusted overlapping oligonucleotide character strings.
121. The computer program product of claim 120, wherein the computer readable medium comprises one or more of: a CD-ROM, a floppy disk, a tape, a flash memory device or component, a system memory device or component, a hard drive, or a data signal embodied in a carrier wave.
PCT/US2002/014866 2001-05-09 2002-05-09 Methods, systems, and software for regulated oligonucleotide-mediated recombination WO2008127213A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2002368549A AU2002368549A1 (en) 2001-05-09 2002-05-09 Methods, systems, and software for regulated oligonucleotide-mediated recombination

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US28994701P 2001-05-09 2001-05-09
US60/289,947 2001-05-09

Publications (2)

Publication Number Publication Date
WO2008127213A2 true WO2008127213A2 (en) 2008-10-23
WO2008127213A3 WO2008127213A3 (en) 2008-12-11

Family

ID=39864489

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2002/014866 WO2008127213A2 (en) 2001-05-09 2002-05-09 Methods, systems, and software for regulated oligonucleotide-mediated recombination

Country Status (2)

Country Link
AU (1) AU2002368549A1 (en)
WO (1) WO2008127213A2 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9618474B2 (en) 2014-12-18 2017-04-11 Edico Genome, Inc. Graphene FET devices, systems, and methods of using the same for sequencing nucleic acids
US9859394B2 (en) 2014-12-18 2018-01-02 Agilome, Inc. Graphene FET devices, systems, and methods of using the same for sequencing nucleic acids
US9857328B2 (en) 2014-12-18 2018-01-02 Agilome, Inc. Chemically-sensitive field effect transistors, systems and methods for manufacturing and using the same
US10006910B2 (en) 2014-12-18 2018-06-26 Agilome, Inc. Chemically-sensitive field effect transistors, systems, and methods for manufacturing and using the same
US10020300B2 (en) 2014-12-18 2018-07-10 Agilome, Inc. Graphene FET devices, systems, and methods of using the same for sequencing nucleic acids
US10429342B2 (en) 2014-12-18 2019-10-01 Edico Genome Corporation Chemically-sensitive field effect transistor
US10811539B2 (en) 2016-05-16 2020-10-20 Nanomedical Diagnostics, Inc. Graphene FET devices, systems, and methods of using the same for sequencing nucleic acids

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000042560A2 (en) * 1999-01-19 2000-07-20 Maxygen, Inc. Methods for making character strings, polynucleotides and polypeptides

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000042560A2 (en) * 1999-01-19 2000-07-20 Maxygen, Inc. Methods for making character strings, polynucleotides and polypeptides

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHA ET AL.: 'Assembly of designed oligonucleotides as an efficient method for gene recombination: a new tool in directed evolution' CHEMBIOCHEM.: A EUROPEAN JOURNAL OF CHEMICAL BIOLOGY vol. 4, no. 1, January 2003, pages 34 - 39, XP002491068 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9618474B2 (en) 2014-12-18 2017-04-11 Edico Genome, Inc. Graphene FET devices, systems, and methods of using the same for sequencing nucleic acids
US9859394B2 (en) 2014-12-18 2018-01-02 Agilome, Inc. Graphene FET devices, systems, and methods of using the same for sequencing nucleic acids
US9857328B2 (en) 2014-12-18 2018-01-02 Agilome, Inc. Chemically-sensitive field effect transistors, systems and methods for manufacturing and using the same
US10006910B2 (en) 2014-12-18 2018-06-26 Agilome, Inc. Chemically-sensitive field effect transistors, systems, and methods for manufacturing and using the same
US10020300B2 (en) 2014-12-18 2018-07-10 Agilome, Inc. Graphene FET devices, systems, and methods of using the same for sequencing nucleic acids
US10429342B2 (en) 2014-12-18 2019-10-01 Edico Genome Corporation Chemically-sensitive field effect transistor
US10429381B2 (en) 2014-12-18 2019-10-01 Agilome, Inc. Chemically-sensitive field effect transistors, systems, and methods for manufacturing and using the same
US10494670B2 (en) 2014-12-18 2019-12-03 Agilome, Inc. Graphene FET devices, systems, and methods of using the same for sequencing nucleic acids
US10607989B2 (en) 2014-12-18 2020-03-31 Nanomedical Diagnostics, Inc. Graphene FET devices, systems, and methods of using the same for sequencing nucleic acids
US10811539B2 (en) 2016-05-16 2020-10-20 Nanomedical Diagnostics, Inc. Graphene FET devices, systems, and methods of using the same for sequencing nucleic acids

Also Published As

Publication number Publication date
AU2002368549A1 (en) 2008-10-23
WO2008127213A3 (en) 2008-12-11
AU2002368549A8 (en) 2009-01-08

Similar Documents

Publication Publication Date Title
US7421347B2 (en) Identifying oligonucleotides for in vitro recombination
US7904249B2 (en) Methods for identifying sets of oligonucleotides for use in an in vitro recombination procedures
US7058515B1 (en) Methods for making character strings, polynucleotides and polypeptides having desired characteristics
US8058001B2 (en) Oligonucleotide mediated nucleic acid recombination
US7462469B2 (en) Integrated system for diversity generation and screening
WO2001023401A2 (en) Use of codon-varied oligonucleotide synthesis for synthetic sequence recombination
US20060051795A1 (en) Oligonucleotide mediated nucleic acid recombination
EP1272967A2 (en) In silico cross-over site selection
US20030054390A1 (en) Oligonucleotide mediated nucleic acid recombination
WO2008127213A2 (en) Methods, systems, and software for regulated oligonucleotide-mediated recombination
DK2253704T3 (en) Oligonucleotide-mediated recombination nucleic acid
KR20010042040A (en) Oligonucleotide mediated nucleic acid recombination

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 02808377

Country of ref document: EP

Kind code of ref document: A2