WO1999063077A2 - Compositions of nucleic acid which alter ligand-binding characteristics and related methods and products - Google Patents

Compositions of nucleic acid which alter ligand-binding characteristics and related methods and products Download PDF

Info

Publication number
WO1999063077A2
WO1999063077A2 PCT/US1999/012516 US9912516W WO9963077A2 WO 1999063077 A2 WO1999063077 A2 WO 1999063077A2 US 9912516 W US9912516 W US 9912516W WO 9963077 A2 WO9963077 A2 WO 9963077A2
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
polynucleotide
polynucleotide sequence
flanking
binding site
Prior art date
Application number
PCT/US1999/012516
Other languages
French (fr)
Other versions
WO1999063077A3 (en
Inventor
Michael J. Lane
Albert S. Benight
Brian D. Faldasz
Original Assignee
Tm Technologies, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tm Technologies, Inc. filed Critical Tm Technologies, Inc.
Publication of WO1999063077A2 publication Critical patent/WO1999063077A2/en
Publication of WO1999063077A3 publication Critical patent/WO1999063077A3/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/1034Isolating an individual clone by screening libraries
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6811Selection methods for production or design of target specific oligonucleotides or binding molecules

Definitions

  • DNA is known to undergo a wide variety of conformational alterations which are dependent on the conditions in which the DNA is found.
  • Critical to the final structure(s) adopted by DNA are the precise order of bases, the length of the DNA, the overall base content (GC/AT content) and the stability of the sequence to denaturation. While not widely appreciated, some sequences of DNA can convert relatively easily between different conformational states.
  • An example of this type of behavior is the well documented conversion of poly dGC to a Z form (left-handed helix) as opposed to the normal right-handed B form in the presence of a high salt environment (Pohl, FM and Jovin, TM (1972) J. Mol Biol. 67:375-396).
  • the invention described herein relates in one aspect to altering the ligand-binding characteristics of a nucleic acid sequence of any given length without the use of small molecule pharmaceuticals, thus providing for the determination of the ligand binding characteristics of a particular nucleic acid sequence and all its related members. Accordingly, the binding affinity of a ligand for its ligand binding site may be modulated solely by the nucleotide composition of the polynucleotide sequence(s) flanking an adjacent ligand binding site.
  • a method for searching through sequence space for duplex polynucleotide sequences of a defined length, e.g., 20, 30 or 40 base pairs in length that allow sampling of any given length of sequence space given the four nucleotides A, C, G, and T is disclosed herein.
  • Another embodiment features a method of ranking the relative reactivity of the duplex polynucleotide sequences when flanking a ligand binding site.
  • Yet another embodiment employs a method which is advantageously computer-implemented for grouping duplex polynucleotide sequences of a defined length, e.g., 20, 30 or 40 base pairs in length into families within which each member of a family is related to another member within the same family by virtue of its similar relative binding affinity imparted to a flanking ligand binding site.
  • An advantageous embodiment in accordance with the present disclosure includes an isolated polynucleotide sequence which is a member of a mutant sequence family, which family can be described by application of the following steps (a)-(d) below to a seed sequence of interest: (a) providing a minimum number of base positions to be mutated simultaneously (MSI); (b) providing a maximum number of base positions to be mutated (MXl); (c) reading the seed sequence; and (d) executing a computer program which, using the values of MS 1 and MXl, generates the mutant sequences comprised in the mutant sequence family from the seed sequence.
  • Another advantageous embodiment includes the mutant sequence family itself as described above.
  • the seed sequence is a flanking sequence conferring a relative binding affinity to an adjacent polynucleotide binding site for a given ligand, which relative binding affinity can be described by application of methods provided herein below for isolating and ranking such flanking sequences.
  • the isolated polynucleotide sequence may desirably be, e.g., at least 20, 30, 40 or 50 base pairs long.
  • the isolated polynucleotide sequence may also desirably be, e.g., at least 30%, 40%, 60% or 80% homologous to the seed sequence; and may also desirably have a relative binding constant of ⁇ 5% to the seed sequence.
  • the computer program described in the previous paragraph for generating the mutant sequences comprised in the mutant sequence family from the seed sequence may desirably employ further constraints for reducing the cumbersome nature of the data generated, i.e., the number of mutant sequences generated, while ensuring that those mutant sequences of most interest are generated.
  • constraints include some or all of the following: providing a minimum space value (SVl) of greater than 1 between bases to be mutated in the flanking polynucleotide sequence; providing a ratio of bases G to C (GC1) and/or ratio of bases A to T (ATI) which must be maintained in the mutant sequences; and specifying a region of said flanking sequence which is not to be mutated in generating the mutant sequence(s).
  • the present invention further relates to chromosomes having inserted therein the isolated polynucleotide sequences disclosed herein, and in particular those in which the isolated polynucleotide sequence is operationally coupled to the polynucleotide binding site, e.g., it is coupled to the binding site such that the isolated polynucleotide sequence has the anticipated effect on binding.
  • Other genomic material such as DNA constructs, expression vectors, and animal genomes, having inserted therein the isolated polynucleotide sequences disclosed herein, and in particular those in which the isolated polynucleotide sequence is operationally coupled to the polynucleotide binding site are also included.
  • the present invention further relates to living cells comprising genomic material such as DNA constructs, expression vectors, and animal genomes, having inserted therein the isolated polynucleotide sequences disclosed herein, and in particular those in which the isolated polynucleotide sequence is operationally coupled to the polynucleotide binding site.
  • genomic material such as DNA constructs, expression vectors, and animal genomes
  • the majority of duplex polynucleotide sequences of any specified length are created that allow sampling of any given length sequence space utilizing the four nucleotides A, C, G, and T.
  • the sequences may be randomly synthesized but can also include any predetermined combination of the nucleotides A, G, C, or T.
  • the majority of duplex polynucleotide sequences of any specified length are created that allow sampling of any given length sequence space utilizing the four nucleotides A, C, G, and T, wherein the sequences are further modified chemically, for example by a methyl transferase.
  • the sequences may be randomly synthesized but can also include any predetermined combination of the nucleotides A, G, C, or T.
  • a method for isolating and ranking duplex polynucleotide sequences which confer a given relative binding affinity towards a particular duplex polynucleotide ligand comprising a plurality of duplex polynucleotide molecules wherein each of the duplex polynucleotide molecules comprises a predetermined ligand binding site, the binding site being flanked by a randomly synthesized or other duplex polynucleotide sequence; exposing the duplex polynucleotide molecules to a ligand selective for the binding site; isolating ligand bound from ligand unbound duplex polynucleotide molecules; amplifying the duplex polynucleotide molecules; sequencing each of the duplex polynucleotide molecules to determine the sequence identity of the duplex polynucleotide sequence flanking the ligand binding site.
  • the restriction endonuclease BamRl is added to a population of duplex polynucleotide molecules wherein the duplex polynucleotide molecules comprise a BamRl binding site (5'-GGATCC-3') flanked on either side or adjacent to a stretch of nucleotides under conditions which allow BamHl to contact its binding site but not cleave it. Under these conditions only a fraction of the population of duplex polynucleotide molecules will bind to BamRl.
  • the bound subpopulation of duplex polynucleotide molecules can be separated from the unbound subpopulation of duplex polynucleotide molecules by methods well known to those skilled in the art.
  • the restriction endonucleases Mspl or Hpall are added to a population of duplex polynucleotide molecules wherein the duplex polynucleotide molecules comprise a methylated (5'-C m CGG-3') or unmethylated (5'- CCGG-3') Mspl and Hpall binding site flanked on either side or adjacent to a stretch of polynucleotides under conditions which allow Mspl or Hpall to contact its binding site but not cleave it. Under these conditions only a fraction of the population of duplex polynucleotide molecules will bind to Mspl or Hpall.
  • the bound subpopulation of duplex polynucleotide molecules can be separated from the unbound subpopulation of duplex polynucleotide molecules by methods well known to those skilled in the art.
  • the enzyme Hp ll-methylase is added to a population of nucleic acid molecules which are ligated into a synthetic DNA construct directly flanking or adjacent to a polynucleotide duplex ligand binding site comprising the sequence 5'-CCGG-3'.
  • the restriction endonucleases Mspl or Hpall are subsequently added under conditions which allow Mspl or Hpall to contact its binding site but not cleave it.
  • duplex polynucleotide molecules Under these conditions only a fraction of the population of duplex polynucleotide molecules will bind to Mspl or Hpall.
  • the bound subpopulation of duplex polynucleotide molecules can be separated from the unbound subpopulation of duplex polynucleotide molecules by methods well known to those skilled in the art.
  • a subpopulation of nucleic acid molecules are ranked based on their ability to confer relatively high or low binding affinity of a particular ligand towards its ligand binding site when flanked by duplex polynucleotide sequences.
  • methods are provided for determining the ability of a pre-selected duplex polynucleotide sequence to influence the relative binding affinity of a duplex polynucleotide sequence for its ligand which is remote from the pre-selected duplex polynucleotide sequence.
  • Remote sequences may include enhancers or proximal- promoter sequences.
  • methods are provided for increasing or decreasing the mutation rate of a pre-selected duplex polynucleotide sequence, e.g., by placing heterologous duplex polynucleotide sequence(s) flanking the pre-selected duplex polynucleotide sequence, wherein the heterologous duplex polynucleotide sequence flanking the pre-selected duplex polynucleotide sequence is known to influence the binding affinity of a mutagen to the pre-selected duplex polynucleotide sequence.
  • methods are provided for predicting mutation-prone regions of a duplex polynucleotide sequence by virtue of homology to duplex polynucleotide sequences with known flanking effect on relative binding affinity.
  • a ligand is added to a population of polynucleotide molecules wherein the polynucleotide molecules comprise a double stranded polydeoxyribonucleic acid (DNA) ligand binding site flanked on either side or adjacent to a single stranded or double stranded polyribonucleic acid (RNA) sequence under conditions which allow the ligand to contact its binding site. Under these conditions only a fraction of the population duplex polynucleotide molecules will bind to the ligand.
  • the bound subpopulation of duplex polynucleotide molecules can be separated from the unbound subpopulation duplex polynucleotide molecules by methods well known to those skilled in the art.
  • a ligand is added to a population of polynucleotide molecules wherein the polynucleotide molecules comprise a single or double stranded polyribonucleic acid (RNA) ligand binding site flanked on either side or adjacent to a duplex polydeoxyribonucleic acid (DNA) sequence under conditions which allow the ligand to contact its binding site. Under these conditions only a fraction of the population of duplex polynucleotide molecules will bind to the ligand.
  • the bound subpopulation of duplex polynucleotide molecules can be separated from the unbound subpopulation of duplex polynucleotide molecules by methods well known to those skilled in the art.
  • Figure 1 shows a schematic representation of the method of ranking subsets of any duplex polynucleotide sequence flanking a ligand binding site based on the ability of the subsets of duplex polynucleotide flanking sequences to confer relative binding affinity for a ligand to its binding site.
  • BamRl is the ligand.
  • Figure 2 shows an autoradiograph of a BamRl band shift assay for the selection by "relative binding affinity" of flanking DNA sequence sub-populations from a synthetically generated random population which confer altered BamRl binding affinity for an adjacent BamRl binding site. All duplex polynucleotide sequences are 32 P end- labeled.
  • Figure 3 shows an autoradiograph of a S ⁇ w3AI band shift assay of a random or repeated duplex polynucleotide sequence flanking a S ⁇ rz.3AI binding site. All duplex polynucleotide sequences are 32 P end-labeled.
  • Figure 4 depicts a block diagram of a computer system that is suitable for practicing an exemplary embodiment of the present invention.
  • Figure 5 is a flow chart that illustrates steps that are performed by the main program of the sequence generating facility.
  • Figure 6 is a flow chart that depicts the steps that are performed by the Mutate () Function.
  • Figure 7 depicts an example of the mutant DNA sequence generation performed by the exemplary embodiment of the present invention.
  • Figure 8 illustrates a first example of an output that is generated by the sequence generating facility.
  • Figure 9 illustrates a second example of an output generated by the sequence generating facility.
  • a "polynucleotide” or “polynucleotide sequence” shall mean multiple nucleotides (i.e., molecules comprising a sugar (e.g., ribose or deoxyribose) linked to a phosphate group and to an exchangeable organic base, which is either a substituted pyrimidine (e.g., cytosine (C), thymidine (T) or uracil (U)) or a substituted purine (e.g., adenine (A) or guanine (G)).
  • a substituted pyrimidine e.g., cytosine (C), thymidine (T) or uracil (U)
  • a substituted purine e.g., adenine (A) or guanine (G)
  • Polynucleotides can be obtained from existing nucleic acid sources (e.g., genomic DNA or cDNA), but can also be synthetic (e. g., produced by oligonucleotide synthesis) DNA or RNA.
  • a "ligand” shall mean any chemical moiety selected from the group consisting of: a compound which binds to a duplex polynucleotide in a sequence-specific way; a compound which binds to a duplex polynucleotide sequence in a non-specific way; a protein; an enzyme; an enzyme which alters the structure of a duplex polynucleotide sequence to which it binds; an enzyme which alters the structure of a duplex polynucleotide sequence to which it binds by breaking or forming a covalent or non- covalent bond between an atom of the nucleic acid and another atom; an enzyme which cleaves one or both strands of a duplex polynucleotide sequence to which it binds; a restriction enzyme; a restriction endonuclease; an enzyme which methylates a duplex polynucleotide sequence to which it binds; an enzyme which alkylates a duplex polynucleot
  • a "ligand binding site” or “binding site” shall mean any domain or subdomain in a duplex polynucleotide molecule which directly contacts a ligand by hydrogen bonding, van der Waals radius interactions and/or electron cloud interaction with the bases of a nucleic acid molecule or indirectly via a salt or water molecule.
  • flanking sequence is a polynucleotide sequence located adjacent to a ligand binding site of a polynucleotide molecule.
  • a flanking sequence or flanking sequences may be 5', 3' or 5' and 3' of the ligand binding site.
  • a “remote sequence” can be any regulatory polynucleotide sequence which is located at a great distance either 5' or 3' from the ligand binding site or from the flanking polynucleotide sequence adjacent the ligand binding site.
  • Remote sequences may be placed in either orientation i.e., 5' -»• 3' or 3' -» 5' relative to the ligand binding site or the flanking polynucleotide sequence adjacent the ligand binding site. Examples of remote sequences are enhancers or proximal-promoter sequences.
  • "relative binding affinities" of nucleotide flanking sequences are measured for different nucleic acid molecules in which different flanking sequences are located adjacent the same ligand binding side.
  • a first molecule having a first flanking sequence (or pair of flanking sequences) is found to bind to a ligand in preference to a second molecule having a second flanking sequence (or pair of flanking sequences)
  • the first molecule is said to have a binding affinity that is "relatively higher” than the binding affinity of the second molecule.
  • An endonuclease is said to be “substantially free of cleavage activity" when under the given conditions, there is substantially no observable cleavage.
  • a “pure repeat” is a repeating DNA sequence for which all base positions are defined. For example, if the nucleotides in a pure dinucleotide repeat are A and G, then a pure dinucleotide repeat is (AG) n where n is the number of times (AG) is repeated. Similarly, if the nucleotides in a trinucleotide repeat are A, G, and C, then a pure trinucleotide repeat is (AGC) n where n is the number of times (AGC) is repeated.
  • an impure repeat is a repeating DNA sequence for which one or more base positions allows for the insertion of a random nucleotide. For example, if one of the nucleotides in an impure dinucleotide repeat is A and the random nucleotide is X where X is either A, C, G, or T, then an impure dinucleotide repeat is 5'-(AX) n -3' where n is the number of times (AX) is repeated.
  • an impure trinucleotide repeat is 5'-(AGX) n -3' where n is the number of times (AGX) is repeated.
  • the definition is not intended to be limiting to dinucleotide or trinucleotide repeats and can be extended to tetranucleotide repeats, pentanucleotide repeats, and higher repeating units.
  • a "family" of DNA sequences is a group of DNA sequences that are related by virtue of their conforming to the rules defined herein for governing the ability of the polynucleotide sequence to influence binding of a ligand to a ligand binding site located adjacent to the DNA sequence.
  • a "frame" of a DNA repeat refers to the minimum successively repeating DNA sequence or motif in a DNA sequence. For example, one frame in the sequence 5'- GCGCGC-3' would be GC. In the sequence 5'-GCTGCTGCT-3', one frame would be GCT.
  • an "initial frame" of a DNA repeat refers to the minimum successively repeating DNA sequence or motif in a DNA sequence beginning with the 5' most nucleotide of the repeating DNA sequence or motif in a DNA sequence.
  • the initial frame in the sequence 5'-GCGCGC-3' would be GC.
  • the initial frame in the sequence 5'- GCTGCTGCT-3' would be GCT.
  • a "shifted frame” or “frame shifting” is any of the unique, non-initial frames in a repeating DNA sequence or motif.
  • a shifted frame is CG.
  • a shifted frame is CTG.
  • a “regulatory sequence” or “polynucleotide regulatory sequence” is a polynucleotide sequence which when contacted by a ligand regulates the transcription of a biologically functional gene associated with the regulatory sequence.
  • a “promoter”, “polynucleotide promoter sequence” or “promoter sequence” is any DNA sequence which transcription factor(s) and or RNA polymerase contacts.
  • the promoter determines the polarity of the transcript by specifying which strand will be transcribed. Promoters can be classified according to their “strength”; that is, the relative frequency of transcription initiation (times per minute) at each promoter. Thus, RNA polymerase initiates transcription at a high frequency at strong promoters and at low frequency at weak promoters.
  • Enhancers is any regulatory DNA sequence to which a protein or proteins contact, influencing the rate of transcription of a biologically functional gene associated with the enhancer. Contact of the enhancer by the protein or proteins may either stimulate or decrease the rate of transcription of the associated gene. Enhancers may be located at a great distance either 5' or 3' from the transcription start site of the biologically functional gene it controls. Enhancers may also be regulate transcription of its associated gene when placed in either orientation i.e., 5' — > 3' or 3' ⁇ 5' relative to the gene whose transcription it controls.
  • proximal-promoter sequence or "proximal-promoter element” is any regulatory sequence that is located close to (within 200 base pairs of) a promoter and binds a protein or proteins thereby modulating the transcription of the biologically functional gene associated with the promoter.
  • the promoter-proximal sequence can occur 5' or 3' of the transcription start site of the biologically functional gene.
  • An “operator” is a short polynucleotide DNA sequence in a bacterial or viral genome which contacts a protein or proteins and regulates transcription of an associated biologically functional gene.
  • a “Long Terminal Repeat” or “LTR” is a regulatory polynucleotide sequence of viral origin (DNA tumor viruses or retroviruses) comprising of integration signals for integrating into the host genome, an enhancer, a promoter and a polyadenylation site.
  • Sequence identity or homology refers to the sequence similarity between two polypeptide molecules or between two nucleic acid molecules. When a position in both of the two compared sequences is occupied by the same base or amino acid monomer subunit, e.g., if a position in each of two DNA molecules is compared by adenine, then the molecules are homologous or sequence identical at that position.
  • the percent of homology or sequence identity between two sequences is a function of the number of matching or homologous identical positions shared by the two sequences divided by the number of positions compared x 100. For example, if 6 of 10, of the positions in two sequences are the same then the two sequences are 60% homologous or have 60% sequence identity.
  • the DNA sequences ATTGCC and TATGGC share 50% homology or sequence identity.
  • a comparison is made when two sequences are aligned to give maximum homology.
  • loop out regions e.g., those arising from, from deletions or insertions in one of the sequences are counted as mismatches.
  • the comparison of sequences and determination of percent homology between two sequences can be accomplished using a mathematical algorithm.
  • the alignment can be performed using the Clustal Method.
  • Gapped BLAST can be utilized as described in Altschul et al, (1997) Nucleic Acids Research 25(17):3389-3402.
  • the default parameters of the respective programs e.g., BLASTX and BLASTN
  • BLASTX and BLASTN can be used. See http://www.ncbi.nlm.nih.gov.
  • Another preferred non- limiting example of a mathematical algorithm utilized for the comparison of sequences is the algorithm of Myers and Miller, CABIOS (1989).
  • a "DNA expression vehicle” is a DNA sequence which can be transcribed to produce RNA to be translated and is constructed in vivo or in vitro by methods known to those skilled in the art. The sequence may or may not be capable of sustaining its own replication.
  • a "seed sequence” is a duplex polynucleotide sequence which is isolated and the nucleotide composition of the sequence determined physically by methods well known in the art.
  • the sequence may have the ability to alter the relative binding affinity of a ligand to an adjacent flanking ligand binding site compared to the pre-existing polynucleotide sequence flanking the ligand binding site.
  • the seed sequence can be used to generate a family of polynucleotide sequences that are related by virtue of their conforming to the rules defined herein for governing the ability of the polynucleotide sequence to influence binding of a ligand to a ligand binding site located adjacent to the DNA sequence.
  • the seed sequence may also be a flanking sequence.
  • the invention relates to a method for selecting duplex polynucleotide flanking sequences, e.g., duplex polynucleotide sequences which flank (in the 3' and/or 5' direction) a selected polynucleotide sequence (such as a ligand binding site) from a population of molecules of identical structure.
  • the method of the invention is useful for determining flanking polynucleotide sequences which provide desired characteristics, such as Tm, ability of the ligand binding site to bind a ligand, ability of a ligand to react with the polynucleotide sequence, stability of the polynucleotide sequence, and the like.
  • the relative binding affinity of a polynucleotide sequence for a ligand can be modulated by providing appropriate flanking polynucleotide sequences.
  • a flanking polynucleotide sequence(s) can be selected from a mixture of family sequences to identify which sequences increase the ability of a ligand to bind to a ligand binding site; decrease the ability of a ligand to bind to a ligand binding site; increase the mutability of the polynucleotide sequence; decrease the mutability of the polynucleotide sequence; and the like.
  • the method includes the steps of providing a polynucleotide sequence which includes a ligand binding site and at least one polynucleotide sequence which flanks the ligand binding site (e.g., in the 3' and/or 5' direction); and determining the ability of the ligand to bind to the ligand binding site.
  • a plurality such as combinatorial library
  • polynucleotide sequences each including the same ligand binding site and different (e.g., randomly differing) flanking polynucleotide sequences, are provided.
  • the mixture of polynucleotide sequences can be screened against a limiting concentration of the ligand, and polynucleotide sequences which preferentially bind to the ligand (or do not bind to the ligand) can be selected, and (preferably) are then sequenced to determine an appropriate flanking polynucleotide sequence(s).
  • Example 1 describes the preparation of a plurality of polynucleotide DNA sequences; each sequence includes the binding site for the restriction enzyme R ⁇ HI, flanked on each end by any polynucleotide sequence any defined length, e.g., 20, 30 or 40 base pairs in length.
  • the mixture of polynucleotide sequences was titrated with a known concentration of BamRl under conditions where substantially no cleavage takes place, and those polynucleotide sequences which bound most strongly to BamRl were selected (in this example, by gel shift assay and recovery of the shifted bands). Similarly, the poorest-binding polynucleotide sequences were selected.
  • flanking polynucleotide sequences which conferred increased or decreased ligand-binding ability on the ligand binding site.
  • Tables 1-4 list the polynucleotide sequences obtained by the method described above. As discussed above, the ability of a polynucleotide sequence to bind to a ligand is believed to be related solely to the nucleotide composition of the flanking polynucleotide sequence. It is further believed that the ability of a polynucleotide sequence to bind to ligands is at least largely independent of the ligand selected.
  • flanking polynucleotide sequence which lowers the relative binding affinity of any one ligand to a ligand binding site adjacent the flanking polynucleotide sequence, will also lower the relative binding affinity other ligands to the ligand binding site.
  • the particular ligand selected for use according to the methods of the invention to determine the ability of a flanking polynucleotide sequence to affect ligand binding, is a matter of convenience and design choice which will be routine to one of ordinary skill in the art. It will be appreciated that flanking polynucleotide sequences which confer particular ligand binding attributes upon a neighboring polynucleotide sequence will have many potential uses.
  • flanking polynucleotide sequences can be selected to promote binding to a ligand, such as an RNA or DNA binding protein, a polymerase, a reverse transcriptase, a telomerase, a helicase, a transcription factor, and the like.
  • a ligand such as an RNA or DNA binding protein, a polymerase, a reverse transcriptase, a telomerase, a helicase, a transcription factor, and the like.
  • the invention provides methods for selecting flanking polynucleotide sequences which can be used in vivo, e.g., to study the interaction of ligands and nucleic acids, or to provide improved probes or primers for PCR amplification, and the like.
  • the flanking sequences of a polynucleotide can also be provided in a nonrandom manner.
  • flanking polynucleotide sequence can be provided, e.g., by oligonucleotide chemical or biochemical synthesis, to provide a flanking region of any known sequence.
  • This flanking polynucleotide sequence can then be tested to determine the effect on ligand binding.
  • One particularly preferred practice of the invention involves the construction of a plurality of oligonucleotides, each including a ligand binding site flanked by at least one flanking polynucleotide sequence which has a known nucleotide composition. The effect on the relative binding affinity of an adjacent flanking ligand binding site for its ligand can then be assayed, e.g., as described herein.
  • This embodiment of the invention is useful in constructing sequence reactivity data compilations, e.g., a database, which quantifies the effect on stability of any possible flanking sequence (see infra).
  • flanking polynucleotide sequences which have the ability to influence the binding affinity of the DNA binding ligand to its binding site and ultimately affect the functioning of the ligand.
  • the method described above could be used to identify polynucleotide sequences flanking the binding site in DNA of a transcription factor which can adversely affect or promote the binding of that particular transcription factor to its binding site.
  • MyoD is a transcription factor which plays a role in muscle development
  • N is any one of the nucleotides A, C, T, or G as the ligand binding site between two randomly synthesized duplex polynucleotide sequences
  • duplex polynucleotide molecules comprising randomly synthesized polynucleotide sequences flanking either side of the MyoD binding site which can be isolated as bound complexes with MyoD can be said to confer high binding affinity of MyoD to its binding site and duplex polynucleotide sequences which are isolated unbound with MyoD can be said to confer low binding affinity of MyoD to its binding site.
  • the randomly synthesized polynucleotide sequences can be ranked in order of their ability to influence binding affinity of the ligand to its binding site.
  • a dinucleotide repeat with a repeat length of 2, there are 16 possible polynucleotide sequence motifs: (AA) come, (AC) precede, (AG) terme, (AT) today, (GA) today, (GC) resort, (GG) technically, (GT) today, (CA) prevail, (CG) prevail, (CC) seldom, (CT) seldom, (TA) seldom, (TC) today, (TG) seldom, (TT) suit.
  • the dinucleotide repeat motifs also include the single-nucleotide repeats.
  • the polynucleotide sequences generated above are examined and grouped together into families according to equivalence by complementarity, for example, by execution of a computer program.
  • the last base position of each frame is replaced with a nucleotide "X,” indicating that the base position may be any base permitted by the DNA synthesis process.
  • X nucleotide
  • the following impure repeats result from this step: (AX) W> ( K) n> (GX) grip , (TX) repeat.
  • the unique polynucleotide sequences are examined and grouped together into families according to equivalence by complementarity, for example by execution of a computer program. Since duplex DNA is assumed, the impure dinucleotide repeat motif (AX)/? is equivalent to the motif (XT) n by complementarity.
  • dinucleotide repeats illustrate a method for creating six families of polynucleotide sequence motifs which represent an efficient and economical means of synthesizing and characterizing the relative reactivities of polynucleotide sequences from the sixteen individual motifs that would have to be synthesized and characterized if the method of this invention were not employed.
  • For tri-nucleotide repeats sixty-four original sequence motifs can be reduced to just ten families of polynucleotide sequence motifs.
  • the utility of the above invention is not limited to repeat sequences illustrated above but can be extended to tetranucleotide repeats, pentanucleotide repeats, and higher repeating units.
  • the invention provides a method for determining polynucleotide sequences which are more (or less) prone to mutation.
  • the method comprises the steps of construction of a database of polynucleotide sequences of length n which exert a known binding affinity on a flanking ligand binding site; taking any polynucleotide sequence (query sequence) including a naturally occurring polynucleotide sequence and dividing it into all possible polynucleotide sequences of length n by position along the length of the query sequence; finding the most homologous polynucleotide sequence of length n in the database of length n on either the 5' side or 5' side of one strand, on both the 5' side and 3' side of one strand, on either the 5' side or 3' side of the opposing strand or on both the 5' side and 3' side of the opposing strand of position of interest; determining the relative binding affinity of the polynucleotide sequence for a ligand conferred by a
  • the above method is further illustrated by the following example. Let's assume that a database of polynucleotide sequences of 40 base pairs in length with known relative binding affinities has been constructed by the method described herein. Let's also assume that one wishes to predict mutable sites in any gene of interest, the query sequence, that is 100 base pairs long. The method would include dividing the 100 base pair query sequence into all possible sequences of 40 base pairs in length beginning either at the -40 base pair position (at the 5' side) and or at the + 140 base pair position (at the 5' side).
  • the relative binding affinity of a polynucleotide sequence for a ligand conferred by the flanking polynucleotide sequence -39+ 1 (polynucleotide sequence A) at the 5' position and + 101- + 140 (polynucleotide sequence Z) at the 3' position of the same strand is determined by searching the database for the most homologous sequence to polynucleotide sequences A and Z that is in the database and reading off the relative binding affinity previously determined for the most homologous polynucleotide sequence.
  • the average relative binding affinity conferred by these flanking polynucleotide sequences is calculated. This process is reiterated along the length of the query sequence at every position.
  • flanking sequence would be from position -38+2 (polynucleotide B) at the 5' position and +100- +139 at the 3' position (polynucleotide sequence Y) of the same strand.
  • the relative binding affinity of a polynucleotide sequence for a ligand conferred by flanking polynucleotide sequences B and Y is determined by searching the database for the most homologous sequence to polynucleotide sequences B and Y that is in the database and reading off the relative binding affinity of the most homologous polynucleotide sequences. The average relative binding affinity conferred by these flanking polynucleotide sequences is again calculated.
  • This method creates a table of relative binding affinities for each position along the query sequence.
  • the position(s) with the lowest value(s) will be the least mutable and the position(s) with the highest value(s) will be the most mutable.
  • the method of the invention can also be used to determine regions of a polynucleotide sequence, including a gene, which are more (or less) likely to mutate, e.g., in response to a selection pressure on the organism.
  • the method of the invention is useful, e.g., for determining which portions of a gene are optimal targets for design of probes, e.g., for the detection of the presence of a microorganism in a biological sample.
  • a probe which is complementary to a portion of bacterial polynucleotide sequence can be used to detect the presence of the bacterium in a biological sample, e.g., to detect bacterial infection, as is well known in the art.
  • the probe will no longer bind (or will bind with decreased affinity) to the polynucleotide sequence of the mutated bacterium, thus rendering detection of the bacterium more difficult.
  • a probe can be designed to be complementary to a portion of the bacterial polynucleotide sequence which is less prone to mutation and thus the probability that the probe will be rendered useless by subsequent mutation is decreased.
  • the method of the invention is also useful for determining functionally important portions of a protein which is encoded by a polynucleotide sequence. Without wishing to be bound by theory, it is believed that polynucleotide sequences which code for critical residues or regions of the protein will reside in regions of the gene which are relatively resistant to deleterious mutations which would decrease or abolish the desired function of the protein.
  • critical residues of the encoded protein can be identified.
  • the method of the invention can be used to determine or predict regions of a protein or polypeptide which are antigenically important. Due to the degeneracy of the genetic code, a plurality of polynucleotide sequences can often code for a single polypeptide. Routine computational methods allow the determination of the relative binding affinity of each polynucleotide sequence which encodes a selected polypeptide. For a given polypeptide, the binding affinity of a naturally-occurring coding polynucleotide sequence can be compared to the binding affinities of all possible polynucleotide sequences which could code for that polypeptide, to determine coding sequences of high or low relative binding affinity. D. Method for Modulating Mutation Rate of Polynucleotide Sequences
  • the invention provides means for altering the susceptibility to mutation of a polynucleotide sequence.
  • a region of DNA of interest in a host cell or organism can be made less prone to mutation, e.g., to prevent mutation of a region of the DNA by inserting adjacent to the region of DNA a heterologous duplex polynucleotide sequence whose nucleotide composition is known to decrease the binding affinity of a mutagen to the region of DNA; similarly, a region of DNA of interest in a host cell or organism can be made more prone to mutation, e.g.
  • a heterologous duplex polynucleotide sequence whose nucleotide composition is known to increase the binding affinity of a mutagen to the region of DNA is inserted adjacent to the region of DNA whose susceptibility to mutation is to be increased.
  • This method can thereby provide a method for producing non-naturally occurring polynucleotide sequences (and proteins encoded by them); this is a form of "directed evolution" in that a particular gene or portion thereof can be targeted for mutation without increasing the propensity for mutation of other regions of the genome.
  • Such non-naturally occurring proteins can be assayed to determine properties such as binding specificity, binding affinity, rate of catalysis of a reaction, and the like, to identify proteins which have desirable characteristics.
  • the method of the invention can be used to speed the process of preparing and selecting mutant proteins.
  • a polynucleotide sequence, and the protein encoded thereby can be "protected” to prevent mutations, e.g., by altering the nucleotide composition of the polynucleotide sequence, a nearby (e.g., flanking) polynucleotide sequence or a remote polynucleotide sequence to decrease the relative binding affinity of a mutagen for the polynucleotide sequence encoding the protein.
  • a method of this invention can be used to modify the nucleotide composition of the flanking polynucleotide sequence or a pair of flanking polynucleotide sequences of any regulatory DNA sequence that alters the relative frequency of transcription initiation as compared to the pre-existing flanking polynucleotide sequence or a pair of flanking polynucleotide sequences.
  • Enhancers and proximal-promoter sequences are examples of such sequences. Replacing the native flanking polynucleotide sequence or a pair of flanking polynucleotide sequences adjacent to an enhancer can alter the frequency of transcription initiation of the promoter the enhancer regulates.
  • telomere sequences that regulate the expression of a biologically functional DNA cloned into the vector.
  • pcDNA 3.1(+/-) from Invitrogen is one example of such an expression vector.
  • the expression of the desired cloned gene can be further improved by replacing the existing flanking polynucleotide sequences adjacent the enhancer-promoter sequences with heterologous flanking polynucleotide sequences whose nucleotide composition confers a higher binding affinity of the promoter-enhancer for transcription factor(s) and/or RNA polymerase thereby increasing the frequency of transcription initiation and in turn increasing cellular output of the desired product.
  • the ability to increase the cellular output of a biopharmaceutical by simply altering the nucleotide composition of the polynucleotide sequences flanking the promoter of the desired biologically functional gene is of great economic value.
  • the ability to alter the expression of a gene simply by altering the nucleotide composition of polynucleotide sequences flanking its promoter and/or enhancer also offers great promise in the treatment of human pathological conditions.
  • the present invention can be utilized to increase the expression of a gene whose protein product can be increased to treat a disease in vivo.
  • the STATs are a family of latent cytoplasmic proteins that are activated to participate in gene control when cells encounter various extracellular signals.
  • the STATs have been shown to be involved in the induction of cell death (apoptosis). Apoptosis is initiated by activation of a cascade of enzymes that cleave cellular proteins, resulting in the efficient termination of the cell.
  • STATs activate genes containing GAS (Gamma-Activated Sequence) elements.
  • the present invention can be utilized to replace the polynucleotide flanking sequence or a pair of polynucleotide flanking sequences adjacent to the GAS element with a heterologous polynucleotide flanking sequence or a pair of heterologous polynucleotide flanking sequences whose nucleotide composition confers a relatively higher binding affinity for the STATs as compared to the naturally occurring flanking polynucleotide sequences and thereby increase the expression of genes that which will induce pre-mature death of unwanted cells or cells that have lost control of their own growth.
  • this invention can also be utilized to reduce and perhaps even entirely inhibit the expression of undesirable genes whose protein products can lead to the manifestation of pathological conditions.
  • the naturally occurring adjacent flanking polynucleotide sequence or a pair of naturally occurring adjacent flanking polynucleotide sequences can be excised and replaced by a heterologous flanking polynucleotide sequence or a pair of heterologous polynucleotide flanking sequences whose nucleotide composition confers a relatively lower binding affinity of the transcription factor(s) and/or RNA polymerase for the promoter as compared to the naturally occurring adjacent polynucleotide flanking sequence or a pair of adjacent polynucleotide flanking sequences.
  • the resulting promoter will have a lower relative frequency of transcription initiation as compared to the promoter flanked by the original polynucleotide sequences in the genome and thus reduce or perhaps entirely inhibit the expression of the unwanted gene product.
  • a computer system may be utilized to generate mutant DNA polynucleotide sequences from a seed sequence.
  • Each element in the seed sequence is a base (i.e., adenine, guanine, thymine or cytosine), and each element occupies a given base position within the polynucleotide sequence.
  • the number of elements within the polynucleotide sequence may vary.
  • the computer system generates mutant DNA polynucleotide sequences (which may include International Union of Pure and Applied Chemistry nucleic acid ambiguity codes in the place of one or more mutated positions in the seed sequence) that have a desired degree of homology relative to the seed sequence.
  • mutant DNA polynucleotide sequences that have an 80% or greater homology relative to a DNA polynucleotide sequence having five base elements.
  • each of the mutant DNA polynucleotide sequence differs from the seed sequence by one element.
  • the computer system may be adapted to provide mutant DNA polynucleotide sequences with at least a minimum degree of homology requested by the user. For example, a user may request a 95% homology or greater rather than an 80% homology or greater.
  • the number of mutant sequences which generated, particularly from long seed sequences, may be quite large and potentially cumbersome in terms of demands on computer hardware, etc.
  • the computer program may desirably employ further constraints for reducing the cumbersome nature of the data generated, i.e., the number of mutant sequences generated, while ensuring that those mutant sequences of most interest are generated.
  • constraints include some or all of the following: providing a minimum space value (SVl) of greater than 1 between bases to be mutated in the flanking polynucleotide sequence; providing a ratio of bases G to C (GC1) and/or ratio of bases A to T(AT1) which must be maintained in the mutant sequences; and specifying a region of said flanking sequence which is not to be mutated in generating the mutant sequence(s).
  • the constraint of providing a ratio of bases G to C (GC 1 ) and/or ratio of bases A to T(AT1) which must be maintained in the mutant sequences may be additionally desirable in certain cases, as it has been found that for, e.g, a higher GC percentage composition in the polynucleotide sequence confers lower reactivity on the sequence and as such, reactivity may be controlled by employing such a constraint in the computer program.
  • flanking sequences e.g., from 5 to 20 nucleotides in length
  • the constraint of specifying a region of flanking sequences (e.g., from 5 to 20 nucleotides in length) immediately adjacent to the binding site may have a great effect on binding (positive or negative); as such, maintaining those regions more or less constant in the mutant sequence(s) would reduce the number of mutant sequences generated and offer the additional advantage of not adversely disturbing the properties of this region on binding.
  • flanking sequences e.g., from 5 to 20 nucleotides in length
  • FIG. 4 depicts a block diagram of a computer system A10 that is suitable for practising this aspect of the present invention.
  • the computer system A10 contains a central processing unit (CPU)A12 for executing computer instructions.
  • the computer system A10 also includes a display device A 14, such as a video display, and a printer A16 for producing printed output.
  • the computer system A10 may include one or more input devices Al 8, such as a mouse, a keyboard or a microphone.
  • the computer system A10 includes a primary storage A20 and a secondary storage A22.
  • the primary storage A20 may be implemented using random access memory (RAM) or using other types of appropriate storage devices.
  • the secondary storage A22 may be implemented as a magnetic hard disk drive or other secondary storage device.
  • the secondary storage A22 may facilitate the use of removable computer readable media such as CD-ROMs.
  • the sequence generating facility A28 is stored during execution within the primary storage A20 and executed on the CPU A 12.
  • the sequence generating facility A28 may be implemented in computer instructions that constitute one or more programs libraries or modules. Those skilled in the art will appreciate that the sequence generating facility A28 may be written in number of different computer languages and may take many different formats.
  • the secondary storage A22 may hold data files A30, that may include mutant DNA sequences that have been generated by the sequence generating facility A28 or other data.
  • the computer system Al 0 may also include the resources for communicating with other remote computing resources.
  • the computer system A10 may include a network adapter A24 for connecting a computer system to a computer network. This computer network may be any of a number of different well-known local area networks (LANs) or wide area networks (WANs).
  • a modem A26 may be provided to facilitate modem communications over cable connections, wireless connections, or traditional analog telephone lines.
  • the sequence generating facility 28 is largely divisible into two major components: a main program and a Mutate () function.
  • the following pseudo-code identifies the functionality performed by these respective components.
  • Mutate() parameters: mute count integer representing the # simultaneous mutations at this point in the calling structure mutenext next sequence position to mutate
  • Figure 5 depicts a flow chart listing the steps that are performed in the main body for code for the sequence generating facility A28.
  • this flow chart is intended to be merely illustrative and not limiting of the present invention.
  • the functionality realized for the sequence generating facility A28 may also be realized by performing other sequences of computing steps.
  • the sequence generating facility A28 operates to output the seed sequence and mutant DNA polynucleotide sequences produced with the constraints that have been selected.
  • the sequence generating facility A28 identifies which base positions are to be mutated and sequentially mutates those positions in a predefined order. This process continues until all mutant DNA polynucleotide sequences that fulfill the constraints that have been generated for the seed sequence.
  • the seed sequence is obtained (step B 10 in Figure 5, see also Tables 1- 4).
  • the seed sequence and its complement may be entered interactively by the user via an input device A18 or may be read from the data file A30 that is stored in secondary storage A22.
  • the seed sequence may be stored as an array of characters. As was mentioned above, the number of elements in the DNA polynucleotide sequence may vary.
  • the seed sequence is output (step B12 in Figure 5). This may entail sending the seed sequence to a printer A 16. Each base is represented by its corresponding representative character A, G, T or C, and the sequences are output as strings of characters selected from the DNA alphabet of A, G, T and C.
  • the main program then begins the process of generating the mutant DNA polynucleotide sequences.
  • a pointer designated as "mutenext” is used to identify the next base position that is to be mutated in the DNA polynucleotide sequence. Initially, “mutenext” is designated to point to the first base position within the sequence (step B14 in Figure 5).
  • the main program proceeds to enter a loop (see B 16 in Figure 5) that performs the brunt of the work for generating the mutant DNA polynucleotide sequences from the seed sequence. The loop continues execution until "mutenext” is at the last mutable position within the seed DNA sequence (in the simplest case, this is the last element in the seed sequence).
  • Step B 18 in Figure 5 the base position pointed to by "mutenext” is mutated.
  • This entails replacing the base that is currently at the base position pointed to by "mutenext” with the next sequential base from a base table.
  • the bases are assigned from a base table where the representations of the bases are stored in predefined sequence of A, G, C, T.
  • the base position currently has a value of "A”
  • alternative base encodings may be used.
  • the bases may be categorized into pyrimidines and purines rather than into separate bases.
  • the sequence generating facility A28 is capable of mutating more than one base position simultaneously.
  • the sequence generating facility checks whether there is more than one position to be mutated simultaneously. For illustrative purposes, it is assumed that a maximum of two positions may be simultaneously mutated for the flow chart of Figure 5.
  • the variable "mute_count” holds a value that identifies the number of base positions that are to be mutated simultaneously. Since there are only two options in the examples shown in Figure 5, it is assumed that "mute_count” equals two in the instance where more than one position is to be mutated simultaneously (see step B22 in Figure 5).
  • the value of "mutenext" is then updated to point to the next position to be mutated.
  • the next position is the specified minimum distance (i.e., the value of the variable "min_dist") from the old value of "mutenext" (step B24 in Figure 5).
  • min_dist the value of the variable
  • the program then calls the Mutate () function to perform the required mutation while holding constant the value of the base position that was mutated in step B18 (step B26 in Figure 5).
  • the "mute count” value and the "mutenext” value are passed as parameters to the Mutate () function, which will be described in more detail below. If there is only one position to be mutated simultaneously (as checked in step
  • the main program prints the mutated polynucleotide sequences that are produced by mutating the base position mutated in step B 18 (B28 in Figure 5). After step B28 and after step B26, "mutenext" is incremented to point to the next mutable position. The end of the loop has then reached its step B32 and the process is repeated beginning at step B16. The process is repeated for each of the mutable positions until all of the appropriate mutant DNA polynucleotide sequences have been generated.
  • FIG. 6 shows a flow chart of the steps that are performed by the Mutate () function.
  • the variable "mutation_limit” identifies the maximum number of mutable positions that may be simultaneously mutated.
  • the Mutate () functions checks whether the "mute-count” parameter is equal to the mutation limit. In other words, a check is made whether the number of positions being simultaneously mutated is the maximum. If the number of positions being mutated equals the mutation limit, each base position is mutated from the base position currently pointed to by "mutenext" to the end of the sequence (step C12 in Figure 6) each resulting mutated DNA polynucleotide sequence is output (step C 14 in Figure 6).
  • the position currently pointed to by "mutenext” is mutated (step C16 in Figure 6).
  • "Mutetext” is then incremented, and "mute_count” is incremented (step C20 in Figure 6).
  • the Mutate () function is recursively called passing "mute_count” and "mutenext” as parameters (step C22 in Figure 6).
  • An example is helpful to illustrate operation of the sequence generating facility. Suppose that the seed sequence is like sequence D10 shown in Figure 7. This seed sequence D10 is printed as an initial step of the mutant DNA sequence generation process. Further suppose that two base positions may be simultaneously mutated and that there is no minimum distance requirement between the base positions that are mutated.
  • the sequence generating facility After printing the seed sequence, the sequence generating facility initializes the "mutenext" pointer to point to the first base position within the seed sequence D10. Mutations are performed with the-assistance of a base table D20.
  • the base table D20 is simply a table that holds the predefined sequences of bases. As was mentioned above, this predefined sequence in the illustrative case is A, G, C, T.
  • a "started” pointer identifies the base value with which the process begins. In particular, "started” points to the initial value of the base for the base position that is to be mutated.
  • the next base pointer points to the next base to be used in mutating the base position.
  • the first base pointer points to the first base in the base table and last base pointer points to the last base in the base table.
  • the mutation begins by initially mutating the base position pointed to by "mutenext”.
  • the next base value is assigned to the base position pointed to by "mutenext".
  • the first base in the seed sequence D10 is changed from A to G.
  • "mutenext" is incremented by one to point to the second base position within the sequence and this second position is also mutated to have the next base value of G.
  • the resulting mutant DNA polynucleotide sequence D 12 is then output.
  • the mutation of the second base position within the polynucleotide sequence continues while holding the first base position constant.
  • “mutenext” continues to point to the second base position but next base is incremented to point to the next base within the base table B20.
  • the second base position is mutated to have a value of C.
  • the resulting mutant polynucleotide sequence D14 is output.
  • the second base position is then further mutated to have the next base value of T.
  • the resulting mutant DNA polynucleotide sequence D16 is output.
  • the second base position has been fully mutated.
  • the started pointer and the next base pointer both point to the same position within the base table D20.
  • FIG. 8 shows an example of the output mutant DNA sequences that are generated using a seed sequence of AAAAAA and mutating two of the base positions simultaneously.
  • Figure 9 shows a mutation of the same seed sequence when two base positions are simultaneously mutated but the minimum distance between the positions being mutated is two. As can be seen, only 91 mutant DNA polynucleotide sequences are produced in Figure 9 whereas 136 mutant DNA polynucleotide sequences are produced in Figure 8.
  • a linear polynucleotide DNA sequence construct created by synthetic means known in the art and consists of (from left to right, 5' - 3') a unique PCR primer site followed by a random insert of 20, 40 or 80 bases created by allowing a DNA synthesizer to insert any of the four nucleotide bases A, G, C, and T; a BamRl binding site; and a second 20, 40 or 80 base random insert followed by a second unique PCR primer site.
  • Each primer site includes an Ec ⁇ RI site.
  • a population of synthetic polynucleotide molecules are generated with different polynucleotide sequences at the random insert sites which, after PCR amplification using oligonucleotides complementary to the primer sites as PCR primers, are transformed into a population of duplex polynucleotide molecules containing a BamRl recognition site flanked by random polynucleotide sequences.
  • Incubation of these duplex polynucleotide sequences with appropriate (empirically determined) quantities of the endonuclease results in a portion of the duplex polynucleotide sequences being bound by BamRl while some of the duplex polynucleotide sequences are not bound.
  • duplex polynucleotide sequences which bind BamRl with a relatively high affinity bind the endonuclease in preference to those duplex polynucleotide sequences which bind BamRl with a relatively low affinity. Since the bound duplex polynucleotide sequences can be separated from the unbound duplex polynucleotide sequences in a gel-shift assay, those duplex polynucleotide sequences bound to the enzyme with higher affinity are represented as "shifted" molecules at relatively lower BamRl concentrations. This assay is depicted in Figure 1 as an inset. Lane 1 shows schematically the migration pattern of the unbound duplex polynucleotide sequence population.
  • Lane 2 represents the migration pattern obtained at relatively low BamRl concentrations. Lanes 3 and 4 show how the migration pattern varies as the concentration of BamRl is increased still further.
  • the populations of bound and unbound sequences are eluted separately from the gel and subsequently cloned into a vector, propagated and isolated. Sequencing of these clones reveals polynucleotide motifs that confer higher or lower affinity for the endonuclease.
  • Polynucleotide flanking sequences surrounding a BamRl binding site ( Figure 2) or a S ⁇ w3AI ( Figure 3) include pure or impure polynucleotide repeat units. Synthesis of a population of duplex polynucleotide sequences having pure dinucleotide repeats, pure trinucleotide repeats, impure dinucleotide repeats, impure trinucleotide repeats, and the like, can be performed according to standard methods, e.g., on a DNA synthesizer.
  • Populations (mixtures) of the duplex polynucleotide sequences can then be incubated with varying quantities of a binding ligand such as BamRl or S «3AI.
  • a binding ligand such as BamRl or S «3AI.
  • the ligand can be, e.g. , a protein which binds to a nucleic acid, e.g., an enzyme, e.g., a restriction enzyme.
  • the ligand can be the endonuclease B ⁇ mRl alternatively, as shown in Figure 3, the ligand can be S ⁇ w3AI.
  • B ⁇ mRl binds to the polynucleotide sequence 5'-GGATTC-3';
  • S w3AI binds to the polynucleotide sequence 5'-GATC-3'; binding sites for other ligands can be employed as is known in the art.
  • the binding site is flanked in both directions by A (to desymmetrize the construct), and then a 40-base long random insert is provided in both the 5' and 3' directions (longer or shorter random polynucleotide sequences can be employed if desired, e.g., to study the effect of remote polynucleotide flanking sequences on binding site reactivity).
  • PCR primer sites are provided to permit amplification of the construct, if desired.
  • Each PCR primer site includes an EcoRI restriction site.
  • the synthesizer is programmed to provide a mixture of each of the four nucleotide A, G, C, and T at each position of the 40-base random sequences; thus, a population of constructs is created as a statistical mixture differing at the random portions of the construct. (It will be appreciated that only a subpopulation of the 440 possible random sequences can be obtained due to practical limitations on the amount of DNA synthesized.)
  • the population of constructs is amplified with PCR under standard conditions and purified by polyacrylamide gel electrophoresis (PAGE), followed by elution into buffer including 50mM NaCl and 50mM Tris-HCI (pH 8.0).
  • PAGE polyacrylamide gel electrophoresis
  • the result is a population of polynucleotide duplexes (the PCR reaction provides the complementary strand). It can be determined through appropriate control experiments that the synthesis and PCR amplification of the duplexes results in correct binding sites for B ⁇ mRl or S ⁇ w3AI.
  • Samples of shifted and unshifted duplexes at each round of band shifting, elution and amplification by PCR or after the final round of band shifting, elution and amplification by PCR is digested with 200 units o ⁇ EcoRl per microgram of duplex polynucleotide.
  • the polynucleotides are cleaved at the EcoRl recognition sites, purified by PAGE, and ligated into Lambda ZAP vector predigested with EcoRl and treated with CIAP at 1:1 inser vector ratio in the presence of 2U of T4 ligase in 5 microliters of T4 ligase buffer at 40°C overnight.
  • the ligated samples are then packaged using Gigapack II Gold packaging extract (Stratagene) and cloned into E. coli XL I -Blue host strain and subjected to blue/white selection.
  • the recombinant (white) clones are selected and eluted in 500 microliters SM buffer (100 mM NaCl, 8 mM MgSO4, 50 mM Tris-HCI, pH 7.5, 0.01 % gelatin, 0.04% chloroform).
  • SM buffer 100 mM NaCl, 8 mM MgSO4, 50 mM Tris-HCI, pH 7.5, 0.01 % gelatin, 0.04% chloroform.
  • Ten microliters of the eluate is amplified by PCR using T3/T7 primers, purified by Qiagen PCR purification kit and sequenced.
  • methods such as the methods described above can be used to generate compilations of data for the prediction of the reactivity of a potential binding site for a ligand based upon sequences which flank the ligand binding site.
  • methods such as the methods described above can be used to generate compilations of data for the prediction of the reactivity of a potential binding site for a ligand based upon sequences which flank the ligand binding site.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Organic Chemistry (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Genetics & Genomics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biotechnology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Microbiology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Biochemistry (AREA)
  • Physics & Mathematics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Plant Pathology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • Immunology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention details compositions of DNA sequence with the ability to confer a given relative binding affinity to any DNA ligand whose binding site is placed adjacent to the DNA sequence composition. Further, this invention features a method of creating and thereby systematically sampling polynucleotides of any given length sequence space for polynucleotide sequences given the nucleotides A, C, G, and T/U. This invention also features a method of grading the sequences created into a range of relative binding affinities from high binding affinity to low binding affinity, with respect to conferring a given relative binding affinity of any DNA ligand to its binding site placed adjacent to a polynucleotide sequence based on its nucleotide composition. In another aspect this invention features an algorithm for grouping homologous polynucleotide sequences into families related solely by their nucleotide composition. This invention also relates to methods for utilizing a member of a family to allow for the prediction of a region of any polynucleotide sequence or protein which is susceptible to mutation, for modulating the mutation rate of any polynucleotide sequence or protein, for altering the relative expression level of any gene, and the like.

Description

COMPOSITIONS OF NUCLEIC ACID WHICH ALTER LIGAND-BINDING CHARACTERISTICS AND RELATED METHODS AND PRODUCTS
Background of the Invention DNA is known to undergo a wide variety of conformational alterations which are dependent on the conditions in which the DNA is found. Critical to the final structure(s) adopted by DNA are the precise order of bases, the length of the DNA, the overall base content (GC/AT content) and the stability of the sequence to denaturation. While not widely appreciated, some sequences of DNA can convert relatively easily between different conformational states. An example of this type of behavior is the well documented conversion of poly dGC to a Z form (left-handed helix) as opposed to the normal right-handed B form in the presence of a high salt environment (Pohl, FM and Jovin, TM (1972) J. Mol Biol. 67:375-396). While this conformational flexibility has been exhibited in many other simple repeated sequences, in the vast majority of such cases the final "altered" conformation has not been characterized to the extent of the Z DNA example described above. This suggests that such conformational variability is a common albeit unappreciated feature of DNA. Interestingly, such sequences are also known to be highly reactive to common endonucleases. Underscoring this point are studies which indicate that small (GC)8 DNA segments can influence the overall structure of apparently random sequence DNA segments as large as one thousand bases in length (Kirn et al., (1993) Biopolymers 33:1725-1745). In toto the current understanding of this phenomenon is insufficient to permit useful information to be obtained by simple interpretation of a nucleotide sequence.
It is clear that the ability to study polynucleotide sequences in vitro will significantly advance our understanding of how polynucleotide sequences flanking a DNA ligand binding site can alter the relative binding affinity of a DNA binding agent for its binding site in a DNA sequence of interest; this knowledge can be applied in vivo and in turn allow for efficient and economical identification of such DNA sequences. It is now possible to directly isolate such DNA sequences by the method described elsewhere (U.S. Provisional Application Serial No. 60/068,616). However, the number of polynucleotide sequences required in order to examine all possible sequences of a given length grows exponentially with increasing length. For example, if polynucleotide 40 base pairs in length was to be synthesized randomly using the four nucleotides A, G, C, and T, the number of DNA molecules containing all possible combinations of the four nucleotides that would be synthesized would be 440! Obtaining the relative binding affinities of each individual DNA molecule and determining their sequence would be an enormous undertaking. Thus, an efficient and economical method for creating and thereby systematically sampling polynucleotides of any given length sequence space for polynucleotide sequences given the nucleotides A, C, G, and T while minimizing the number of sequences that would have to be analyzed has long been desired in the art An efficient and economical method for determining how DNA molecules of a given length are related to each other in terms of their function without undue experimentation also has been sought. DNA sequence information has been and is being accumulated at an ever-increasing rate. Despite this increase in information about extant coding (and non-coding) DNA sequences is still unclear how to translate DNA sequence information so as to reveal which segments of DNA in a given string will be more reactive to DNA binding agents in a DNA sequence of interest. In addition, it is clear from several studies that the same function for a particular nucleic acid sequence can be attributed to sequences of similar nucleotide compositions. For example, in bacteria strong promoters have sequences that are closely related in their nucleotide composition Two short sequences about six nucleotides in length and about -35 and -10 nucleotides from the transcription start site (+1), determine where E. coli RNA polymerase (the enzyme that catalyses the synthesis of an RNA molecule on a DNA template) binds; close relatives of these two hexanucleotide sequences properly spaced from each other specify the promoter for most E. coli genes.
Methods have been developed in the past to alter the ligand binding properties of nucleic acid constructs (U. S. Patent No. 5,306,619; 5,578,444; 5,693,436; 5,716,780; 5,726,014; and 5,738,990). The methods described therein screen for small molecule pharmacological agents which bind to a test sequence flanking an adjacent DNA ligand binding protein site, the cognate site, and in turn alter the binding affinity of the DNA binding protein for its cognate site. In addition the experiments described therein also teach that a ligand binds to its cognate site with "indifference to the nucleotide sequences flanking the screening site" (U. S. Patent No. 5,578,444; 5,693,436; 5,716,780; 5,726,014; and 5,738,990). In other words, the binding affinity of a ligand for its binding site is independent of the nucleotide composition of the flanking sequences placed adjacent the ligand binding site. These approaches also require that each sequence be tested individually with each small molecule pharmaceutical to determine the influence of the small molecule on the binding characteristics of a nucleic acid for a particular DNA binding protein. This undue experimentation is costly and inefficient.
Summary of the Invention
In contrast to the disadvantages evident in the prior art, the invention described herein relates in one aspect to altering the ligand-binding characteristics of a nucleic acid sequence of any given length without the use of small molecule pharmaceuticals, thus providing for the determination of the ligand binding characteristics of a particular nucleic acid sequence and all its related members. Accordingly, the binding affinity of a ligand for its ligand binding site may be modulated solely by the nucleotide composition of the polynucleotide sequence(s) flanking an adjacent ligand binding site. A method for searching through sequence space for duplex polynucleotide sequences of a defined length, e.g., 20, 30 or 40 base pairs in length that allow sampling of any given length of sequence space given the four nucleotides A, C, G, and T is disclosed herein. Another embodiment features a method of ranking the relative reactivity of the duplex polynucleotide sequences when flanking a ligand binding site. Yet another embodiment employs a method which is advantageously computer-implemented for grouping duplex polynucleotide sequences of a defined length, e.g., 20, 30 or 40 base pairs in length into families within which each member of a family is related to another member within the same family by virtue of its similar relative binding affinity imparted to a flanking ligand binding site. An advantageous embodiment in accordance with the present disclosure includes an isolated polynucleotide sequence which is a member of a mutant sequence family, which family can be described by application of the following steps (a)-(d) below to a seed sequence of interest: (a) providing a minimum number of base positions to be mutated simultaneously (MSI); (b) providing a maximum number of base positions to be mutated (MXl); (c) reading the seed sequence; and (d) executing a computer program which, using the values of MS 1 and MXl, generates the mutant sequences comprised in the mutant sequence family from the seed sequence. Another advantageous embodiment includes the mutant sequence family itself as described above. In a preferred embodiment, the seed sequence is a flanking sequence conferring a relative binding affinity to an adjacent polynucleotide binding site for a given ligand, which relative binding affinity can be described by application of methods provided herein below for isolating and ranking such flanking sequences. The isolated polynucleotide sequence may desirably be, e.g., at least 20, 30, 40 or 50 base pairs long. The isolated polynucleotide sequence may also desirably be, e.g., at least 30%, 40%, 60% or 80% homologous to the seed sequence; and may also desirably have a relative binding constant of ±5% to the seed sequence. The computer program described in the previous paragraph for generating the mutant sequences comprised in the mutant sequence family from the seed sequence may desirably employ further constraints for reducing the cumbersome nature of the data generated, i.e., the number of mutant sequences generated, while ensuring that those mutant sequences of most interest are generated. Such constraints include some or all of the following: providing a minimum space value (SVl) of greater than 1 between bases to be mutated in the flanking polynucleotide sequence; providing a ratio of bases G to C (GC1) and/or ratio of bases A to T (ATI) which must be maintained in the mutant sequences; and specifying a region of said flanking sequence which is not to be mutated in generating the mutant sequence(s). Notably the present invention further relates to chromosomes having inserted therein the isolated polynucleotide sequences disclosed herein, and in particular those in which the isolated polynucleotide sequence is operationally coupled to the polynucleotide binding site, e.g., it is coupled to the binding site such that the isolated polynucleotide sequence has the anticipated effect on binding. Other genomic material such as DNA constructs, expression vectors, and animal genomes, having inserted therein the isolated polynucleotide sequences disclosed herein, and in particular those in which the isolated polynucleotide sequence is operationally coupled to the polynucleotide binding site are also included. Furthermore, the present invention further relates to living cells comprising genomic material such as DNA constructs, expression vectors, and animal genomes, having inserted therein the isolated polynucleotide sequences disclosed herein, and in particular those in which the isolated polynucleotide sequence is operationally coupled to the polynucleotide binding site. In an embodiment, the majority of duplex polynucleotide sequences of any specified length are created that allow sampling of any given length sequence space utilizing the four nucleotides A, C, G, and T. The sequences may be randomly synthesized but can also include any predetermined combination of the nucleotides A, G, C, or T.
In another embodiment, the majority of duplex polynucleotide sequences of any specified length are created that allow sampling of any given length sequence space utilizing the four nucleotides A, C, G, and T, wherein the sequences are further modified chemically, for example by a methyl transferase. The sequences may be randomly synthesized but can also include any predetermined combination of the nucleotides A, G, C, or T.
In yet another embodiment a method is provided for isolating and ranking duplex polynucleotide sequences which confer a given relative binding affinity towards a particular duplex polynucleotide ligand comprising a plurality of duplex polynucleotide molecules wherein each of the duplex polynucleotide molecules comprises a predetermined ligand binding site, the binding site being flanked by a randomly synthesized or other duplex polynucleotide sequence; exposing the duplex polynucleotide molecules to a ligand selective for the binding site; isolating ligand bound from ligand unbound duplex polynucleotide molecules; amplifying the duplex polynucleotide molecules; sequencing each of the duplex polynucleotide molecules to determine the sequence identity of the duplex polynucleotide sequence flanking the ligand binding site.
In a particular embodiment, the restriction endonuclease BamRl is added to a population of duplex polynucleotide molecules wherein the duplex polynucleotide molecules comprise a BamRl binding site (5'-GGATCC-3') flanked on either side or adjacent to a stretch of nucleotides under conditions which allow BamHl to contact its binding site but not cleave it. Under these conditions only a fraction of the population of duplex polynucleotide molecules will bind to BamRl. The bound subpopulation of duplex polynucleotide molecules can be separated from the unbound subpopulation of duplex polynucleotide molecules by methods well known to those skilled in the art. In yet another particular embodiment, the restriction endonucleases Mspl or Hpall are added to a population of duplex polynucleotide molecules wherein the duplex polynucleotide molecules comprise a methylated (5'-CmCGG-3') or unmethylated (5'- CCGG-3') Mspl and Hpall binding site flanked on either side or adjacent to a stretch of polynucleotides under conditions which allow Mspl or Hpall to contact its binding site but not cleave it. Under these conditions only a fraction of the population of duplex polynucleotide molecules will bind to Mspl or Hpall. The bound subpopulation of duplex polynucleotide molecules can be separated from the unbound subpopulation of duplex polynucleotide molecules by methods well known to those skilled in the art. In another embodiment, the enzyme Hp ll-methylase is added to a population of nucleic acid molecules which are ligated into a synthetic DNA construct directly flanking or adjacent to a polynucleotide duplex ligand binding site comprising the sequence 5'-CCGG-3'. The restriction endonucleases Mspl or Hpall are subsequently added under conditions which allow Mspl or Hpall to contact its binding site but not cleave it. Under these conditions only a fraction of the population of duplex polynucleotide molecules will bind to Mspl or Hpall. The bound subpopulation of duplex polynucleotide molecules can be separated from the unbound subpopulation of duplex polynucleotide molecules by methods well known to those skilled in the art. In another embodiment, a subpopulation of nucleic acid molecules are ranked based on their ability to confer relatively high or low binding affinity of a particular ligand towards its ligand binding site when flanked by duplex polynucleotide sequences.
In another embodiment, methods are provided for determining the ability of a pre-selected duplex polynucleotide sequence to influence the relative binding affinity of a duplex polynucleotide sequence for its ligand which is remote from the pre-selected duplex polynucleotide sequence. Remote sequences may include enhancers or proximal- promoter sequences.
In still another embodiment, methods are provided for increasing or decreasing the mutation rate of a pre-selected duplex polynucleotide sequence, e.g., by placing heterologous duplex polynucleotide sequence(s) flanking the pre-selected duplex polynucleotide sequence, wherein the heterologous duplex polynucleotide sequence flanking the pre-selected duplex polynucleotide sequence is known to influence the binding affinity of a mutagen to the pre-selected duplex polynucleotide sequence. In still another embodiment, methods are provided for predicting mutation-prone regions of a duplex polynucleotide sequence by virtue of homology to duplex polynucleotide sequences with known flanking effect on relative binding affinity.
In another particular embodiment, a ligand is added to a population of polynucleotide molecules wherein the polynucleotide molecules comprise a double stranded polydeoxyribonucleic acid (DNA) ligand binding site flanked on either side or adjacent to a single stranded or double stranded polyribonucleic acid (RNA) sequence under conditions which allow the ligand to contact its binding site. Under these conditions only a fraction of the population duplex polynucleotide molecules will bind to the ligand. The bound subpopulation of duplex polynucleotide molecules can be separated from the unbound subpopulation duplex polynucleotide molecules by methods well known to those skilled in the art.
In another particular embodiment, a ligand is added to a population of polynucleotide molecules wherein the polynucleotide molecules comprise a single or double stranded polyribonucleic acid (RNA) ligand binding site flanked on either side or adjacent to a duplex polydeoxyribonucleic acid (DNA) sequence under conditions which allow the ligand to contact its binding site. Under these conditions only a fraction of the population of duplex polynucleotide molecules will bind to the ligand. The bound subpopulation of duplex polynucleotide molecules can be separated from the unbound subpopulation of duplex polynucleotide molecules by methods well known to those skilled in the art.
Brief Description of the Drawings
Figure 1 shows a schematic representation of the method of ranking subsets of any duplex polynucleotide sequence flanking a ligand binding site based on the ability of the subsets of duplex polynucleotide flanking sequences to confer relative binding affinity for a ligand to its binding site. In the example shown, BamRl is the ligand.
Figure 2 shows an autoradiograph of a BamRl band shift assay for the selection by "relative binding affinity" of flanking DNA sequence sub-populations from a synthetically generated random population which confer altered BamRl binding affinity for an adjacent BamRl binding site. All duplex polynucleotide sequences are 32P end- labeled. Figure 3 shows an autoradiograph of a Sαw3AI band shift assay of a random or repeated duplex polynucleotide sequence flanking a Sώrz.3AI binding site. All duplex polynucleotide sequences are 32P end-labeled.
Figure 4 depicts a block diagram of a computer system that is suitable for practicing an exemplary embodiment of the present invention.
Figure 5 is a flow chart that illustrates steps that are performed by the main program of the sequence generating facility.
Figure 6 is a flow chart that depicts the steps that are performed by the Mutate () Function. Figure 7 depicts an example of the mutant DNA sequence generation performed by the exemplary embodiment of the present invention.
Figure 8 illustrates a first example of an output that is generated by the sequence generating facility.
Figure 9 illustrates a second example of an output generated by the sequence generating facility.
Definition of Terms
As used herein, the following terms and phrases shall have the meanings set forth below: A "polynucleotide" or "polynucleotide sequence" shall mean multiple nucleotides (i.e., molecules comprising a sugar (e.g., ribose or deoxyribose) linked to a phosphate group and to an exchangeable organic base, which is either a substituted pyrimidine (e.g., cytosine (C), thymidine (T) or uracil (U)) or a substituted purine (e.g., adenine (A) or guanine (G)). The term "polynucleotide" or "polynucleotide sequence" as used herein refers to both polyribonucleotides and polydeoxyribonucleotides.
Polynucleotides can be obtained from existing nucleic acid sources (e.g., genomic DNA or cDNA), but can also be synthetic (e. g., produced by oligonucleotide synthesis) DNA or RNA.
A "ligand" shall mean any chemical moiety selected from the group consisting of: a compound which binds to a duplex polynucleotide in a sequence-specific way; a compound which binds to a duplex polynucleotide sequence in a non-specific way; a protein; an enzyme; an enzyme which alters the structure of a duplex polynucleotide sequence to which it binds; an enzyme which alters the structure of a duplex polynucleotide sequence to which it binds by breaking or forming a covalent or non- covalent bond between an atom of the nucleic acid and another atom; an enzyme which cleaves one or both strands of a duplex polynucleotide sequence to which it binds; a restriction enzyme; a restriction endonuclease; an enzyme which methylates a duplex polynucleotide sequence to which it binds; an enzyme which alkylates a duplex polynucleotide sequence to which it binds; a nucleic acid ligase such as DNA ligase; an enzyme which promotes or catalyzes the synthesis of nucleic acid; a nucleic acid polymerase; a nucleic acid polymerase which requires a double stranded primer; a DNA polymerase; DNA polymerase I; Taq polymerase; an RNA polymerase; an enzyme which alters the primary or secondary structure of a duplex polynucleotide sequence to which it binds; a topoisomerase; an enzyme which promotes or inhibits recombination; a DNA binding agent; a mutagen; a compound which enhances the expression of a gene under the control of the duplex polynucleotide sequence bound by a ligand; a compound which intercalates into a duplex polynucleotide molecule; a compound which, when contacted with a reaction mixture comprising a first single stranded polynucleotide molecule and a second single stranded polynucleotide molecule will increase the free energy of duplex formation at least n-fold, wherein n is 2, 5, 10, 50 100, 500, 103, 104, 105, 106; a compound which, when contacted with a reaction mixture will decrease the free energy of duplex formation by at least n-fold, wherein n is 2, 5, 10, 50, 100, 500, 103, 104, 105, 106.
A "ligand binding site" or "binding site" shall mean any domain or subdomain in a duplex polynucleotide molecule which directly contacts a ligand by hydrogen bonding, van der Waals radius interactions and/or electron cloud interaction with the bases of a nucleic acid molecule or indirectly via a salt or water molecule.
A "flanking sequence" is a polynucleotide sequence located adjacent to a ligand binding site of a polynucleotide molecule. A flanking sequence or flanking sequences may be 5', 3' or 5' and 3' of the ligand binding site.
A "remote sequence" can be any regulatory polynucleotide sequence which is located at a great distance either 5' or 3' from the ligand binding site or from the flanking polynucleotide sequence adjacent the ligand binding site. Remote sequences may be placed in either orientation i.e., 5' -»• 3' or 3' -» 5' relative to the ligand binding site or the flanking polynucleotide sequence adjacent the ligand binding site. Examples of remote sequences are enhancers or proximal-promoter sequences.
In particular methods of this invention, "relative binding affinities" of nucleotide flanking sequences are measured for different nucleic acid molecules in which different flanking sequences are located adjacent the same ligand binding side. In the situation in which a first molecule having a first flanking sequence (or pair of flanking sequences) is found to bind to a ligand in preference to a second molecule having a second flanking sequence (or pair of flanking sequences), the first molecule is said to have a binding affinity that is "relatively higher" than the binding affinity of the second molecule. An endonuclease is said to be "substantially free of cleavage activity" when under the given conditions, there is substantially no observable cleavage.
A "pure repeat" is a repeating DNA sequence for which all base positions are defined. For example, if the nucleotides in a pure dinucleotide repeat are A and G, then a pure dinucleotide repeat is (AG)n where n is the number of times (AG) is repeated. Similarly, if the nucleotides in a trinucleotide repeat are A, G, and C, then a pure trinucleotide repeat is (AGC)n where n is the number of times (AGC) is repeated. The definition is not intended to be limiting to dinucleotide or trinucleotide repeats and can be extended to tetranucleotide repeats, pentanucleotide repeats, and higher repeating units. An "impure repeat" is a repeating DNA sequence for which one or more base positions allows for the insertion of a random nucleotide. For example, if one of the nucleotides in an impure dinucleotide repeat is A and the random nucleotide is X where X is either A, C, G, or T, then an impure dinucleotide repeat is 5'-(AX)n-3' where n is the number of times (AX) is repeated. Similarly, if two of the nucleotides in a trinucleotide repeat are defined as A and G and the third random nucleotide is X where X is either A, C, G, or T, then an impure trinucleotide repeat is 5'-(AGX)n-3' where n is the number of times (AGX) is repeated. The definition is not intended to be limiting to dinucleotide or trinucleotide repeats and can be extended to tetranucleotide repeats, pentanucleotide repeats, and higher repeating units. A "family" of DNA sequences is a group of DNA sequences that are related by virtue of their conforming to the rules defined herein for governing the ability of the polynucleotide sequence to influence binding of a ligand to a ligand binding site located adjacent to the DNA sequence. A "frame" of a DNA repeat refers to the minimum successively repeating DNA sequence or motif in a DNA sequence. For example, one frame in the sequence 5'- GCGCGC-3' would be GC. In the sequence 5'-GCTGCTGCT-3', one frame would be GCT.
An "initial frame" of a DNA repeat refers to the minimum successively repeating DNA sequence or motif in a DNA sequence beginning with the 5' most nucleotide of the repeating DNA sequence or motif in a DNA sequence. Thus, the initial frame in the sequence 5'-GCGCGC-3' would be GC. Likewise, the initial frame in the sequence 5'- GCTGCTGCT-3' would be GCT.
A "shifted frame" or "frame shifting" is any of the unique, non-initial frames in a repeating DNA sequence or motif. For the sequence 5'-GCGCGC-3' containing the initial frame GC, a shifted frame is CG. For the sequence 5'-GCTGCTGCT-3' containing the initial frame GCT, a shifted frame is CTG.
A "regulatory sequence" or "polynucleotide regulatory sequence" is a polynucleotide sequence which when contacted by a ligand regulates the transcription of a biologically functional gene associated with the regulatory sequence.
A "promoter", "polynucleotide promoter sequence" or "promoter sequence" is any DNA sequence which transcription factor(s) and or RNA polymerase contacts. The promoter determines the polarity of the transcript by specifying which strand will be transcribed. Promoters can be classified according to their "strength"; that is, the relative frequency of transcription initiation (times per minute) at each promoter. Thus, RNA polymerase initiates transcription at a high frequency at strong promoters and at low frequency at weak promoters.
An "enhancer" is any regulatory DNA sequence to which a protein or proteins contact, influencing the rate of transcription of a biologically functional gene associated with the enhancer. Contact of the enhancer by the protein or proteins may either stimulate or decrease the rate of transcription of the associated gene. Enhancers may be located at a great distance either 5' or 3' from the transcription start site of the biologically functional gene it controls. Enhancers may also be regulate transcription of its associated gene when placed in either orientation i.e., 5' — > 3' or 3' → 5' relative to the gene whose transcription it controls.
A "proximal-promoter sequence" or "proximal-promoter element" is any regulatory sequence that is located close to (within 200 base pairs of) a promoter and binds a protein or proteins thereby modulating the transcription of the biologically functional gene associated with the promoter. The promoter-proximal sequence can occur 5' or 3' of the transcription start site of the biologically functional gene.
An "operator" is a short polynucleotide DNA sequence in a bacterial or viral genome which contacts a protein or proteins and regulates transcription of an associated biologically functional gene.
A "Long Terminal Repeat" or "LTR" is a regulatory polynucleotide sequence of viral origin (DNA tumor viruses or retroviruses) comprising of integration signals for integrating into the host genome, an enhancer, a promoter and a polyadenylation site. "Sequence identity or homology", as used herein, refers to the sequence similarity between two polypeptide molecules or between two nucleic acid molecules. When a position in both of the two compared sequences is occupied by the same base or amino acid monomer subunit, e.g., if a position in each of two DNA molecules is compared by adenine, then the molecules are homologous or sequence identical at that position. The percent of homology or sequence identity between two sequences is a function of the number of matching or homologous identical positions shared by the two sequences divided by the number of positions compared x 100. For example, if 6 of 10, of the positions in two sequences are the same then the two sequences are 60% homologous or have 60% sequence identity. By way of example, the DNA sequences ATTGCC and TATGGC share 50% homology or sequence identity. Generally, a comparison is made when two sequences are aligned to give maximum homology. Unless otherwise specified "loop out regions", e.g., those arising from, from deletions or insertions in one of the sequences are counted as mismatches.
The comparison of sequences and determination of percent homology between two sequences can be accomplished using a mathematical algorithm. Preferably, the alignment can be performed using the Clustal Method. Multiple alignment parameters include GAP Penalty = 10, Gap Length Penalty =10. For DNA alignments, the pairwise alignment parameters can be Htuple =2, Gap penalty =5, Window =4, and Diagonal saved =4. For protein alignments, the pairwise alignment parameters can be Ktuple =1, Gap penalty =3, Window =5, and Diagonals Saved =5.
Additional non-limiting example of a mathematical algorithm utilized for the comparison of sequences is the algorithm of Karlin and Altschul ( 1990) Proc. Natl Acad. Sci. USA 87:2264-68, modified as in Karlin and Altschul (1993) Proc. Natl. Acad. Sci. USA 90:5873-77. Such an algorithm is incorporated into the BLAST N and BLASTX programs (version 2.0) of Atschul, et al, (1990) J. Mol. Biol. 215:403-10. BLAST nucleotide searches can be performed with the BLASTN program, score =100, wordlength = 12 to obtain nucleotide sequences homologous to nucleic acid molecules of the invention. BLAST protein searches can be performed with the BLASTX program, score = 50, wordlength = 3 to obtain amino acid sequences homologous to protein molecules of the invention. To obtain gapped alignments for comparison purposes Gapped BLAST can be utilized as described in Altschul et al, (1997) Nucleic Acids Research 25(17):3389-3402. When utilizing BLAST and Gapped BLAST programs, the default parameters of the respective programs (e.g., BLASTX and BLASTN) can be used. See http://www.ncbi.nlm.nih.gov. Another preferred non- limiting example of a mathematical algorithm utilized for the comparison of sequences is the algorithm of Myers and Miller, CABIOS (1989). Such an algorithm is incorporated into the ALIGN program (version 2.0) which is part of the GCG sequence alignment software package. When utilizing the ALIGN program for comparing amino acid sequences, a PAM120 weight residue table, a gap length penalty of 12, and a gap penalty of 4 can be used.
A "DNA expression vehicle" is a DNA sequence which can be transcribed to produce RNA to be translated and is constructed in vivo or in vitro by methods known to those skilled in the art. The sequence may or may not be capable of sustaining its own replication.
A "seed sequence" is a duplex polynucleotide sequence which is isolated and the nucleotide composition of the sequence determined physically by methods well known in the art. The sequence may have the ability to alter the relative binding affinity of a ligand to an adjacent flanking ligand binding site compared to the pre-existing polynucleotide sequence flanking the ligand binding site. The seed sequence can be used to generate a family of polynucleotide sequences that are related by virtue of their conforming to the rules defined herein for governing the ability of the polynucleotide sequence to influence binding of a ligand to a ligand binding site located adjacent to the DNA sequence. The seed sequence may also be a flanking sequence.
Detailed Description of the Invention
A. Method for Selecting Flanking Sequences.
In one aspect, the invention relates to a method for selecting duplex polynucleotide flanking sequences, e.g., duplex polynucleotide sequences which flank (in the 3' and/or 5' direction) a selected polynucleotide sequence (such as a ligand binding site) from a population of molecules of identical structure. The method of the invention is useful for determining flanking polynucleotide sequences which provide desired characteristics, such as Tm, ability of the ligand binding site to bind a ligand, ability of a ligand to react with the polynucleotide sequence, stability of the polynucleotide sequence, and the like.
According to this aspect of the invention, the relative binding affinity of a polynucleotide sequence for a ligand can be modulated by providing appropriate flanking polynucleotide sequences. For example, a flanking polynucleotide sequence(s) can be selected from a mixture of family sequences to identify which sequences increase the ability of a ligand to bind to a ligand binding site; decrease the ability of a ligand to bind to a ligand binding site; increase the mutability of the polynucleotide sequence; decrease the mutability of the polynucleotide sequence; and the like.
In one embodiment, the method includes the steps of providing a polynucleotide sequence which includes a ligand binding site and at least one polynucleotide sequence which flanks the ligand binding site (e.g., in the 3' and/or 5' direction); and determining the ability of the ligand to bind to the ligand binding site. In certain embodiments, a plurality (such as combinatorial library) of polynucleotide sequences, each including the same ligand binding site and different (e.g., randomly differing) flanking polynucleotide sequences, are provided. In this embodiment, the mixture of polynucleotide sequences can be screened against a limiting concentration of the ligand, and polynucleotide sequences which preferentially bind to the ligand (or do not bind to the ligand) can be selected, and (preferably) are then sequenced to determine an appropriate flanking polynucleotide sequence(s).
Thus, for instance, Example 1, infra, describes the preparation of a plurality of polynucleotide DNA sequences; each sequence includes the binding site for the restriction enzyme Rα HI, flanked on each end by any polynucleotide sequence any defined length, e.g., 20, 30 or 40 base pairs in length. The mixture of polynucleotide sequences was titrated with a known concentration of BamRl under conditions where substantially no cleavage takes place, and those polynucleotide sequences which bound most strongly to BamRl were selected (in this example, by gel shift assay and recovery of the shifted bands). Similarly, the poorest-binding polynucleotide sequences were selected. Certain of the selected polynucleotide sequences were then sequenced to determine the flanking polynucleotide sequences which conferred increased or decreased ligand-binding ability on the ligand binding site. Tables 1-4 list the polynucleotide sequences obtained by the method described above. As discussed above, the ability of a polynucleotide sequence to bind to a ligand is believed to be related solely to the nucleotide composition of the flanking polynucleotide sequence. It is further believed that the ability of a polynucleotide sequence to bind to ligands is at least largely independent of the ligand selected. Thus, a flanking polynucleotide sequence which lowers the relative binding affinity of any one ligand to a ligand binding site adjacent the flanking polynucleotide sequence, will also lower the relative binding affinity other ligands to the ligand binding site. Thus, the particular ligand selected for use according to the methods of the invention, to determine the ability of a flanking polynucleotide sequence to affect ligand binding, is a matter of convenience and design choice which will be routine to one of ordinary skill in the art. It will be appreciated that flanking polynucleotide sequences which confer particular ligand binding attributes upon a neighboring polynucleotide sequence will have many potential uses. For example, flanking polynucleotide sequences can be selected to promote binding to a ligand, such as an RNA or DNA binding protein, a polymerase, a reverse transcriptase, a telomerase, a helicase, a transcription factor, and the like. Thus, the invention provides methods for selecting flanking polynucleotide sequences which can be used in vivo, e.g., to study the interaction of ligands and nucleic acids, or to provide improved probes or primers for PCR amplification, and the like. The flanking sequences of a polynucleotide can also be provided in a nonrandom manner. For example, a flanking polynucleotide sequence can be provided, e.g., by oligonucleotide chemical or biochemical synthesis, to provide a flanking region of any known sequence. This flanking polynucleotide sequence can then be tested to determine the effect on ligand binding. One particularly preferred practice of the invention involves the construction of a plurality of oligonucleotides, each including a ligand binding site flanked by at least one flanking polynucleotide sequence which has a known nucleotide composition. The effect on the relative binding affinity of an adjacent flanking ligand binding site for its ligand can then be assayed, e.g., as described herein. This embodiment of the invention is useful in constructing sequence reactivity data compilations, e.g., a database, which quantifies the effect on stability of any possible flanking sequence (see infra).
The finding that a particular DNA binding ligand can bind to its binding site differentially in the context of the randomly synthesized flanking polynucleotide sequences permits the identification of polynucleotide sequences which have the ability to influence the binding affinity of the DNA binding ligand to its binding site and ultimately affect the functioning of the ligand. For example, the method described above could be used to identify polynucleotide sequences flanking the binding site in DNA of a transcription factor which can adversely affect or promote the binding of that particular transcription factor to its binding site. In particular, if the method described above for BamRl was carried out utilizing a particular transcription factor such as MyoD (MyoD is a transcription factor which plays a role in muscle development) as the ligand and the sequence 5'-CANNTG-3' where N is any one of the nucleotides A, C, T, or G as the ligand binding site between two randomly synthesized duplex polynucleotide sequences, it may be possible to identify polynucleotide sequences flanking the MyoD binding site in the genome which will either enhance binding of MyoD to its binding site or result in poor binding of MyoD to its binding site. Thus, duplex polynucleotide molecules comprising randomly synthesized polynucleotide sequences flanking either side of the MyoD binding site which can be isolated as bound complexes with MyoD can be said to confer high binding affinity of MyoD to its binding site and duplex polynucleotide sequences which are isolated unbound with MyoD can be said to confer low binding affinity of MyoD to its binding site. Further, the randomly synthesized polynucleotide sequences can be ranked in order of their ability to influence binding affinity of the ligand to its binding site.
B. Method for Creating and thereby Systematically Sampling Polynucleotide Sequences of Any Given Length.
I. Creation of Families of Pure Repeat Sequences
The creation of families of pure repeat polynucleotide sequences are based on two characteristics inherent to the nature of duplex DNA and of repeating sequences: 1) When a DNA duplex is acted upon, for example, by a ligand, both strands of the duplex are simultaneously acted upon; and 2) Pure polynucleotide repeats of sufficient length contain multiple, overlapping repeat frames. If the polynucleotide repeats are of sufficient length, this means that the motif is repeated a sufficient number of times so that the difference of two repeated units between one frame and another is measurably insignificant. For the creation of families of pure repeat polynucleotide sequences all possible sequence motifs of the desired repeat length i.e., the initial frame, are created, for example, by execution of a computer program, by iterating the four possible base substitutions in each base position of the repeating polynucleotide sequence. In the trivial case of a single-base repeat, with a repeat length of 1, there are four possible polynucleotide sequence motifs: (A)n, (C)n, (G)n, and (T)n. In the case of a dinucleotide repeat, with a repeat length of 2, there are 16 possible polynucleotide sequence motifs: (AA)„, (AC) „, (AG)„, (AT)„, (GA)„, (GC)„, (GG)„, (GT)„, (CA)„, (CG)„, (CC)„, (CT)„, (TA)„, (TC)„, (TG)„, (TT)„. Note that the dinucleotide repeat motifs also include the single-nucleotide repeats. The polynucleotide sequences generated above are examined and grouped together into families according to equivalence by complementarity, for example, by execution of a computer program. Since duplex DNA is assumed, the mono-nucleotide repeat motif (A)n is equivalent to the motif (T)n by complementarity. This reduces the number of mono-nucleotide sequence motifs to two: (A)n=(T)n and (G)n=(C)n. Similarly, the dinucleotide repeat motif (AA)n is equivalent to the motif (TT)rt by complementarity. Following this step, the number of dinucleotide sequence motifs is reduced to ten: (AT)„, (CG)„, (GC)Λ, (TA)„, (AA)„=(TT)„, (GG)„=(CC)„, (AC)„=(GT)„, (CA)„=(TG)„, (TC)„=(GA)„, and (CT)„=(AG)„.
The polynucleotide sequence families grouped together according to equivalence by complementarity are examined and further grouped together according to equivalence by frame shifting, for example by execution of a computer program. For mono- nucleotide repeat motifs, this step is redundant, as there is only a single unique motif. For dinucleotide repeat motifs, this step reduces the number of families to only six: (AT)„=(TA)„, (GC)„=(CG)„, (AA)„=(TT)„, (GG)„=(CC)„, (Ac)„=(GT)„=(cA)„=(TG)„, (TC)„=(GA)„=(cT)„=(AG)„. Note that the method of grouping polynucleotide sequences into families according to equivalence by complementarity or frame shifting are mutually exclusive and thus may be performed in either order.
II. Creation of families of impure repeat sequences Only one property of DNA that lends itself to the creation of families of polynucleotide sequences of pure repeats is shared in the creation of families of impure repeats i.e., both strands of a polynucleotide duplex are simultaneously acted upon by a DNA binding agent. In this example however, mono-nucleotide repeat motifs do not lend themselves to the steps that follow. Also, the last base position of each frame with a repeat length of 2 or greater is replaced with a random nucleotide denoted "X." This allows for the creation of an infinite diversity of polynucleotide sequences.
For the creation of families of impure repeat polynucleotide sequences, all possible sequence motifs of the desired repeat length i.e., the initial frame, are created, for example, by execution of a computer program, by iterating the four possible base substitutions in each base position of the repeating sequence. In the case of a dinucleotide repeat, with a repeat length of 2, there are 16 possible sequence motifs: (AA)„, (AC)„, (AG)„, (AT)„, (GA)„, (GC)„, (GG)„, (GT)„, (CA)M, (CG)„, (CC)„, (CT)tt, (TA)^, (TC)rø, (TG)W, (TT)n. For each of these sequence motifs, the last base position of each frame is replaced with a nucleotide "X," indicating that the base position may be any base permitted by the DNA synthesis process. For example, in the case of a dinucleotide repeat, the following impure repeats result from this step: (AX)W> ( K)n> (GX)„, (TX)„. In the case of a tri-nucleotide repeat, there are 64 possible sequence motifs: (AAA)„ (AAG)„ (AAC)„ (AAT)„, (ACA)„, (ACC)„, (ACG)„ (ACT),,, (AGA)„,(AGC)„, (AGG)„, (AGT)„ (ATA)„ (ATC)„, (ATG)„, (ATT)„, (GAA)„, (GAC)„, (GAG)„, (GAT)„, (GCA)„, (GCC)„, (GCG)„, (GCT)„, (GGA)„, (GGC)„, (GGG)„, (GGT)„, (GTA)„, (GTC)„, (GTG)„, (GTT)„, (CAA)„, (CAC)„, (CAG)„, (CAT)„, (CGA)„, (CGC)Λ, (CGG)„, (CGT)„, (CCA)„, (CCC)„, (CCG)„, (CCT)„, (CTA)„, (CTC)„ (CTG)„, (CTT)„, (TAA)„, (TAC)„,(TAG)„, (TAT)„, (TCA)„, (TCC)„, (TCG)„, (TCT)„ (TGA)„ (TGC)„, (TGG)„, (TGT)„, (TTA)„ (TTC)„ (TTG)„ (TTT)n- For each of these sequence motifs, the last base position of each frame is replaced with a nucleotide "X," indicating that the base position may be any base permitted by the DNA synthesis process. For example, in the case of a tri-nucleotide repeat, the following impure repeats result from this step: (AAX)n, (AAX)W, (AAX)W, (AAX)„, (ACX)„,(ACX)n,(ACX)n,(ACX)„,(AGX)„, (AGX)„,(AGX)„, (AGX)„,(ATX)„, (ATX)„ (ATX)„, (ATX)M, (GAX)„, (GAX)„, (GAX)„, (GAX)„, (GCX)„, (GCX)„, (GCX)„, (GCX)„, (GGX)„, (GGX)„, (GGX)„, (GGX)„, (GTX)„, (GTX)„, (GTX)„, (GTX)„, (CAX)M, (CAX)„, (CAX)„ (CAX)„, (CGX)„, (CGX)„, (CGX)„, (CGX)„, (CCX)„, (CCX)„, (CCX)„, (CCX)„, (CTX)„, (CTX)„, (CTX)„ (CTX)„ (TAX)„, (TAX)„, (TAX)„, (TAX)„, (TCX)„, (TCX)„, (TCX)„, (TCX)„, (TGX)„, (TGX)„ (TGX)„ (TGX)„ (TTX)„ (TTX)„ (TTX)„, (TTX)„. Next, duplicate sequence motifs are compared to one another, and duplicate motifs are eliminated. The 64 possible motifs are now reduced to 16: (AAX)M, (ACX)n, (AGX)n, (ATX)W, (GAX)„, (GCX)„, (GGX)„, (GTX)„, (CAX)„, (CCX)„, (CGX)„, (CTX)„, (TAX)„, (TCX)„, (TGX)„, (TTX)„.
The unique polynucleotide sequences are examined and grouped together into families according to equivalence by complementarity, for example by execution of a computer program. Since duplex DNA is assumed, the impure dinucleotide repeat motif (AX)/? is equivalent to the motif (XT)n by complementarity. This reduces the number of impure dinucleotide sequence motifs to two: (AX)n =(XT)n and (GX)n =(XC)n; and the number of impure tri-nucleotide sequence motifs to 10: (AAX)n=(XTT)n, (ACX)„=(XGT)„, (AGX)„=(XCT)„, (ATX)„=(XAT)„, (GAX)„=(XTC)„,
(GCX)„=(XGC)„, (GGX)„=(XCC)„, (GTX)„=(XAC)„, (CGX)„=(XCG)„ and (TAX)„=(XTA)„. The resulting impure polynucleotide sequence motifs are synthesized by means known in the art, with the base positions denoted "X' synthesized with the appropriate base nucleotides to generate populations of oligonucleotides representative of all sequences matching the motif, e.g., by permitting all four bases into the synthesis reaction at the steps corresponding to the "X" base positions in the polynucleotide sequence.
The above examples, exemplified by dinucleotide repeats, illustrate a method for creating six families of polynucleotide sequence motifs which represent an efficient and economical means of synthesizing and characterizing the relative reactivities of polynucleotide sequences from the sixteen individual motifs that would have to be synthesized and characterized if the method of this invention were not employed. For tri-nucleotide repeats, sixty-four original sequence motifs can be reduced to just ten families of polynucleotide sequence motifs. The utility of the above invention is not limited to repeat sequences illustrated above but can be extended to tetranucleotide repeats, pentanucleotide repeats, and higher repeating units.
C. Method for Predicting Mutable Sites
In another aspect, the invention provides a method for determining polynucleotide sequences which are more (or less) prone to mutation. The method comprises the steps of construction of a database of polynucleotide sequences of length n which exert a known binding affinity on a flanking ligand binding site; taking any polynucleotide sequence (query sequence) including a naturally occurring polynucleotide sequence and dividing it into all possible polynucleotide sequences of length n by position along the length of the query sequence; finding the most homologous polynucleotide sequence of length n in the database of length n on either the 5' side or 5' side of one strand, on both the 5' side and 3' side of one strand, on either the 5' side or 3' side of the opposing strand or on both the 5' side and 3' side of the opposing strand of position of interest; determining the relative binding affinity of the polynucleotide sequence for a ligand conferred by a flanking polynucleotide sequence of length n at the position of interest based on the relative binding affinity of the most homologous polynucleotide sequence of length n in the database of polynucleotide sequences of length n; reiterating the process along the length of the query sequence and determining the relative binding affinity of a polynucleotide sequence for a ligand conferred by a flanking polynucleotide sequence of length n at every position along the length of the query sequence; comparing all relative binding affinities; determining the polynucleotide sequence of length n in the query sequence that has the highest relative binding affinity. The above method is further illustrated by the following example. Let's assume that a database of polynucleotide sequences of 40 base pairs in length with known relative binding affinities has been constructed by the method described herein. Let's also assume that one wishes to predict mutable sites in any gene of interest, the query sequence, that is 100 base pairs long. The method would include dividing the 100 base pair query sequence into all possible sequences of 40 base pairs in length beginning either at the -40 base pair position (at the 5' side) and or at the + 140 base pair position (at the 5' side). Next, the relative binding affinity of a polynucleotide sequence for a ligand conferred by the flanking polynucleotide sequence -39+ 1 (polynucleotide sequence A) at the 5' position and + 101- + 140 (polynucleotide sequence Z) at the 3' position of the same strand is determined by searching the database for the most homologous sequence to polynucleotide sequences A and Z that is in the database and reading off the relative binding affinity previously determined for the most homologous polynucleotide sequence. The average relative binding affinity conferred by these flanking polynucleotide sequences is calculated. This process is reiterated along the length of the query sequence at every position. Thus, the next polynucleotide flanking sequence would be from position -38+2 (polynucleotide B) at the 5' position and +100- +139 at the 3' position (polynucleotide sequence Y) of the same strand. As before, the relative binding affinity of a polynucleotide sequence for a ligand conferred by flanking polynucleotide sequences B and Y is determined by searching the database for the most homologous sequence to polynucleotide sequences B and Y that is in the database and reading off the relative binding affinity of the most homologous polynucleotide sequences. The average relative binding affinity conferred by these flanking polynucleotide sequences is again calculated. This method creates a table of relative binding affinities for each position along the query sequence. The position(s) with the lowest value(s) will be the least mutable and the position(s) with the highest value(s) will be the most mutable. The method of the invention can also be used to determine regions of a polynucleotide sequence, including a gene, which are more (or less) likely to mutate, e.g., in response to a selection pressure on the organism.
The method of the invention is useful, e.g., for determining which portions of a gene are optimal targets for design of probes, e.g., for the detection of the presence of a microorganism in a biological sample. For example, a probe which is complementary to a portion of bacterial polynucleotide sequence can be used to detect the presence of the bacterium in a biological sample, e.g., to detect bacterial infection, as is well known in the art. However, if a mutation occurs in the bacterial genome at the site to which the probe binds, the probe will no longer bind (or will bind with decreased affinity) to the polynucleotide sequence of the mutated bacterium, thus rendering detection of the bacterium more difficult. According to the invention, a probe can be designed to be complementary to a portion of the bacterial polynucleotide sequence which is less prone to mutation and thus the probability that the probe will be rendered useless by subsequent mutation is decreased. The method of the invention is also useful for determining functionally important portions of a protein which is encoded by a polynucleotide sequence. Without wishing to be bound by theory, it is believed that polynucleotide sequences which code for critical residues or regions of the protein will reside in regions of the gene which are relatively resistant to deleterious mutations which would decrease or abolish the desired function of the protein. Thus, by determining the susceptibility to mutation of regions of a gene, e.g., by determining the effect of altering the nucleotide composition of flanking polynucleotide sequences, critical residues of the encoded protein can be identified.
In another embodiment, the method of the invention can be used to determine or predict regions of a protein or polypeptide which are antigenically important. Due to the degeneracy of the genetic code, a plurality of polynucleotide sequences can often code for a single polypeptide. Routine computational methods allow the determination of the relative binding affinity of each polynucleotide sequence which encodes a selected polypeptide. For a given polypeptide, the binding affinity of a naturally-occurring coding polynucleotide sequence can be compared to the binding affinities of all possible polynucleotide sequences which could code for that polypeptide, to determine coding sequences of high or low relative binding affinity. D. Method for Modulating Mutation Rate of Polynucleotide Sequences
In this embodiment, the invention provides means for altering the susceptibility to mutation of a polynucleotide sequence. Thus, a region of DNA of interest in a host cell or organism can be made less prone to mutation, e.g., to prevent mutation of a region of the DNA by inserting adjacent to the region of DNA a heterologous duplex polynucleotide sequence whose nucleotide composition is known to decrease the binding affinity of a mutagen to the region of DNA; similarly, a region of DNA of interest in a host cell or organism can be made more prone to mutation, e.g. to increase the susceptibility of a region of DNA to mutation, a heterologous duplex polynucleotide sequence whose nucleotide composition is known to increase the binding affinity of a mutagen to the region of DNA is inserted adjacent to the region of DNA whose susceptibility to mutation is to be increased. This method can thereby provide a method for producing non-naturally occurring polynucleotide sequences (and proteins encoded by them); this is a form of "directed evolution" in that a particular gene or portion thereof can be targeted for mutation without increasing the propensity for mutation of other regions of the genome. Such non-naturally occurring proteins can be assayed to determine properties such as binding specificity, binding affinity, rate of catalysis of a reaction, and the like, to identify proteins which have desirable characteristics. The method of the invention can be used to speed the process of preparing and selecting mutant proteins.
In this embodiment, a polynucleotide sequence, and the protein encoded thereby, can be "protected" to prevent mutations, e.g., by altering the nucleotide composition of the polynucleotide sequence, a nearby (e.g., flanking) polynucleotide sequence or a remote polynucleotide sequence to decrease the relative binding affinity of a mutagen for the polynucleotide sequence encoding the protein.
E. Method for Altering the Expression Level of Any Gene
In another aspect, a method of this invention can be used to modify the nucleotide composition of the flanking polynucleotide sequence or a pair of flanking polynucleotide sequences of any regulatory DNA sequence that alters the relative frequency of transcription initiation as compared to the pre-existing flanking polynucleotide sequence or a pair of flanking polynucleotide sequences. Enhancers and proximal-promoter sequences are examples of such sequences. Replacing the native flanking polynucleotide sequence or a pair of flanking polynucleotide sequences adjacent to an enhancer can alter the frequency of transcription initiation of the promoter the enhancer regulates. Several commercially available expression vectors are available which contain enhancer-promoter sequences that regulate the expression of a biologically functional DNA cloned into the vector. pcDNA 3.1(+/-) from Invitrogen is one example of such an expression vector. The expression of the desired cloned gene can be further improved by replacing the existing flanking polynucleotide sequences adjacent the enhancer-promoter sequences with heterologous flanking polynucleotide sequences whose nucleotide composition confers a higher binding affinity of the promoter-enhancer for transcription factor(s) and/or RNA polymerase thereby increasing the frequency of transcription initiation and in turn increasing cellular output of the desired product.
Thus, the ability to increase the cellular output of a biopharmaceutical by simply altering the nucleotide composition of the polynucleotide sequences flanking the promoter of the desired biologically functional gene is of great economic value. However, the ability to alter the expression of a gene simply by altering the nucleotide composition of polynucleotide sequences flanking its promoter and/or enhancer also offers great promise in the treatment of human pathological conditions. For example, the present invention can be utilized to increase the expression of a gene whose protein product can be increased to treat a disease in vivo. For example, the STATs (Signal Transducers and Activators of Transcription) are a family of latent cytoplasmic proteins that are activated to participate in gene control when cells encounter various extracellular signals. In a recent report by Kumar et al. (1998) the STATs have been shown to be involved in the induction of cell death (apoptosis). Apoptosis is initiated by activation of a cascade of enzymes that cleave cellular proteins, resulting in the efficient termination of the cell. STATs activate genes containing GAS (Gamma-Activated Sequence) elements. Thus, the present invention can be utilized to replace the polynucleotide flanking sequence or a pair of polynucleotide flanking sequences adjacent to the GAS element with a heterologous polynucleotide flanking sequence or a pair of heterologous polynucleotide flanking sequences whose nucleotide composition confers a relatively higher binding affinity for the STATs as compared to the naturally occurring flanking polynucleotide sequences and thereby increase the expression of genes that which will induce pre-mature death of unwanted cells or cells that have lost control of their own growth.
Alternatively, this invention can also be utilized to reduce and perhaps even entirely inhibit the expression of undesirable genes whose protein products can lead to the manifestation of pathological conditions. Once the promoter of the disease causing gene is isolated, the naturally occurring adjacent flanking polynucleotide sequence or a pair of naturally occurring adjacent flanking polynucleotide sequences can be excised and replaced by a heterologous flanking polynucleotide sequence or a pair of heterologous polynucleotide flanking sequences whose nucleotide composition confers a relatively lower binding affinity of the transcription factor(s) and/or RNA polymerase for the promoter as compared to the naturally occurring adjacent polynucleotide flanking sequence or a pair of adjacent polynucleotide flanking sequences. The resulting promoter will have a lower relative frequency of transcription initiation as compared to the promoter flanked by the original polynucleotide sequences in the genome and thus reduce or perhaps entirely inhibit the expression of the unwanted gene product.
E. Computer-Implemented Method for Generating Homologous Polynucleotide Sequences into Families Related Solely by Their Nucleotide Composition to the Seed Sequence
In accordance with one embodiment of the present invention, a computer system may be utilized to generate mutant DNA polynucleotide sequences from a seed sequence. Each element in the seed sequence is a base (i.e., adenine, guanine, thymine or cytosine), and each element occupies a given base position within the polynucleotide sequence. The number of elements within the polynucleotide sequence may vary. The computer system generates mutant DNA polynucleotide sequences (which may include International Union of Pure and Applied Chemistry nucleic acid ambiguity codes in the place of one or more mutated positions in the seed sequence) that have a desired degree of homology relative to the seed sequence. For example, suppose that a user wishes to view mutant DNA polynucleotide sequences that have an 80% or greater homology relative to a DNA polynucleotide sequence having five base elements. In such a case, each of the mutant DNA polynucleotide sequence differs from the seed sequence by one element. The computer system may be adapted to provide mutant DNA polynucleotide sequences with at least a minimum degree of homology requested by the user. For example, a user may request a 95% homology or greater rather than an 80% homology or greater. The number of mutant sequences which generated, particularly from long seed sequences, may be quite large and potentially cumbersome in terms of demands on computer hardware, etc. As such, the computer program may desirably employ further constraints for reducing the cumbersome nature of the data generated, i.e., the number of mutant sequences generated, while ensuring that those mutant sequences of most interest are generated. Such constraints include some or all of the following: providing a minimum space value (SVl) of greater than 1 between bases to be mutated in the flanking polynucleotide sequence; providing a ratio of bases G to C (GC1) and/or ratio of bases A to T(AT1) which must be maintained in the mutant sequences; and specifying a region of said flanking sequence which is not to be mutated in generating the mutant sequence(s). The constraint of providing a ratio of bases G to C (GC 1 ) and/or ratio of bases A to T(AT1) which must be maintained in the mutant sequences may be additionally desirable in certain cases, as it has been found that for, e.g, a higher GC percentage composition in the polynucleotide sequence confers lower reactivity on the sequence and as such, reactivity may be controlled by employing such a constraint in the computer program. Also, the constraint of specifying a region of flanking sequences (e.g., from 5 to 20 nucleotides in length) immediately adjacent to the binding site may have a great effect on binding (positive or negative); as such, maintaining those regions more or less constant in the mutant sequence(s) would reduce the number of mutant sequences generated and offer the additional advantage of not adversely disturbing the properties of this region on binding. Those skilled in the art will appreciate that other constraints consistent with this discussion may be included as well to refine the mutant sequence families.
Figure 4 depicts a block diagram of a computer system A10 that is suitable for practising this aspect of the present invention. The computer system A10 contains a central processing unit (CPU)A12 for executing computer instructions. The computer system A10 also includes a display device A 14, such as a video display, and a printer A16 for producing printed output. The computer system A10 may include one or more input devices Al 8, such as a mouse, a keyboard or a microphone. The computer system A10 includes a primary storage A20 and a secondary storage A22. The primary storage A20 may be implemented using random access memory (RAM) or using other types of appropriate storage devices. The secondary storage A22 may be implemented as a magnetic hard disk drive or other secondary storage device. The secondary storage A22 may facilitate the use of removable computer readable media such as CD-ROMs. The sequence generating facility A28 is stored during execution within the primary storage A20 and executed on the CPU A 12. The sequence generating facility A28 may be implemented in computer instructions that constitute one or more programs libraries or modules. Those skilled in the art will appreciate that the sequence generating facility A28 may be written in number of different computer languages and may take many different formats. The secondary storage A22 may hold data files A30, that may include mutant DNA sequences that have been generated by the sequence generating facility A28 or other data. The computer system Al 0 may also include the resources for communicating with other remote computing resources. For example, the computer system A10 may include a network adapter A24 for connecting a computer system to a computer network. This computer network may be any of a number of different well-known local area networks (LANs) or wide area networks (WANs). A modem A26 may be provided to facilitate modem communications over cable connections, wireless connections, or traditional analog telephone lines.
The sequence generating facility 28 is largely divisible into two major components: a main program and a Mutate () function. The following pseudo-code identifies the functionality performed by these respective components.
Main Code:
Obtain DNA sequence Print Seed sequence Set mutenext to start of Sequence While mutenext<last_mutable_position
Mutate mutenext
If there is more than one position to be mutated simultaneously Set mute_count = 2 Set mutenext = mutenext+min dist CALL Mutate (mute_count, mutenext) Else
Print the mutated sequences Increment mutenext
Mutate(): parameters: mute count integer representing the # simultaneous mutations at this point in the calling structure mutenext next sequence position to mutate
If mute_count = mutation limit mutate each base position from mutenext to end print each mutation
Else mutate mutenext set mutenext = mutenext + 1 increment mute_count CALL mutate (mute count, mutenext) return
Figure 5 depicts a flow chart listing the steps that are performed in the main body for code for the sequence generating facility A28. Those skilled in the art will appreciate that this flow chart is intended to be merely illustrative and not limiting of the present invention. The functionality realized for the sequence generating facility A28 may also be realized by performing other sequences of computing steps.
The sequence generating facility A28 operates to output the seed sequence and mutant DNA polynucleotide sequences produced with the constraints that have been selected. The sequence generating facility A28 identifies which base positions are to be mutated and sequentially mutates those positions in a predefined order. This process continues until all mutant DNA polynucleotide sequences that fulfill the constraints that have been generated for the seed sequence.
Initially, the seed sequence is obtained (step B 10 in Figure 5, see also Tables 1- 4). The seed sequence and its complement may be entered interactively by the user via an input device A18 or may be read from the data file A30 that is stored in secondary storage A22. The seed sequence may be stored as an array of characters. As was mentioned above, the number of elements in the DNA polynucleotide sequence may vary. After the seed sequence is obtained, the seed sequence is output (step B12 in Figure 5). This may entail sending the seed sequence to a printer A 16. Each base is represented by its corresponding representative character A, G, T or C, and the sequences are output as strings of characters selected from the DNA alphabet of A, G, T and C. The main program then begins the process of generating the mutant DNA polynucleotide sequences. A pointer, designated as "mutenext" is used to identify the next base position that is to be mutated in the DNA polynucleotide sequence. Initially, "mutenext" is designated to point to the first base position within the sequence (step B14 in Figure 5). The main program proceeds to enter a loop (see B 16 in Figure 5) that performs the brunt of the work for generating the mutant DNA polynucleotide sequences from the seed sequence. The loop continues execution until "mutenext" is at the last mutable position within the seed DNA sequence (in the simplest case, this is the last element in the seed sequence). Initially, the base position pointed to by "mutenext" is mutated (Step B 18 in Figure 5). This entails replacing the base that is currently at the base position pointed to by "mutenext" with the next sequential base from a base table. For purposes of the discussion below, it is assumed that the bases are assigned from a base table where the representations of the bases are stored in predefined sequence of A, G, C, T. Thus, if the base position currently has a value of "A", it is mutated to have a value of "G". Those skilled in the art will appreciate that alternative base encodings may be used. For instance, the bases may be categorized into pyrimidines and purines rather than into separate bases.
As was discussed above, the sequence generating facility A28 is capable of mutating more than one base position simultaneously. Hence, in step B20, the sequence generating facility checks whether there is more than one position to be mutated simultaneously. For illustrative purposes, it is assumed that a maximum of two positions may be simultaneously mutated for the flow chart of Figure 5. Thus, the variable "mute_count" holds a value that identifies the number of base positions that are to be mutated simultaneously. Since there are only two options in the examples shown in Figure 5, it is assumed that "mute_count" equals two in the instance where more than one position is to be mutated simultaneously (see step B22 in Figure 5). Those skilled in the art will appreciate that more than two positions may be simultaneously mutated in practising the present invention. The value of "mutenext" is then updated to point to the next position to be mutated. The next position is the specified minimum distance (i.e., the value of the variable "min_dist") from the old value of "mutenext" (step B24 in Figure 5). As was discussed above, a constraint may be employed to require that the positions being mutated be separated by one of more bases. The extent of the separation is captured in the "min_dist" variable. The program then calls the Mutate () function to perform the required mutation while holding constant the value of the base position that was mutated in step B18 (step B26 in Figure 5). The "mute count" value and the "mutenext" value are passed as parameters to the Mutate () function, which will be described in more detail below. If there is only one position to be mutated simultaneously (as checked in step
B20), the main program prints the mutated polynucleotide sequences that are produced by mutating the base position mutated in step B 18 (B28 in Figure 5). After step B28 and after step B26, "mutenext" is incremented to point to the next mutable position. The end of the loop has then reached its step B32 and the process is repeated beginning at step B16. The process is repeated for each of the mutable positions until all of the appropriate mutant DNA polynucleotide sequences have been generated.
Figure 6 shows a flow chart of the steps that are performed by the Mutate () function. The variable "mutation_limit" identifies the maximum number of mutable positions that may be simultaneously mutated. In step C IO of Figure 6, the Mutate () functions checks whether the "mute-count" parameter is equal to the mutation limit. In other words, a check is made whether the number of positions being simultaneously mutated is the maximum. If the number of positions being mutated equals the mutation limit, each base position is mutated from the base position currently pointed to by "mutenext" to the end of the sequence (step C12 in Figure 6) each resulting mutated DNA polynucleotide sequence is output (step C 14 in Figure 6). If the number of positions being mutated simultaneously is equal to the maximum, the position currently pointed to by "mutenext" is mutated (step C16 in Figure 6). "Mutetext" is then incremented, and "mute_count" is incremented (step C20 in Figure 6). The Mutate () function is recursively called passing "mute_count" and "mutenext" as parameters (step C22 in Figure 6). An example is helpful to illustrate operation of the sequence generating facility. Suppose that the seed sequence is like sequence D10 shown in Figure 7. This seed sequence D10 is printed as an initial step of the mutant DNA sequence generation process. Further suppose that two base positions may be simultaneously mutated and that there is no minimum distance requirement between the base positions that are mutated. After printing the seed sequence, the sequence generating facility initializes the "mutenext" pointer to point to the first base position within the seed sequence D10. Mutations are performed with the-assistance of a base table D20. The base table D20 is simply a table that holds the predefined sequences of bases. As was mentioned above, this predefined sequence in the illustrative case is A, G, C, T. A "started" pointer identifies the base value with which the process begins. In particular, "started" points to the initial value of the base for the base position that is to be mutated. The next base pointer points to the next base to be used in mutating the base position. The first base pointer points to the first base in the base table and last base pointer points to the last base in the base table.
The mutation begins by initially mutating the base position pointed to by "mutenext". The next base value is assigned to the base position pointed to by "mutenext". For the example shown in Figure 7, the first base in the seed sequence D10 is changed from A to G. Then, since two base positions are to be simultaneously mutated, "mutenext" is incremented by one to point to the second base position within the sequence and this second position is also mutated to have the next base value of G. The resulting mutant DNA polynucleotide sequence D 12 is then output.
The mutation of the second base position within the polynucleotide sequence continues while holding the first base position constant. Thus, "mutenext" continues to point to the second base position but next base is incremented to point to the next base within the base table B20. Hence, the second base position is mutated to have a value of C. The resulting mutant polynucleotide sequence D14 is output. The second base position is then further mutated to have the next base value of T. The resulting mutant DNA polynucleotide sequence D16 is output. At this point, the second base position has been fully mutated. The started pointer and the next base pointer both point to the same position within the base table D20. Hence, "mutenext" is incremented by one to point to the third base position and nextbase is incremented to point to G. The process of mutating the third base position which the first base position stays constant is then initiated in a like fashion. All of the remaining base positions are mutated until the last base position has been mutated at which the first base position is changed to C, and the process repeats with each of the other base positions being fully mutated. Figure 8 shows an example of the output mutant DNA sequences that are generated using a seed sequence of AAAAAA and mutating two of the base positions simultaneously. Figure 9 shows a mutation of the same seed sequence when two base positions are simultaneously mutated but the minimum distance between the positions being mutated is two. As can be seen, only 91 mutant DNA polynucleotide sequences are produced in Figure 9 whereas 136 mutant DNA polynucleotide sequences are produced in Figure 8.
While the example algorithm and flow diagram provided here address specific homologies, it should be apparent to one skilled in the art that successively greater homologies would be similarly analyzed to arrive at the complete family of sequences related to the seed sequence.
EXAMPLES
The invention is further illustrated by the following non-limiting examples.
Example 1
Shown in the box at the top is a linear polynucleotide DNA sequence construct created by synthetic means known in the art and consists of (from left to right, 5' - 3') a unique PCR primer site followed by a random insert of 20, 40 or 80 bases created by allowing a DNA synthesizer to insert any of the four nucleotide bases A, G, C, and T; a BamRl binding site; and a second 20, 40 or 80 base random insert followed by a second unique PCR primer site. Each primer site includes an EcøRI site.
When synthesized, a population of synthetic polynucleotide molecules are generated with different polynucleotide sequences at the random insert sites which, after PCR amplification using oligonucleotides complementary to the primer sites as PCR primers, are transformed into a population of duplex polynucleotide molecules containing a BamRl recognition site flanked by random polynucleotide sequences. Incubation of these duplex polynucleotide sequences with appropriate (empirically determined) quantities of the endonuclease results in a portion of the duplex polynucleotide sequences being bound by BamRl while some of the duplex polynucleotide sequences are not bound. Those duplex polynucleotide sequences which bind BamRl with a relatively high affinity bind the endonuclease in preference to those duplex polynucleotide sequences which bind BamRl with a relatively low affinity. Since the bound duplex polynucleotide sequences can be separated from the unbound duplex polynucleotide sequences in a gel-shift assay, those duplex polynucleotide sequences bound to the enzyme with higher affinity are represented as "shifted" molecules at relatively lower BamRl concentrations. This assay is depicted in Figure 1 as an inset. Lane 1 shows schematically the migration pattern of the unbound duplex polynucleotide sequence population. Lane 2 represents the migration pattern obtained at relatively low BamRl concentrations. Lanes 3 and 4 show how the migration pattern varies as the concentration of BamRl is increased still further. The populations of bound and unbound sequences are eluted separately from the gel and subsequently cloned into a vector, propagated and isolated. Sequencing of these clones reveals polynucleotide motifs that confer higher or lower affinity for the endonuclease.
Examples 2 and 3
Pure and impure polynucleotide sequences can be used to probe flanking sequence reactivity. In these examples, polynucleotide flanking sequences surrounding a BamRl binding site (Figure 2) or a Sαw3AI (Figure 3) include pure or impure polynucleotide repeat units. Synthesis of a population of duplex polynucleotide sequences having pure dinucleotide repeats, pure trinucleotide repeats, impure dinucleotide repeats, impure trinucleotide repeats, and the like, can be performed according to standard methods, e.g., on a DNA synthesizer. Populations (mixtures) of the duplex polynucleotide sequences can then be incubated with varying quantities of a binding ligand such as BamRl or S «3AI. By determining the relative binding affinity of pure and impure duplex polynucleotide sequences for the binding ligand, the ability of a polynucleotide flanking sequence to affect the reactivity of ligand binding site can be systematically explored, and the results can be used to create a database of reactivity values and/or a predictive algorithm, for use in predicting or identifying the ligand binding characteristics which will be conferred upon a ligand binding site by any polynucleotide flanking sequence.
As depicted in Figure 2 and 3, a binding site for a ligand is provided. The ligand can be, e.g. , a protein which binds to a nucleic acid, e.g., an enzyme, e.g., a restriction enzyme. As shown in Figure 2, the ligand can be the endonuclease BαmRl alternatively, as shown in Figure 3, the ligand can be Sαw3AI. BαmRl binds to the polynucleotide sequence 5'-GGATTC-3'; S w3AI binds to the polynucleotide sequence 5'-GATC-3'; binding sites for other ligands can be employed as is known in the art. The binding site is flanked in both directions by A (to desymmetrize the construct), and then a 40-base long random insert is provided in both the 5' and 3' directions (longer or shorter random polynucleotide sequences can be employed if desired, e.g., to study the effect of remote polynucleotide flanking sequences on binding site reactivity). At both the 5 ' and 3 ' ends of the construct, PCR primer sites are provided to permit amplification of the construct, if desired. Each PCR primer site includes an EcoRI restriction site. The constructs used in this example are synthesized on an automated DNA synthesizer, (although other synthesis methods can be used). During the synthesis, the synthesizer is programmed to provide a mixture of each of the four nucleotide A, G, C, and T at each position of the 40-base random sequences; thus, a population of constructs is created as a statistical mixture differing at the random portions of the construct. (It will be appreciated that only a subpopulation of the 440 possible random sequences can be obtained due to practical limitations on the amount of DNA synthesized.)
The population of constructs is amplified with PCR under standard conditions and purified by polyacrylamide gel electrophoresis (PAGE), followed by elution into buffer including 50mM NaCl and 50mM Tris-HCI (pH 8.0). The result is a population of polynucleotide duplexes (the PCR reaction provides the complementary strand). It can be determined through appropriate control experiments that the synthesis and PCR amplification of the duplexes results in correct binding sites for BαmRl or Sαw3AI. Aliquots of these duplexes are then incubated with appropriate quantities of BαmRl or Sαui Al under conditions such that BαmRl or Sα«3AI will bind to its binding site on the duplex but will not cleave the site (100 ng of the duplexes containing 0.01 pmol of 32P end-labeled duplex polynucleotide is incubated with varying concentrations of BαmRl or Sau3Al (shown in the legend of Figure 2 and Figure 3, respectively) in a total volume of 30 microliters of 50 mM Tris-HCI, pH 8; 50 mM EDTA; 50 mM NaCl; 1 mM dithiothreitoU hour, 37°C.
After BamRl or S.7..3AI incubation, the aliquots are subjected to PAGE analysis on an 8% native polyacrylamide gel and visualized. Duplexes to which the enzyme bound should show a retarded mobility on the gel compared to unbound duplexes, and low mobility bands are in fact seen. Shifted (low mobility) or the higher conferred binding affinity sub-population and unshifted (mobility similar to duplex in the absence of BamRl or Sau3Al) or the lower conferred binding affinity sub-population bands are excised from the gel and eluted overnight in 1 mL of 50mM Tris-HCI, 50 mM NaCI pH 8.0 buffer. The sample is concentrated into 50 microliters of the same buffer and amplified by PCR. The concentrated shifted and unshifted bands can be subjected to further rounds of band shifting, elution and amplification by PCR as shown in Figure 2.
Samples of shifted and unshifted duplexes at each round of band shifting, elution and amplification by PCR or after the final round of band shifting, elution and amplification by PCR is digested with 200 units oϊEcoRl per microgram of duplex polynucleotide. The polynucleotides are cleaved at the EcoRl recognition sites, purified by PAGE, and ligated into Lambda ZAP vector predigested with EcoRl and treated with CIAP at 1:1 inser vector ratio in the presence of 2U of T4 ligase in 5 microliters of T4 ligase buffer at 40°C overnight.
The ligated samples are then packaged using Gigapack II Gold packaging extract (Stratagene) and cloned into E. coli XL I -Blue host strain and subjected to blue/white selection. The recombinant (white) clones are selected and eluted in 500 microliters SM buffer (100 mM NaCl, 8 mM MgSO4, 50 mM Tris-HCI, pH 7.5, 0.01 % gelatin, 0.04% chloroform). Ten microliters of the eluate is amplified by PCR using T3/T7 primers, purified by Qiagen PCR purification kit and sequenced.
In addition, methods such as the methods described above can be used to generate compilations of data for the prediction of the reactivity of a potential binding site for a ligand based upon sequences which flank the ligand binding site. Thus, as increasing numbers of sequences which confer high (or low) ligand binding or reactivity upon a neighbouring site are identified, the ability to predict the characteristics of a previously unknown flanking sequence will be improved, without the requirement of performing a binding experiment to determine such characteristics.
Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, numerous equivalents to the specific procedures described herein. Such equivalents are considered to be within the scope of this invention and are covered by the following claims.
The contents of all publications cited herein are hereby incorporated by reference.
TABLE 1: RANDOM 1/2X
SEQ ID NO:
1 TGGAGTCCACGCTGTGCCATCCCCTCTATGCGGTCGCGGC
2 AATTAAACGGCCACGGCTGCAGGGGCCAGGGATGGGCATT
3 CAGCGGGGGGGGCCGGTAACCTATGGTTATTGCGTTCTTG
4 GGTCGCCGGTCCTCATGGCACTGGTGTGCGCTCCTCGTGG
5 AGGAAGTAGGAGGGCTGCCCTGCTGGTGCTCTTGTGGTCG
6 GGTTGAGTTACGGGTAGACAGCCGAGCTGCGGCCCAGGAG
7 ACGCGGGCTGGAATGAGGATGTTGGTCGACCCGTAGCGTG
8 GACGTATGGGGTTTAGGACTCCCACGCGGCAGCCGTAGGG
9 GCCGCGGCTTAGGCGTCCGCTCGTGAACCTTAGCGGGGTG
10 TTGGGTGCTCCTTCCACCCGGGGGCTCCCGGCGGACATCT
11 CTCAGTGTTAGGGGTTGTGCAGAACATTTAGTAGTCCGGT
12 CTGTGGGTTCTGCGGGCGTGCGGTCCACCTGAGTTTGTTG
13 TAGCACGACCCCCCATGTGGGGGTAAAATCAGCTCGTAGG
14 GACTGGGGGCTCCCGGCATACCGGGGGATTGGGGGGGGTG
15 TGCGGTGCTGGAGGCGTTGTAAATGCAATGGGTGTTCACA
16 TCCTTAGCCGGGCAGGCGCCGCTTAATGCTCGGGTGATCG
17 GATTGGACATAAGCCGTGTTGGTAGTGCAGTCGTACAGGC
18 CTCAATACGAGGTGTGTGGTCCTTAATCATTTATAGTTGG
19 ACTTGCCGTCGGCGTGTTGTCCATGTGGCGTAAGTGCTGC 20 TGTAGCCGTGCGGTGATTGTTATTCTGGGAATATAGTTCG
21 CGGTTGAAGGACGGACCCGCCCGGGGAGGTGTGAGGTTTC
22 CAGGTGGCGTTTGAGTCGAGGTTGAAGGAGTGGAGGCGGG 23 CTGACGGTACGCAGTAGTTCGTGCCGCTTTTTGGATTTGA 24 ACGGCGTGAAGAGTAGTTATATTATGGCTCCGCTAGTTCG 25 GCGCGCGCCGGCGAGACGGACCGGTTATGTGATTCGGGTG 26 TGTGCCAGGAAGGTGTGTCCAGATGGTACTGCGGGGGTGC 27 TTTCGACGGTGACCGGGGATACACACTGGATGGACCGGGG 28 GGGCGTAGAGAGAATCGATTGAGGGATTGATTGTGTAGGT 29 GTCGTAGGGTACCAAATCCTCTTGCGGTGACTAGCGGGCA
30 AAGGGGCCCTGGCAGAGGACGAGTGGGGGTTGGTGCTGTG
31 GAGGTCGACAGCATGGGTTCGGGTGCCAACGTAGCTGCGT
32 CATAGGTAAGGGCGTTGCATTTGATTTGCGGGCTCTGTGG
33 GATAGCATGGTTCGGTATGGATGCGGCCTACCCTTAGGGC 34 AAGTCCACCCGGGTGATATGCGGGCACCATCCTGTGCCGC
35 GCCCGCTATCCTGCAGTCTCTGGCCAGTGCGAGGGTGGTG
36 GTATTGCGAACAGGGGGCGGCGCGGGGGGGGGTGGCGCGC
37 GTGTGCGCGGCGCTTGGTCGCTGCTCATGTAATAGGTCTG
38 AATGCCAGCTTACGCCATAGAGTGCGTTGGCAGTTGAGTC
39 GATGACTGGCGATACAGACATGGGGTAAAGCTCGGTGGGG
40 TTGCCGATCAACGTTAATAGGGGAAGCGCGTCTCGGGAGG
41 CGGGCCGTCGCCGGGCGATTACAATTGGGTGGTCTGGCCG
42 ACTGGAGCGAGGGGGCCGCGGAAACCATGCGCGGAATGAG
43 TGAATAGGGCGCGCACCTCGGGTTCTTGGGCGTCAGGATG
44 AGGCACAGTTGGGCTCAGCAACTGAGCATCATGGCGATAC
45 TTGGACGTGGACCGCGTGGCGTCGGCGTGACCCCTGTGCG
46 GCCCGTAGGCGGGCGGCCCCGTGGGACGATGGTAATGAAT
47 CCGTTCGTTATTTCCCACGGTTGGGTCGGGTGTCACGGGG
48 GGTTGCGACGTCGTGTTGTGATTGGGGCTAGGAGTGGCTC
49 GGCGCATGTGGGGGGGTACATATCCAGGAGGATAGGGTCG
50 GCAAGGGGGCTAGTTGCGGGTTTGTTGAGGTCTGAACTTG
51 TATTAGGTGGAAAAGCGCGCTAGGGTGGAGATTGTGTGTG
52 GGTGAGGGTGGCGCGGGCTAAAGAGCGGGCAGGGTTTATG
53 ATTTGCCGCTCATCTGGTTTGGATTGCGGCCAGCCACCGC
54 CTGGCGCCAAGGATGATCGGCTACCAATTAGGGACCCTGG
55 GAGTTAAGTTGTTCGAAATGTGCTGTGTGGTGGCCGGCGG
56 GTCCATGGAAAGTGCACGCGGGTGGAATATTTGTGTGTCG
57 GGCGATCTCTCATAGCCCCGGTGCGCTCGTCGGGCGGAGG
58 GTATACGTAGTGGGCCTTGTAGGGGAGCGTGCGGGGAGGT
59 GGGGCGCTAGACGTGCCTACTATCGATTTTGTTGAATGCG
60 AGTGCAAGAGAATCGGAAAAAGTCGCGCGAGTTTCCGGTG
61 TAGGGGCATTCTGAAATCCCTGTACTGCTTTGGGCCGTGG
62 ATGGGGTGCCGGGAGCGCTATGGTGAGCGCGGGGTCTCTG
63 GCGGCTGCGGACGATATAGGATAGGCTCCAGTGACCTGCG
64 AACCGTGATAGGACAACTGATCGGTGCCGGGCGAGACGGC
65 GGGTTAGGGTCTCGTTGACGTTGTGGAACGGCTTGTCACG
66 TGACGGGGACAGGTGGGCGGAACGGGTCTAGGTTTTCCGG
67 GACGCTCACTGAGTAGGGGGCGCCACCGGGTCGAGCTGGG
68 TGTCGAGTTTTTCGGGAGTCCGGGCGGCTTGCTTAATGTG
69 GCAGGGCTTTTAGCGGGGCTTTGGTGGGCGTGAGTCCCGT
70 GCTTGGGTGTATAATGACGGTAAGGGGAACGGCTGGGGGG
71 TCAAGTCGAGCGTCTGGCACCCTGGAACCTGGTCTCTGTC
72 CGAGCTGACGATTGTGATGGTGCTTGCCCTGGGCGCGCTG
73 GGGGTGTTAGCGCCAGAACCCATGCACTACTTCTTTAGGG
74 GACAGTAATGCGGTCTTCTGATGTCGCCCCCTGGTTTTGG
75 TTCTGCATGGCTTTACTTGGGGTATGTGGACGGTGGCCCC
76 GCGGGAAGGGCGGCAAGTAATGGAGTCATGCCTAAGGGCT
77 CCGTACGCTGTTGCGCTAAGCAGCGCCATGAGCCGTGGAG
78 GTTGCCGCACGAGCGGACCGGGTGCCTTTAGTCATCAGCG
79 TGGGCTGGGATTCCTGGTGGCGGTCAGCACTTGGTGTTGT
80 TGGTGGGAGATAGGCGGTCCGCTTTAGGGTTCCCGGGCCG
81 CCTCTGTTTGCCCCCTCGAGCTGTCGCGGCATAACCTCGC
82 CGGTGCTGACTGCATCTTACGAGTGCGGCGGTTGGTTCGC
83 CAGTATCACGCGAGCCGTGCAGTCGGTGGTATTCGGCGGT 84 GGCTTCTGTCTACATTGTGGCCGGGTCCGGCAGAGTGGGA
85 TTTCTACCTCCGGGTTCGGCTTGTCACGGGTAGACGGGGG
86 CGAGAATGACAAGGGCGCGTGGGCGAGTTATTGCTGGCCT
87 GCTTTAGGTCAACAGTTGGTCGCGCACACGATATGGGACG
88 GACGCGAACTTGGAGTGTCGGGCGGAAGGATTGATACTAG
89 GCGCGATGAGGCGATGGACCGTCGTGCAATCGGAGGGCGG
90 GAGCTGCGAGCGAAAGTTGTGTGGTTATTTGTGTGTAGAT
91 GGAGAGCTCGCAACTCTTGATGGCTAGAACGTAGCTGGCG
92 GGTGAGCGTTACAAGCTCGCAACCCTGTCGATAAACGGTT
93 GCTGGCACCTGGGGCTTGGACAGGGGGGCTGTACTTGTTG
94 ACGTGGGCACGCAGCAAGGCATGGCAGGTCCCTGCGGATT
95 GCCTGTATGGCCGGGCGCTACTAAAAGACCTGGTTCTGTA
96 GTGACAACGTACGATGTTTCTCACACCTGACAGCGAGAAC
97 GTTCACGGAAATTGGGGTCGGCACTGAAACATCGTGGGGG
98 GGCAATGGTCGCGGCAAGCCTTTCGGCACAAAGTGGAACC
99 TGACAGCGTCGTGGATGTGGCGGGATTGGTCATGCCGGGG
100 AGCTTTAGCGAACCGGCTCGTGAGATTCGCACACACGGGT
101 TTCCGTTTGCGGGGCCAAGAGTGCGCATCAATGGCCGGGG
102 ACCGGGCGAGAAGAGCGTTGGCGGTCCTTGGGATGTCACA
103 GCAATCACTACACTTGTGTAATCGTACGAGGTTGGGCGTG
104 ATGGCGCAGCGCGCGCGGGACCTACATTGGTGGCAGGGTG
105 CTAGGTCTAGCTCGTGGTGGGCGGGGGCGTCAGTAGTGTG
106 GCATGGGTTCACGGGCTAGCTGAGTACCTGCTGGTGGGTC
107 CATCCCAAGATGGGACCAATGGCGTGAGTACGTACGGTGG
108 GAGTTGGAGTGCGGCGCCGGTCGTTAGAATGGAGGTTAGC
109 GTCTCTGGATTATGGTACTGTGGGCTGAGTAGTGGTAGCA
110 TCGCGTGGCGGGTTTGGTCCTCTCGTTTTTTCTTGTTGGG
111 TGTGGACGCTGCAGAAACCGAGTTCGGCCTCAGTCCAGGC
112 TGCATGGAGGTCGGTGCGACGCGTGCGAGCGGGGACGCTG
113 AATGGACTGGTGGTGAGAACGGTCCAACCAGCTCTGTGAT
114 GGATGGCGTATGCGTCGGAGCCGCGCGCGCGGGAAAGGCT
115 ACATTTGGCAAGCCCTCATCGTGGGGTTTAAGTCCGGTTG
116 GATCGGTTAGTGGTTCGGCGGTTCGTCTGGCGGGTTAGGG 17 TGCCGCGGTCCGTCTAATCAGGGCGACTTGCCGGTTGGGG
TABLE 2: RANDOM 2X
SEQ ID NO:
118 GTCCGAGCTGGCAGGTGCTATGGCTGGGTGGTGTCTCTGG
119 TGGGGGTGGGGTTCGGGCGACGATACGTAACGCGGGGTGC
120 GGATTATGTCTCATAGATTGCCCACCTGTGGGAAGTTGGG
121 CACATTACGGATCGGGGTAACGGAGGCGCGCGGGGTGGAT
122 GTACCAGAGAGCGGCGGCTTGGGTTGATGCAGTTCATGGG
123 TTTGTGGACGGTGGGGCGTGTGAGCTTACCTTGTTGAGTG
124 GCAGGTTGACTTTAGTGGGTAGTGGAGTTTCGATGAGAAT
125 TTCATGATGGTATTCGTGTATGCTTTCATCTTCGTGGTTG
126 TTTATGGATCGTCCGAGTTTTCAGAATGACGCATTATTAT
127 GCAGATTAGATGTTTTTGGCGGTCACAGATTATTGGGGCG
128 ACGGGTGGTAGTTGTGTCCGCGCGGTCTGATGTGGCAGGT
129 TGGTGACAGTCAAACGAGGTGAGGTGTACACGTACCACGG
130 GGCGGTTGACGTCTGGGTGGTTTATTGCGGGCCGATATTG
131 ATCCTACTTGTCGTTGTAAGTGGCTTGTGGCACGGGGGTA
132 GGAGCGGATTACGCCGTGCGTCATAAGGGGGGGAGTTAGT
133 GGGCTCGGTTTTGGACCGGGGTACTAATGGCTTGGGCGGG
134 AATGTGTCGATGATGGGCCCAAATAGCCGATGTCTTCTTG
135 AGGCGTGGCGTGTGCTTATGTGGGAGTGTGGTCGACGTCT
136 TCTGTCGAGGCTACGCATGTTAGGGTAGGGGTGTGAGGTG
137 CAGTTTGGTCGTGGTGCAAGGCATTGGGTGAGGAGTGTGG
138 ATTGTGTTAGAAGTGGGCTATTTTTGAATCACTGCAAAAC
139 TCCAGGGTTGTATCCCGGGTATGGCTGATAGGAGTTGCTG 140 TACTTATTTGGGGCTGACGGAGTCAGGGTTGGGAAAGGAT 141 TGGACTATATTGGTTTTTTTGCTGGTTCATGCTAATTGCG 142 GGGACGGCGATCGGGTCTGCCAAGACTATTCCTCCATAGA 143 GTGGTGGCCTCACAGGCACTGTAGGCGGGCTTCATTCGAA 144 GGCGTTTCAGGGGTGGCTGCGCTGTACGTGGTGTGTGACG 145 CTGAGACCGCGCGGGTTGAGATGTTATGGGGTGGGCGGTT 146 AGTTGGCGTTGACCACTAACAAATTACTGCGGTTGAGGCG 147 GGTTCTCAGCCGCGGCTAGGAGTAGGCAACAAGGCTCAGG 148 CCGGCGGTTGCGTACCGTGGTTTTGGGCACGGGTACCTAT 149 GGGGTTGTGCCGGGCTGGGTCGTCGTTGGTGGGTGCCTTG
150 GGCACAGCGGCACTAAGCTCCATAGTTGTATCTGCCCCGG
151 CTAATCAATTTCCCGCAGTGTTGGCAGCCTCATTGTGATA
152 GGACGGCTGGGGTGGGAGTCTTAGATACGTAGTTGGACTG
153 TTGAGGCGGTGAACCTTTTATGTACGGGAGCCAGCACACG
154 GCGAAGTGTCTGGATGCACCGGGTACTGGTTACGGGCGTG
155 GTCTGAGCATCGGTGATATCTGGGCCGGGGGTGTGGGTCC
156 ATAGGTACGAGGCCGCTCGTGAGTACTGAGGGACCCGCGT
157 TTGGTAGGGTGAATTTTGAATTACCCAGGTGGTCATGTGG
158 CTCAGCGCGGCATGGTGGAAGGGCGAGAGATTCCAGCGGT
159 TTACGGGAACCAGTGCGTGAGCCATGTGTTTGGTTGCGTT
160 GGTAGGGTGTGAGACTGGTTGCTCGCGGTGAAGGGTCGTT
161 ACTTGTACGAGAGAATTGTGGGTGGGGGAAGACATTTGGC
162 GCCTCCGATTTGGGAGCCCTCACTTTAAGGGGCGGTCGGT
163 GGGACGTTAGCGGTCGCATCTGCTTAGTTCTGATGTCGCA
164 CTAAGACGGGCCTGGGGCTGGGAACATACAGCGTCAGGCT 165 CGTATGGCGCGTGGGTTTATGCAATCTGTCTGTAACTCCG
166 CCATTATTAGCGGGTCAGGTACATGGCTGCGTGCACGGGC
167 ATGGTGGACCCATGGGGAGTTGTCAGCGCCACTGTAACGG
168 TTGGATGGATCATTAATGATGGGTGATGGCCTTGATCGGT
169 AGTGCCTCTGATTGGCCGCGGCGATGGCATAAGCTGTCTG
170 TGGCGCGAGGATCTCATGCGTCTTGTTTGCGGAGGGGGGC
171 TGCGCCGGCGCTCTGCAGATGGGGACCGTCGCTGCCCGGG
172 ATTGCGCTACCTCAGGTGGTGGTAGATACGAGAGCCCTAT
173 TGTGCGGACCGGCTGTACTCGTGAAGTCGTCGTGGTTGGG 174 GTCGGGGTGGATGCTGTTGGGGGGGGCGCTGCAGACGTCG 175 GGGGTTGTTGTTGTAGTAGTCGCGGGGGACGATGCCGGGG 176 CTGGCCTCTTGGTTAGTCAGTGGTGCCTGACATTGGCATG 177 TAGTAGCTGCCCGGTTGGATGTACTGGAGTGCAAGCGGCT 178 GTTGCTCCAACGGGTTGCGCATAGGCTATGGGTTCACGTG
179 GCATTCCGTGGCCGGGGGAGTATTGATTATGTTCGGGTCG
180 GTGTACGGCCACAACGCGTGGCGGTTAACAGAAGGGGGCG
181 CCTTTGGTTTCTAGGCTGGGGTTTCGACGATGGCGGGGTC
182 GTCTTCCGATGCCCGCGGGTCTCCCGCTTCTTCTTCAGCG
183 GCACAGTGATGTAGGCGGAGCCTCTCCATGAGTACTTGGT
184 GAGCTCTCCGCATTTGAGTCCGATGATTTGAATGGGAAGC
185 GCCAGGGATTTGCTTTTGCGGAGGGGGGATGTGAGCCGTG 186 TCGGAGCGACGGGCGGGGCGTAGGGGTTCGGGAATCTCTC
187 GCGGATCCGTGGCGTTCAGGGGGGTCAAAAATGCACTCAG
188 TGCGGGATAGAGTCTAGGGAGTGGGTGGACGTACAGTTGA
189 CCTTCTCCGTGTGCGAGTGTAGCAGGTATAGAAATCTGCG
190 TGGGGGTCCCCGGGTCGTGTGGGTTTCTATGGCGATCTGG
191 AGCCGGGTTCAATTCGGATTGGCGGTGACTAGCGGCCGGC
192 GGGTGGGACGGGGAAGGAGCGCTTCGGGGGTCGGTGGGTG
193 AAAGTCGGGATTGCAAGCAGCGGCGCTAAACGGCGCTGTG
194 TCAGGGTGGACCGGAAATGTCAGGACTGTAAGCTGGAGTG
195 AGCACGTTGTTCTGTGTGTACCTATCGGGCAGGTAATGAC
196 CCCTGCTGGTAGGTGGTGGGAGGTTGGTTGTTTAGGGGCG
197 GATGACCCGCGGGTTTCGGCGTGGAGCAGCATCTTGCGCT
198 CAGTGAGGCCGATAACGGGGGAGTGCCTTTCGCCATGTTG
199 CCCACCTAGCGCTCGGAACAACGTGCGGCGGCACTTTGTT
200 GCCCGCGGCTTGGTTGCGTTGGCTTCGTCTCCAAGTGCGC
201 CTGGATTGCCGGTAGATTTGGGTGGCGCAGCCTGCGTCCG
202 GTGGGATTGGCTAGGACTGCAGTTAGGGTGGGGGTTTCGG
203 GTACCGATTGCAGGGGGAAAGCCAGTGGGGGCGGGAGAGT
204 GGTAGGGACTCGCACGGTCTTCGAGGGCGACGGGGTGTGG
205 TGGGGTAAATGTCAGCGGGGCTGAGCCCCACAAGCTGGGT
206 ATTATTGGGGGTGTTTATAGTGTTGGTTGTCAGGTGGCTG
207 CAAAGCTGATCTGTGTCGTACGATCGAAGAGAGAGGAGCG
208 TGAGTTCATGTACCGCGTCGGGATGGGGTTGATGACAACT
209 GAAGTCCGGCGTTCCCTGATCACTTGTTGTGCGCTGTTGG
210 GATGTGGAGCGTATGAAGCATAATTGGTGGGATTTGAAGT
211 GCGTGGGGCGGATTTACATGTGCTGGGATGCAGTAGCGGG
212 CGCGGGGTAGGTCAGGGGGTCGTTGGGTTCGCGATCGTGG
213 GGGGCACGGTGGGAAGTCGTGAACTTAGGAGGGGCGGAGC
214 TGGCTCGTGCTATGAGGGGGTTTTATTTTGTTTGGGGGGG 215 CAGGGGGCTCAGTTAACGGTCGAATCTTAGGAAATTAACA
216 GTATGGCCCATGGTGGCATCGGATCTTGGGAGGTGAGGCG
217 CTTCAGCGGGCAGATGGGGGATGCGACGCTTGACTAAAAC
218 AGCGAGGGTGAGCTCGGCTCTGGTTTGCAGTTGTGTCTGG
219 GGCGGGATCGAGCGCATCGATCTACCCATTGTTATAGTGT
220 CGGGGGCTCTCGTTCGCGTGGCAAGCTGGTGTGTTGCGAG
221 TAGGTTCCGTCTTCCCAATACGTAATGTTATTCACACGTG
222 ATCCGCTGGCCGTTGGGACTCTATGCCTAGTCGTAGCCAC
223 CCCCCACCAGTCATGCACAATACTTTGTTTAGAGGTGCGC
224 ATCATGGGTCGGTGGGCGGACTGCTGGTATGGCGTCGGTA
225 TCGAACGGAATGGTCGAGCTGAGAGGACCGCGACGTGGCG
226 ACCTGGCATTTTACGAACTACAAGGATTTAGTAGCGCGTG
227 TCATAGTAGCGGACTTGTTTATGTCCGTCCTGATGTATAC
228 GTTGTTAGAGCGGCTCTGATTCGTGTGTTTGAGCGGTCTG
229 GTGGGTACCGGGCGTTCAGATTTAGGTCGGGGCTCGGACT
230 GGGCGGCATGGGGAGGGACGAGGGTGAGAGGGGGCGGCGG
231 GAACATGCGCAAATGGATTGTTGGTGTGCCGTTTGCCGGC
232 GCCGAGGGGTGGTGTTCGCTTTGCGGGGTGCTTGGGGGTC
233 GGTTCTCAGCCGCGGCTAGGAGTAGGCAACAAGGCTCAGG
234 CGTTTTTGCATGCTCAGCGGTGTAGTGGTTGTAGTTGCGC
235 AGCTAGGGTGCAGCCATTGCGGTCGTCCACGGGGGGGGGG
236 AGTCTGGGGGTTGGACGCGTCAGTTCGTACGGTGATCGTG
237 GGGGGCATAGGTTCCGTGCTGCGGCGGTGCGGCGGACGGC
238 TGCGGGGCGAATTGAGCTGGGGAGCACATTAGGTCATTCT
239 TTGTGGGTGGCGGATCTTGGCCAAGGTGGAACTCGGTGGG 240 CGATCTTTAGTTGGACAGGGTGGAAATGTGCAGCGGCGAT
241 CGGCATGGAGGTCTACGAGCTGGGGATAGTTTGCACTCTG
242 TTGATGGATGTCGGGAGTGCAGGCGGAGTGTGGGCATTGG 243 GTGGTCTGGGTGGTCGCCTTGGCAAGAAGACATCCCTTGT 244 CTAGAAACAGTTAGGAGGAGTTGTGGAACCCTGTGAGGTC 245 CGAGGGGTCTGAGAGGGAACTGACTTGGTTATTGCATATT 246 GCTATGTTCTGGACGTTGCCCCGCTCTGTTGCTTTTTGTG 247 GCGCAGGTATGGAGCTGTGGTGTAGCTCGTGTGTATTGCC 248 GGAGTGGTGAGTGGTAGGCGAGTTCATGCGAGTTGTGTGG 249 CCCGGGCGGGGGGTCGGGTTTAGGGTTTTGTGGAGATAGT
250 GTGGGGCGCACAGCTGGAGGGGCAGGTTATCCGTCGGTGG
251 TCCTTACCTGAGCAAGTTTCGTGTGGGGCTCCTTAACACT
252 CGGTGGTACTGGTGTGCAGTGTGGGTGTAGGGCTTGGTGG
253 GAGGTGGTCGTATGCGGGGGGCGTGTTCGGGTTTGTATCT
254 GACAAATGGCCTTTTATAAGCTACTTTTGGCCGCACGTAG
255 CTTTAGGATCGCCGAGATAGGGCTAGCTTCAAAGGGTGGT
256 TGGTCAGGTGCCTCAAGGCGGGGAAAGAGTGGTCTTGCTG
257 GTGGGCGAACCGGCAGAGATTACATCGTTGTTGCCGTGGT
258 TCGCCGCCAGGGCTCGGCGACCACCGGTACTGTTTCTGGG
259 ATGCTGGTTGAATGATCTTACATTGCTCCACTCAGGCACT
260 TGCCGGAGGTCGTCTCTATAGTGTGGTTGTTGGGGCACGG TABLE 3: GXX 1/2 X
SEQ ID NO:
261 GATGAGGTCGGGGGAGGAGAGGGCGTGGCGGATGCGGCGG
262 GCCGAGGCTGAGGTTGCGGTCGGAGATGCGGTAGTAGCGG
263 GGTGTAGAAGAAGATGCTGTTGGCGCAGTGGATGAGGGTG
264 GAAGGAGGTGGCGGTGGGGATGCGGCTGAGGACGTAGCGG
265 GGGGTGGGTGGGGTCGTGGTGGTTGAAGGGGGAGTGGTGG
266 GAAGGGGATGCGGTGGTCGGCGTAGTGGTTGCTGTCGTGG
267 GCCGCTGGCGCGGGGGGAGTTGGTGCAGCGGACGTCGACG
268 GAGGTCGGCGCAGGAGCGGGCGGTGAGGCGGATGGGGGTG
269 GATGCAGGTGCTGGTGGCGCTGCTGCTGCCGATGTTGGCG 270 GAAGAGGTGGCTGGGGGGGGCGGGGGAGAAGTAGACGTCG
271 GTGGGTGTGGCCGGTGCAGTAGGAGCCGACGCGGTCGGGG
272 GTCGATGGCGCGGTTGATGGAGTTGTGGTTGGCGACGCAG
273 GCCGTTGTAGAAGTGGGCGTGGGAGGAGGCGTTGCAGGGG 274 GCTGTAGCAGAAGAAGTGGCCGGCGTCGTCGTCGGTGGCG 275 GCAGGCGGGGCTGCAGCGGGGGTGGAGGTAGTGGGTGGCG 276 GAGGGTGTGGCTGACGAGGATGTCGGGGTAGATGCAGTAG 277 GTCGCGGCAGGTGGAGTGGGAGTCGTCGGAGATGGGGTGG 278 GCCGTGGTGGAGGAAGCTGGCGTTGGAGTCGCGGCTGTTG
279 GTGGGGGCTGCTGGGGGTGGCGTGGGCGCGGATGTCGGTG
280 GCTGTAGTTGCCGCAGCAGCGGGTGAGGAGGTAGCGGGGG
281 GCAGTTGAGGAGGCCGAGGGTGTAGGGGTGGTGGTGGGTG
282 AGAAGCCGCGGCGGGGCGGGGGGCGGGGCAGGCGGAGTAG
283 GGGGTGGTGGGTGTTGATGAAGATGTTGGAGTGGATGCTG
284 GCGGGAGGAGTGGTGGTCGCTGGTGCGGGAGTCGGGGGTG
285 GAAGTTGTGGCAGGCGTGGACGCCGCCGGGGGGGTGGATG
286 GGTGTAGCCGCAGGAGAGGTCGCGGCTGTGGTGGAGGGGG
287 GCGGAAGGGGCCGCTGGAGGTGCGGCCGGAGGGGAGGGCG
288 GTTGGTGGAGGTGCGGGTGGTGTCGATGCGGTTGTGGTGG
289 GCGGCTGAAGAAGGCGGTGGGGGTGGCGTGGGTGGAGAAG 290 GTCGAAGGTGGGAGAGGACGGTGTCGCCGGGGTGCTGAAG 291 GCAGGGGGCGTTGACGATGATGATGCGGATGGTGTTGGGG 292 GGAGTGGTGGAGGGGGAGGACGCTGCGGGGGTTGGTGGGG
293 GCGGTAGCTGGGGTCGCCGTCGCTGTCGGGGACGATGCGG
294 GGGGTAGCAGGTGTTGTTGCGGCCGCTGAGGGTGTTGTAG
295 GTAGGGGTAGAGGACGCTGACGGTGGGGCTGGGGATGGCG 296 GATGGCGCTGGAGACGTAGATGCCGCGGCGGCGGAAGTAG 297 GGAGGTGGTGCCGAAGTGGTTGGGGGGGGCGACGTGGGTG 298 GTCGACGTGGGTGGCGATGCTGTGGAGGGAGTTGCTGATG
299 GGTGGGGGGGAGGCGGCCGGGGGCGCGGCAGTGGTTGGGG
300 GGAGGGGAGGCTGGCGCAGGGGCTGGCGAGGTTGAGGTTG
301 GGAGGTGTTGGTGTGGTGGGGGTAGGTGCTGCCGGGGTGG
302 GGTGTTGGCGTTGTCGTTGTGGAGGGTGGCGATGTTGTGG
303 GCTGAAGGAGTACGTCGTCGGCGCGGCGCAGGTGGGGGAG
304 GTAGCAGATGTTGCAGAGGCCGTAGCCGCTGCTGCTGGGG
305 GTTGTTGACGTTGGCGCTGATGTTGTTGATGGTGGCGGTG
306 GTGGTCGTGGGTGTAGCAGCAGTAGAGGCGGGCGCTGTGG
307 GACGTAGCGGGAGATGCTGCGGTCGACGACGCCGGCGGCG 308 GGGGGTGGCGAGGCTGGTGGGGTGGCCGAAGTGGCGGCTG
309 GCAGCACCAGTAGTAGGGGACGATGTTGTGGCTGGTGTAG
310 GCGGGGGGCGATGGCGGAGTTGGCGTGGTTGATGGTGATG
311 GACGGTGCAGTAGCAGAGGACGGGGTGGGTGCCGGAGGGG
312 GTCGGTGGTGCGGGAGTGGGGGTTGTAGAAGCAGGTGGTG
313 AGTGGGATGGACGCGGAGGCGGCTGAAGGGGTGGGGGAGG
314 GAAGGAGCGGAGGCGGTCGATGGAGATGAGGATGGTGGGG
315 GCAGTGGCCGCAGATGTAGTGGAAGGTGTGGTAGGTGTGG
316 GTGGATGCCGAAGGCGCTGTCGGGGGAGTAGTGGGGGCAG
317 GCTGTGGACGGCGATGCCGAGGACGCTGAGGTAGCGGTAG
318 GGTGTGGGTGCCGTGGCTGGCGTAGTGGCCGTCGTAGTGG
319 GGTGCAGTAGTCGAGGGTGGCGGTGCTGGAGGAGACGTAG
320 GAGGTTGACGTGGTCGGCGCGGCAGTTGGAGGGGAAGTGG
321 GGGGGGGGCGTCGAAGCGGGTGCTGGGGGAGAGGACGGTG
322 GCCGCCGGAGTTGCGGTCGTCGGCGATGTCGAGGGCGCCG
323 GGGGGAGTAGATGCGGTTGTTGGTGGAGGTGTGGTCGTTG
324 GGGGCAGAGGACGAGGACGAAGGGGATGTCGTTGGCGGCG
325 GCAGGCGTGGCTGCGGGTGGAGGGGCTGACGTGGTTGTGG
326 GAGGACGGTGCAGACGACGATGTGGTGGTAGGCGCTGTTG
327 GCTGGCGTGGACGGCGATGCGGGGGACGGGGAAGTAGGGG
328 GTGGCTGCGGCTGCTGCGGCTGTGGTAGCCGGTGGAGAGG
329 GTAGACGTGGTTGTTGCGGGTGGTGGTGTTGTCGCGGTGG
330 GGGGGCGCCGTAGGAGTTGTCGATGTTGGGGGCGAGGTGG
331 GTCGTGGCCGACGATGCTGCTGTTGTGGCTGCTGCGGTGG
332 GCCGCAGGTGTGGCGGATGAGGTGGGAGGTGCCGATGCTG
333 GTTGATGACGGGGAGGGCGGAGCGGGGGGGGGCGTGGCGG
334 GATGATGTCGGCGACGCGGAGGAAGTGGTTGCCGGGGTGG
335 GGAGGGGGGGGTGGTGTAGAGGTGGCTGTTGTGGTTGCCG
336 GCTGTGGCGGGGGTAGAGGGGGAGGGTGACGGGGCTGTCG
TABLE 4 : GXX 2X
SEQ ID NO :
337 GATGCCGGAGTAGGTGCGGGGGGCGCGGATGATGAAGCTG
338 GTTGTGGTCGTCGTGGGGGAAGTTGGAGCAGGAGTTGCTG
339 GGGGAGGGGGAAGTTGGCGCGGTTGAAGTGGGTGTGGGTG 340 GACGAGGTGGAAGGGGTGGCTGGCGCGGGTGTCGGGGTAG
341 GTTGTTGGGGTAGAAGGCGAGGAGGTAGACGGTGCTGAGG
342 GCTGTAGGTGGAGCGGGTGGAGTGGAAGGAGCCGGTGTGG 343 GTAGCTGAAGCGGCTGGCGTCGGAGTAGTTGTGGACGTTG 344 GATGCTGGCGGTGGTGCGGCGGACGCAGAAGGTGGGGGTG 345 GATGGGGTCGATGGCGTAGCCGTCGCGGGTGGTGCTGGGG 346 GCGGTCGAGGTAGTTGTAGAGGCTGTGGACGTGGTTGGTG 347 GTGGTGGTCGTAGCTGTTGGAGCCGGATTCGTGGCCGAGG 348 GCCGCTGAGGGAGGGGAAGTCGCCGGAGCTGCAGAGGCAG 349 GGAGATGGGGAGGTTGGCGCTGGTGTCGCTGCGGGCGTCG
350 GTCGACGGCGCCGAGGTTGGGGGAGAAGGGGATGGGGTTG
351 GAAGGTGGGGACGTAGACGCGGGAGGAGAGGCCGTGGTAG
352 GGCGTGGGAGGGGGAGTGGTGGTATAGGCCGGGGTGGCTG
353 GTAGGGGACGTGGCGGGAGCTGTGGGTGTGGGTGTGGATG
354 GACGTGGCCGGGGGCGCCGCCGGCGTGGATGATGGGGGGG
355 GCAGGAGTTGAAGGAGCTGTTGCTGGGGGGGGTGACGGTG
356 GGGGCCGCCGGCGTGGGCGCGGTGGAGGGCGGAGAGGCGG
357 GCTGTTGCTGACGGAGCGGGGGACGGGGGGGGAGTTGTGG
358 GGGGTCGATGGGGGGGACGTGGGAGCCGGGGTCGCGGTCG
359 GTAGTGGCTGTTGAAGGCGGCGACGTCGGGGAAGGGGAGG
360 GGTGGCGTCGGTGGCGAGGGTGCGGGAGCGGTTGTTGGTG
361 GGGGTAGTCGGTGTGGCAGCTGATGACGAAGAGGTAGGCG
362 GGGGGGGTCGCGGGTGTTGCTGTAGAAGCGGTAGGTGTTG
363 GATGGGGTAGTTGACGCGGTCGCTGGAGCGGCGGTCGATG
364 GGAGACGCGGGGGAAGGTGCTGGTGGGGGGGTCGATGGGG 365 GATGTGGATGCTGGGGACGCGGGAGGCGGGGTTGGCGCCG 366 GTAGGAGGAGCCGTGGTTGGAGATGCAGGAGGGGTGGATG 367 GGAGTAGGGGAGGTTGTGGCCGGTGCTGATGCGGCCGGGG
368 GTCGTGGATGGGGTGGTAGACGGGGAGGGTGAGGTTGTCG
369 GTCGCGGTAGAGGTAGTTGTGGGGGCAGTCGCGGGGGCAG
370 GATGCGGTGGAAGAAGAGGGGGTGGGCGGCGGAGCGGTTG
371 GCGGTAGAAGGCGGAGGGGCTGATGTTGTTGGTGTTGTGG
372 GTGGACGATGCGGTCGTGGCAGGAGCCGACGCCGGTGTAG
373 GTCGTGGTTGACGGTGCGGTAGATGTAGGGGTTGACGTGG
374 GTGGTGGTTGACGTTGATGTCGGGGTGGGAGTCGGTGGGG
375 GTTGTAGTGGCTGGCGATGCAGATGTCGAGGCTGGTGCCG 376 GGAGTCGGCGCTGCGGTTATTGGTGGGGGAGGAGGCGATG
377 GCAGCGGCAGTGGCTGTGGCAGACGGTGTCGTCGTAGAGG
378 GCTGCCGGGGTCGGTGATGTGGCAGAGGGTGCTGTGGTTG 379 GCGGATGGCGGGGTCGAGGATGGGGTGGTAGGGGAGGTTG
380 GAAGCCGCGGGTGGAGGCGGCGCCGAAGGGGAAGTCGTTG
381 GGAGCGGAAGAGGGGGAGGGAGACGCAGGCGCTGGTGTGG
382 GGGGTAGGCGCGGGGGAAGAGGTTGGTGTGGTCGGCGTTG
383 GGTGTGGGAGCCGAGGCGGGGGAGGTTGGCGGAGCTGTCG 384 GCCGAGGCAGGGGGAGTTGAGGCAGAGGGGGTCGTTGTGG
385 GCTGTGGTTGAGGAAGACGCTGTTGCTGCAGTAGGGGGTG
386 GATGATGTGGAAGCCGGGGGGGATGTAGTTGTCGGTGCGG
387 GCGGCGGGGGACGTCGCAGGCGCAGTAGTTGCGGTCGCGG
388 GTGGTCGTTGAGGGGGTTGGCGGTGTCGCCGGTGAGGGGG
389 GGTGTGGGGGTTGACGTCGTGGGCGGGGGTGAGGTAGGGG
390 GCAGCTGTAGGTGGTGCTGTTGCTGCTGGTGCCGCGGACG
391 GTTGCGGGGGTAGCGGATGTAGATGTAGCGGGAGGCGTGG
392 GCTGCTGTCGTCGCGGCGGTAGGAGGCGTCGTGGCCGTAG
393 GGCGCAGAAGGCGAGGCAGTCGGGGGTGTGGGTGTTGTGG
394 GTCGGCGCCGGGGTCGCCGCTGAAGTTGGGGTAGGTGGTG
395 GGAGGTGCGGCGGGAGCGGTTGGAGCGGTAGGGGCGGGTG
396 GGGGCGGTAGCAGGGGTTGTTGGGGTTGGTGAGGGGGCCG
397 GGAGTCGTTGCTGTTGACGTCGTTGAGGTCGACGGTGCTG
398 GATGTTGAGGCGGTCGATGTAGCAGTTGTGGGCGGCGTCG
399 GAAGCCGCTGGGGCAGTCGGGGGTGGTGAGGGGGTCGTGG 400 GGCGAGGTAGTTGACGGTGGGGGCGCCGCTGGCGACGCGG 401 GACGTAGTGGCAGTTGGCGCCGTGGCGGTGGTCGGGGAAG 402 GGCGCCGTCGGGGTCGGCGTAGTAGCGGCAGGAGTTGTGG 403 GGGGCTGATGTGGGCGGCGCTGGGGGTGTTGGTGTTGGGG 404 GGCGGTGCGGCGGAAGCCGAGGCAGGCGGCGAGGATGCTG 405 GCTGTCGCTGCTGGTGTTGCGGCTGTTGTAGACGGGGGGG 406 GCGGTAGCTGGTGGAGTAGGGGGGGTTGGTGTTGAAGTTG 407 GAGGTCGCCGTGGCCGGGGTTGTCGGGGGTGAGGTTGTGG 408 GGGGTTGTAGGAGCAGGCGCCGTAGCTGTTGTTGGGGACA 409 GCGGGGGCTGAGGATGTAGTAGATGGTGGGGGGGTTGTTG

Claims

What is claimed is:
1. A method of determining the nucleotide composition of flanking duplex polynucleotide sequences which confer relatively high binding affinity to an adjacent polynucleotide binding site for a given ligand from a plurality of duplex polynucleotide molecules, comprising:
(a) providing a plurality of different duplex polynucleotides wherein each of the duplex polynucleotide molecules has the same polynucleotide ligand binding site and a randomly synthesized polynucleotide sequence flanking the binding site; (b) exposing the duplex polynucleotide molecules to a ligand selective for the binding site;
(c) isolating duplex polynucleotide molecules which bind to the ligand;
(d) determining the nucleotide composition of the flanking duplex polynucleotide sequence by sequencing the flanking duplex polynucleotide sequence adjacent the ligand binding site.
2. A method of determining the nucleotide composition of flanking duplex polynucleotide sequences which confer relatively low binding affinity to an adjacent polynucleotide binding site for a given ligand from a plurality of duplex polynucleotide molecules, comprising:
(a) providing a plurality of different duplex polynucleotides wherein each of the duplex polynucleotide molecules has the same polynucleotide ligand binding site and a randomly synthesized polynucleotide sequence flanking the binding site;
(b) exposing the duplex polynucleotide molecules to a ligand selective for the binding site;
(c) isolating duplex polynucleotide molecules which do not bind to the ligand;
(d) determining the nucleotide composition of the flanking duplex polynucleotide sequence by sequencing the flanking duplex polynucleotide sequence adjacent the ligand binding site.
3. An isolated polynucleotide sequence which is a member of a mutant sequence family, which family can be described by application of steps (a) - (d) below to a seed sequence, said steps comprising:
(a) providing a minimum number of base positions to be mutated simultaneously (MS 1);
(b) providing a maximum number of base positions to be mutated (MXl);
(c) reading said seed sequence; and
(d) executing a computer program, which the values of MS 1 and MXl, generates said mutant sequences comprised in said mutant sequence family from said seed sequence.
4. The isolated polynucleotide sequence of claim 3 wherein said seed sequence is a flanking sequence conferring a relative binding affinity to an adjacent polynucleotide binding site for a given ligand, which relative binding affinity can be described by application of the method of claims 1 or 2.
5. The isolated polynucleotide sequence of claim 3 wherein said steps further comprise the step of providing a minimum spacing value (SVl) of greater than 1 between bases to be mutated in said flanking polynucleotide sequence and said computer program further uses said value SVl to generate said mutant sequence library from said seed sequence.
6. The isolated polynucleotide sequence of claim 3 wherein said steps further comprise the step of providing a ratio of bases G to C (GC1) which must be maintained in said mutant sequences and said computer program further uses the value GC1 to generate said mutant sequence library from said seed sequence.
7. The isolated polynucleotide sequence of claim 3 wherein said steps further comprise the step of providing a ratio of bases A to T (ATI) which must be maintained in said mutant in said mutant sequences and said computer program further uses the value ATI to generate said mutant sequence library from said seed sequence.
8. The isolated polynucleotide sequence of claim 3 wherein said steps further comprise the step of specifying a region of said flanking sequence which is not to be mutated as an instruction to said computer program and said computer program further uses this instruction to generate said mutant sequence library from said seed sequence.
9. The isolated polynucleotide sequence of claim 4, wherein said isolated polynucleotide sequence:
(a) is at least 20 base pairs long;
(b) is at lease 30% homologous to said seed sequence; and (c) has a relative binding constant of +5% to said seed sequence.
10. The isolated polynucleotide sequence of claim 9 wherein said isolated polynucleotide sequence is at least 40% homologous to said seed sequence.
11. The isolated polynucleotide sequence of claim 9 wherein said isolated polynucleotide sequence is at least 60% homologous to said seed sequence.
12. The isolated polynucleotide sequence of claim 9 wherein said isolated polynucleotide sequence is at least 80% homologous to said seed sequence.
13. The isolated polynucleotide sequence of claim 9 wherein said isolated polynucleotide sequence is at least 30 base pairs long.
14. The isolated polynucleotide sequence of claim 9 wherein said isolated polynucleotide sequence is at least 40 base pairs long.
15. The isolated polynucleotide sequence of claim 9 wherein said isolated polynucleotide sequence is at least 50 base pairs long.
16. The isolated polynucleotide sequence of claim 9 wherein said isolated polynucleotide sequence is at least 60 base pairs long.
17. A chromosome having inserted therein the isolated polynucleotide sequence of claim 4.
18. A chromosome of claim 17 wherein said isolated polynucleotide sequence is operationally coupled to said polynucleotide binding site.
19. A DNA construct having inserted therein the isolated polynucleotide sequence of claim 4.
20. A DNA construct of claim 19 wherein said isolated polynucleotide sequence is operationally coupled to said polynucleotide binding site.
21. An expression vehicle having inserted therein the isolated polynucleotide sequence of claim 4.
22. The expression vehicle of claim 21 wherein said isolated polynucleotide sequence is operationally coupled to said polynucleotide binding site.
23. A living cell comprising the DNA construct of claim 19 wherein isolated polynucleotide sequence is operationally coupled to said polynucleotide binding site.
24. The living cell of claim 23 wherein said isolated polynucleotide sequence confers relatively high binding affinity to said polynucleotide binding site.
25. The living cell of claim 24 wherein said isolated polynucleotide sequence confers relatively low binding affinity to said polynucleotide binding site.
26. A flanking polynucleotide sequence for a BamRl binding site, wherein said flanking sequence is at least 40 base pairs in length and said flanking sequence confers a relative binding affinity of greater than or equal to 2x, with respect to a random flanking sequence population when said flanking polynucleotide sequence is adjacent to said BamRl binding site.
27. A flanking polynucleotide sequence for a BamRl binding site, wherein said flanking sequence is at least 40 base pairs in length and said flanking sequence confers a relative binding affinity of less than or equal to l/2x, with respect to a random flanking sequence population when said flanking polynucleotide sequence is adjacent to said BamRl binding site.
28. A flanking polynucleotide sequence for a BamRl binding site, wherein said flanking sequence is selected from the group consisting of the sequences listed in Tables 1-4.
29. The isolated polynucleotide sequence of claim 3, wherein said seed sequence is selected from the group consisting of the sequences listed in Tables 1-4.
30. A chromosome having inserted therein the isolated polynucleotide sequence of claim 29.
31. The chromosome of claim 30, wherein said isolated polynucleotide sequence is operationally coupled to said polynucleotide binding site.
32. A DNA construct having inserted therein the isolated polynucleotide sequence of claim 29.
33. The DNA construct of claim 32, wherein the isolated polynucleotide sequence is operationally coupled to said polynucleotide binding site.
34. An expression vehicle having inserted therein the isolated polynucleotide sequence of claim 29.
35. The expression vehicle of claim 34, wherein said isolated polynucleotide sequence is operationally linked to said polynucleotide binding site.
36. A living cell comprising the DNA construct of claim 35 wherein said isolated polynucleotide sequence is operationally coupled to said polynucleotide binding site.
37. The living cell of claim 29, wherein said isolated polynucleotide sequence confers relatively high binding affinity to said polynucleotide binding site.
38. The living cell of claim 37, wherein said isolated polynucleotide sequence confers relatively low binding affinity to said polynucleotide binding site.
39. The seed sequence of claim 3, wherein the seed sequence is a non-naturally occurring polynucleotide sequence which has the ability to confer a relatively high or low binding affinity to an adjacent polynucleotide binding site for a given ligand.
40. The flanking polynucleotide sequence of claim 26, wherein every n th nucleotide in said sequence is fixed.
41. The flanking polynucleotide sequence of claim 40, wherein n is an integer from 1 to 40.
42. The flanking polynucleotide sequence of claim 40, wherein « w 3.
43. The flanking polynucleotide sequence of claim 40 wherein said fixed nucleotide is guanine (G).
44. The flanking polynucleotide sequence of claim 42 wherein said fixed nucleotide is guanine (G).
PCT/US1999/012516 1998-06-04 1999-06-04 Compositions of nucleic acid which alter ligand-binding characteristics and related methods and products WO1999063077A2 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US8790598P 1998-06-04 1998-06-04
US60/087,905 1998-06-04
US32467299A 1999-06-03 1999-06-03
US09/324,672 1999-06-03

Publications (2)

Publication Number Publication Date
WO1999063077A2 true WO1999063077A2 (en) 1999-12-09
WO1999063077A3 WO1999063077A3 (en) 2000-06-29

Family

ID=26777506

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1999/012516 WO1999063077A2 (en) 1998-06-04 1999-06-04 Compositions of nucleic acid which alter ligand-binding characteristics and related methods and products

Country Status (1)

Country Link
WO (1) WO1999063077A2 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1134292A2 (en) * 2000-03-17 2001-09-19 Tosoh Corporation Oligonucleotides for detection of vibrio parahaemolyticus and detection method for vibrio parahaemolyticus using the same oligonucleotides
WO2001092523A2 (en) * 2000-05-30 2001-12-06 Curagen Corporation Human polynucleotides and polypeptides encoded thereby
CN104975352A (en) * 2015-05-06 2015-10-14 重庆大学 Oligonucleotide fragment library used for regulation of gene expression level and application thereof

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1995000665A1 (en) * 1993-06-17 1995-01-05 The Research Foundation Of State University Of New York Thermodynamics, design, and use of nucleic acid sequences
WO1997008183A1 (en) * 1995-08-25 1997-03-06 Lane Michael J Nucleic acid capture moieties
US5726014A (en) * 1991-06-27 1998-03-10 Genelabs Technologies, Inc. Screening assay for the detection of DNA-binding molecules
WO1999032664A1 (en) * 1997-12-23 1999-07-01 Tm Technologies, Inc. Method of selecting flanking sequences which convey relative binding affinities to a ligand binding site
WO1999063117A1 (en) * 1998-06-04 1999-12-09 Tm Technologies, Inc. Method of creating flanking polynucleotide sequences which convey relative binding affinities to a ligand binding site

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5726014A (en) * 1991-06-27 1998-03-10 Genelabs Technologies, Inc. Screening assay for the detection of DNA-binding molecules
WO1995000665A1 (en) * 1993-06-17 1995-01-05 The Research Foundation Of State University Of New York Thermodynamics, design, and use of nucleic acid sequences
WO1997008183A1 (en) * 1995-08-25 1997-03-06 Lane Michael J Nucleic acid capture moieties
WO1999032664A1 (en) * 1997-12-23 1999-07-01 Tm Technologies, Inc. Method of selecting flanking sequences which convey relative binding affinities to a ligand binding site
WO1999063117A1 (en) * 1998-06-04 1999-12-09 Tm Technologies, Inc. Method of creating flanking polynucleotide sequences which convey relative binding affinities to a ligand binding site

Non-Patent Citations (9)

* Cited by examiner, † Cited by third party
Title
A.S. BENIGHT ET AL.: "Sequence contest and DNA reactivity: Application to sequence-specific cleavage of DNA." ADVANCES IN BIOPHYSICAL CHEMISTRY, vol. 5, 1995, pages 1-55, XP002127590 *
BLACKWELL T K ET AL: "DIFFERENCES AND SIMILARITIES IN DNA-BINDING PREFERENCES OF MyoD AND E2A PROTEIN COMPLEXES REVEALED BY BINDING SITE SELECTION" SCIENCE,US,AMERICAN ASSOCIATION FOR THE ADVANCEMENT OF SCIENCE,, vol. 250, no. 4984, page 1104-1110 XP000371689 ISSN: 0036-8075 *
M.J. LANE ET AL.: "Non-contacted sequences flanking a ligand binding site can modulate the affinity of ligand binding: implications of such modulation in single strand to duplex reaction design." BIOPHYSICAL JOURNAL, vol. 74, no. 2, part 2, 22 - 26 February 1998, page A5 XP002116814 *
OLIPHANT A R ET AL: "DEFINING THE SEQUENCE SPECIFICITY OF DNA-BINDING PROTEINS BY SELECTING BINDING SITES FROM RANDOM-SEQUENCE OLIGONUCLEOTIDES: ANALYSIS OF YEAST GCN4 PROTEIN" MOLECULAR AND CELLULAR BIOLOGY,US,AMERICAN SOCIETY FOR MICROBIOLOGY, WASHINGTON, vol. 9, no. 7, page 2944-2949 XP000673592 ISSN: 0270-7306 *
P.L. NORBY ET AL.: "Determination of recognition-sequences for DNA-binding proteins by a polymerase chain reaction asisted binding site selection method (BSS) using nitrocellulose immobilized DNA binding protein." NUCLEIC ACIDS RESEARCH, vol. 20, no. 23, 1992, pages 6317-6321, XP002116813 *
P.V. RICCELLI ET AL.: "Investigation of DNA context effects: Influences of flanking sequence stability on site specific binding of BamHI restriction enzyme to duplex DNA oligomers." BIOPHYSICAL JOURNAL, vol. 72, no. 2, part 2, 1997, page a95 XP000870137 *
PIERROU S ET AL: "SELECTION OF HIGH-AFFINITY BINDING SITES FOR SEQUENCE-SPECIFIC, DNA BINDING PROTEINS FROM RANDOM SEQUENCE OLIGONUCLEOTIDES" ANALYTICAL BIOCHEMISTRY,US,ACADEMIC PRESS, SAN DIEGO, CA, vol. 229, no. 1, page 99-105 XP000524716 ISSN: 0003-2697 *
THIESEN H J ET AL: "TARGET DETECTION ASSAY (TDA): A VERSATILE PROCEDURE TO DETERMINE DNA BINDING SITES AS DEMONSTRATED ON SP1 PROTEIN" NUCLEIC ACIDS RESEARCH,GB,OXFORD UNIVERSITY PRESS, SURREY, vol. 18, no. 11, page 3203-3209 XP000132496 ISSN: 0305-1048 *
Y. LIN ET AL.: "Peptide conjugation to an in vitro-selected DNA ligand improves enzyme inhibition." PROC. NATL. ACAD. SCI. USA, vol. 92, November 1995 (1995-11), pages 11044-11048, XP002127589 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1134292A2 (en) * 2000-03-17 2001-09-19 Tosoh Corporation Oligonucleotides for detection of vibrio parahaemolyticus and detection method for vibrio parahaemolyticus using the same oligonucleotides
EP1134292A3 (en) * 2000-03-17 2003-12-10 Tosoh Corporation Oligonucleotides for detection of vibrio parahaemolyticus and detection method for vibrio parahaemolyticus using the same oligonucleotides
WO2001092523A2 (en) * 2000-05-30 2001-12-06 Curagen Corporation Human polynucleotides and polypeptides encoded thereby
WO2001092523A3 (en) * 2000-05-30 2002-09-06 Curagen Corp Human polynucleotides and polypeptides encoded thereby
CN104975352A (en) * 2015-05-06 2015-10-14 重庆大学 Oligonucleotide fragment library used for regulation of gene expression level and application thereof

Also Published As

Publication number Publication date
WO1999063077A3 (en) 2000-06-29

Similar Documents

Publication Publication Date Title
Lister et al. Next is now: new technologies for sequencing of genomes, transcriptomes, and beyond
Carninci et al. Normalization and subtraction of cap-trapper-selected cDNAs to prepare full-length cDNA libraries for rapid discovery of new genes
McCarty et al. Mu-seq: sequence-based mapping and identification of transposon induced mutations
EP1546345B1 (en) Genome partitioning
US7435542B2 (en) Exhaustive selection of RNA aptamers against complex targets
CN102124126A (en) Cdna synthesis using non-random primers
CA2584984A1 (en) Methods for assembly of high fidelity synthetic polynucleotides
US8697607B2 (en) Generation and application of standardized universal libraries
CN103333949A (en) High throughput physical mapping using aflp
Nguyen et al. Minimising the secondary structure of DNA targets by incorporation of a modified deoxynucleoside: implications for nucleic acid analysis by hybridisation
López-Nieto et al. Selective amplification of protein-coding regions of large sets of genes using statistically designed primer sets
Yang et al. A genome-phenome association study in native microbiomes identifies a mechanism for cytosine modification in DNA and RNA
EP0668361A1 (en) Oligonucleotide and method for analyzing base sequence of nucleic acid
WO1999063077A2 (en) Compositions of nucleic acid which alter ligand-binding characteristics and related methods and products
Pavlov et al. Nucleotide-sequence-specific and non-specific interactions of T4 DNA polymerase with its own mRNA
EP1042510A1 (en) Method of selecting flanking sequences which convey relative binding affinities to a ligand binding site
US20030092662A1 (en) Molecular interaction sites of 16S ribosomal RNA and methods of modulating the same
WO1999002725A1 (en) Categorising nucleic acid
WO2002072886A2 (en) Complex element micro-array and methods of use
WO1999063117A1 (en) Method of creating flanking polynucleotide sequences which convey relative binding affinities to a ligand binding site
EP1210460B1 (en) Gene cloning
CA2457318A1 (en) Molecular interaction sites of rnase prna and methods of modulating the same
WO2005058931A2 (en) Methods and algorithms for identifying genomic regulatory sites
US9725713B1 (en) In vitro selection with expanded genetic alphabets
CA2458205A1 (en) Molecular interacting sites of 23s ribosomal rna and methods of modulating the same

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): JP

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

121 Ep: the epo has been informed by wipo that ep was designated in this application
AK Designated states

Kind code of ref document: A3

Designated state(s): JP

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

122 Ep: pct application non-entry in european phase