WO2005121419A2 - Rapid and efficient cdna library screening by self-ligation of inverse pcr products - Google Patents

Rapid and efficient cdna library screening by self-ligation of inverse pcr products Download PDF

Info

Publication number
WO2005121419A2
WO2005121419A2 PCT/US2005/016765 US2005016765W WO2005121419A2 WO 2005121419 A2 WO2005121419 A2 WO 2005121419A2 US 2005016765 W US2005016765 W US 2005016765W WO 2005121419 A2 WO2005121419 A2 WO 2005121419A2
Authority
WO
WIPO (PCT)
Prior art keywords
gene
pcr
clones
cdna
genes
Prior art date
Application number
PCT/US2005/016765
Other languages
French (fr)
Other versions
WO2005121419A3 (en
Inventor
Roger A. Hoskins
Mark T. Stapleton
Joseph W. Carlson
Susan E. Celniker
Reed A. George
Original Assignee
The Regents Of The University Of California
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Regents Of The University Of California filed Critical The Regents Of The University Of California
Publication of WO2005121419A2 publication Critical patent/WO2005121419A2/en
Publication of WO2005121419A3 publication Critical patent/WO2005121419A3/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/1096Processes for the isolation, preparation or purification of DNA or RNA cDNA Synthesis; Subtracted cDNA library construction, e.g. RT, RT-PCR

Abstract

The invention provides a method for screening, isolation and recovery of clones using self-ligation of inverse PCR products. The recovery of full-length, intact clones representing genes and alternatively spliced transcripts of interest is described. We demonstrate the utility of the method by recovering full-length cDNA clones for genes and alternatively spliced transcripts, including genes that are not represented in available EST collections. The method is applicable to any plasmid library, including genomic libraries.

Description

RAPID AND EFFICIENT CDNA LIBRARY SCREENING BY SELF-LIGATION OF INVERSE PCR PRODUCTS
Inventors: Roger A. Hoskins, Mark T. Stapleton, Joseph W. Carlson, Susan E. Celniker and Reed A. George
CROSS-REFERENCE TO RELATEDAPPLICATIONS
[001] This application claims priority to U.S. Provisional Patent Application No.
60/570,582, filed on May 12, 2004, which is hereby incorporated by reference in its entirety.
STATEMENT OF GOVERNMENT SUPPORT
[002] This invention was made during work partially supported by NIH grant
HG002673 and the U.S. Department of Energy under Contract No. DE-AC03-76SF00098. The government has certain rights in this invention.
BACKGROUND OF THE INVENTION FIELD OF THE INVENTION
[003] The invention relates to the use of screening methods for isolation, identification or recovery of cDNAs representing RNA transcripts using PCR and ligation.
DESCRIPTION OF THE RELATED ART
[004] Full-length cDNAs are defined as cDNAs representing the entire mRNA transcript including the 5' and 3' untranslated regions (UTRs) which flank the protein-coding sequence. The traditional approach to recovering full-length cDNA clones representing a majority of the genes in a genome is to sequence random cDNA clones from a clone library from their 5' ends (expressed sequence tag (EST) sequencing) and analyze the sequences to determine which cDNA clones are full-length. This approach has been applied very effectively in the fruit fly, Drosophila melanogaster, for example.
[005] However, the representation of genes in a cDNA library is biased by differences in mRNA abundance. Abundant mRNA transcripts are represented by more clones in a cDNA library than rare transcripts, so abundant transcripts will be represented by multiple sequence reads in an EST project. Rare transcripts which exist in low copy number in a cDNA library will not be easily identified using the EST sequencing method because it samples a library at random. [006] The use of normalized cDNA libraries improves the efficiency of gene discovery by EST sequencing, but even the best methods result in very incomplete normalization, so this approach still requires significant sampling [Stapleton, M., et al., A Drosophila full-length cDNA resource. Genome Biology, 2002. 3: p. research0080; Okazaki, Y., Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature, 2002. 420(6915): p. 563-73; Imanishi, T., Integrative annotation of 21,037 human genes validated by full-length cDNA clones. PLoS Biol, 2004. 2(6): p. el62]. In addition, because ESTs are derived from cDNA ends, they do not efficiently interrogate alternative splicing in the central regions of transcripts. Thus, new high-throughput methods are needed to selectively recover cDNAs. [007] Many methods for finding and sequencing rare transcripts have been developed, but these are often arduous and tedious to perform on a large scale, or do not result in recovery of full-length cDNA clones. Full-length cDNA clones are most useful to the biologist for further characterization and study of identified genes. Such methods include hybridization-based library screening approaches, which are very arduous. The traditional method for screening a cDNA library for clones representing a gene of interest is hybridization of labeled gene-specific DNA probes to colonies or plaques lifted onto a nylon filter see[Maniatis, T., E.F. Fritsch, and J. Sambrook, Molecular Cloning: A Laboratory Manual. 2 ed. 1989, Planview, NY: Cold Spring Harbor Laboratory Press]. This method is labor and time intensive, especially when the desired clones are rare in the library. It is not an efficient approach for screening libraries on a large scale.
[008] One method has been described for screening arrayed cDNA libraries by PCR of pooled clones in a combinatorial scheme [Munroe, D.J., R. Loebbert, E. Brie, T. Whitton, D. Prawitt, D. Vu, A. Buckler, A. Winterpacht, B. Zabel, and D.E. Housman, Systematic screening of an arrayed cDNA library by PCR. Proc Natl Acad Sci U S A, 1995. 92(6): p. 2209-13]. This approach requires arraying individual clones into microtiter wells and is therefore practical only for abundantly expressed transcripts.
[009] Several PCR-based methods have been developed to recover cDNA sequences for specific genes, such as Reverse Transcriptase-PCR, but result in recovery of the protein- coding portion of the transcribed sequence without the 5' and 3' untranslated regions (UTRs) and do not generally recover many of the alternatively spliced transcripts.. In RT-PCR [ Mocharla, H., R. Mocharla, and M.E. Hodes, Coupled reverse transcription-polymerase chain reaction (RT-PCR) as a sensitive and rapid method for isozyme genotyping. Gene, 1990. 93(2): p. 271-5], first-strand cDNA is used as a template in a PCR reaction with a pair of gene-specific primers at the 5' and 3' ends of the desired transcript. This procedure can generate cDNAs that are as complete as the starting gene model. However, it depends on an accurate and complete model of the transcript of interest, and most gene-finding algorithms predict only the open reading frame. In particular, 5' and 3'UTRs are very difficult to predict and are therefore not usually captured by RT-PCR. The related Rapid Amplication of cDNA Ends (RACE) [ Frohman, M.A., M.K. Dush, and G.R. Martin, Rapid production of full-length cDNAs from rare transcripts: amplification using a single gene-specific oligonucleotide primer. Proc Natl Acad Sci U S A, 1988. 85(23): p. 8998-9002] method addresses the problem of identifying UTRs, but it produces PCR products representing incomplete transcripts that contain either the 5' or 3' UTR and part of the coding sequence, but not full-length cDNA clones. In addition, because only one of the primers in a RACE PCR reaction is gene-specific, successful amplification often requires sequential rounds of PCR with nested primers.
[010] An alternative method for obtaining both the 5' and 3' ends of a transcript has been described in which double-stranded cDNA is self-ligated in dilute solution to produce circular molecules rather than cloned into a vector [ Huang, S.H., S.H. Chen, and A.Y. ong, Use of inverse PCR to clone cDNA ends. Methods Mol Biol, 2003. 221: p. 51-8]. The circularized cDNA is used as a template for an inverse PCR reaction using gene-specific primers oriented away from one another in the transcript sequence. The resulting PCR products include both the 5' and 3' ends of the transcript, which are joined together in inverted orientation at the point of ligation. This approach to characterizing the 5' and 3' ends of transcripts ensures that the two ends within a PCR product are derived from the same transcript isoform. The products can be cloned and characterized, but they are rearranged relative to the intact transcript. Thus, the method does not lead directly to intact, full-length cDNA clones.
[011] Two related methods for recovering complete and intact cDNA clones from plasmid libraries, MACH-1 and MACH-2 [Haerry, T.E. and M.B. O'Connor, Isolation of Drosophila activin and follistatin cDNAs using novel MACH amplification protocols. Gene, 2002. 291(1-2): p. 85-93] are based on site directed mutagenesis. MACH-1, based on Stratagene QuikChange™ PCR-Based Site-Directed Mutagenesis Kit [PCR-Based Site-Directed Mutagenesis. 2005] uses a pair of overlapping, oppositely directed, gene-specific primers to amplify cDNA sequences from a plasmid library in a linear amplification reaction. The methylation-sensitive restriction enzyme Dpn I is then used to digest the methylated plasmid library template, leaving the un-methylated amplification products intact. The products self- anneal to form nicked circles that are repaired upon transformation into bacteria. Because MACH-1 is a linear amplification method, it is not suitable for recovery of rare cDNAs. MACH-2, based on [Jones, D.H., Howard, B.H., 1990. A rapid method for site-specific mutagenesis and directional subcloning by using the polymerase chain reaction to generate recombinant circles. Biotechniques 8: p. 178-183.], uses two separate PCR reactions with different pairs of gene-specific primers to amplify cDNA sequences from a library. Following Dpn I digestion to degrade the library template, the two linear DNA products are size-selected and purified by agarose gel electrophoresis, mixed together, and melted and re-annealed to form hybrid which are transformed into bacteria. MACH-2 appears to be effective and suitable for recovery of rare cDNAs. However, because it requires two PCR reactions per target and includes a gel purification step, it is too inefficient for large scale screens.
[012] Herein is described a similar but simpler method for efficient inverse-PCR-based cDNA library screening, which is alternatively called Self-Ligation of Inverse PCR products (SLIP). SLIP is similar to the Stratagene ExSite™ site-directed mutagenesis protocol [Stratagene ExSite™ PCR-Based Site-Directed Mutagenesis Kit. Catalog #200502, Revision#073006f, found at URL< http://www.stratagene.com/manuals/200502.pdf>] but is used to screen a plasmid cDNA library to obtain clones for a gene of interest, and not to introduce a mutation into a clone.
[013] An important goal of the human genome project and of model organism genome projects is to develop cDNA collections representing most, or all major spliced forms of, most or all genes. EST sequencing has been a very efficient method for obtaining clones representing a large fraction of the genes. However, EST sequencing is a random or "shotgun" approach and therefore is not a practical method for recovery of clones representing rare transcripts for completion of a cDNA collection.
[014] In the fruit fly Drosophila melanogaster, for example, the sequencing of approximately 259,000 5' ESTs from libraries representing a variety of tissues and developmental stages led to the recovery of a collection of full-length cDNA clones called the Drosophila Gene Collection (DGC). The DGC currently represents approximately 10,000 of the predicted 14,000 protein-coding genes in Drosophila, and 200 polyadenylated non-protein- coding genes. Full-insert sequencing of this collection has lead to the definition of precise gene structures and a high-quality annotation of protein-coding transcripts, including 5' and 3' UTRs, encoded in the genomic sequence. The complete ORFs represented in the DGC are also a valuable resource for molecular genetics, genomics, and proteomics research. A primary goal of the Berkeley Drosophila Genome Project is to complete the DGC by identifying and sequencing full-length cDNA clones representing both the remaining genes, and for all genes, the major alternatively spliced forms affecting protein-coding potential. Further EST sequencing is not an efficient or practical method to achieve these goals. Traditional hybridization-based library screening methods are also not efficient enough to be practical on a large scale, and methods such as RT-PCR and RACE do not recover full-length cDNA clones.
[015] The development of non-redundant cDNA collections is one of the objectives of the human and model organism genome projects [Strausberg, R.L., E.A. Feingold, R.D. Klausner, and F.S. Collins, The mammalian gene collection. Science, 1999. 286(5439): p. 455-7; Gerhard, D.S., et al, The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC). Genome Res, 2004. 14(10B): p. 2121-7.] for at least two reasons. First, sequencing of full-length cDNA clones is the most accurate and reliable way to delineate gene structures, including exons and introns, polyadenylation sites, and 5' and 3' UTRs [ Haas, B.J., N. Volfovsky, CD. Town, M. Troukhan, N. Alexandrov, K.A. Feldmann, R.B. Flavell, O. White, and S.L. Salzberg, Full-length messenger RNA sequences greatly improve genome annotation. Genome Biol, 2002. 3(6): p. RESEARCH0029; Misra, S., et al., Annotation of the Drosophila melanogaster euchromatic genome: a systematic review. Genome Biology, 2002. 3(12): p. research0083]. Second, complete cDNA collections are a necessary starting point for many large-scale studies of gene and protein function, including spotted DNA microarray analysis [Schena, M., D. Shalon, R.W. Davis, and P.O. Brown, Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 1995. 270(5235): p. 467-70], yeast two-hybrid protein interaction screening [Uetz, P., et al., A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature, 2000. 403(6770): p. 623-7; Ito, T., et al., A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci U S A, 2001. 98(8): p. 4569-74], and high- throughput x-ray crystallography [ Hui, R. and A. Edwards, High-throughput protein crystallization. J Struct Biol, 2003. 142(1): p. 154-61; Busso, D., R. Kim, and S.H. Kim, Expression of soluble recombinant proteins in a cell-free system using a 96-well format. J Biochem Biophys Methods, 2003. 55(3): p. 233-40; Yakunin, A.F., A.A. Yee, A. Savchenko, A.M. Edwards, and CH. Arrowsmith, Structural proteomics: a tool for genome annotation. Curr Opin Chem Biol, 2004. 8(1): p. 42-8].
[016] The new approach for recovery of specific cDNA clones of interest from libraries, which we describe here, will be useful for expanding full-length cDNA clone collections in Drosophila, human, mouse and any other species. It can also be used to recover cDNA clones representing genes and alternatively spliced transcripts of interest from any plasmid library. Finally, it can be generally applied to the recovery of specific clones from any plasmid library, including genomic libraries. SUMMARY OF THE INVENTION
[017] In a first aspect, the invention provides a practical, efficient, and effective approach to the isolation and identification of plamid cDNA clones representing RNA transcripts of interest, including rare transcripts and alternative transcripts, relative to available methods. In a preferred embodiment, the method is PCR-based and when used to isolate cDNA clones from a plasmid library, results in the isolation of full-length and alternatively spliced cDNA clones. The resulting cDNA clones are intact and in a form that is readily usable for other applications including sequencing and transcript analysis, and cloning of portions of the cDNA insert into other vectors, such as expression vectors.
[018] Thus, the invention provides for a method for recovery of clones from a plasmid library comprising the steps of: (a) providing a pair of oppositely directed inverse PCR primers that abut each other at their 5' ends, wherein said inverse PCR primers are sequence-specific to a gene or transcript of interest and phosphorylated at their 5' ends; (b) mixing said phosphorylated inverse PCR primers and at least one template comprising said plasmid of interest; (c) amplifying said plasmid of interest by inverse PCR to generate a linear PCR product; (d) digesting any unamplified template with a methylation-sensitive restriction enzyme; (e) ligating said linear PCR product to circularize the PCR product; and (f) isolating said circularized PCR product by molecular cloning.
[019] The method can further comprise the step of polishing the ends of the PCR products with a DNA polymerase before ligation to ensure blunt ends were created during PCR amplification by the polymerase.
[020] In one aspect of the method, the template is comprised of a plasmid library containing a cDNA of interest and the amplification is performed by a high-fidelity thermostable polymerase. In one embodiment, the template is a high-quality plasmid cDNA library representing organisms, tissues, and developmental stages of interest. Digestion of any unamplified template can be carried out using a methylation-sensitive DNA restriction enzyme such as Dpnϊ. Ligation can be performed using a ligase enzyme such as T4 DNA ligase. Finally, the circularized PCR product should be the intact plasmid of interest, including full-length insert and plasmid vector, wherein isolation of the circularized PCR product is by transformation of the circularized PCR product into a host and the subsequent isolation of transformant clones for selection of a full-length, intact plasmid of interest.
[021] In one aspect the method is useful for selecting and isolating genes missing from gene collections of model organism species, because the transcripts of interest are not represented in available EST collections. For example, this method can be useful in isolating cDNAs representing rare and/or alternatively spliced RNA transcripts.
[022] In another aspect the method promotes the recovery of clones representing a gene of interest. For example, an individual researcher interested in a specific gene which is found in one organism may use this method to isolate cDNAs representing the corresponding gene in another organism. Here, primers might be designed against just one or two genes of interest. Any individual with access to a cDNA library from the species of interest can isolate cDNA clones representing any gene of interest, provided some specific sequence of the gene is available.
[023] In another aspect, high-throughput recovery of many cDNAs is enabled by this method, such that high-throughput users like genome centers and biotechnology companies can recover large numbers of cDNAs representing large sets of genes of interest in a directed way. Yet another application would be the ability to isolate gene families within a given organism. Here, primers are designed in such a way that in a single pass many similar genes of interest can be isolated. For example, one can design a pair of primers and using the method, isolate most homeodomain-containing transcription factor genes expressed in a specific tissue at a specific developmental stage.
[024] In another aspect, the present invention provides for a kit comprising a set of instructions directing a user to carry out the described method and reaction volumes of enzymes, buffers and other components useful for carrying out the method.
BRIEF DESCRIPTION OF THE DRAWINGS
[025] Figure 1 is a schematic showing the basic method whereby clones of interest can be recovered from any plasmid library. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
[026] The term "inverse PCR" is meant to include the technique of amplifying known or unknown sequences in DNA adjacent to known sequence in DNA using the Polymerase Chain Reaction (PCR) and PCR primers directed away from one another in the sequence, rather than toward one another as for standard PCR. The general technique has been applied in the art for some site-directed mutagenesis methods and is described by Ochman H, Gerber AS, Hartl DL, 1988, Genetic applications of an inverse polymerase chain reaction, Genetics. 120(3): p. 621-3, which is hereby incorporated by reference. The term "template" herein refers to a substrate for a PCR reaction. In general, this can be any collection of DNA molecules from any source. In the preferred embodiment described here, "template" refers to a plasmid library from which a plasmid clone of interest is to be recovered. A plasmid library is constructed by cloning of a mixture of cDNA or genomic DNA sequences into a plasmid cloning vector. [027] The term "primer" herein refers to an oligonucleotide used to direct the site of initiation of DNA polymerization and is required for the PCR amplification process. Primers should be of sufficient length to hybridize stably to the template and must represent unique sequence in the template. It is well known in the art that primers are usually 15-30 bases in length, although longer primers can be used for greater specificity.
[028] In this specification, the singular forms "a," "an," and "the" include plural reference unless the context clearly dictates otherwise. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood to one of ordinary skill in the art to which this invention belongs.
[029] The present invention describes a method for recovery of clones from a plasmid library comprising the steps of: (1) providing a pair of oppositely directed inverse PCR primers that abut each other at their 5' ends, wherein said inverse PCR primers are sequence-specific to a gene or transcript of interest and phosphorylated at their 5' ends; (2) mixing said phosphorylated inverse PCR primers and at least one template comprising a plasmid library containing said gene or transcript of interest; (3) amplifying said gene or transcript of interest by inverse PCR to generate linear PCR product; (4) digesting any unamplified template; (5) ligating said linear PCR product to circularize the PCR product; and (6) isolating said PCR product by molecular cloning.
[030] In the general embodiment, a pair of inverse PCR primers is designed directly adjacent to each other, meaning they abut each other at their 5' ends, and pointing in opposing directions when hybridized to the gene or transcript of interest. The inverse PCR primers should be sequence-specific to a gene or transcript of interest. By sequence-specific, it is meant that the primers should be designed so that they are substantially homologous to a specific sequence in the gene or transcript of interest so as not to amplify unintended genes or transcripts. In a preferred embodiment, the primers hybridize under stringent conditions to a specific sequence in the gene or transcript of interest, having 100% identity to the gene or transcript of interest and are exon-specific.
[031] As used herein, a polynucleotide or fragment thereof is "substantially homologous"
(or "substantially similar") to another if, when optimally aligned (with appropriate nucleotide insertions or deletions) with the other polynucleotide (or its complementary strand), using
BLASTN (Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D . (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410) there is nucleotide sequence identity in at least about 80%, preferably at least about 90%, and more preferably at least about 95-98% of the nucleotide bases, and most preferably 100% identity in the nucleotide sequence. To determine homology between two different polynucleotides, the percent homology is to be determined using the BLASTN program "BLAST 2 sequences". This program is available for public use from the National Center for Biotechnology Information (NCBI) over the Internet
(Tatiana A. Tatusova, Thomas L. Madden (1999), "Blast 2 sequences - a new tool for comparing protein and nucleotide sequences", FEMS Microbiol Lett. 174:247-250). The parameters to be used are whatever combination yields the highest calculated percent homology (as calculated below) with the default parameters shown in parentheses: Program— blastn
Reward for a match— 0 or 1 (1)
Penalty for a mismatch-0, -1, -2 or -3 (-2)
Open gap penalty-0, 1, 2, 3, 4 or 5 (5)
Extension gap penalty-0 or 1 (1) Gap x_dropoff-0 or 50 (50) Expect- 10 Word size— 11 Filter-low complexity.
[032] The term "stringent conditions" refers to conditions under which a primer will hybridize to its target subsequence, but to no other sequences. Stringent conditions are sequence-dependent and will be different in different circumstances. Longer sequences hybridize specifically at higher temperatures. Generally, stringent conditions are selected to be about 15°C lower than the thermal melting point (Tm) for the specific sequence at a defined ionic strength and pH. The Tm is the temperature (under defined ionic strength, pH, and nucleic acid concentration) at which 50% of the probes complementary to the target sequence hybridize to the target sequence at equilibrium. (As the target sequences are generally present in excess, at Tm, 50% of the probes are occupied at equilibrium).
[033] The present invention contemplates using inverse PCR primers designed by techniques known to those of skill in the art including, optimization for annealing temperatures, the specificity of the primers to the template, and length of the primer. The design of the primers can be done using primer prediction software such as Oligoό (Molecular Biology Insights, Inc., Cascade, CO). Custom scripts and software for primer design can also be used, such as those referred to in Example 2. In a preferred embodiment, the inverse PCR primers are designed according to the methods described in Example 2 using Equation 1 in order to allow Primer3 to design primer pairs using the parameters set forth in Table 2.
[034] In one embodiment for the isolation of full-length cDNA clones, a primer pair is designed near the 5' end of a gene of interest, to select against truncated and incomplete cDNA clones. The cDNA clones are often produced by initiating reverse transcription of mRNA into a first-strand cDNA with a primer that hybridizes to the 3' poly A tail present on most mRNA transcripts. By designing primers near the 5' end of the gene of interest, this ensures that the largest or full-length clone is selected and not a clone representing a shorter fragment of the transcript missing the 5' end.
[035] The phosphorylation of primers at their 5' ends may be carried out using techniques known in the art and described in Current Protocols in Molecular Biology, (Eds. F.M. Ausubel, et al., John Wiley and Sons, Inc., Edison, NJ, pp 10.4.4-5), which is hereby incorporated by reference in its entirety. At least one of the primers in the primer pair is required to be phosphorylated in order for the subsequent step of ligation of linear PCR products to occur.
[036] In a preferred embodiment, the template is comprised of a complex library pool or a plasmid library containing the plasmid of interest. Genome centers and biotechnology companies have made such plasmid libraries commercially and publicly available. For example, the plasmid library template used in the examples is a mixture of Drosophila melanogaster cDNA libraries made from embryo, adult head, larvae and pupal tissues.
[037] In a preferred embodiment, the inverse polymerase chain reaction (inverse PCR or iPCR) is preferably carried out with a high-fidelity thermostable polymerase for sufficient number of cycles to amplify the plasmid of interest and generate the PCR products. Examples of suitable polymerases include PHUSION (Finnzymes, Espoo, Finland) and Pfu polymerases. The number of PCR cycles is preferably great enough in order to ensure that sufficient amounts of PCR products are generated.
[038] In some embodiments, the steps in the method of ligation and digestion can be reordered.
[039] In a preferred embodiment, digestion of the template DNA is preferably performed by a DNA-methylation sensitive restriction enzyme such as Dpnl, to remove all methylated unamplified template. This allows for specific selection of the PCR-amplified plasmid of interest.
[040] Ligation methods are preferably carried out using a ligase enzyme such as T4
DNA ligase.
[041] In another embodiment, the method can further comprise a step during digestion where the ends of the PCR products are polished with a DNA polymerase such as Pfu polymerase before ligation, to ensure blunt ends. See Wang, K., Koop, B.F. and Hood, L. (1994) BioTechniques, 17, 236-238. For example, some DNA polymerases produce PCR products with overhanging 3' A residues, and such products are not suitable, thus requiring the ends to be polished. In another embodiment the method can also optionally include purification of the PCR product before ligation, if required.
[042] Referring to Figure 1, in a preferred embodiment, a pair of oppositely directed
PCR primers is designed within an exon of a target gene. A target gene is shown in Fig. 1A as the double-stranded sequence with the two inverse PCR primers shown hybridized above and below the target gene sequence. The primers abut at their 5' ends with no overlap, and the 5' ends phosphorylated. The primers are used to amplify specific clones from a template, shown in Figure 1 as a plasmid cDNA library. The correlating positions of the primers (shown as two short arrows on either side of the insert) within a target cDNA are shown in Fig. IB, with the vector indicated in white and the cloned cDNA insert indicated in black and white cross-hatch. The result of PCR amplification of the cloned cDNA inserts using the primers are the linear amplification products in Fig. IC. The linear amplification products are complete sequences of target clones, including the intact vector and the entire insert, which is split into two halves at the position of the PCR primers. Fig. ID shows self-ligation of the linear PCR products into circular products that regenerate the original target cDNA clones, and digestion of the starting template, leaving the self-ligated amplification products intact. In a preferred embodiment, the methylation-sensitive restriction enzyme Dpnl is used to digest the unamplified template DNA. These products are cloned, sequenced, and analyzed as described herein to identify bona fide target-specific cDNAs (Fig. IE).
[043] In another embodiment the isolation of the PCR products in the method of the present invention is carried out by transformation of an E. coli host with the ligated reaction products of interest, overnight growth, and DNA sequencing. The invention contemplates the use of known sequencing methods using various sequencing reactions and sequencing devices such as the Perkin-Elmer ABI 3730 Sequencer (Perkin-Elmer, Wellesley, MA) to generate DNA sequence reads. The generated sequence can then be analyzed by conventional sequence analysis methods of comparison and alignment against known sequences from various organisms which are available through public and proprietary databases. The analysis should also look specifically for the presence of the inverse PCR primer sequences and the integrity of the ligation junction. [044] The success of the present method is not correlated with named genes (a common surrogate for the confidence of the target gene annotation) nor with the length of the target nor with the presence of EST evidence from other cDNA libraries. Several aspects of the methods can be optimized or modified to increase the success rate of the present method. [045] In a preferred embodiment, the template is comprised of cDNA library or pool of cDNAs with high complexity and contains transcripts from several tissues and developmental stages of the organism of interest. If screening failures are observed, they may be due to the absence of cDNAs for the target gene from the plasmid library template since recovery of a cDNA clone for a particular gene of interest depends primarily on the presence of a clone in the library template. The absolute complexity of the library pool used may be difficult to estimate. If the library pool is diluted 500-fold for the first round of screening experiments, a second screen for can be performed using a higher concentration of library pool. In a preferred embodiment, the second screen should be performed using a 2-, 3-, 5-, 10-, 20, or 50-fold concentration higher, to yield additional specific clones in this second screen. [046] In addition to the necessity that target clones be present in the library template, the quality of the annotation of predicted genes may have a direct effect on the success of the screen. If PCR primers are designed based on annotations for predicted genes, including some for which no molecular evidence currently exists, success in recovering clones then also depends upon the accuracy of the gene predictions. Further examination of the PCR primer sequences and the curated gene models they were designed to target may suggest other ways of improving screening success rate. While one potential cause for non-target clones may be incomplete Dpnl digestion of the template plasmid library, in some cases the non-target clones may in fact share some sequence complementary to one or both of the PCR primers. Therefore, in a preferred embodiment, the primers designed will be based upon the latest and best quality annotation for predicted regions to prevent mis-priming and mis-targeting genes of interest. [047] In another embodiment, one of the parameters that may be adjusted is the number of isolates selected for end-sequencing. In one embodiment, picking four isolates instead of three isolates increased screening success rates by approximately 12%. Similarly four isolates yields approximately 32% more screening successes than two isolates, and 88% more screening successes than a single isolate. It is estimated that picking more than four isolates will give an incremental increase (5%) in the number of successes, however, this increase in success rates must needs be balanced by the increase in costs of high-throughput clone processing and end- sequencing.
[048] Another parameter that may be adjusted is the number of isolates selected for full-insert sequencing. In the Examples, in most cases the selected clones were identical, however, there were isolated cases where differences were found. In a preferred embodiment, to acquire the most data for a given target, all four isolates should be full-length sequenced. [049] Another area of optimization is in any automated analysis of the end sequence to determine which clones are considered for sequence finishing. Analysis of the finished sequence from this experiment, largely gained through manual examination of the clone sequences, suggests that a criteria of a minimum of 50% sequence identity of the clone with its target annotation for half of the sequence length generated is indicative of the potential for a clone to be worthy of finishing.
[050] The success of this library screening further raises the question of when a cDNA collection project should switch from an EST-based approach to a directed library screening approach. At the end of the BDGP EST sequencing project, the last 10,000 EST sequences identified cDNAs representing 96 (1%) new genes not previously represented in the collection. It was at that point decided that additional EST sequencing was not warranted. However, with the present screening method as an available alternative, current and future cDNA collection projects can switch from EST sequencing approach to directed screening at an earlier stage. To find the optimal time to switch, a complete cost analysis may be required to determine feasibility balancing costs with efficiency and success.
[051] In summary, the present method is an effective method for increasing the representation of genes and transcripts in comprehensive cDNA collections, such as those currently under construction by the NIH Mammalian Gene Collection project for several model organisms and the human. As described in the Examples, it has been successfully used to efficiently screen for 104 of 153 target genes and recover full length cDNA clones for 84 of the genes. Our results also demonstrate that the present method can be used to screen for cDNAs representing alternatively-spliced transcripts. By designing PCR primers in predicted isoform- specfic exonic sequence, cDNAs containing the alternatively spliced sequences can be captured. Finally, the utility of the present method is not limited to genome projects. The method is simple and should be useful in any project requiring the isolation of cDNA clones. [052] In another embodiment, the present invention provides for a kit comprising reaction volumes of enzymes and buffers useful for carrying out the method and a set of instructions directing a user to carry out the described method. In another embodiment, the kit would comprise sufficient volumes of enzymes and buffers useful for carrying out the method at least 5, 10, 20, or 100 times.
[053] In a preferred embodiment, the kit would be comprised of components to carry out phosphorylation of the inverse PCR primers, including aliquots of buffer, and polynucleotide kinase, and components to carry out the inverse PCR reaction including aliquots of buffer, nucleotides, and polymerase, and control primers and template. In another embodiment, the kit would be further comprised of components to carry out the digestion and ligation steps including buffer, a DNA restriction enzyme and a ligase; and components to carry out the transformation and isolation step including plasmids and competent host cells.
EXAMPLE 1 High-throughput Screening a cDNA Library Pool for 153 Drosophila Transcription Factor Genes [054] The present method (also referred to as Self-Ligation of Inverse PCR Products
(SLIP) screening) was tested by screening a pool of cDNA libraries for clones representing 153 Drosophila transcription factor genes. These are curated genes in the Release 3.1 D. melanogaster genome sequence annotation that have been assigned the function attribute "transcription factor" in the Gene Ontology database [ Harris, M.A., The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res, 2004. 32 Database issue: p. D258-61] and that were not yet represented by cDNA clones in the DGC. Twenty-four of the targeted genes are represented by one or more ESTs, but the cDNAs that had been previously selected for full- insert sequencing were found to be defective in some way, and so replacement cDNAs were needed since we did not have a candidate full-length replacement EST. The remaining 129 target genes are not represented by ESTs in the collection. A complete list of the 153 target genes and predicted size is presented in Table 1. The sequences of the primer pairs used in screening are not shown. Table 1. 153 Target genes and predicted size.
Figure imgf000018_0001
Figure imgf000019_0001
Figure imgf000020_0001
Figure imgf000021_0001
* CGs with EST evidence
[055] Aliquots of four plasmid cDNA libraries, all of which were previously used in our EST sequencing projects, were pooled and diluted to produce a template for SLIP screening. The library pool (1 μg/μl) was diluted (1/500) to provide sufficient template for at least 3,000 PCR reactions. Aliquots of the GH (adult head, 1.23 μg/μl), LD (embryo, 1 μg/μl), LP (larva and pupae, 1.16 μg/μl), and SD (S2 cell line, 0.66 μg/μl) plasmid pOT2, chloramphenicol resistant, cDNA libraries described in Rubin, G.M., et al., A Drosophila complementary DNA resource. Science, 2000. 287(5461): p. 2222-4, were pooled to make a mixed library stock. Each library was available as a singly amplified stock, and 10 μl aliquots of each were combined to generate the pool. The complexity of the mixed stock was estimated to be approximately 2 x 106 independent clones. The mixed stock was diluted 1 :500 to produce a working stock for use as a template in PCR reactions. [056] Custom scripts were developed to automate primer design for SLIP and are described in Examples below. The scripts enable the use of the primer selection software Primer3 to design appropriately oriented, abutting but non-overlapping, inverse PCR primers. To improve the likelihood of recovering full-length cDNAs, primer design was restricted to the 5 '-most 500 bases of each curated gene model. Target genes were amplified from the library pool in a standard PCR reaction using the PHUSION thermostable DNA polymerase (Finnzymes, Espoo, Finland) according to the manufacturers instructions. For each targeted gene, both the forward and reverse PCR primers (8 μM each) were phosphorylated in a single 15 μl reaction with T4 polynucleotide kinase (N units) at 37° for 1 hour, followed by an incubation at 65° for 20 minutes to inactivate the kinase. [057] Each 15 μl reaction included 1.5 μl of working library stock, 1 μM of each 5'- phophorylated primer (1.85 μl of kinase reaction), 200 μM dNTPs, and 0.3 units of PHUSION DNA polymerase. Reactions were heated to 98° for 30 seconds, followed by 35 PCR cycles including denaturation at 98° for 10 seconds, annealing for 30 seconds, and extension at 72° for 2 minutes, 45 seconds. The annealing temperature for the first five cycles was ramped down linearly from 72° to 68° (touchdown PCR). In the subsequent 30 cycles, the annealing temperature was 68°. After cycling, the reactions were incubated for an additional 5 minutes at 72° to finish the final extension. A 3 μl aliquot of each PCR reaction was examined by agarose gel electrophoresis, but little correlation was observed between the qualitative appearance of samples on agarose gels and subsequent success in obtaining gene-specific clones, so this step was eliminated.
[058] To exchange the buffer and reduce the concentration of unincorporated dNTPs and primers, the remaining 12 μl of each PCR product was subjected to gel filtration through 300 μl SEPHAROSE G-50 columns (Amersham Bioscience, Piscataway, NJ), equilibrated with dH20, in 96-well format by standard methods. The recovered filtrate samples were diluted to 30 μl with dH20.
[059] The linear inverse PCR products were circularized using T4 DNA ligase and standard reaction conditions. Each filtered sample (15 μl) was treated with T4 DNA ligase (5 units) according to standard methods in a 100 μl overnight reaction at 16° to recircularize the products into plasmids by self-ligation. The ligation products were then digested with Dpnl to degrade the methylated library template, leaving the un-methylated and circularized PCR products intact. Dpnl (5 units) was then added to each sample, and the reactions were incubated at 37° for 2 hours to digest unamplified plasmid library DNA and then at 80° for 20 minutes to inactivate the restriction enzyme
[060] The digested products were transformed into bacterial host cells and plated to produce individual colonies. An aliquot (3 μl) of each ligated and digested sample was transformed into TAM1 competent E. coli host cells (Active Motif) in 96-well format according to the manufacturers instructions. The entire volume of transformed cells was plated on LB plates containing chloramphenicol (50 μg/ml) and incubated overnight at 37°. Four clones per target were grown overnight in 2X YT medium containing chloramphenicol (50 μg/ml). An aliquot of each culture was used to produce an archival frozen stock, and the remainder was used to prepare plasmid DNA by a standard alkaline lysis protocol.
[061] For each targeted gene, four clones were analyzed by DNA sequencing. Three sequencing reactions were performed on each of the four clones, using two sequencing primers flanking the cloning sites in the vector, plus the sense-strand target-specific PCR primer. The sequence data were analyzed automatically and reviewed manually. A consensus sequence for each clone was assembled from the three sequencing reads and compared to all annotated transcripts of the corresponding target gene. Of the 153 target genes, 92 targets (60%) yielded one or more candidate gene-specific clones in this initial screen. Sequencing reactions were performed with BIGDYE V3 dye-terminator chemistry (Perkin-Elmer, Wellesley, MA) at 1/16th the manufacturer's recommended reaction scale. Sequence data were collected on a PERKIN- ELMER 3730X1 capillary device. All templates were sequenced with the primers PM002 (5' end), PMOOl (3' end) [Stapleton, M., et al., The Drosophila Gene Collection: Identification of Putative Full-Length cDNAs for 70% of D. melanogaster Genes. Genome Res, 2002. 12(8): p. 1294-300], and the sense target-specific PCR primer. Following data analysis, selected cDNAs were sequenced to completion using additional custom primers.
[062] Sequence Analysis. The results of this initial library screen were examined to look for correlations that might predict whether the screen would successfully recover clones for a particular target gene. Named genes are more likely to have been studied and validated at the molecular level, and so they might be more likely to yield specific clones in a library screen. Of the 153 target genes, 66 (43%) are named genes in the Release 3.1 annotation, and 87 (57%) are un-named genes designated only by a CG (Curated Gene) number. The 66 named genes had 35 (53%) cases in which one or more isolate was successfully recovered and 31 (47%) cases with no successes. Of the 87 unnamed genes, 57 (66%) were successful and 30 (34%) failed. Because the library screening method is PCR-based, it was examined whether the rate of successful recovery of gene-specific clones was higher for target genes with shorter transcripts. The median size of the largest Release 3.1 annotated transcripts of the target genes is 1 , 116 bp for the 61 targets that failed in the screen and 1,654 bp for the 92 targets for which gene-specific clones were recovered.
[063] Sequence trace files were processed using phred and crossmatch to produce vector-masked sequence files with basecalls and associated quality scores [Ewing, B., L. Hillier, M.C Wendl, and P. Green, Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res, 1998. 8(3): p. 175-85; Ewing, B. and P. Green, Base-calling of automated sequencer traces using phred. LT. Error probabilities. Genome Research, 1998. 8: p. 186-94]. The set of three sequence reads from each template were assembled using a customized version of phrap [available at URL<http://www.phrap.org/>] in which every trace is included in the assembly. Each sequence assembly was evaluated by a custom script in a series of tests to select clones for full-insert sequencing. Test 1 , if the translation of the longest ORF in the contig containing the 5 'end read matches the predicted protein sequence of a transcript of the targeted gene, then the IPCR processing of the clone was declared 'done' and the clone was entered into our cDNA processing pipeline for finishing. Our standard cDNA processing pipeline requires quality standards higher than can routinely be achieved with three traces (phrap estimated error rate less than 1/50,000, individual base quality better than q25), so a further round of primer sequencing is performed if needed. This work is designed either manually or by autofinish [Gordon, D., C Desmarais, and P. Green, Automated finishing with autofinish. Genome Res, 2001. 11(4): p. 614-25]. Test 2, if the sequence assembly produces contigs with only a partial match to the predicted transcript or CDS nucleotides because of either low quality, gaps, or errors in the prediction - this clone was retained for possible further full length sequencing. A partial match is defined as alignment of at least 50% of the length of the isolate contig containing the 5' read. The percent identify of the match is also reported. If the contig sequence did not meet this criterion, all contigs were concatenated together and compared to the annotated transcript using sim4. A cutoff of 50% of the length of the clone sequence or 100 bp and a percent identity of 50% over the aligned region was required for inclusion into the cDNA processing pipeline for finishing. Test 3, if the assembly did not show any significant alignment to the target or is derived from the E. coli genome it was discarded.
[064] After all isolates for a particular target have been evaluated, isolates are selected from the set according to the rules: (1) if one or more isolates are 'done', the isolate from the set that contains a poly(A) tail, includes the longest 5' UTR, and has the highest sequence quality is selected. If no isolate has a poly(A) tail proceed with finishing - since the entire targeted CDS is captured - and choose the isolate with the longest 5' UTR. All other isolates are removed from the processing queues. This one isolate is entered into the cDNA processing pipeline for quality assurance, automated annotation, and submission to GenBank. (2) if there are one or more isolates passing Test 2 from above and no isolates passing Test 1, all candidate isolates are selected for one round of primer walking using primers designed using the target gene sequence. Only one of the isolates is entered into our cDNA processing pipeline for finishing. [065] For 61 target genes, the initial library screen did not identify a clone. Five of these attempts yielded a clone that corresponded to a gene other than that being targeted, but without a representative cDNA and these clones were selected for inclusion in the DGC. For an additional 24 target genes, the initial screen yielded gene-specific clones that were compromised in various ways, all previously observed in cDNA libraries, such that they did not represent full- length cDNAs (see below). A second screen was performed on 56 of the 61 target genes for which the initial screen did not identify a clone, and on 13 of the 24 target genes for which the initial screen yielded compromised clones. In this second screen, a ten-fold higher concentration of the cDNA library pool was used as the template, in an attempt to recover very rare clones. Otherwise, the same procedures used in the initial screen were followed. Analysis of the sequence data from the second screen showed that 23 of the 69 targeted genes yielded one or more gene-specific clones, including 12 of the 56 targets that failed in the first screen, and 11 of the 13 targets for which compromised gene-specific clones were recovered in the initial screen. [066] Twenty- four targeted genes had EST evidence (Table 1), although not necessarily from the libraries used in this screen. None of the ESTs were suitable for selection to the DGC because of other problems and these genes needed screening to find a replacement. For eight gene targets no target-specific clones were recovered while 16 gene targets were successfully recovered.
[067] Overall, the two rounds of library screening produced one or more target-specific clones for 104 (68%) of the 153 target genes shown in Table 1. Further characterization of these clones is described below. The 49 library screening experiments that did not yield target- specific clones fall into three classes. For two target genes, all clones did not yield sequence data. For 16 target genes, all isolates mapped to genes that were not the intended targets and did not include a complete copy of one or both of the PCR primer sequences. For 31 target genes, all isolates had sequences that included at least one copy of one or both of the PCR primer sequences, but were otherwise unrelated to the target gene.
EXAMPLE 2 Methods for Primer Design for the Screening Method [068] A single transcript model was selected from the Release 3.1 annotation [Misra,
S., et al., Annotation of the Drosophila melanogaster euchromatic genome: a systematic review. Genome Biology, 2002. 3(12): p. research0083] for each curated gene in the list of targets. For genes with multiple curated transcript models, the first ("RA") model was arbitrarily selected. Primer3 (URL<http://frodo.wi.mit.edu/primer3/primer3_code.html>) designs standard PCR primer pairs and can be used to design primers for multiple sequence targets automatically, but it has no explicit inverse PCR primer design feature, so software was written to manipulate the sequence.
[069] Primer3 was developed for the purpose of designing primers for PCR amplification of DNA with one primer on each strand, flanking the region to be amplified. Since the present process requires the primers to be designed so that the two primers abut at the 5' ends with no overlap and in opposing directions on the template, it was necessary to computationally rearrange our template sequences to mimic the format that primer3 was designed to operate on. A separate template sequence was constructed at each base location 26- 500 in the template sequence by the following method. First, a series of 4 "N"s was added to the 3' terminus of the transcript sequence. Next, the 5' sequence of the transcript (from base 1 to the current base location 26-500) was removed from the 5' end and attached to the 3' end of the "N"s. This generated a linear representation of the circular plasmid, with the intended primer locations at the ends and flanking the sequence to be amplified. This procedure resulted in 475 templates for primer design from each transcript sequence that was at least 500 bp in length. The procedure started at base 26 so that sufficient sequence would be available at the 3' end of the template for primer design. [070] Next, each template sequence was run through Primer3 to design a PCR primer pair, with constraints imposed using the adjustable parameters. Table 2 shows the parameter settings that were used for Primer3. Table 2. Primer3 parameter settings.
Figure imgf000026_0001
* for score calculation methods see Rozen S, Skaletsky H (2000) Primer3 on the WWW for general users and for biologist programmers. In: Krawetz S, Misener S (eds) Bioinformatics Methods and Protocols: Methods in Molecular Biology. Humana Press, Totowa, NJ, pp 365-386, which is hereby incorporated by reference. [071] A critical constraint was to fix the PCR product length equal to the overall template length, forcing the program to design a pair of PCR primers that included the 5' and the 3' termini of the template sequence as required by the screening method. A mis-priming library was also employed to prevent designing primers that were complementary to the cDNA vectors pBluescript SK(+) or pOT2. A primer pair design was produced for each template that had sequences at the ends that allowed primers that met the Primer3 criteria. Primer3 produces an output file that describes attributes of each primer and pair. [072] All acceptable primers in the primer3 output file were compared to a database of all curated transcripts in the Release 3.1 annotation using blastn (wublast-2.0 with parameters S=50 Q=200). The blastn output files were parsed to check that the targeted transcript had the highest blastn score and was perfectly aligned over the length of the primer sequence. Next, alignments to other transcripts were analyzed. If there were any gaps in the alignment to non- target transcripts, the primer was not disqualified. If the alignment was shorter than 16 bp, the primer was accepted. If the non-target alignment was equal to or longer than 16 bp, then the alignment was further analyzed in the 18 3'-most bases of the sequence. If fewer than 16 bases aligned in the 18 3'-most bases, the primer was accepted. If greater than or equal to 16 bases aligned in the 18 3'-most bases, the primer sequence was checked to see if the two most 3' bases align. If so, the primer was rejected. If not, the primer is accepted. This process resulted in a reduced set of primer pairs from which to select the optimum pair for each transcript. [073] To select one primer pair for each targeted transcript, from among the set of all acceptable primers, an objective function was calculated for each primer pair according to the equation: WtmlTmavg-Tmoptl + Wgc|GCavg-GC0pt| + Wbiast BlastLength + WΔtm|ΔTm| (Equation 1 )
where Wtm is the weight assigned to Tm (.3), Tmavg is the average Tm of the two primers, Tπiopt is the optimum Tm, Wgc is the weight assigned to GC content (.1), GCavg is the average percent GC content of the primers, GCopt is the optimum GC content, Wbiast is the weight assigned to the BLASTN alignment (.3), BlastLength is the length of the longest blastn alignment to non-target curated genes, Watm is the weight assigned to the difference in Tm between the primers (.3), and ΔTm is the difference in Tm between the primers. For each targeted gene, the primer pair with the lowest objective function score was selected.
EXAMPLE 3 Optimization of the SLIP Screening Method
[074] Performing the present screening on a significant number of genes has provided data for optimizing the process.
[075] Library Complexity. The complexity of the library seemed likely to have been a limiting factor, because the pool was diluted 500-fold for the first round of screening experiments. To test this, a second screen for 69 target genes, including the 56 targets that failed to yield specific clones in the first round of screening, was performed using a ten-fold higher concentration of library pool (50-fold dilution). An additional twelve genes yielded specific clones in this second screen.
[076] Statistical analysis of the results indicates that the additional successes are consistent with the expected increase from selection of more isolates from the screening with the underlying screening success rate identical for both library pools (data not shown). The effect of library concentration suggests that most of the total complexity of the library pool was represented in each sample in the initial round of screening. Note that these libraries had already been extensively sampled by EST sequencing, and this had not yielded clones for 129 of the 153 genes targeted in the described screens. To use this method to recover the missing transcription factor cDNAs it will be necessary to construct cDNA libraries with higher complexity and from additional tissues and developmental stages.
[077] In addition to the necessity that target clones be present in the library template, the quality of the annotation of predicted genes has a direct effect on the success of the screen. PCR primers were designed based on annotations for predicted genes, including some for which no molecular evidence currently exists. Therefore, success in recovering clones also depends upon the accuracy of the gene predictions. In 28 cases, the clones recovered in the screen provide evidence for modifying the corresponding gene annotations. Further examination of the PCR primer sequences and the curated gene models they were designed to target may suggest other ways of improving the screening success rate.
[078] The 49 failed targets are likely a result of mis-priming in the PCR reaction and almost certainly arose because the target clone was not present in the library pool. Another potential cause for non-target clones is incomplete Dpnl digestion of the template plasmid library. However, in many cases the non-target clones do share some sequence complementary to one or both of the PCR primer.
[079] Performing the SLIP screening on a significant number of genes has provided data for optimizing the process. One of the parameters that may be adjusted is the number of isolates selected for end-sequencing. Based on a retrospective analysis of this data, we estimate that by picking four isolates instead of three isolates we have increased our screening success rate by approximately 12%. Similarly four isolates yields approximately 32% more screening successes than two isolates, and 88% more screening successes than a single isolate. We estimate that picking more than four isolate will give an incremental increase (5%) in the number of successes, which needs to be balanced by the increase in costs of clone processing and end-sequencing. Another parameter that may be adjusted is the number of isolates selected for full-insert sequencing. While in most cases, all of the selected clones were identical, there were isolated cases where differences were found. Another area of optimization is in the automated analysis of the end sequence to determine which clones are considered for sequence finishing. Analysis of the finished sequence from this experiment, largely gained through manual examination of the clone sequences, suggests that a criteria of a minimum of 50% sequence identity of the clone with its target annotation for half of the sequence length generated is indicative of the potential for a clone to be worthy of finishing.
[080] The success of this library screening experiment raises the question of when a cDNA collection project should switch from an EST-based approach to a directed library screening approach. At the end of our EST sequencing project, the last 10,000 EST sequences identified cDNAs representing 96 (1%) new genes not previously represented in the collection. At that point, it was decided that additional EST sequencing was not warranted. If the SLIP screening method had been an available alternative, we might have switched from EST sequencing to directed screening at an earlier stage in the DGC project. To find the optimal time to switch, a complete cost analysis would be required.
[081] In summary, SLIP is an effective method for increasing the representation of genes and transcripts in comprehensive cDNA collections, such as those currently under construction by the NIH Mammalian Gene Collection project for several model organisms and the human. We have successfully used it to efficiently screen for 104 of 153 target genes and recover full length cDNA clones for 84 of the genes.
EXAMPLE 4 Full-Insert Sequencing and Characterization of cDNAs
[082] The sequence data from the 104 candidate target-specific clones were further analyzed to identify potential full-length cDNAs. cDNAs for which the initial three sequencing reads did not produce a complete, high-quality sequence of the cloned insert were selected for sequence finishing. Finishing reads were produced using custom primers designed from the sequence assembly and the Release 3.1 annotated transcript model (see Example 2). During the clone selection and sequence finishing process, if a cDNA was determined to be compromised, a high-quality sequence of the complete insert was not necessarily produced. [083] The predicted protein sequence encoded by each clone was translated from the longest open reading frame (ORF) in each sequence assembly and compared to the predicted protein annotation for the target gene in the recently available Release 4.1 genome sequence annotation (http://flybase.bio.indiana.edu/). For 49 (47%) of the target genes, the selected cDNA contains a complete ORF that encodes a protein identical to that of the Release 4.1 gene model. For an additional 28 (27%) genes, the cDNA represents a full-length transcript with a complete ORF that is not identical to that of an annotated transcript. These cases provide evidence for modifying Release 4.1 transcript models. Of these 28 genes, cDNA clones for 16 are classified as "5' extension", "3' extension" or "5' short with upstream in-frame stop codon", which modify the amino or carboxy terminus of the gene model. cDNA clones for 10 genes are classified as encoding "exon variants" and consist of three cDNAs that encode alternate amino termini, four cDNAs that encode other amino acid differences when compared to the annotated transcripts, and three cDNAs that encode proteins that are smaller than the annotated predictions. Lastly, one selected cDNA represents a dicistronic transcript containing both CG17197 and CGI 7198 while another selected cDNA merges three annotated genes (CGI 5781, CG15782 and CG15783) into one gene with a single, long ORF.
[084] For the remaining 27 (26%) genes, one or more clones are gene-specific, but all are compromised in various ways. Table 3 summarizes the analysis of the screen to recover cDNA clones for 153 transcription factors. All but four of these contain artifacts known to occur in cDNA libraries, including nucleotide discrepancies, truncations of the 5' and/or 3' ends, genomic clone contaminants, chimeras and retained introns. These artifacts are an unavoidable issue in cDNA library screening. The clones recovered for four target genes are compromised. One clone contains just one of the two PCR primer sequences, two clones contain multiple concatenated copies of both primer sequences, and a fourth has a 2-bp deletion at the point of ligation where the 5' ends of the two primers abut. This last clone is in all other respects a full- length cDNA. One clone corresponds to an antisense transcript, an unexpected but interesting product. Table 3 Total screen Summary of cDNA Clones Recovered Genes targeted 153 Targets for which a gene-specific clone was recovered 104 Clones that encode complete ORFs: ORFs identical to the Release 4.1 predicted proteins3 46 ORF identical to the Release 4.1 predicted proteins but co- » ligated4 Subtotal 49
Clones that alter Release 4.1 annotations5: 5' extension 11 3' extension 3 5' short with upstream in-frame stop codon. 2 3' truncated with downstream in-frame stop 0 exon variants 10 dicistronic 1 gene merges 1
Subtotal 28
Clones that are compromised: nucleotide discrepancies6 4 5' short7 1 3' truncated5 5 5'and 3' short 1 co-ligated inserts8 4 antisense transcripts9 1 genomic contaminant10 4 retained intron11 3 screening artifacts12 4
Subtotal 27
2 The sixty-nine attempts include 61 that failed to identify a correct target clone and eight that produced compromised clones. The ORF predicted from the cDNA sequence is identical to the corresponding Release 4.1 predicted protein; the clones are from the LD, GH, HL, LP and SD cDNA libraries. The ORF predicted from the cDNA sequence is identical to the corresponding Release 4.1 predicted protein but co-ligated with an ORF of another gene; an artifact commonly observed in cDNA libraries.
5These clones have structures that are inconsistent with the corresponding Release 4.1 predicted gene. The 5'-short and 3'-truncated clones may reflect alternative splice products. Those clones referred to as putative exon variants are cases in which the cDNA clone contains additional nucleotides that are a multiple of 3, relative to the Release 4.1 predicted mRNA, and maintains the open reading frame. These cases will be resolved by modifying the Release 4.1 gene model. 6These clones align well to the Release 4.1 predicted transcript, but have nucleotide differences that are most likely the result of errors generated by Reverse Transcriptase (RT) during library construction. These include missense and frameshift (+/-1 or +/-2 nt difference) changes in the predicted ORF relative to the Release 4.1 predicted protein. The ORF predicted from the cDNA sequence is missing the N-terminal portion of the Release 4.1 predicted protein for the 5'-short class, or missing the C-terminal portion for the 3'-truncated class.
8These clones carry two unrelated ORFs and are almost certainly the result of two cDNA molecules being cloned into the same plasmid vector during library construction. 9These clones overlap Release 4. lpredicted genes but are transcribed from the opposite strand as the mRNA encoding the Release 4.1 predicted protein; a number of such cases were documented in the reannotation of the genome (ref) and the existence of such antisense transcripts have been reported in many organisms (ref).
10These clones contain short, intron-less fragments of genomic DNA. 1 ^hese clones contain poly-adenylated cDNA inserts that include an unprocessed intron. 12These clones represent artifacts of the screening method. One clone contains a deletion of 2 bp at the ligation junction between the two PCR primers; this is otherwise a good cDNA clone. Another clone contains three PCR primer sequences and is rearranged with respect to the genomic sequence. A third clone contains only one of the PCR primer sequences and represents a fragment of 3' UTR.
[085] For the 16 targeted genes which had previous EST evidence but required a replacement clone, we recovered eight cDNA clones that represent full-length transcripts with complete ORFs that are identical to that of an annotated transcript, two cDNAs that provide evidence for modifying Release 4.1 transcript models and six cDNAs that are compromised. In summary, cDNAs for 77 (50 %) of the 153 target genes meet the current criteria for inclusion in the Drosophila Gene Collection. Table 4 shows a summary of screen positives that match Release 4.1 predicted genes.
Figure imgf000032_0001
Figure imgf000033_0001
Figure imgf000034_0001
LEGEND 5' extension (B); 3' extension (C); 5'short upstream inframe stop codon (D); exon variant (F); nucleotide discrepancy (G); 5' short (H); 3' short (I); co-ligated (J); 5' and 3' short (K); antisense (L); dicistronic (M); genomic contaminant (N); retained intron (O); merge (P); iPCR artifact (Q)
[086] The above examples, methods, procedures, reagents and materials contained herein are meant to exemplify and illustrate the invention and should in no way be seen as limiting the scope of the invention. Changes and modifications in the specifically described embodiments and examples can be carried out without departing from the scope of the invention. All patents, publications and references referred to herein are incorporated by reference.

Claims

CLAIMSWhat is claimed is:
1. A method for recovery of clones from a plasmid library comprising the steps of: a. providing a pair of oppositely directed inverse PCR primers that abut each other at their 5' ends, wherein said inverse PCR primers are sequence-specific to a gene or transcript of interest and phosphorylated at their 5' ends; b. mixing said phosphorylated inverse PCR primers and at least one template comprising said gene or transcript of interest; c. amplifying said gene or transcript of interest by inverse PCR to generate a linear PCR product; d. digesting any unamplified template; e. ligating said linear PCR product to circularize the PCR product; and f. isolating said circularized PCR product.
2. The method of claim 1 , further comprising the step of polishing the ends of the PCR products with a DNA polymerase before the ligating step to ensure blunt ends were created during PCR amplification.
3. The method of claim 1, wherein said template is comprised of a plasmid library containing said gene or transcript of interest.
4. The method of claim 1, wherein the amplification step comprises polymerization with a high-fidelity thermostable polymerase.
5. The method of claim 1, wherein the digesting step of any unamplified template comprises use of a methylation-sensitive DNA restriction enzyme.
6. The method of claim 1, wherein the ligating step comprises enzymatic us of a ligase enzyme.
7. The method of claim 1, wherein said circularized PCR product is the intact gene or transcript of interest, including full-length insert and plasmid vector.
8. The method of claim 1, wherein the isolating step comprises transformation of the circularized PCR product into a host and the subsequent isolation of the transformants for selection of a full-length, intact gene or transcript of interest.
9. A kit for recovery of clones from a plasmid library comprising: a set of instructions directing a user to carry out the method of claim 1 ; and reaction volumes of components useful for carrying out the method of claim 1.
10. The kit of claim 9, comprising sufficient reaction volumes of components useful for carrying out the method at least twice.
11. The kit of claim 9 comprising inverse PCR reaction components, wherein said inverse PCR reaction components comprise aliquots of buffer, nucleotides, polymerase, control primers and control templates.
12. The kit of claim 9 further comprising digestion and ligation components, wherein said digestion and ligation components comprise a DNA restriction enzyme and a ligase.
13. The kit of claim 9 further comprising components to carry out transformation and isolation of PCR products, wherein said components comprise plasmids and competent cells.
PCT/US2005/016765 2004-05-12 2005-05-11 Rapid and efficient cdna library screening by self-ligation of inverse pcr products WO2005121419A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US57058204P 2004-05-12 2004-05-12
US60/570,582 2004-05-12

Publications (2)

Publication Number Publication Date
WO2005121419A2 true WO2005121419A2 (en) 2005-12-22
WO2005121419A3 WO2005121419A3 (en) 2006-06-08

Family

ID=35503746

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2005/016765 WO2005121419A2 (en) 2004-05-12 2005-05-11 Rapid and efficient cdna library screening by self-ligation of inverse pcr products

Country Status (1)

Country Link
WO (1) WO2005121419A2 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4965188A (en) * 1986-08-22 1990-10-23 Cetus Corporation Process for amplifying, detecting, and/or cloning nucleic acid sequences using a thermostable enzyme
US5789166A (en) * 1995-12-08 1998-08-04 Stratagene Circular site-directed mutagenesis

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4965188A (en) * 1986-08-22 1990-10-23 Cetus Corporation Process for amplifying, detecting, and/or cloning nucleic acid sequences using a thermostable enzyme
US5789166A (en) * 1995-12-08 1998-08-04 Stratagene Circular site-directed mutagenesis
US5932419A (en) * 1995-12-08 1999-08-03 Stratagene, La Jolla, California Circular site-directed mutagenesis
US6391548B1 (en) * 1995-12-08 2002-05-21 Stratagene Circular site-directed mutagenesis
US20030064516A1 (en) * 1995-12-08 2003-04-03 Stratagene And Children's Medical Center Corporation Circular site-directed mutagenesis
US6713285B2 (en) * 1995-12-08 2004-03-30 Stratagene Circular site-directed mutagenesis
US20040253729A1 (en) * 1995-12-08 2004-12-16 Stratagene And Children's Medical Center Corporation Circular site-directed mutagenesis

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
'Stratagene ExSite PCR-Based Site-Directed Mutagenesis Kit' CATALOG #20052, REVISION #073006 2003, XP002998636 *
ZHENG ET AL.: 'cDNA cloning by amplification of circularized first strand cDNAS reveals non-IRE-regulated ion-responsibve mRNAS' BIOCHEMICAL AND BIOPHYSICAL RESEARCH COMMUNICATIONS vol. 275, 2000, pages 223 - 227, XP002170432 *

Also Published As

Publication number Publication date
WO2005121419A3 (en) 2006-06-08

Similar Documents

Publication Publication Date Title
US10515714B2 (en) Methods for accurate sequence data and modified base position determination
Cheung et al. Sequencing Medicago truncatula expressed sequenced tags using 454 Life Sciences technology
Xiong et al. Chemical gene synthesis: strategies, softwares, error corrections, and applications
Todd et al. Progress of structural genomics initiatives: an analysis of solved target structures
Cong et al. Tiger swallowtail genome reveals mechanisms for speciation and caterpillar chemical defense
Yao et al. A global survey of the transcriptome of allopolyploid Brassica napus based on single‐molecule long‐read isoform sequencing and Illumina‐based RNA sequencing data
KR20210092851A (en) Methods of sequencing nucleic acids in mixtures and compositions related thereto
EP1781786A2 (en) Method of error reduction in nucleic acid populations
Näätsaari et al. Peroxidase gene discovery from the horseradish transcriptome
Coulter et al. BaRTv2: a highly resolved barley reference transcriptome for accurate transcript‐specific RNA‐seq quantification
Vandenbussche et al. Generation of a 3D indexed Petunia insertion database for reverse genetics
Renganaath et al. Systematic identification of cis-regulatory variants that cause gene expression differences in a yeast cross
Hoskins et al. Rapid and efficient cDNA library screening by self-ligation of inverse PCR products (SLIP)
CN116144631B (en) Heat-resistant endonuclease and mediated gene editing system thereof
AU2019321208A1 (en) Sequencing algorithm
WO2005121419A2 (en) Rapid and efficient cdna library screening by self-ligation of inverse pcr products
Farrow et al. Combinatorial recombination of gene fragments to construct a library of chimeras
Underwood et al. Simultaneous high‐throughput recombinational cloning of open reading frames in closed and open configurations
US20090117538A1 (en) Methods for Obtaining Gene Tags
CN110527708A (en) A method of distinguishing 5- methylated cytosine and 5- methylolation cytimidine in DNA
US20230352120A1 (en) System and method for predicting efficiency and outcome of base editor by using deep learning
Cornet et al. A BAC-guided haplotype assembly pipeline increases the resolution of the virus resistance locus CMD2 in cassava
Takenaka et al. Requirement of various protein combinations for each C-to-U RNA editosome in plant organelles
Prykhozhij et al. Mutation Knock-in Methods Using Single-Stranded DNA and Gene Editing Tools in Zebrafish
Szymanski et al. tRNA-Cys gene clusters exhibit high variability in Arabidopsis thaliana

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KM KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NG NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase in:

Ref country code: DE

WWW Wipo information: withdrawn in national office

Country of ref document: DE

122 Ep: pct application non-entry in european phase