WO2002014553A2

WO2002014553A2 - A molecular vector identification system

Info

Publication number: WO2002014553A2
Application number: PCT/US2001/025106
Authority: WO
Inventors: Daniel P. Gold; Robert J. Shopes
Original assignee: Favrille, Inc.
Priority date: 2000-08-11
Filing date: 2001-08-10
Publication date: 2002-02-21
Also published as: AU2001283272A1; WO2002014553A3

Abstract

The invention relates to a method for constructing a nucleic acid containing an identification insert, comprising providing a nucleic acid and an identification insert, wherein the identification insert is incorporated into the nucleic acid. The invention also relates to nucleic acids containing identification inserts.

Description

A MOLECULAR VECTOR IDENTIFICATION SYSTEM

RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Application Serial No. 60/224,618, filed August 11, 2000, which is hereby incorporated herein by reference in its entirety, including any figures, tables, or drawings.

Co-pending U.S. Provisional Patent Applications entitled "Multi-Dimensional Chromatography Purification of Recombinant Immunoglobulin Proteins," Serial No. 60/224,942; "Expression Vectors for Production of Recombinant Immunoglobulin," Serial No. 60/224,722; "Method for Producing an Idiotypic Vaccine," Serial No. 60/224,723, "Method and Composition for Altering a T Cell Mediated Pathology," Serial No. 60/266,133; and "Method and Composition for Altering a B Cell Mediated Pathology," Serial No. 60/279,079 are hereby incorporated by reference in their entirety, including any figures, tables, or drawings.

FIELD OF THE INVENTION

The invention disclosed herein relates to methods and compositions that permit an investigator to internally incorporate an identification label into a nucleic acid, such as a vector.

BACKGROUND OF THE INVENTION

Researchers in molecular biology frequently create a series of vectors where the inserted gene or genetic material differs by only a few nucleotides from the other genetic material present in each vector of the series. As an example, a researcher might create a series of expression vectors, each beginning from the same starting vector, in which a different allele of a gene is inserted. Allelic differences may be as small as one or two nucleotides in a gene that could be several kilobases in length. As another example, a researcher might create a series of gene mutants where only a nucleotide or two in the entire gene have been changed. Commonly, vectors with different inserts can be distinguished on the basis of the size of the insert or the pattern of fragments produced by restriction enzymes. The utility of both these techniques is severely limited when applied to genes that differ by only a few bases. Although one could technically sequence each expression vector to determine the identity of the insert, this course of action is costly, both in time and money. In addition, for differences located internally in the genes of interest, custom-designed primers would need to be synthesized and characterized. What is required is a fast and efficient method for the identification of one vector, or nucleic acid in any form, in a series of similar vectors or nucleic acids. This would be especially critical in a clinical setting where a researcher or clinician needs to be certain of the identity of a plasmid he or she plans to use therapeutically.

SUMMARY OF THE INVENTION

The invention relates to a method for constructing a nucleic acid containing an identifying sequence of nucleic acids known as the molecular bar code ("MBC") comprising a series of nucleic acids which forms an identification insert, wherein the molecular bar code is attached to the nucleic acid sequence that is to be tracked through its association with the molecular bar code. The invention also relates to nucleic acids containing molecular bar codes, wherein the molecular bar code consists of a series of nucleotides that was synthesized randomly but is sequenced after incorporation into the vector.

In one aspect, the invention relates to a method for tracking genetic material that has been isolated from a patient and cloned into vectors which will be subsequently used for replication or expression. The method comprises the steps of inserting a molecular bar code into a vector that the researcher wishes to subsequently identify where the vector comprises genetic material obtained from the patient. This thereby uniquely associates said a particular molecular bar code with the genetic material isolated from a given patient. This molecular bar code can then be subsequently detected in a sample whenever it is necessary to determine the presence and identity of the vector and its associated gene of interest. By detecting the presence of the unique molecular bar code in a sample this tracks the genetic material from a given patient to this sample.

In one embodiment of this aspect, the molecular bar code is detected and identified by sequencing. In other embodiments of this aspect, the genetic material from the patient is all or a portion of a gene encoding an isolated T cell receptor variable region or immunoglobulin variable region associated with a pathology. In further embodiments of this aspect, the vector is a plasmid or an expression vector.

In a further aspect, the invention relates to a method for tracking a vector comprising a gene of interest that has been isolated from a patient and cloned into this vector for purposes of replication or expression of the gene of interest. The method comprises the step of inserting a molecular bar code into a vector that the researcher wishes to subsequently identify where the vector comprises a gene of interest obtained from the patient. This thereby uniquely associates said a particular molecular bar code with a vector comprising the gene of interest isolated from a given patient. This molecular bar code can then be subsequently detected in a sample whenever it is necessary to determine the presence and identify of the vector. By detecting the presence of the unique molecular bar code in a sample this tracks the gene of interest from a given patient to this sample.

In one embodiment, the molecular bar code is detected and identified by sequencing. In other embodiments of this aspect, the gene of interest from the patient is all or a portion of a gene encoding a isolated a T cell receptor variable region or immunoglobulin variable region associated with a pathology. In further embodiments of this aspect, the vector is a plasmid or an expression vector. In another aspect, the MBC is inserted into a specific site in the vector or expression vector that has been designed to facilitate the insertion and maintenance of the MBC.

In other preferred embodiments, the invention relates to a method for constructing a nucleic acid containing molecular bar code where a molecular bar code is inserted into the nucleic acid as part of the nucleic acid's backbone. In one embodiment of this invention, the nucleic acid is selected from the group consisting of a plasmid, a vector, and an expression vector. In another embodiment of this invention, the MBC can be synthesized as part of the nucleic acid, or the MBC and the nucleic acid can be synthesized or made independently and then ligated into one nucleic acid.

In preferred embodiments of this invention, the MBC can by synthesized as a random series of nucleotides, or the sequence of the MBC can be defined before sequencing. In other preferred embodiments, the MBC is sequenced as a random series of nucleotides between defined ends that are designed to form targets for restriction endonucleases after annealing. However, it is not necessary to use restriction enzymes to form compatible ends for insertion of the MBC into the nucleic acid, and other methods for accomplishing this goal are known to those skilled in the art.

In still another aspect, the invention consists of a means for specifically identifying a vector comprising genetic information from a given patient by inserting a means for identifying such a vector into the nucleic acid backbone of the vector and then detecting the means for identifying the vector.

In an additional aspect, a vector comprising a MBC is prepared by a process comprising the following steps: (i) preparing a vector comprising the genetic material or gene of interest to accept the molecular bar code by digesting the vector with appropriate restriction endonucleases; (ii) preparing a molecular bar code by synthesizing an oligonucleotide chain comprising the steps of (a) synthesizing one strand of a restriction endonuclease target site, (b) randomly synthesizing from 10 to 100 nucleotides, and (c) synthesizing on strand of a restriction endonuclease site at the other end of the oligonucleotide, (d) preparing a complementary strand and annealing it to said synthesized oligonucleotide, (e) preparing the double-stranded oligonucleotide for ligation by digesting it with the appropriate restriction endonucleases; and (iii) ligating said vector with said molecular bar code. In a final aspect the invention consists of a kit for preparing a vector comprising a molecular bar code comprising a series of containers each of which comprises a DNA molecule comprising a unique molecular bar code.

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 outlines an example of the basic steps of the method to create, incorporate, and utilize a molecular bar code in a vector.

DETAILED DESCRIPTION OF THE INVENTION

The invention disclosed here relates to methods and composition for identifying a nucleic acid segment. The invention contemplates the inclusion of a random series of nucleotides into a nucleic acid sequence of interest for use as an identifier for the nucleic acid sequence. This identifying sequence has been given the name "molecular bar code." The nucleotide sequence used as an identification insert acts as a unique identifier that allows a researcher to positively identify a vector with a minimum of effort. This series of random nucleotides does not intentionally contain useful genetic information; for example, it is not expected to encode the synthesis of a protein, to signal the transcription or translation machinery of a cell, or to specify the cleavage site of an enzyme or restriction endonuclease. One advantage to this method is that, because the random, identifying sequence is incorporated into the sequence of the nucleic acid of interest, the identifier will be reproduced with every subsequent copy made from the nucleic acid of interest. This feature allows the MBC to act as a tag that specifically identifies the source of the genetic material even after many replications of that material have occurred.

One application for the invention allows one to confirm the identity of a vector without sequencing large portions of the nucleic acid sequence. This capability is especially useful in monitoring vectors that are subjected to multiple generations of growth. For example, when a researcher or clinician wishes to subclone a gene of interest into a vector, the researcher or clinician will likely generate various batches of this vector to use in subsequent experiments. A particular batch of the vector may be grown to yield sufficient material to perform nucleic acid sequencing. In the case of an expression vector, numerous additional batches of vector may be grown to express a nucleic acid sequence that was inserted into the expression vector. It would be helpful to the researcher or clinician to have an identifier present within the sequence of the vector to confirm that the vector at the end of the growth experiment or expression was the same vector that was introduced at the beginning of the experiment. The methods and compositions of the disclosed invention accomplish this goal. This is especially useful when the researcher or clinician is generating a large number of clones that are very similar in sequence.

The nucleotide sequence of a molecular bar code can be randomly generated from the four standard deoxynucleotides: guanine, thymine, adenine, and cytosine or by using equivalent deoxynucleotides. In an alternative embodiment, ribonucleotides are used to synthesis the identification inserts. In general, a population of identification inserts is generated using standard nucleic acid synthesis chemistry, which is well known to those of ordinary skill in the art. Typically, oligonucleotides are synthesized using solid phase chemistry in the 3'-to-5' direction, starting with a column containing the 3' nucleotide temporarily immobilized on glass beads. One skilled in the art will understand how to create a random series of nucleotides using a commercially available oligonucleotide systhesizer. One of skill in the art will also understand that a random series of nucleotides could be produced by randomly selecting the nucleotide that was to be placed at each position in the oligonucleotide and then specifically synthesizing that oligonucleotide that has the randomly selected sequence.

In another embodiment, a population of identification inserts is generated using a manual synthesis system. In another embodiment, a nucleic acid synthesizer is used to produce the identification inserts. An example of a suitable nucleic acid synthesizer is the ABI 3948 Nucleic Acid Synthesis and Purification System (Applied Biosystems, Inc., Foster City, CA). One embodiment of the invention involves the synthesis of an entire vector using chemical methods. In this embodiment, the sequence of a vector or nucleic acid sequence is known and is provided to a nucleotide synthesizer for production. During the synthesis of the nucleic acid of interest, a segment of random nucleotides is included in the synthesis protocol. Following synthesis of this segment, the synthesizer returns to generating the previously provided sequence. Once the first strand of the nucleic acid is complete, it is circularized and made double-stranded. This process will produce a library of nucleic acids of interest, with each containing a random stretch of nucleotides. The random stretch of nucleotides will act as a molecular bar code insert that will permit an investigator to differentiate one nucleic acid from another.

In an alternative embodiment, rather that synthesizing the entire nucleic acid of interest that includes a molecular bar code, the inserts can be produced themselves and later introduced into a nucleic acid of interest. As discussed above, the molecular bar code can be generated using any standard nucleic acid synthesis technology. When the molecular bar code is to be later inserted into a nucleic acid of interest, typically an investigator will determine the desired length of the insert for generation and program the synthesizer accordingly. The synthesizer produces a library of random single-stranded oligonucleotide sequences that will ultimately serve as identification inserts molecular bar code.

In addition to the random sequence of nucleotides, the identification inserts may also contain additional sequence at the 5' and/or the 3' ends of the insert to facilitate subcloning into a vector or into another nucleic acid sequence of interest. For example, a restriction endonuclease recognition site can be incorporated at either the 5', the 3', or both ends of an otherwise randomly generated identification insert. The inclusion of such a site would facilitate introduction of the identification insert into a vector or other nucleic acid of interest. Alternatively, a purely random sequence can be generated without any additional sequences at the 5' and/or 3' ends. The molecular bar codes useful in the present invention do not include classic polylinker sequences that contain a number of restriction endonuclease sites, but do not contain the random series of nucleotides required for the molecular bar codes of the present invention.

Once the single-stranded oligonucleotides are complete, they are freed from the synthetic matrix, purified, and placed in an appropriate buffer for further manipulation. At this point the oligonucleotides comprise a library of individual, random sequences that are largely unique, depending on the length number of oligonucleotides and the number of molecules produced. Typically the synthesized oligonucleotides are purified by size to remove any non-incorporated nucleotides as well as oligonucleotides that were prematurely terminated during the synthesis process. The purified oligonucleotide library is then typically freeze-dried and then resuspended in an appropriate buffer.

Once prepared, the single-stranded oligonucleotides are made double-stranded using standard techniques well known in the science of molecular biology. A general review of such molecular biology techniques can be found in Ausubel, et al., Current Protocols in Molecular Biology, John Wiley & Sons, Inc., 2000, which is hereby incorporated by reference in its entirety. For example, when the single-stranded oligonucleotides possess restriction endonuclease recognition sites at their respective 3' ends, a primer complementary to this sequence is generated and added to the resuspended single-stranded oligonucleotide library under conditions that promote the specific binding of the primers to the various members of the oligonucleotide library. Once bound, the oligonucleotide-primer pairs are subjected to a single round of the polymerase chain reaction, or any other polymerization reaction mixture that will extend the primers using the oligonucleotide library members as templates. When the reaction is complete, the library of formerly single-stranded oligonucleotides is transformed into a library of double-stranded molecular bar codes. The products of the polymerization reaction are then purified and prepared for insertion into a population of nucleic acid molecules, such as vectors, using standard molecular biology techniques. Alternatively, when the single-stranded oligonucleotides possess restriction endonuclease recognition sites at their respective 5' and 3' ends, primers complementary to both ends of this sequence are generated and added to the resuspended single-stranded oligonucleotide library under conditions that promote the specific binding of the primers to the various members of the oligonucleotide library. Once bound, the oligonucleotide-primer pairs are subjected to multiple rounds of the polymerase chain reaction, or any other polymerization reaction mixture that will extend the primers using the oligonucleotide library members as templates. When the reaction is complete, the library of formerly single-stranded oligonucleotides is transformed into a library of double-stranded molecular bar codes.

When the single-stranded oligonucleotides are composed completely of random nucleotides without a restriction endonuclease recognition site or other defined region, other methods are necessary to make these molecules double-stranded. One solution is to generate an additional population of short random oligomers of approximately 4 or more nucleotides in length to serve as the primers for the polymerization process. These random primers are purified and added to the library of single-stranded oligonucleotides to permit the generation of double-stranded molecular bar codes, as described above.

Following generation of the double-stranded molecular bar codes, a population of nucleic acids is prepared to receive the identification inserts. In one embodiment, a population of vectors is linearized to produce ends that are compatible with the ends of the molecular bar code. Typically the vector will be linearized using a restriction endonuclease. If the molecular bar code does not contain restriction endonuclease recognition sites, the vector is linearized to yield blunt ends. When the molecular bar code contains one or more restriction endonuclease sites, the vector will be linearized using one or more restriction endonucleases that produce ends that are compatible with the cut ends of the molecular bar code. Once the vector molecules and molecular bar codes have been prepared, the inserts are subcloned into the vectors using standard molecular biology techniques.

The term "tracking" refers to repetitively identifying a vector from an isolated or purified state through a host cell used for replicating or expressing the vector to a nucleic acid molecule isolated or purified from the host cell following replication or expression. Tracking may also refer to confirming the identity of a vector that has been isolated or purified and retained in a storage receptacle, such as a freezer vial for storage in liquid nitrogen.

The term "genetic material" refers to an inheritable unit of DNA including a gene on a human chromosome or in a bacterial plasmid or any segment of nucleic acid such as an element of nucleic acid, regulatory sequence intron, exon and the like. The "genetic material" may be described as a linear chain of deoxyribonucleotides that may be referred to by the name of the gene or by the sequence of nucleotides forming the chain. "Sequence" can be used to indicate both the ordered listing of the nucleotides that form the chain, and the deoxyribonucleotide chain itself which has that sequence of nucleotides. ("Sequence" is used in the same way when referring to RNA chains, linear chains made of ribonucleotides, and is also used in a similar fashion when referring to polypeptides, in which the backbone is a linear chain made from amino acids.) The term "genetic material" may include regulatory and control sequences, sequences that can be transcribed into an RNA molecule, and may contain sequences with unknown function. Some of the RNA products (products of transcription from genetic material) are messenger RNAs (mRNAs) that initially include ribonucleotide sequences which are translated into a polypeptide and ribonucleotide sequences which are not translated. The sequences that are not translated include control sequences and may include some sequences with unknown function. The coding sequences of many mammalian genes are discontinuous in the chromosome, having sequences present in the mature RNA, exons, along with non-coding sequences, introns. The exons and introns are both transcribed initially into the precursor RNA molecule from the chromosomal or plasmid DNA, with the introns being subsequently removed with the concomitant splicing of exons resulting in a single, linear, mature n RNA molecule.

The term "gene" as used in this application refers to a linear region of DNA that encodes a protein. Each gene is composed of a linear chain of deoxyribonucleotides which, when transcribed and processed, will produce an RNA molecule comprising an open reading frame encoding a protein. The gene itself may be referenced by the sequence of nucleotides comprising the chain. The term "gene" may also encompass associated regulatory and control sequences, sequences which can be transcribed into an RNA molecule, and may contain sequences with unknown function. Some of the RNA products from the transcription of DNA are messenger RNAs (rnRNAs) that initially include ribonucleotide sequences (or sequence) which are translated into a polypeptide and ribonucleotide sequences which are not translated. The sequences that are not translated may include control sequences, may include some sequences with currently unknown functions, and may include sequences which are spliced out of the initial transcript as it is processed to form a mature mRNA molecule. It should be recognized that small differences in nucleotide sequence for the same gene could exist between different persons, or between normal cells and cancerous cells, without altering the identity of the gene.

The term "without genetic meaning" means that the MBC has no effect on the vector or any of the genes contained in the vector. The MBC does not have any effect on the replication of the vector or the expression of any of the genes contained on the vector and in preferred embodiments, the MBC will not be transcribed.

The terms "protein," "polypeptide," "peptide" are used herein interchangeably.

The term "derived from a patient" refers to genetic material that has been isolated or purified from a clinical sample such as blood or tissue obtained from a patient. The term "patient" refers to a living subject who has presented at a clinical setting with a particular symptom or symptoms suggesting the need for treatment with a therapeutic agent. The treatment may either be generally accepted in the medical community or it may be experimental. In preferred embodiments, the patient is a mammal, including animals such as dogs, cats, pigs, cows, sheep, goats, horses, rats, and mice. In further preferred embodiments, the patient is a human. A patient's diagnosis can alter during the course of disease progression, either spontaneously or during the course of a therapeutic regimen or treatment. The term "expression vector" refers to a DNA construct that allows a researcher to place a gene encoding a gene product of interest, usually a protein, into a specific location in a vector in which the selected gene product can be expressed. One skilled in the art understands the term. The location where the selected gene is inserted commonly includes a promotor upstream of the site and a terminator region downstream. Commonly, the insertion site comprises recognition sites for restriction endonucleases to facilitate insertion of the gene of interest. The term "expression vector" also refers to such a DNA construct into which the gene of interest intended to produce the product (either RNA or protein molecule) of interest has already been inserted.

The term "vector" in this application refers to a DNA molecule designed for a function, usually expression or cloning, into which another DNA molecule of interest can be inserted by incorporation into the DNA of the vector. One skilled in the art is familiar with the term. Examples of classes of vectors are plasmids, cosmids, viruses, and bacteriophages. The term "plasmid vector" refers to a vector that is a plasmid. Typically, vectors are designed to accept a wide variety of inserted DNA molecules and then used to transmit the DNA of interest into a host cell {e.g., bacterium, yeast, insect tissue culture cell, higher eukaryotic cell). A vector may be chosen based on the size of the DNA molecule to be inserted, as well as based on the intended use. For transcription into RNA or transcription followed by translation to produce an encoded polypeptide, an expression vector is frequently chosen. For the preservation or identification of a specific DNA sequence {e.g., one DNA sequence in a cDNA library) or for producing a large number of copies of the specific DNA sequence, a cloning vector is frequently chosen. If the vector is a virus or bacteriophage, the term vector may include the membrane and/or protein coat surrounding the DNA. Following transfection of a cell, all or part of the vector DNA, including the inserted DNA, maybe incorporated into the host cell chromosome, or the vector may be maintained in the host cell extrachromosomally.

The terms "insert" or "inserting" when used in reference to the manipulation of DNA molecules refers to covalently attaching a nucleic acid to another nucleic acid. Typically, this is accomplished by making a restriction enzyme cut in the DNA backbone of a vector, usually by means of a restriction enzyme, and then adding the genetic material which may comprise a gene or a molecular bar code, usually where the genetic material to be added has been prepared with ends which are compatible with the initial break in the vector, and finally ligating the selected genetic material into the vector. One skilled in the art is very familiar with such techniques, and such techniques are described in laboratory manuals such as Sambrook et al., ("Molecular Cloning: A Laboratory Manual", third edition, Cold Spring Harbor Laboratory, 2001) or Ausubel et al. ("Current Protocols in Molecular Biology", John Wiley & Sons, 1998) both of which are incorporated herein by reference in their entirety, including any drawings, figures or tables.

The term "molecular bar code" or "identification insert" refers to a DNA sequence comprising 10 to 100 nucleotides of a nucleotide sequence that is distinguishable from other oligonucleotides of the same length, and not otherwise known to be present in a vector to be labeled with a molecular bar code ("MBC"). In preferred embodiments, this sequence arises from a synthetic oligonucleotide that contains at least a series of randomly selected nucleotides. The molecular bar code may have a portion at the ends where the nucleic acid sequence specify restriction endonuclease sites. In further preferred embodiments, the MBC is synthesized with defined ends to intentionally produce selected restriction enzyme sites. A different molecular bar code is associated uniquely with each nucleic acid of interest or each clone of genetic material following the incorporation of the gene or clone of genetic material in a vector. In preferred embodiments, the length of the molecular bar code is at least about 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95 or 100 nucleotides in length. In preferred embodiments, the molecular bar code is neither expected to encode a peptide nor expected to be transcribed. In other embodiments, the molecular bar code may comprise an open reading frame, but the expectation is that a peptide would not be produced, in whole or in part, by transcription and translation from the molecular bar code. This due to the lack signals surrounding the molecular bar code insertion site capable of driving its expression such as promoters, enhancers, polyadenylation sequences and the like. Thus, in many embodiments, the MBC is without function or phenotype (except for the presence of the additional DNA sequence) in the final plasmid; it is not transcribed, nor does it effect transcription or RNA processing, nor does it affect the normal functions of the plasmid such as interrupting sequences required for maintenance or replication of the plasmid.

Other embodiments are envisioned, in which each molecular bar code is distinguishable from other molecular bar codes used in that series of vectors. In such other preferred embodiments, the MBC is located at a site within the vector where it is transcribed. One example of this would be where it is located in the untranslated sequence of an n RNA that is downstream from the stop codon. In this case, the MBC's presence may also be detected in the mRNA produced by transcription from the vector.

In other embodiments, the MBC may be designed to incorporate a binding site for a "molecular beacon" so as to provide a unique pairing of a gene of interest with a molecular beacon. The term "molecular beacon" refers to oligonucleotides such as those sold by Operon Technologies (Alameda, CA) and Synthetic Genetics (San Diego, CA). (See also, Tyagi and Kramer, Nat Biotechnol, 1996, 14:303-308; and Tyagi et al., Nat Biotechnol, 2000, 18:1191-96.)

Additionally, the instant invention is not restricted to MBCs that are recognized solely by means of their unique sequence; these other MBCs are called "non-sequence MBCs" (ΝS-MBCs or Size-MBCs). One example of a non-sequence MBC is a non- coding, non-translated sequence which is designed with a unique restriction enzyme recognition site incorporated at a varying distances from one of the ends of the non- sequence-driven MBC so as to produce uniquely-sized restriction fragments when paired with a restriction site already present in the vector. In embodiments, the non-sequence- driven MBC incorporates an internal restriction endonuclease site with a six-base, seven- base, or eight-base recognition sequence not otherwise present in the vector or present at only one other site. In these embodiments, digestion with the restriction endonuclease using the recognition site within the non-sequence-driven MBC produces a uniquely sized fragment to be associated with the plasmid to be identified. These are also known as ΝS- MBCs and are distinguished from MBCs by the presence of an internal restriction site not otherwise present in the vector (or only a few places).

In certain other preferred embodiments, the MBC may be distinguished by the size of the insert without using a defined sequence. Thus, non-sequence-driven MBCs of varying length may be used where vectors are identified by the size of the MBC when removed from the plasmid by a restriction digest.

Furthermore, an MBC may function by a combination of these methods. That is, a series of MBCs may be prepared by largely random sequence, along with a restriction enzyme site not otherwise present in the vector located at a varying distance in the interior of the MBC. Therefore the MBC can be identified by direct sequencing, hybridization or other molecular technique sufficient to identify the MBC.

The term "series of vectors" in this application refers to a series of DNA constructs made by incorporating a multiplicity of genes of interest into vectors so as one gene of interest is comprised in each vector. Prior to the insertion of the gene(s) of interest, the vectors are identical, except that each vector contains a unique molecular bar code at the MBC insertion site. Therefore, after the addition of the gene of interest and molecular bar code to a vector, each vector now is a unique pairing of a gene of interest and a MBC incorporated into the same vector backbone. When this is done for a multiplicity of genes of interest and MBCs, this forms a series of vectors. In preferred embodiments, each gene of interest is obtained from a different patient. Therefore, a series of vectors is a multiplicity of vectors, with each member of this series comprising the same vector backbone and a unique pairing of a gene of interest derived from a patient and a MBC.

The terms "unique pairing" and "uniquely associating" refer to the intentional association of a specific and unique MBC with the genetic material cloned from a given patient, i other embodiments, the genetic material associated with the MBC is not obtained from a patient, but is rather obtained from a different source. The term "detecting" refers to any method of verifying the presence of a given molecular bar code with a given nucleic acid or plasmid. The techniques used to accomplish this may include, but are not limited to, PCR, sequencing, PCR sequencing, molecular beacon technology, hybridization, and hybridization followed by PCR. Examples of reagents which might be used for detection include, but are not limited to, radiolabeled probes, enzymatic labeled probes (horseradish peroxidase, alkaline phosphatase), and affinity labeled probes (biotin, avidin, or streptavidin).

The term "sample" refers to an aliquot of material, very frequently an aqueous solution or an aqueous suspension derived from biological material. Samples to be assayed for the presence of the molecular bar code of the present invention include, but are not limited to, cells, nucleic acids extracted from cells, or biological fluids such as blood, serum, plasma, or urine. The samples used in the above-described methods will vary based on the assay format, the detection method and the nature of the tissues, cells or extracts to be assayed. Methods for preparing nucleic acid extracts of cells are well known in the art and can be readily adapted in order to obtain a sample that is compatible with the method utilized.

The term "method of detection" refers to the means chosen to identify the MBC. In preferred embodiments, the method of detection is PCR sequencing or any other method of sequencing. However, other methods of detection are possible, including binding a molecular beacon or restriction enzyme digestion followed by electrophoresis and ethidium bromide staining. Similar techniques are well known in the art {e.g., see Sambrook et al, supra).

The term "T cell receptor variable region" refers to all or part of that portion of a T cell receptor molecule that does not belong to the constant region of the T cell receptor. The term "T cell receptor variable region" may also refer to the DNA sequence encoding the T cell receptor variable region. The term "TCR" or "T cell receptor" refers to a polypeptide found on the surface of T cells that comprises two polypeptide chains, and alpha chain and a beta chain. The term "TCR" or "T cell receptor" may also refer to nucleic acids encoding such polypeptide chains. Due to the normal development of the immune system, TCRs display considerable sequence diversity due to the operation of DNA rearrangements such as described in Bell et al. (Bell et al, 1995, T Cell Receptors, Oxford University Press, Oxford) The exact sequence of a given TCR cannot be predicted and must be determined by sequencing either the encoding nucleic acid or the protein of the TCR in question.

The term "pathology" refers to a state in an organism {e.g., a human) which is recognized as abnormal by members of the medical community. In preferred embodiments, this pathology is characterized by an abnormality in the function either of T cells or of B cells.

The term "immunoglobulin variable region" refers to all or part of that portion of a immunoglobulin molecule which does not belong to the constant region of the immunoglobulin. The term "immunoglobulin variable region" may also refer to the DNA sequence encoding the immunoglobulin variable region. Immunoglobulin types include IgG_γi, IgG_γ2, IgG_γ3, IgG_γ , IgA, IgAi, IgA₂, IgM, IgD, IgE heavy chains, and K or λ light chains or segments thereof. Any of these types of immunoglobulin variable region segments are included in the instant invention. In preferred embodiments, the immunoglobulin variable region is associated with a patient's pathology.

The term "molecular bar code insertion site" or "MBCIS" refers to a site in a vector where an inserted molecular bar code will not be expected to produce a protein nor expected to augment or depress any of the functions of the vector into which it is incorporated, nor expected to augment or depress any expression any of the genes incorporated into the vector. The location chosen for the molecular bar code insertion site should be without phenotype except for the presence of the MBC. The only expected function for the molecular bar code insertion site is to provide a location for the MBC. In other embodiments, the MBC may be transcribed, but the site is still selected so as to be silent with respect to the functioning or expression of any of the genes comprised in the vector, expect the presence of the additional nucleic acid bases.

The term "incorporated into the nucleic acid" for the purposes of this invention refers to ligating a second nucleic acid into the continuous nucleic acid backbone of a first nucleic acid.

The term "random series of nucleotides" refers to a sequence that is synthesized where the next nucleotide to be added to the chain is selected randomly from the set consisting of deoxyadenosine, deoxyguanosine, deoxycytidine, and deoxythymidine. Once the nucleic acid sequence has been synthesized, its sequence is determined by means well known in the art, and thus is no longer random.

The term "bracketed" refers to the practice synthesizing a predetermined restriction enzyme site at each end of a synthesized oligonucleotide. Since oligonucleotides are synthesized as single strands, initially only one strand of the restriction enzyme site is synthesized. The restriction enzyme site may include additional base pairs on both sides to enable a restriction enzyme to readily recognize and cleave the site.

The term "restriction endonuclease target site" refers to not only the bases necessary to form the restriction enzyme site, such as GGATCC for Bam HI, but also two to ten flanking bases to counter the lack of activity some restriction enzymes display when cleaving sites near the end of a double stranded nucleotide. (See, for example, the New England Biolabs (Beverly, MA) catalog for 1998/1999, pg. 538.)

EXAMPLES

Example 1

FIGURE 1 outlines the general steps of the method to create and incorporate a molecular bar code into a vector. In general, population of vector molecules is selected as a target. Two restriction endonuclease cutting sites are selected in the vector, R_t and R₂. An identification insert is synthesized with the sequence 5' C C C Ri (N)χ R₂ T T T 3' (SEQ LO NO: 1) where R! and R₂ represent the sequence of two restriction endonuclease recognition sites, N is a random nucleotide, and X is the number of random nucleotides incorporated into the identification insert. An extension primer with the sequence 5'A A A R₂' (SEQ LO NO:2) is also synthesized, where R ' is complementary to R₂. The two populations of oligonucleotides are mixed and incubated with the Klenow fragment or other polymerase to fill extend the extension primer, thus producing a double-stranded oligonucleotide. Alternately, the second strand can be generated by PCR using oligonucleotides defined by the ends. This double-stranded oligonucleotide will serve as the identification insert.

The double-stranded oligonucleotide is next digested with restriction endonucleases that cut at R_\ and R₂. The cut identification insert is then ligated into the vector, which is then transformed into suitable host cells for propagation. The result of this procedure is a library of vectors, each containing its own identification insert.

Example 2

Sequencing A 54 base pair oligonucleotide, termed BCv3.0 was synthesized. This oligonucleotide contained 5' Pst I and a 3' BssH II endonuclease restriction sites flanking a 15 base pair core region that has an equal 25% probability of containing an adenosine, cytosine, guanidine, or thymidine deoxynucleotide at each position. To assess the feasibility of deriving a library of unique oligonucleotides derived from BCv3.0, a PCR was performed and the resulting PCR fragments were ligated into a commercially available plasmid vector. Random bacterial colonies harboring the ligated vector were picked and the DNA sequence of the unique molecular bar code inserts contained within the plasmid vector was determined. The specific details of this test and the results obtained are described infra. PCR Amplification of BCv3.0: One hundred nanograms of BCv3.0 were mixed with 500 ngs the primers Code 5' and Code 3 ' respectively in 50μl PCR buffer containing dNTPs and 2.5 units expand Taq polymerase (Roche). The mixture was heated to 94°C for 2 minutes and then was subjected to 2 rounds of amplification in which annealing at 60°C and extension at 72°C was performed for 2 minutes. Twenty-five subsequent cycles of amplification were performed under at similar temperatures but each step was performed for 30 seconds. Denaruration of double stranded molecules was performed at 94°C for 30 seconds at each cycle.

Table 1 : Primer sequences used for generating molecular bar codes

BCv3.0 (SEQ ID NO:12):

5' CTCCATGCTGCAGATA (15x N) TCGTGAATAGCGCGCAAGAAAAT 3'

Pst I BssH II

SEQ ID NO: 13: Code 5': 5' CTCCATGCTGCAGATA 3' SEQ ID NO:14: Code 3': 5' ATTTTCTTGCGCGCTAT 3'

Sequence Determination of Double Stranded BCv3.0 PCR Products Ligated into the pCR4 Plasmid Vector: PCR products from the reaction were cloned directly into plasmid pCR4-TOPO as per manufacturer's recommendations, and introduced into Top 10 competent E. coli cells (Invitrogen). Twenty-four miniprep DNA plasmids were prepared from carbenicillin resistant bacterial colonies using a QIAGΕN Bio Robot 8000. Nineteen of the 24 colonies contained a molecular bar code insert as determined via digestion with Pst I. Of these, 10 were subjected to DNA sequence analysis using the Cy5/Cy5.5 Dye Primer Cycle Sequencing Kit (Visible Genetics). Following the completion of the sequencing reactions, samples were electrophoresed on the OpenGene Automated DNA Sequencing System and the data was processed with GeneObjects software package (Visible Genetics). Additional analyses including sequence alignments were performed using the SΕQUΕNCHΕR Version 4.1.2 DNA analysis software (GΕNΕ Codes Corp.). Of the 10 plasmids sequenced 9 had easily discemable, unique molecular bar code sequences as shown below.

Table 2: Molecular Bar Codes of Example 2

SEQ JX> NO:3: GCGAGTAGAAGCCAC SEQ ID NO:4: GTTATGATCCACCTG SEQ ID NO:5: CGCTTCCTAGAGTC SEQ ID NO:6: AGACAAGTTACGTAA SEQ ID NO:7: TCTGCCAACAGCCT SEQ ID NO:8: TAACTGTTCACAAT SEQ ID NO:9: CCTATACCACGGGAA SEQ ID NO:10: CATCGACGACACA SEQ ID NO: 11 : TAATAGCGTGCCGTA

Having now fully described this invention, it will be appreciated by those skilled in the art that the same can be performed within a wide range of equivalent parameters, concentrations, and conditions without departing from the spirit and scope of the invention and without undue experimentation.

While this invention has been described in connection with specific embodiments thereof, it will be understood that it is capable of further modifications. This application is intended to cover any variations, uses, or adaptations of the inventions following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains and as may be applied to the essential features hereinbefore set forth as follows in the scope of the appended claims.

Claims

CLAIM 1. A method of tracking genetic material comprising: attaching a molecular bar code to said genetic material thereby uniquely associating said molecular bar code with said genetic material; detecting the presence of said molecular bar code in a sample and thereby tracking genetic material.

CLAIM 2. The method of claim 1 wherein said genetic material is a gene.

CLAIM 3. The method of claim 1 wherein said genetic material is derived from a patient.

CLAIM 4. The method of claim 1 wherein the method of detection is sequencing.

CLAIM 5. The method of claim 1 wherein the genetic material comprises a T cell receptor variable region associated with a pathology.

CLAIM 6. The method of claim 1 wherein the genetic material comprises an immunoglobulin variable region associated with a pathology.

CLAIM 7. A method of tracking a vector comprising a gene of interest, said method comprising:

inserting a molecular bar code into said vector that contains said gene of interest, thereby uniquely associating said molecular bar code with said gene of interest in said vector; detecting the presence of said molecular bar code in a sample and thereby tracking said gene of interest and said vector to said sample.

CLAIM 8. The method of claim 7 wherein said gene of interest is derived from a patient.

CLAIM 9. The method of claim 7 wherein the method of detection is sequencing.

CLAIM 10. The method of claim 7 wherein said gene of interest comprises a T cell receptor variable region associated with a pathology.

CLAIM 11. The method of claim 7 wherein said gene of interest comprises an immunoglobulin variable region associated with a pathology.

CLAIM 12. A method of tracking genetic material derived from a patient in a vector comprising: inserting a molecular bar code into an molecular bar code insertion site in said vector that contains genetic material from said patient; thereby uniquely associating said molecular bar code with said genetic material from said patient; detecting the presence of said molecular bar code in a sample and thereby tracking genetic material from said patient to said sample.

CLAIM 13. The method of claim 12 wherein said genetic material is a gene.

CLAIM 14. The method of claim 12 wherein the method of detection is sequencing.

CLAIM 15. The method of claim 12 wherein the genetic material comprises a T cell receptor variable region associated with a pathology.

CLAIM 16. The method of claim 12 wherein the genetic material comprises an immunoglobulin variable region associated with a pathology.

CLAIM 17. A method of distinguishing a unique pairing of a gene of interest with a vector from a multiplicity of vectors comprising genes, said method comprising: inserting a molecular bar code into an molecular bar code insertion site in said vector that comprises said gene of interest; thereby uniquely associating said molecular bar code with said gene of interest in said vector; detecting the presence of said molecular bar code in a sample and thereby tracking said gene of interest and said vector to said sample.

CLAIM 18. The method of claim 17 wherein said gene of interest is derived from a patient.

CLAIM 19. The method of claim 17 wherein the method of detection is sequencing.

CLAIM 20. The method of claim 17 wherein the genetic material comprises a T cell receptor variable region associated with a pathology.

CLAIM 21. The method of claim 17 wherein the genetic material comprises an immunoglobulin variable region associated with a pathology.

CLAIM 22. A method for constructing a nucleic acid containing a molecular bar code, comprising: providing a nucleic acid; and providing a molecular bar code, wherein said molecular bar code is attached to the nucleic acid.

CLAIM 23. The method of Claim 22, wherein the nucleic acid is a vector.

CLAIM 24. The method of Claim 22, wherein the molecular bar code is synthesized together with said nucleic acid.

CLAIM 25. The method of Claim 22, wherein the molecular bar code is synthesized independently from said nucleic acid.

CLAIM 26. A nucleic acid comprising a molecular bar code, wherein said molecular bar code consists essentially of a random series of nucleotides.

CLAIM 27. The nucleic acid of claim 26 wherein said series of nucleotides is from 10 to 100 nucleotides.

CLAIM 28. A method of tracking genetic material derived from a patient in a vector comprising: inserting a means for tracking said vector that comprises genetic material from said patient, and detecting the presence of said means in a sample and thereby tracking genetic material from said patient to said sample.

CLAIM 29. A means for specifically identifying a vector comprising genetic information from a patient comprising: inserting a means for identifying said vector that comprises genetic material from said patient, and detecting the presence of said means in a sample and thereby identifying said vector containing genetic material from said patient in said sample.

CLAIM 30: A vector comprising a gene of interest and a molecular bar code, the vector prepared by a process comprising the steps of:

(i) preparing a vector comprising the gene of interest to accept the molecular bar code by digesting the vector with appropriate restriction endonucleases;

(ii) preparing a molecular bar code by synthesizing an oligonucleotide chain comprising the steps of (a) synthesizing one strand of a restriction endonuclease target site, (b) randomly synthesizing from 10 to 100 nucleotides, and (c) synthesizing one strand of a restriction endonuclease target site at the other end of the oligonucleotide, (d) preparing a complementary strand and annealing it to said synthesized oligonucleotide, (e) preparing the double-stranded oligonucleotide by digesting it with the appropriate restriction endonucleases; and (iii) ligating said vector with said molecular bar code.

CLAIM 31. A kit for preparing a vector comprising a molecular bar code comprising a series of containers each comprising DNA molecules comprising a unique molecular bar code bracketed by a predetermined restriction endonuclease cleavage site. CLAIM 32: A nucleic acid comprising a molecular bar code incorporated into an molecular bar code insertion site, wherein said molecular bar code comprises some randomly synthesized series of nucleotides.

CLAIM 33: A nucleic acid comprising a molecular bar code, wherein said molecular bar code comprises at least some randomly synthesized series of nucleotides.

CLAIM 34: A vector comprising a gene of interest and a molecular bar code, the vector prepared by a process comprising the steps of:

(i) preparing a vector comprising the gene of interest to accept the molecular bar code by digesting the vector with appropriate restriction endonucleases; (ii) preparing a molecular bar code by synthesizing an oligonucleotide chain comprising the steps of (a) synthesizing one strand of a restriction endonuclease target site, (b) randomly synthesizing from 10 to 100 nucleotides, and (c) synthesizing one strand of a restriction endonuclease target site at the other end of the oligonucleotide, (d) preparing a complementary strand by the use of primers complementary to the first strand and a polymerase, (e) preparing the double-stranded oligonucleotide by digesting it with the appropriate restriction endonucleases; and (iii) ligating said vector with said molecular bar code.