WO2000031110A1 - Identification of disease predictive nucleic acids - Google Patents

Identification of disease predictive nucleic acids Download PDF

Info

Publication number
WO2000031110A1
WO2000031110A1 PCT/US1999/027710 US9927710W WO0031110A1 WO 2000031110 A1 WO2000031110 A1 WO 2000031110A1 US 9927710 W US9927710 W US 9927710W WO 0031110 A1 WO0031110 A1 WO 0031110A1
Authority
WO
WIPO (PCT)
Prior art keywords
poly
site
gene
nucleic acid
disease state
Prior art date
Application number
PCT/US1999/027710
Other languages
French (fr)
Inventor
David J. Ecker
Original Assignee
Isis Pharmaceuticals, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US09/200,355 external-priority patent/US6451524B1/en
Application filed by Isis Pharmaceuticals, Inc. filed Critical Isis Pharmaceuticals, Inc.
Priority to AU17426/00A priority Critical patent/AU1742600A/en
Publication of WO2000031110A1 publication Critical patent/WO2000031110A1/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/11DNA or RNA fragments; Modified forms thereof; Non-coding nucleic acids having a biological activity
    • C12N15/113Non-coding nucleic acids modulating the expression of genes, e.g. antisense oligonucleotides; Antisense DNA or RNA; Triplex- forming oligonucleotides; Catalytic nucleic acids, e.g. ribozymes; Nucleic acids used in co-suppression or gene silencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6809Methods for determination or identification of nucleic acids involving differential detection
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61KPREPARATIONS FOR MEDICAL, DENTAL OR TOILETRY PURPOSES
    • A61K38/00Medicinal preparations containing peptides
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N2310/00Structure or type of the nucleic acid
    • C12N2310/30Chemical structure
    • C12N2310/35Nature of the modification
    • C12N2310/351Conjugate

Definitions

  • the present invention is directed to methods of identifying target nucleic acid sequences which are predictive of disease states or biological conditions in cells containing the nucleic acid sequence.
  • RNA molecules participate in or control many of the events required to express proteins in cells. Rather than function as simple intermediaries, RNA molecules actively regulate their own transcription from DNA, splice and edit mRNA molecules and tRNA molecules, synthesize peptide bonds in the ribosome, catalyze the migration of nascent proteins to the cell membrane, and provide fine control over the rate of translation of messages. RNA molecules can adopt a variety of unique structural motifs, which provide the framework required to perform these functions. The many functions of RNA molecules has also solidified their importance as therapeutic drug and diagnostic targets. Indeed, many investigators are pursuing mRNA transcripts and proteins produced therefrom that are expressed at different levels in cancer vs.
  • the present invention provides the means to identify distinguishing features of types of cancer coupled with a common molecular mechanism to diagnose and selectively destroy the cancer cells or other cells associated with a disease state or biological condition. It is a principal object of the invention to identify a target nucleic acid sequence which is predictive of a disease state or biological condition in cells containing the nucleic acid sequence.
  • the present invention is directed to methods of identifying target nucleic acid sequences which are predictive of preselected disease states or biological conditions in cells containing the nucleic acid sequence.
  • Members of a set of mRNA molecules from a common gene, but containing different sequences and structures, are compared.
  • the gene is predictive of the disease state or biological condition in cells containing the gene.
  • At least one molecular interaction site from among those present in the members of the set are identified.
  • the molecular interaction site is present in cells likely to have the disease state or biological condition.
  • At least one nucleic acid sequence from the molecular interaction site is ascertained.
  • Figure 1 illustrates an example of a 3'-EST cluster.
  • Figure 2 illustrates alternative initiation of mdm2 gene in cancer and normal cells results in unique RNA structures.
  • Figure 3 illustrates Her 2 alternative transcript forms.
  • the present invention is directed to methods of identifying target nucleic acid sequences which are predictive of preselected disease states or biological conditions, especially cancer, in cells containing the nucleic acid sequence.
  • Members of a set of mRNA molecules from a common gene, but containing different sequences and structures, are compared.
  • the set of mRNA molecules from a common gene, but containing different sequences and structures are referred to as "alternative transcript forms.”
  • Comparison of the alternative transcript forms provides for the identification of at least one alternative transcript form which is associated with the disease state or biological condition.
  • the alternative transcript form, or the protein which is encoded by the same, is not required to be directly involved in any pathogenesis pathway.
  • the alternative transcript form may be merely a marker for the disease state or biological condition without participating or being required for establishment or maintenance of the disease state or biological condition.
  • the alternative transcript form or protein encoded thereby may be a by-product of the pathogenesis involved in the disease state or biological condition.
  • Additional alternative transcript forms from a variety of genes can be analyzed to determine whether they comprise the same molecular interaction site. In this manner, additional mRNA molecules can be identified which may also be predictive of a disease state.
  • Alternative transcript forms originate from alternative initiation of transcription, alternative splicing, alternative 3 '-end processing or a combination of these mechanisms.
  • Alternative 3 '-end processing may be the greatest source of alternative transcript forms. Studying 160,000 EST sequences, D. Gautheret and collaborators have shown that from 20-40% of the transcripts have two or more different 3'-ends (See Fig. 1) . Gautheret, Genome Res., 1998, 8, 524-530, which is incorporated herein by reference in its entirety.
  • mRNAs are alternatively 3'- end processed in a tissue-specific or developmentally-specific pattern (Edwalds-Gilbert, Nucl. Acids. Res., 1997, 25, 2547-2561, which is incorporated herein by reference in its entirety) and in some cases this has been correlated with cancer.
  • the mss4 transcript was recently shown to have alternative 3 '-end processing in pancreatic cancer. Muller-Pillasch, Genomics, 1887, 46, 389-396, which is incorporated herein by reference in its entirety.
  • Alternative 3 '-end formation does not change the protein composition, but can dramatically influence message stability and regulate translation by including or excluding regulatory sequences in the mRNA transcript.
  • Alternative transcript forms despite being transcribed from identical DNA and translated into identical proteins possess unique sequences and three-dimensional shapes that exist only at the level of RNA.
  • a very important consequence of alternative transcript forms for cancer recognition is the unique shapes that they adopt. In contrast to the regular, helical nature of DNA, RNA strands form intricate stems, loops, and bulges, which are arranged into three-dimensional shapes that rival proteins in their complexity.
  • Alternative transcript forms can produce different shapes in several ways. First, with alternative transcription initiation or 3 '-end formation, there are unique sequences in mRNAs that do not appear at all in the normal mRNA (See, Fig. 2). These sequences, in turn, will fold into unique structures within themselves and with the adjacent RNA. Second, each alternative splicing event can produce a unique junction, in which the adjacent RNA on each side of the junction will re-arrange into a new three-dimensional shape.
  • Alternative transcript forms are distinguishable from cancer-specific expression of transcripts. Cancer-specific transcripts provide a perfectly useful set of molecular targets for the invention described herein. However, the greater opportunity to find useful cancer signatures is in alternative transcript forms since there may be 2-20 different forms of every transcript and 10-20,000 genes expressed in any given cell. Thus, the opportunity to find cancer-specific alternative transcript forms may be much greater than for cancer-specific transcripts. Whether the origin of the cancer signature comes from cancer-specific transcripts or cancer-specific transcript forms, for the technology of the present invention it is not required that the cancer-specific differences in mRNA be responsible for cancer phenotype. It is not even important that it is known what they do. The important point is that they are present in cancer cells and can therefore be used to mark them for destruction. Accordingly, the presented invention is directed, in part, to identifying at least one molecular interaction site within at least one alternative transcript form which is associated with the disease state or biological condition.
  • genes listed in Table 2 contain composite exons in which 5' splice sites can sometimes be silent, causing them to behave as 3 '-terminal exons, or sometimes be active, thereby causing them to behave as internal exons, depending on the tissues in which the gene is expressed; these we call composite, in/terminal exons.
  • Genes like the immunoglobulin heavy chains have an exon serving either as the first 3'-terminal exon in one mRNA (use of pAl) or as an internal exon in a second mRNA which ends with a normal 3 '-terminal exon found further downstream (use of pA2).
  • the distance between the poly(A) sites in these two classes of genes can be quite large (>3 kb in Ig genes) and differential sites of transcription termination, between the poly(A) sites, could change the distribution of 3'-end use in mRNA.
  • Levels of basal polyadenylation factors, splicing factors and termination factors could all contribute cell type-specific mechanisms leading to 3 '-end formation.
  • Transplantation conserved [including the poly(A) sites] antigen across mammalian species; two poly(A) sites Mouse; three poly(A) sites, two mRNAs ⁇ -Galactosidase A Muscle, brain Mouse, human; two poly(A) sites; second Acetylcholinesterase site predominates in muscle, first site predominates in brain
  • ADP ribosylation 3 Testes ARF 1 has two poly(A) sites conserved in factor (ARF) human and rat. ARF 4 makes a short, testes- specific mRNA generated by alternative polyadenylation
  • Aldolase B Mouse one non-canonical and three canonical poly(A) sites, use of all four sites detectable in liver and kidney
  • Amphiglycan Chondrocytes At lease two poly(A) signals; longer (syndecan 4, message is ubiquitous, shorter is tissue- ryudocan) specific; switch in poly(A) site use during chondrocyte differentiation Amyloid protein 2, 4 Sequence between tow poly(A) sites increases translation of the longer mRNA
  • Androgen receptor Human two poly(A) sites, the first is AUUAAA and the second is CAUAAA
  • Angiotensin Testes Rabbit; testes-versus pulmonary-specific converting enzyme pulmonary forms (ACE) Ankyrin-l tissue Ankyrin-l Biain Mouse; both poly(A) sites used in erythroid tissues, distal site used in cerebellum
  • Axonin-1 Retina brain Chicken
  • three poly(A) sites ⁇ -Tubulin HSV Infection Changes in the ratio of the two forms occur during HSV infection ⁇ 2-M ⁇ croglobul ⁇ n Mu ⁇ ne, two ⁇ oly(A) sites ⁇ 3-Adrenerg ⁇ c Human, rat, two poly(A) sites receptor
  • Band 7 2b gene Many cell types Human, integral membrane phosphoprotem, distal site predommates in all tissues, proximal site use is significant in lung, liver and kidney and minimal in spleen
  • Bram-denved Heart, lung, Rat isoform production controlled by neurotrophic factor brain alternative splicing, multiple promoters (BDNF) used, two poly(A) sites, ration of proximal distal site use varies among heart, lung, cerebral cortex
  • BDNF neurotrophic factor brain alternative splicing, multiple promoters
  • Cationic armno acid 1 Cell density Rat relative concentration of two mRNAs is transporter gene regulated by cell density (cat-1) c-Mos 3 Porcine, protooncogene whose expression is restricted to gonadal tissues in the pig, alternative polyadenylation may play a role in translation
  • CD59 membrane Many cell types Human, complement regulator protem, four inhibitor of reactive possible poly(A) sites, use of two poly(A) lysis) sites vanes in different cell lines
  • Chymotrypsin-like Human chromosome 16q22 1 alternative protease polyadenylation creates transcription unit which overlaps with oppositely onented gene
  • Cytochrome P450 many cell types Human; two poly(A) sites, [second poly(A) aromatase signal AUUAAA]; mouse, porcine, equine; two poly(A) sites, 2.5 kb mRNA predominant in ovaries Cytochrome P450- Mouse; two poly(A) sites linked ferredoxin
  • DHFR promoter-proximal site reductase
  • Dipeptidyl peptidase Mouse Two poly(A) sites in exon 26, IV (CD26) proximal poly(A) site predominates in all tissues examined
  • a 1.5 kb transcript is abundant in mouse thymus and in S194 cells. Minor mRNAs of 2.2 and 2.5 kb correspond to use of alternate poly(A) signals as well eIF-5 translation Testes Mammalian; proximal poly(A) site used initiation factor 5) predominantly in testes, distal site favored in other tissues examined
  • Fibroglycan Human at least two functional poly(A) (syndecan 2) signals
  • FMR1 Fragile X gene Two poly(A) sites
  • G protein ⁇ subunit Many cell types Drosophila; use of three different poly( A) (D-G ⁇ l) sites is developmentally regulated and cell type specific: 2.6 kb transcript found in head, 1.3 kb transcript found in body, 1.1 kb transcript more abundant in head than in body
  • Grg Murine related to the groucho transcript of the Drosophila Enhancer of split complex
  • Herpes simplex virus Increased polyadenylation at weak viral sites type 1 (HSV-1) via effects on host cell CstF 64-kDa UL24
  • High mobility group Murine three poly(A) sites 1 protein (HMG1)
  • Histone HI 0 Butyrate Mouse differentiation-specific histone HI; treatment two mRNAs, first poly(A) signal is AUUAAA; minor 0.9 kb mRNA becomes more stable during butyrate-induced dedifferentiation, mRNAs equally stable after treatment with actinomycin D
  • Interleukin-8 Human two mRNAs equally abundant in receptor ⁇ neutrophils Iron regulatory Intracellular iron Human, rat; RNA binding protein whose protein 2 (IRP2) levels affinity for its binding sites is modulated by intracellular iron levels; increase in proximal poly(A) site use with reciprocal decrease in distal poly(A) site use in iron-depleted cells
  • Ketohexokinase Human two poly(A) sites; second is (fructokinase) GAUAAA
  • Lipoprotein lipase Many cell types Human; longer transcript predominates in skeletal and cardiac muscle; adipose tissue produces both forms of mRNA; longer transcript translated more efficiently than short one
  • Microtubule- Many cell types Mouse 3'-UTR well conserved between associated protein 4 mouse and human; first two sites used in all (MAP4) tissues tested; third site used in muscle; fourth site used in testes, but first site predominates
  • Mitochondrial Rat two poly(A) sites: AUUAAA and HMG-CoA synthase AUUAUC
  • N-Formyl peptide Dibutryl cAMP Human two-exon gene, at least two poly(A) receptor (FMLF-R) treatment sites; predominant use of proximal poly(A) site after treatment of HL60 human lymphoma cells with the differentiation agent dibutryl cAMP NAD(P)H:quinone Mitomycin C Human colon cancer HCT 116 cells; two oxidoreductase treatment mRNAs; change in ratio after mitomycin C treatment
  • Non-muscle myosin Human two poly(A) sites heavy chain mal-a Mouse; novel keratinocyte lipid-binding protein; tumor specific over expression; two poly(A) sites, use of first one predominates P-selectin Human, chromosome 12q24; major mRNA glycoprotein ligand species 2.5 kb, minor species 4 kb
  • PR264/SC35 Many cell types Human splicing factor; ratio of different forms varies among six different cell lines tested rab2 Many cell types Human Ras-related GTP binding protein; three potential poly(A) signals
  • GTPase Ran shows testes-specific polyadenylation
  • Rat Renal glutaminase Many cell types Rat; ratio of poly(A) site use varies in different cell lines RHOA Human Ras-related GTP binding protein; protooncogene found in breast cancer cell lines; three poly(A) sites
  • Senescence marker Rat two poly(A) sites protein-30 (SMP-30) set, putative Many cell types Human, mouse; ratio of two mRNAs varies oncogene associated in different cell types and five cell lines with myeloid tested; shorter mRNA predominates in liver leukemogenesis and kidney
  • Soluble angiotensin 3 Porcine two ⁇ oly(A) sites, first is binding protein GAUAAA; longer transcript may be regulated by SINE element in 3'-UTR Splicing factor 9G8 Many cell types Human; two poly(A) sites; pre-mRNA also subjected to alternative splicing
  • Syndecan-1 Mouse Two poly(A) sites Tissue inhibitor of Human; two stable transcripts metalloproteinases-2 (TIMP-2)
  • Tissue inhibitor of TPA treatment Murine; three transcripts of 2.3, 2.8 and 4.6 metalloproteinases-3 kb. 4.6 kb most abundant. All three (TIMP-3) transcripts induced in pre-neoplastic JB6 cells treated with TPA Transforming Human; five possible poly(A) sites but only growth factor alpha two mRNAs detected; use of distal poly(A) (TGF ⁇ ) site (AAUGAAA) predominates in most tissues
  • Triose phosphate Testes Rat 1.4 kb mRNA found in most tissues and isomerase in somatic cells of testes; its level increases after retinol treatment; the 1.5 kb species is detected only in haploid spermatids
  • transcription unit undergoes polycistronic pre- ⁇ r ⁇ ns-splicing and alternative mRNA polyadenylation, which may be coupled in this system
  • VEGF Vascular endothelial Hypoxia Rat; two poly(A) sites; regulation of poly(A) growth factor site use by hypoxia (VEGF)
  • ZAKI-4 Many cell types Human thyroid hormone-responsive gene; two mRNAs, first poly(A) signal is AUUAAA; short mRNA predominates in heart and brain, trace amounts found in liver; long mRNA predominates in skeletal muscle; no messages detected in placenta, lung, kidney, pancreas
  • C3b/C4b receptor Use of proximal poly(A) site yields secreted form of receptor; (complement receptor type predominant membrane-bound receptor is generated by use of distal 1) poly(A) site
  • Cek5 Chicken receptor protein-tyrosine kinase of the Eph subfamily; use of the proximal poly(A) site yields secreted form of kinase, whose expression is low relative to the full-length Cek5 receptor
  • Epidermal growth factor Proximal poly(A) site leads to production of secreted form of receptor, (EGF) receptor; human, which can inhibit the activities of the membrane-bound receptor chicken exuperantia (exu) Drosophila gene required for both oogenesis and spermatogenesis that undergoes sex-specific alternative pre-mRNA processing; tra-2 gene required for male specific RNA processing
  • Fibrinogen ⁇ -chain Rat pre-mRNA undergoes liver-specific choice of proximal poly(A) site; other cell types always use distal poly(A) site
  • Fibroblast growth factor Secreted form of receptor generated by use of the proximal poly(A) site; (FGF) receptor membrane-bound forms are produced by use of distal poly(A) site; secreted form also binds FGF
  • Glucocorticoid receptor ⁇ form of receptor produced by use of the proximal poly(A) site; more abundant ⁇ form uses the distal poly(A) site
  • HER2/neu receptor Protein tyrosine kinase receptor in which membrane-bound form is produced from mRNA using he distal poly(A) site; use of proximal poly(A) site leads to shorter, intracellular form of the receptor; use of the proximal and distal poly(A) sites varies greatly in different tumor cell lines
  • Ig ⁇ heavy chain Pattern of regulation similar to Ig ⁇ heavy chain pre-mRNAs Ig ⁇ heavy chain Pattern of regulation similar to Ig ⁇ heavy chain pre-mRNAs Ig ⁇ heavy chain Pattern of regulation similar to Ig ⁇ and to other Ig heavy chain pre- mRNAs; can also include transcription termination as a mechanism of proximal poly(A) site selection
  • Leukemia inhibitory factor Member of homopoietin receptor family murine gene produces a receptor ⁇ -chain secreted [proximal poly(A) site] and membrane-bound form [distal poly(A) site], with increase in the secreted form during pregnancy
  • Nuclear factor I-B3 Distal ⁇ oly(A) site favored in all tissues examined, proximal poly(A) site used in heart and skeletal muscle; protein encoded by the shorter mRNA acts as a transcriptional repressor
  • Plasma membrane Ca 2+ - Use of proximal poly(A) site specific to skeletal muscle and brain ATPase isoform 3 Poly(A) polymerase Component of polyadenylation complex; six isoforms generated via alternative splicing and polyadenylation; some isoforms found in all tissues examined, others show tissue-specific expression; use of one of three proximal poly(A) sites yields forms that contain the polymerase domain but not the serine/threonine-rich domain and nuclear localization signal
  • Sarco/endoplasmic reticulum Five protein isoforms are generated from three different SERCA genes Ca 2+ -ATPase (SERCA) plus alterative processing events; regulation of expression is both developmental and tissue specific and is suggested to be at the level of splicing rather than polyadenylation; two SERCA2 protein isoforms are translated from four different mRNAs generated by tissue-dependent alternative processing, one of which is brain specific; SERCA2a protein is muscle specific, SERCA2b is found in non-muscle tissues and smooth muscle
  • Secretory PLA 2 receptor Receptor has similar structural organization to macrophage mannose receptor; acts as a mediator of inflammatory processes; secreted form of phospholipaseA 2 receptor found in human kidney; membrane bound receptor is widely expressed, including in kidney
  • Thyroid hormone receptor Proximal poly(A) site yields ⁇ l, which binds thyroid hormone; distal (c-erbA-1) ⁇ oly(A) site produces ⁇ 2, which cannot bind thyroid hormone; ratio of two mRNAs varies in different tissue; ⁇ 2 transcript overlaps with gene transcribed in opposite direction, Rev-ErbA ⁇ ⁇ -Tropomyosin At least four poly(A) sites; proximal poly(A) site used in striated muscle and distal poly(A) site used in smooth muscle and fibroblasts; three of the poly(A) sites used in brain
  • Adeno virus major late Five poly(A) sites the proximal poly(A) site, LI, used predominantly in transcription unit early infection; L3 dominates late in infection ⁇ -Tropomyosin Proximal poly(A) site used exclusively in skeletal muscle; other cell types use the distal poly(A) site; regulation may be at the level of the splice site choice
  • Calcitonin calcitonin gene- Proximal poly(A) site used in most cell types, generating the mRNA for related peptide (CGRP) calcitonin; distal poly(A) site used exclusively in neuronal cells, leading to production of CGRP doublesex (dsx) Drosophila gene required for somatic sexual differentiation that undergoes sex-specific alternative pre-mRNA processing; tra-2 protein required for regulated RNA processing and acts through its binding site in the dsx pre-mRNA Epidermal growth factor Proximal poly(A) site leads to production of secreted form of receptor, (EGF) receptor; rat which can inhibit the activities of the membrane-bound receptor; differs from human and chicken isoforms
  • NCAM cell types molecule
  • Plasma ⁇ ( 1.3)- Two poly(A) sites are used equally in liver; proximal poly(A) site fiicosyltransferase (FUT6) favored in colon; distal poly(A) site used predominantly in kidney
  • Poly(A) polymerase Component of polyadenylation complex six isoforms generated via alternative splicing and polyadenylation; some isoforms found in all tissues examined, others show tissue-specific expression; use of one of three proximal poly(A) sites yields forms that contain the polymerase domain but not the serine/threonine-rich domain and the nuclear localization signal (three exons also composite)
  • the present invention is directed to identifying a target nucleic acid sequence which is predictive of a preselected disease state or biological condition.
  • the disease states or biological conditions include, but are not limited to, nucleic acids known to be important during inflammation, cardiovascular disease, pain, cancer, arthritis, trauma, obesity, Huntingtons, neurological disorders, hyperproliferative conditions, neoplastic states or conditions, Lupus erythematosis, and many other diseases or disorders.
  • ESTs Expressed Sequenced Tags
  • OMIM Inheritance in Man
  • CGAP Cancer Genome Anatomy Project
  • GenBank the Cancer Genome Anatomy Project
  • EMBL the Cancer Genome Anatomy Project
  • PIR the Cancer Genome Anatomy Project
  • SWISS-PROT the like.
  • OMIM which is a database of genetic mutations associated with disease, was developed, in part, for the National Center for Biotechnology Information (NCBI).
  • NCBI National Center for Biotechnology Information
  • NCBI National Center for Biotechnology Information
  • CGAP which is an interdisciplinary program to establish the information and technological tools required to decipher the molecular anatomy of a cancer cell.
  • CGAP can be accessed through the Internet at, for example, http://www.ncbi.nlm.nih.gov/ncicgap/.
  • Some of these databases may contain complete or partial nucleotide sequences.
  • alternative transcript forms can also be selected from private genetic databases. Alternatively, alternative transcript forms can be selected from available publications or can be determined especially for use
  • the nucleotide sequence of the alternative transcript form preferably is determined.
  • the nucleotide sequence of the nucleic acid target is determined by scanning at least one genetic database or is identified in available publications.
  • Preferred databases known and available to those skilled in the art include, for example, the Expressed Gene Anatomy Database (EGAD) and Unigene-Homo Sapiens database (Unigene), GenBank, and the like.
  • EGAD contains a non-redundant set of human transcript (HT) sequences and can be accessed through the Internet at, for example, http://www.tigr.org/tdb/egad/egad.html.
  • Unigene is a system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters.
  • Each Unigene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location.
  • Unigene contains hundreds of thousands of novel expressed sequence tag (EST) sequences.
  • Unigene can be accessed through the Internet at, for example, http://www.ncbi.nlm.nih.gov/UniGene/.
  • These databases can be used in connection with searching programs such as, for example, Entrez, which is known and available to those skilled in the art, and the like.
  • Entrez can be accessed through the Internet at, for example, http://www.ncbi.nlm.nih.gov/Entrez/.
  • the most complete nucleic acid sequence representation available from various databases is used.
  • GenBank database which is known and available to those skilled in the art, can also be used to obtain the most complete nucleotide sequence.
  • GenBank is the NIH genetic sequence database and is an annotated collection of all publicly available DNA sequences. GenBank is described in, for example, Nuc. Acids Res., 1998, 26, 1-7, which is incorporated herein by reference in its entirety, and can be accessed by those skilled in the art through the Internet at, for example, http://www.ncbi.nlm.nih.gov/Web/Genbar_-k/index.html. Alternatively, partial nucleotide sequences of nucleic acid targets can be used when a complete nucleotide sequence is not available.
  • nucleotide sequence of the nucleic acid target is determined by assembling a plurality of overlapping ESTs.
  • the EST database (dbEST), which is known and available to those skilled in the art, comprises approximately one million different human mRNA sequences comprising from about 500 to 1000 nucleotides, and various numbers of ESTs from a number of different organisms. dbEST can be accessed through the Internet at, for example, http://www.ncbi.nlm.nih.gov/dbEST/index. html.
  • ESTs have applications in the discovery of new genes, mapping of genomes, and identification of coding regions in genomic sequences. Another important feature of EST sequence information that is becoming rapidly available is tissue-specific gene expression data. This can be extremely useful in targeting selective gene(s) for therapeutic intervention. Since EST sequences are relatively short, they must be assembled in order to provide a complete sequence. Because every available clone is sequenced, it results in a number of overlapping regions being reported in the database. The end result is the elicitation of alternative transcript forms from, for example, normal cells and cancer cells.
  • the resultant virtual transcript may represent an already characterized nucleic acid or may be a novel nucleic acid with no known biological function.
  • the Institute for Genomic Research (TIGR) Human Genome Index (HGI) database which is known and available to those skilled in the art, contains a list of human transcripts. TIGR can be accessed through the Internet at, for example, http://www.tigr.org/. The transcripts were generated in this manner using TIGR- Assembler, an engine to build virtual transcripts and which is known and available to those skilled in the art.
  • TIGR- Assembler is a tool for assembling large sets of overlapping sequence data such as ESTs, BACs, or small genomes, and can be used to assemble eukaryotic or prokaryotic sequences.
  • TIGR- Assembler is described in, for example, Sutton, et al, Genome Science & Tech., 1995, 1, 9-19, which is inco ⁇ orated herein by reference in its entirety, and can be accessed through the Internet at, for example, ftp://ftp.tigr.org/pub/software/TIGR assembler.
  • GLAXO-MRC which is known and available to those skilled in the art, is another protocol for constructing virtual transcripts.
  • the members of a set of mRNA molecules are compared.
  • the set of mRNA molecules is a set of alternative transcript forms of mRNA.
  • the members of the set of alternative transcript forms of RNA include at least one member which is associated, or whose encoded protein is associated, with a disease state or biological condition.
  • a set of mRNA molecules for the mdm2 oncogene are compared.
  • At least one of the members of the set of mRNA alternative transcript forms is associated with cancer, as described above.
  • comparison of the members of the set of mRNA molecules results in the identification of at least one alternative transcript form of RNA which is associated, or whose encoded protein is associated, with a disease state or biological condition.
  • the members of the set of mRNA molecules are from a common gene. In another embodiment of the invention, the members of the set of mRNA molecules are from a plurality of genes. In another embodiment of the invention, the members of the set of mRNA molecules are from different taxonomic species. Nucleotide sequences of a plurality of nucleic acids from different taxonomic species can be identified by performing a sequence similarity search, an ortholog search, or both, such searches being known to persons of ordinary skill in the art.
  • Sequence similarity searches can be performed manually or by using several available computer programs known to those skilled in the art.
  • Blast and Smith-Waterman algorithms which are available and known to those skilled in the art, and the like can be used.
  • Blast is NCBI's sequence similarity search tool designed to support analysis of nucleotide and protein sequence databases. Blast can be accessed through the Internet at, for example, http://www.ncbi.nlm.nih.gov/BLAST/.
  • the GCG Package provides a local version of Blast that can be used either with public domain databases or with any locally available searchable database.
  • GCG Package v9.0 is a commercially available software package that contains over 100 interrelated software programs that enables analysis of sequences by editing, mapping, comparing and aligning them.
  • GCG Global System for Mobile Communications
  • GCG can be accessed through the Internet at, for example, http://www.gcg.com/.
  • Fetch is a tool available in GCG that can get annotated GenBank records based on accession numbers and is similar to Entrez.
  • Another sequence similarity search can be performed with Gene World and GeneThesaurus from Pangea.
  • Gene World 2.5 is an automated, flexible, high- throughput application for analysis of polynucleotide and protein sequences.
  • GeneWorld allows for automatic analysis and annotations of sequences. Like GCG, GeneWorld incorporates several tools for homology searching, gene finding, multiple sequence alignment, secondary structure prediction, and motif identification.
  • GeneThesaurus 1.Otm is a sequence and annotation data subscription service providing information from multiple sources, providing a relational data model for public and local data.
  • BlastParse is a PERL script running on a UNIX platform that automates the strategy described above. BlastParse takes a list of target accession numbers of interest and parses all the GenBank fields into "tab-delimited” text that can then be saved in a "relational database” format for easier search and analysis, which provides flexibility. The end result is a series of completely parsed GenBank records that can be easily sorted, filtered, and queried against, as well as an annotations-relational database.
  • the plurality of nucleic acids from different taxonomic species which have homo logy to the target nucleic acid, as described above in the sequence similarity search are further delineated so as to find orthologs of the target nucleic acid therein.
  • An ortholog is a term defined in gene classification to refer to two genes in widely divergent organisms that have sequence similarity, and perform similar functions within the context of the organism.
  • paralogs are genes within a species that occur due to gene duplication, but have evolved new functions, and are also referred to as isotypes.
  • paralog searches can also be performed. By performing an ortholog search, an exhaustive list of homologous sequences from as diverse organisms as possible is obtained.
  • an ortholog search can be performed by programs available to those skilled in the art including, for example, Compare.
  • an ortholog search is performed with access to complete and parsed GenBank annotations for each of the sequences.
  • the records obtained from GenBank are "flat-files", and are not ideally suited for automated analysis.
  • the ortholog search is performed using a Q- Compare program. Preferred steps of the Q-Compare protocol are described in the flowchart set forth in U.S. Serial No. 09/076,440, incorporated herein by reference.
  • interspecies sequence comparison is performed using Compare, which is available and known to those skilled in the art.
  • Compare is a GCG tool that allows pair- wise comparisons of sequences using a window/stringency criterion. Compare produces an output file containing points where matches of specified quality are found. These can be plotted with another GCG tool, DotPlot.
  • the molecular interaction site is present in the alternative transcript form of the mRNA which is likely associated, or whose encoded protein is likely associated, with a disease state or biological condition.
  • the molecular interaction site is identified by procedures well known to the skilled artisan.
  • the molecular interaction site can be identified based on the nucleic acid sequence of the particular alternative transcript form of the mRNA or can be based on secondary structures presented within the alternative transcript form of the mRNA.
  • Molecular interaction sites are small, usually less than 30 nucleotides, independently folded, functional subdomains contained within a larger RNA molecule.
  • Determining whether a particular alternative transcript form contains a molecular interaction site based on secondary structure can be performed by a number of procedures known to those skilled in the art. Determination of secondary structure is preferably performed by self complementarity comparison, alignment and covariance analysis, secondary structure prediction, or a combination thereof.
  • secondary structure analysis is performed by alignment and covariance analysis.
  • alignment is performed by ClustalW, which is available and known to those skilled in the art.
  • ClustalW is a tool for multiple sequence alignment that, although not a part of GCG, can be added as an extension of the existing GCG tool set and used with local sequences.
  • ClustalW can be accessed through the Internet at, for example, http://dot.imgen.bcm.tmc. edu:9331/multi- align Options/clustalw.html.
  • ClustalW is also described in Thompson, et al. , Nuc. Acids Res.
  • covariance software is used for covariance analysis.
  • Covariation a set of programs for the comparative analysis of R ⁇ A structure from sequence alignments, is used. Covariation uses phylogenetic analysis of primary sequence information for consensus secondary structure prediction.
  • Covariation can be obtained through the Internet at, for example, http://www.mbio.ncsu.edu/RNaseP/info/programs/programs.html.
  • a complete description of a version of the program has been published (Brown, J. W. 1991 Phylogenetic analysis of RNA structure on the Macintosh computer. CABIOS7:391-393).
  • the current version is v4.1 , which can perform various types of covariation analysis from RNA sequence alignments, including standard covariation analysis, the identification of compensatory base-changes, and mutual information analysis.
  • the program is well-documented and comes with extensive example files.
  • secondary structure analysis is performed by secondary structure prediction.
  • secondary structure prediction There are a number of algorithms that predict RNA secondary structures based on thermodynamic parameters and energy calculations. Preferably, secondary structure prediction is performed using either M-fold or RNA Structure 2.52.
  • M-fold canbe accessed through the Internet at, for example, http://www.ibc.wustl.edu/- zuker/ma form2.cgi or can be downloaded for local use on UNIX platforms. M-fold is also available as a part of GCG package.
  • RNA Structure 2.52 is a windows adaptation of the M- fold algorithm and can be accessed through the Internet at, for example, http://128.151.176.70/RNAstructure.html.
  • secondary structure analysis is performed by self complementarity comparison.
  • self complementarity comparison is performed using Compare, described above.
  • Compare can be modified to expand the pairing matrix to account for G-U or U-G basepairs in addition to the conventional Watson-Crick G-C/C-G or A-U/U-A pairs.
  • Such a modified Compare program begins by predicting all possible base-pairings within a given sequence. As described above, a small but conserved region, preferably a UTR, is identified based on primary sequence comparison of a series of orthologs. In modified Compare, each of these sequences is compared to its own reverse complement.
  • Allowable base-pairings include Watson-Crick A-U, G-C pairing and non-canonical G-U pairing.
  • a result of the secondary structure analysis described above, whether performed by alignment and covariance, self complementarity analysis, secondary structure predictions, such as using M-fold or otherwise, is the identification of secondary structure in other alternative transcript forms.
  • Exemplary secondary structures that may be identified include, but are not limited to, bulges, loops, stems, ha pins, knots, triple interacts, cloverleafs, or helices, or a combination thereof.
  • new secondary structures may be identified.
  • at least one structural motif molecular interaction site is identified. These structural motifs correspond to the identified secondary structures described above. For example, analysis of secondary structure by self complementation may provide one type of secondary structure, whereas analysis by M-fold may provide another secondary structure. All the possible secondary structures identified by secondary structure analysis described above are, thus, represented by a family of structural motifs.
  • transcript forms of mRNAs can be identified by searching on the basis of structure, rather than by primary nucleotide sequence, as described above. Additional alternative transcript forms which have secondary structure similar or identical to the secondary structure found as described above can be identified by constructing a family of descriptor elements for the structural motifs described above, and identifying other nucleic acids having secondary structures corresponding to the descriptor elements. The combination of any or all of the nucleic acids having secondary structure can be compiled into a database. The entire process can be repeated with a different target nucleic acid to generate a plurality of different secondary structure groups which can be compiled into the database. Thus, databases of molecular interaction sites can be compiled by performing by the invention described herein.
  • a family of structure descriptor elements is constructed, as described in U.S. Serial No. 09/076,440, which is inco ⁇ orated herein by reference in its entirety.
  • the structural motifs described above are converted into a family of descriptor elements.
  • One skilled in the art is familiar with construction of descriptors. Structure descriptors are described in, for example, Laferriere, et al, Comput. Appl. Biosci., 1994, 10, 211-212, inco ⁇ orated herein by reference in its entirety.
  • a different structure descriptor element is constructed for each of the structural motifs identified from the secondary structure analysis. Briefly, the secondary structure is converted to a generic text string.
  • Descriptor elements may be defined to have various stringency.
  • the descriptor elements can be defined to allow for a wobble.
  • descriptor elements can be defined to have any level of stringency desired by the user.
  • nucleic acids having secondary structure which correspond to the structure descriptor elements are identified.
  • nucleic acids having secondary structure which correspond to the structure descriptor elements are identified by searching at least one database, performing clustering and analysis, identifying orthologs, or a combination thereof.
  • the identified alternative transcript forms have secondary structure which falls within the scope of the secondary structure defined by the descriptor elements.
  • the identified alternative transcript forms have secondary structure identical to nearly identical, depending on the stringency of the descriptor elements, to the alternative transcript forms previously identified.
  • nucleic acids having secondary structure which correspond to the structure descriptor elements are identified by searching at least one database.
  • Any genetic database can be searched.
  • the database is a UTR database, which is a compilation of the untranslated regions in messenger RNAs.
  • a UTR database is accessible through the Internet at, for example, ftp://area.ba.cnr.it/pub/embnet/database/utr/.
  • the database is searched using a computer program, such as, for example, Rnamot, a UNIX-based motif searching tool available from Daniel Gautheret. Each "new" sequence that has the same motif is then queried against public domain databases to identify additional sequences.
  • Results are analyzed for recurrence of pattern in UTRs of these additional ortholog sequences, as described below, and a database of RNA secondary structures is built.
  • Rnamot takes a descriptor string and searches any Fasta format database for possible matches. Descriptors can be very specific, to match exact nucleotide(s), or can have built-in degeneracy. Lengths of the stem and loop can also be specified. Single stranded loop regions can have a variable length. G-U pairings are allowed and can be specified as a wobble parameter. Allowable mismatches can also be included in the descriptor definition. Functional significance is assigned to the motifs if their biological role is known based on previous analysis.
  • the nucleic acids identified by searching databases such as, for example, searching a UTR database using Rnamot, are clustered and analyzed so as to determine their location within the genome.
  • the results provided by Rnamot simply identify sequences containing the secondary structure but do not give any indication as to the location of the sequence in the genome.
  • Clustering and analysis is preferably performed with ClustalW, as described above.
  • orthologs are identified as described above. However, in contrast to the orthologs identified above, which were solely identified on the basis of their primary nucleotide sequences, these new orthologous sequences are identified on the basis of structure using the nucleic acids identified using Rnamot. Identification of orthologs is preferably performed by BlastParse or Q-Compare, as described above. In embodiments of the invention in which a database containing prokaryotic molecular interaction sites is compiled, it is preferable to refrain from finding human orthologs or, alternatively, discarding human orthologs when found.
  • the nucleic acid sequence from said molecular interaction site is ascertained by routine methodology.
  • the nucleic acid sequences can be used to design targeting biomolecules, such as, for example, oligonucleotides, peptide nucleic acid molecules, ribozymes, and small molecules, which interact with the molecular interaction site.
  • the methods of the invention further include contacting the nucleic acid sequence with biomolecules, such as, for example, an oligonucleotide or small molecule.
  • the biomolecules preferably comprise toxin molecules.
  • Example 1 Molecular target in RNA formed from alternative initiation and splicing of the mdm2 oncogene
  • the mdm2 oncogene has been associated with a variety of human cancers.
  • the protein encoded by mdm2 physically binds to the anti-oncogene p53 protein and interferes with its function as a tumor suppressor.
  • the net result of suppression of a tumor suppressor is tumorigenesis. It was recently discovered that many tumor cells have greatly increased levels or mdm2 protem without a proportionate increase in mdm2 mRNA levels, suggesting that regulation of protein levels occurs downstream of transcription. It was discovered that cancer cells contain a form of the mdm2 mRNA that is different in the 5 '-untranslated region.
  • the cancer-specific mdm2 RNA was found to contain tree classes of unique structures shown in the box on the lower left side of the illustration.
  • the first structure shown labeled "unique exon structure" in Fig. 2, derives from unique sequences in Exon 1 that are not included in the mdm2 transcript found in normal cells.
  • This structure contains two unique internal loops separated by a stack of 5 base pairs and adjacent to a cytosine rich stem loop. Analysis of all mRNA transcripts in the current release of genbank reveals that this structure is unique to the cancer-specific mdm2 transcript.
  • the second unique structure is found 3' to the first structure is shown in red and blue. It is comprised of mRNA originating from Exons 1 and 3, which are uniquely found adjacent to each other in the cancer-specific form. This structure, which is also unique, can only exist where these exons are spliced together because it contains parts of each.
  • the right hand structure in the box is derived exclusively from mRNA that is from Exon 3. This structure could potentially exist in both the cancer and normal forms of the message. However, in the normal form, this RNA is part of a different structure which is disfavored in the Exon l/Exon3 junction form.
  • Example 2 The HER2/neu receptor in carcinoma cells
  • the HER2 proto-oncogene encodes a protein that binds to the membrane of the cell and transduces signals through a tyrosine kinase activity. This protein has clearly demonstrated association with breast cancer.
  • a product that targets this protein with a monoclonal antibody (Herceptin) has recently been approved for use by the FDA for the treatment of breast cancer.
  • HER2 receptor mRNA exists in at least two forms (Mol.
  • the two transcript forms are generated from alternative use of a splice site located 2050 nucleotides downstream from the start of the mRNA. In some cases the splice site is used to generate a transcript greater than 4,000 nucleotides. At other times, the splice site is not used. When it is not used, polyadenylation site downstream of the splice junction triggers termination and polyadenylation of the mRNA. An in-frame stop codon is then used to terminate the protein during translation.
  • the truncated form of the protein contains the extracellular domain of the normal protein without the membrane anchor domain, which results in a secreted rather than a cell associated protein.
  • Transfection studies have shown that the truncated form of HER2 produces a protein that is released from the cell results in resistance to the growth inhibiting effects of the monoclonal antibody used in cancer treatment.
  • cells producing the truncated form of the mRNA are undesired because they may play a role in resistance to an otherwise useful drug.
  • the truncated form of the transcript contains unique structures not found in the normal form (see, Fig. 3). The structure on the left is a portion of the normal form and the truncated form is on the right.
  • Helical structures 1 and 2 are common to both transcript forms.
  • Helicies 4 from both forms are comprised of RNA that is, in part, common to both forms and unique to each form.
  • helix 4 in the truncated form is a unique target for proximity trigger technology, as is 5, 6 and 7 which are unique to the truncated form.
  • Helix 3 is another example of a structure that is comprised of RNA sequence that is common to both forms of the RNA, but still different in shape as a result of the sequences around it. It is also a useful target for proximity trigger technology.

Abstract

The present invention is directed to methods of identifying target nucleic acid sequences which are predictive of preselected disease states or biological conditions in cells containing the nucleic acid sequence. Members of a set of mRNA molecules from a common gene, but containing different sequences and structures, are compared. The gene is predictive of the disease state or biological condition in cells containing the gene. At least one molecular interaction site from among those pressent in the members of the set are identified. The molecular interaction site is present in cells likely to have the disease state or biological condition. At least one nucleic acid sequence from the molecular interaction site is ascertained. Figure 2 illustrates alternative initiation of mdm2 gene in cancer and in normal cells results in unique RNA structures.

Description

IDENTIFICATION OF DISEASE PREDICTIVE NUCLEIC ACIDS
FIELD OF THE INVENTION
The present invention is directed to methods of identifying target nucleic acid sequences which are predictive of disease states or biological conditions in cells containing the nucleic acid sequence.
BACKGROUND OF THE INVENTION
Recent advances in genomics, molecular biology, and structural biology have highlighted how RNA molecules participate in or control many of the events required to express proteins in cells. Rather than function as simple intermediaries, RNA molecules actively regulate their own transcription from DNA, splice and edit mRNA molecules and tRNA molecules, synthesize peptide bonds in the ribosome, catalyze the migration of nascent proteins to the cell membrane, and provide fine control over the rate of translation of messages. RNA molecules can adopt a variety of unique structural motifs, which provide the framework required to perform these functions. The many functions of RNA molecules has also solidified their importance as therapeutic drug and diagnostic targets. Indeed, many investigators are pursuing mRNA transcripts and proteins produced therefrom that are expressed at different levels in cancer vs. normal cells in order to develop therapeutic and or diagnostic compounds which modulate the cancer-causing mRNA transcript or protein. Indeed, 500 transcripts have been reported to be expressed at significantly different levels (15-fold on average) in normal vs. gastrointestinal tumor cells. Zhang, et al, Science, 1997, 276, 1268-72. Many genes have as many as 10-20 alternative transcript forms that, in some cases, have been associated with a cancer phenotype. For example in cancerous cells, transcription of the mdm2 gene is initiated at a distinct site not used in normal cells. Landers, et al, Cancer Res., 1997, 57, 3562-3568, incorporated herein by reference in its entirety. In the Bcl-x mRNA, alternatively spliced forms of the transcript result in dramatically different cell behavior and sensitivity to chemotherapeutic drugs. Kuhl, et al, Br. J. Cancer, 1997, 75, 268-274, which is incorporated herein by reference in its entirety.
A universal technology platform to attack multiple forms of cancer has widely been believed to be impossible due to the heterogeneous nature of cancer. Thus, traditional cancer therapeutics has focused on individual cancer pathways and modulation of individual proteins and/or mRNA transcripts associated with the suspected causative pathway of the disease state. An unconventional, broadly applicable approach to cancer diagnosis and treatment, however, is greatly desired. Accordingly, the present invention provides the means to identify distinguishing features of types of cancer coupled with a common molecular mechanism to diagnose and selectively destroy the cancer cells or other cells associated with a disease state or biological condition. It is a principal object of the invention to identify a target nucleic acid sequence which is predictive of a disease state or biological condition in cells containing the nucleic acid sequence.
SUMMARY OF THE INVENTION The present invention is directed to methods of identifying target nucleic acid sequences which are predictive of preselected disease states or biological conditions in cells containing the nucleic acid sequence. Members of a set of mRNA molecules from a common gene, but containing different sequences and structures, are compared. The gene is predictive of the disease state or biological condition in cells containing the gene. At least one molecular interaction site from among those present in the members of the set are identified. The molecular interaction site is present in cells likely to have the disease state or biological condition. At least one nucleic acid sequence from the molecular interaction site is ascertained. BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 illustrates an example of a 3'-EST cluster.
Figure 2 illustrates alternative initiation of mdm2 gene in cancer and normal cells results in unique RNA structures. Figure 3 illustrates Her 2 alternative transcript forms.
The present invention is directed to methods of identifying target nucleic acid sequences which are predictive of preselected disease states or biological conditions, especially cancer, in cells containing the nucleic acid sequence. Members of a set of mRNA molecules from a common gene, but containing different sequences and structures, are compared. The set of mRNA molecules from a common gene, but containing different sequences and structures are referred to as "alternative transcript forms." Comparison of the alternative transcript forms provides for the identification of at least one alternative transcript form which is associated with the disease state or biological condition. The alternative transcript form, or the protein which is encoded by the same, is not required to be directly involved in any pathogenesis pathway. For example, the alternative transcript form may be merely a marker for the disease state or biological condition without participating or being required for establishment or maintenance of the disease state or biological condition. For example, the alternative transcript form or protein encoded thereby may be a by-product of the pathogenesis involved in the disease state or biological condition. Once an alternative transcript form, which is associated with a disease state or biological condition is identified, the molecular interaction site, or cancer signature, for example, can be identified by the methods described herein. The molecular interaction site is unique to the alternative transcription form which is associated with the disease state or biological condition. The alternative transcription forms which are not associated with a disease state or biological condition do not contain the identified molecular interaction site. Once the molecular interaction site is identified, additional alternative transcript forms from a variety of genes can be analyzed to determine whether they comprise the same molecular interaction site. In this manner, additional mRNA molecules can be identified which may also be predictive of a disease state. Alternative transcript forms originate from alternative initiation of transcription, alternative splicing, alternative 3 '-end processing or a combination of these mechanisms. Alternative 3 '-end processing may be the greatest source of alternative transcript forms. Studying 160,000 EST sequences, D. Gautheret and collaborators have shown that from 20-40% of the transcripts have two or more different 3'-ends (See Fig. 1) . Gautheret, Genome Res., 1998, 8, 524-530, which is incorporated herein by reference in its entirety. Other investigators have shown that certain classes of mRNAs are alternatively 3'- end processed in a tissue-specific or developmentally-specific pattern (Edwalds-Gilbert, Nucl. Acids. Res., 1997, 25, 2547-2561, which is incorporated herein by reference in its entirety) and in some cases this has been correlated with cancer. For example, the mss4 transcript was recently shown to have alternative 3 '-end processing in pancreatic cancer. Muller-Pillasch, Genomics, 1887, 46, 389-396, which is incorporated herein by reference in its entirety. Alternative 3 '-end formation does not change the protein composition, but can dramatically influence message stability and regulate translation by including or excluding regulatory sequences in the mRNA transcript.
Alternative transcript forms, despite being transcribed from identical DNA and translated into identical proteins possess unique sequences and three-dimensional shapes that exist only at the level of RNA. A very important consequence of alternative transcript forms for cancer recognition is the unique shapes that they adopt. In contrast to the regular, helical nature of DNA, RNA strands form intricate stems, loops, and bulges, which are arranged into three-dimensional shapes that rival proteins in their complexity. Alternative transcript forms can produce different shapes in several ways. First, with alternative transcription initiation or 3 '-end formation, there are unique sequences in mRNAs that do not appear at all in the normal mRNA (See, Fig. 2). These sequences, in turn, will fold into unique structures within themselves and with the adjacent RNA. Second, each alternative splicing event can produce a unique junction, in which the adjacent RNA on each side of the junction will re-arrange into a new three-dimensional shape.
Alternative transcript forms are distinguishable from cancer-specific expression of transcripts. Cancer-specific transcripts provide a perfectly useful set of molecular targets for the invention described herein. However, the greater opportunity to find useful cancer signatures is in alternative transcript forms since there may be 2-20 different forms of every transcript and 10-20,000 genes expressed in any given cell. Thus, the opportunity to find cancer-specific alternative transcript forms may be much greater than for cancer-specific transcripts. Whether the origin of the cancer signature comes from cancer-specific transcripts or cancer-specific transcript forms, for the technology of the present invention it is not required that the cancer-specific differences in mRNA be responsible for cancer phenotype. It is not even important that it is known what they do. The important point is that they are present in cancer cells and can therefore be used to mark them for destruction. Accordingly, the presented invention is directed, in part, to identifying at least one molecular interaction site within at least one alternative transcript form which is associated with the disease state or biological condition.
Alternative transcript forms and their association to particular disease states or conditions are known to one skilled in the art. For example, alternative transcript forms known to those skilled in the art are shown in Table 1 below and are described in more detail in Edwalds-Gilbert, Nucl. Acids. Res., 1997, 25, 2547-2561, which is incorporated herein by reference in its entirety. Any of the alternative transcript forms listed in Table 1 , Table 2, and Table 3 can be used to identify molecular interaction sites as set forth below. Table 1 shows numerous examples of transcription units with multiple poly(A) sites, all within a single 3'- terminal exon. Included in Table 1 are those genes for which there is solid evidence for more than one RNA species. Two other major classes of gene organization leading to the generation of alternative poly(A) sites on mRNA are listed in Tables 2 and 3. The final protein products of both types of genes can differ at their C-termini depending on which processing pathway is followed. Exons are generally categorized as 5 '-terminal, internal or
3'-terminal with polyadenylation signals in the UTR. A number of genes listed in Table 2 contain composite exons in which 5' splice sites can sometimes be silent, causing them to behave as 3 '-terminal exons, or sometimes be active, thereby causing them to behave as internal exons, depending on the tissues in which the gene is expressed; these we call composite, in/terminal exons. Genes like the immunoglobulin heavy chains have an exon serving either as the first 3'-terminal exon in one mRNA (use of pAl) or as an internal exon in a second mRNA which ends with a normal 3 '-terminal exon found further downstream (use of pA2). The primary transcript from other genes like calcitonin/calcitonin gene-related peptide, listed in Table 3, are processed into two mRNAs by using either the first alternative 3'-terminal exon with its poly(A) site (pAl) or skipping that exon entirely and splicing the second 3'-terminal exon into the transcript, using pA2 instead. The distance between the poly(A) sites in these two classes of genes can be quite large (>3 kb in Ig genes) and differential sites of transcription termination, between the poly(A) sites, could change the distribution of 3'-end use in mRNA. Levels of basal polyadenylation factors, splicing factors and termination factors could all contribute cell type-specific mechanisms leading to 3 '-end formation.
Table 1
Gene Regulatory Notable features Comments element Where seen
23 kDa Brain, retina From the PI 98 gene, which is highly
Transplantation conserved [including the poly(A) sites] antigen across mammalian species; two poly(A) sites Mouse; three poly(A) sites, two mRNAs α-Galactosidase A Muscle, brain Mouse, human; two poly(A) sites; second Acetylcholinesterase site predominates in muscle, first site predominates in brain
Activin βA subunit TPA treatment Human; eight possible poly(A) sites; treatment of HT1080 fibrosarcoma cells with TPA causes a shift over time to use of proximal poly(A) site
ADP ribosylation 3 (ARF4) Testes ARF 1 has two poly(A) sites conserved in factor (ARF) human and rat. ARF 4 makes a short, testes- specific mRNA generated by alternative polyadenylation
Aldolase B Mouse; one non-canonical and three canonical poly(A) sites, use of all four sites detectable in liver and kidney
Amphiglycan Chondrocytes At lease two poly(A) signals; longer (syndecan 4, message is ubiquitous, shorter is tissue- ryudocan) specific; switch in poly(A) site use during chondrocyte differentiation Amyloid protein 2, 4 Sequence between tow poly(A) sites increases translation of the longer mRNA
Androgen receptor Human; two poly(A) sites, the first is AUUAAA and the second is CAUAAA
Angiotensin Testes, Rabbit; testes-versus pulmonary-specific converting enzyme pulmonary forms (ACE) Ankyrin-l tissue Ankyrin-l Biain Mouse; both poly(A) sites used in erythroid tissues, distal site used in cerebellum
Apolipoprotein B Intestine Putative cryptic poly(A) site improved by editing
Arylsulfatase A Mutation of first poly(A) signal seen in arylsulfatase A pseudodeficiency
Axonin-1 Retina, brain Chicken; three poly(A) sites β-Tubulin HSV Infection Changes in the ratio of the two forms occur during HSV infection β2-Mιcroglobulιn Muπne, two ρoly(A) sites β3-Adrenergιc Human, rat, two poly(A) sites receptor
Band 7 2b gene Many cell types Human, integral membrane phosphoprotem, distal site predommates in all tissues, proximal site use is significant in lung, liver and kidney and minimal in spleen
Bram-denved Heart, lung, Rat, isoform production controlled by neurotrophic factor brain alternative splicing, multiple promoters (BDNF) used, two poly(A) sites, ration of proximal distal site use varies among heart, lung, cerebral cortex
Cationic armno acid 1 Cell density Rat, relative concentration of two mRNAs is transporter gene regulated by cell density (cat-1) c-Mos 3 Porcine, protooncogene whose expression is restricted to gonadal tissues in the pig, alternative polyadenylation may play a role in translation
CD40 Differentiation murme, differential poly(A) site use during B lymphocyte activation
CD59 (membrane Many cell types Human, complement regulator protem, four inhibitor of reactive possible poly(A) sites, use of two poly(A) lysis) sites vanes in different cell lines
Chymotrypsin-like Human chromosome 16q22 1, alternative protease polyadenylation creates transcription unit which overlaps with oppositely onented gene
Clathrin heavy cham 3 Developmental Mosquito, poly(A) site use differs between gene changes somatic cells and germ cells Collagenase 3 Human, three mRNAs seen m mammary carcinoma cells cAMP-responsive Testes Follicle-stimulating hormone regulates element modulator CREM expression m testes by changing (CREM) poly(A) site use, causing an increase in mRNA stability
Cyclm Dl Developmental Human, mouse, zebrafish, two poly(A) sites, changes change in poly(A) site use dunng zebrafish embryonic development, one major and two minor forms found in HeLa and all hematopoietic cells tested Cyclooxygenase-1 Human, two poly(A) sites (COX-1) Cyclooxygenase-2 Dexamethasone Expression induced by cytokines; three (COX-1) treatment poly(A) sites; dexamethasone treatment selectively destabilizes longer mRNA
Cytochrome P450 many cell types Human; two poly(A) sites, [second poly(A) aromatase signal AUUAAA]; mouse, porcine, equine; two poly(A) sites, 2.5 kb mRNA predominant in ovaries Cytochrome P450- Mouse; two poly(A) sites linked ferredoxin
Dihydrofolate Cell cycle Seven poly(A) sites; promoter-proximal site reductase (DHFR) used during growth stimulation
Dipeptidyl peptidase Mouse; two poly(A) sites in exon 26, IV (CD26) proximal poly(A) site predominates in all tissues examined
DNA polymerase β Testes, brain Rat; the 1.4 kb transcript predominates in testes and has a poly(A) signal AAUGAA; 4 kb transcript predominates in brain eIF-2α (translation 1, 4 Testes Two poly(A) sites; different ratio in initiation factor 2α) different tissues; the longer mRNA is more stable in activated T cells; the shorter mRNA has increased translatability; third poly(A) site used in testes eIF-4E (translation Many cell types Mouse; multiple poly(A) signals; 1.8 kb initiation factor 4E) transcript predominates in mouse kidney and liver and in a pre-B cell line, S194. A 1.5 kb transcript is abundant in mouse thymus and in S194 cells. Minor mRNAs of 2.2 and 2.5 kb correspond to use of alternate poly(A) signals as well eIF-5 translation Testes Mammalian; proximal poly(A) site used initiation factor 5) predominantly in testes, distal site favored in other tissues examined
Excision repair gene Testes Human; presumed helicase; two poly(A) ERCC6 signals, first is AUUAAA; shorter mRNA is primarily expressed in testes Fanconi anemia Human; three ρoly(A) sites, longest group C (FACC) transcript is most abundant and its poly(A) signal is AAUAAA; first two poly(A) signals are non-canonical; longest transcript contains a series of direct 35 bp repeats preceded by a 12 bp palindrome
Ferritin heavy chain Many cell types Human; two poly(A) signals; tissue-specific differences in ratio of use (brain, skeletal muscle versus placenta, liver, pancreas) Fibroblast growth Retinoic acid Mouse; both mRNAs inducible in F9 cell factor (int-2) treatment line by treatment with retinoic acid
Basic fibroblast Cell density Use of two poly(A) sites varies with cell growth factor density (bFGF)
Fibroglycan Human; at least two functional poly(A) (syndecan 2) signals
FMR1 Fragile X gene; two poly(A) sites
G protein γ subunit Many cell types Drosophila; use of three different poly( A) (D-G γl) sites is developmentally regulated and cell type specific: 2.6 kb transcript found in head, 1.3 kb transcript found in body, 1.1 kb transcript more abundant in head than in body
Gastric capthesin E Human aspartic protease; two poly(A) sites
GATA-2 Transcription factor; two poly(A) sites
Grg Murine; related to the groucho transcript of the Drosophila Enhancer of split complex
Growth hormone Alternative poly(A) site in exon 5 generates receptor, avian short form, in the absence of alternative splicing, unlike mammalian counterpart
Heparan sulfate Liver, kidney Rat; major cell surface heparan sulfate proteoglycan proteoglycan; three poly(A) sites used in most tissues, most proximal site used only in liver and kidney
Herpes simplex virus Increased polyadenylation at weak viral sites type 1 (HSV-1) via effects on host cell CstF 64-kDa UL24
High mobility group Murine; three poly(A) sites 1 protein (HMG1)
Histone HI0 Butyrate Mouse; differentiation-specific histone HI; treatment two mRNAs, first poly(A) signal is AUUAAA; minor 0.9 kb mRNA becomes more stable during butyrate-induced dedifferentiation, mRNAs equally stable after treatment with actinomycin D
Huntington disease Brain Use of distal poly(A) site predominates in gene brain; most other tissues favor proximal poly(A) site
Integrin α5 Xenopus laevis; alternative polyadenylation occurs in the embryo
Interleukin-8 Human; two mRNAs equally abundant in receptor α neutrophils Iron regulatory Intracellular iron Human, rat; RNA binding protein whose protein 2 (IRP2) levels affinity for its binding sites is modulated by intracellular iron levels; increase in proximal poly(A) site use with reciprocal decrease in distal poly(A) site use in iron-depleted cells
Ketohexokinase Human; two poly(A) sites; second is (fructokinase) GAUAAA
Lamin B3 Testes Mouse; germ cell (testes) specific RNA processing of lamin B2 generates lamin B3
Lipoprotein lipase Many cell types Human; longer transcript predominates in skeletal and cardiac muscle; adipose tissue produces both forms of mRNA; longer transcript translated more efficiently than short one
Long chain acyl- Many cell types Mouse; two poly(A) sites CoA dehydrogenase (ACAD 1) Manganese Many cell types Rat; five poly(A) sites; first two sites used in superoxide all tissues tested; proximal pool site dismutase predominates in testes and liver, distal site used in heart, lung and kidney
Microtubule- Many cell types Mouse; 3'-UTR well conserved between associated protein 4 mouse and human; first two sites used in all (MAP4) tissues tested; third site used in muscle; fourth site used in testes, but first site predominates
Mitochondrial Rat; two poly(A) sites: AUUAAA and HMG-CoA synthase AUUAUC
N-Formyl peptide Dibutryl cAMP Human; two-exon gene, at least two poly(A) receptor (FMLF-R) treatment sites; predominant use of proximal poly(A) site after treatment of HL60 human lymphoma cells with the differentiation agent dibutryl cAMP NAD(P)H:quinone Mitomycin C Human colon cancer HCT 116 cells; two oxidoreductase treatment mRNAs; change in ratio after mitomycin C treatment
Non-muscle myosin Human; two poly(A) sites heavy chain mal-a Mouse; novel keratinocyte lipid-binding protein; tumor specific over expression; two poly(A) sites, use of first one predominates P-selectin Human, chromosome 12q24; major mRNA glycoprotein ligand species 2.5 kb, minor species 4 kb
Paramyosin Developmental Drosophila; use of two poly(A) sites is changes developmentally regulated Phosphofructokinse Developmental Drosophila; use of three poly(A) sites is (PFK) changes developmentally regulated
Platelet-derived Three poly(A) signals growth factor (PDGF)
PR264/SC35 Many cell types Human splicing factor; ratio of different forms varies among six different cell lines tested rab2 Many cell types Human Ras-related GTP binding protein; three potential poly(A) signals
RanGAPl Testes Human; activator of Ras-related nuclear
GTPase Ran, shows testes-specific polyadenylation
Renal glutaminase Many cell types Rat; ratio of poly(A) site use varies in different cell lines RHOA Human Ras-related GTP binding protein; protooncogene found in breast cancer cell lines; three poly(A) sites
Senescence marker Rat; two poly(A) sites protein-30 (SMP-30) set, putative Many cell types Human, mouse; ratio of two mRNAs varies oncogene associated in different cell types and five cell lines with myeloid tested; shorter mRNA predominates in liver leukemogenesis and kidney
Soluble angiotensin 3 Porcine; two ρoly(A) sites, first is binding protein GAUAAA; longer transcript may be regulated by SINE element in 3'-UTR Splicing factor 9G8 Many cell types Human; two poly(A) sites; pre-mRNA also subjected to alternative splicing
Steel Murine; encodes stem cell factor (SCF); distal poly(A) site used predominantly; 3'- UTR is 4.4 kb
Suppressor of forked Drosophila; three mRNAs su
Syndecan-1 Mouse; two poly(A) sites Tissue inhibitor of Human; two stable transcripts metalloproteinases-2 (TIMP-2)
Tissue inhibitor of TPA treatment Murine; three transcripts of 2.3, 2.8 and 4.6 metalloproteinases-3 kb. 4.6 kb most abundant. All three (TIMP-3) transcripts induced in pre-neoplastic JB6 cells treated with TPA Transforming Human; five possible poly(A) sites but only growth factor alpha two mRNAs detected; use of distal poly(A) (TGF α) site (AAUGAAA) predominates in most tissues
Triose phosphate Testes Rat; 1.4 kb mRNA found in most tissues and isomerase in somatic cells of testes; its level increases after retinol treatment; the 1.5 kb species is detected only in haploid spermatids
Tryptophanyl-tRNA Murine, human; two poly(A) sites, first is synthetase AAUCAA
Tubulin Trypanosomes; transcription unit undergoes polycistronic pre- ^rαns-splicing and alternative mRNA polyadenylation, which may be coupled in this system
Vascular endothelial Hypoxia Rat; two poly(A) sites; regulation of poly(A) growth factor site use by hypoxia (VEGF)
WNT-5A Human; expression in early embryogenesis
ZAKI-4 Many cell types Human thyroid hormone-responsive gene; two mRNAs, first poly(A) signal is AUUAAA; short mRNA predominates in heart and brain, trace amounts found in liver; long mRNA predominates in skeletal muscle; no messages detected in placenta, lung, kidney, pancreas
Table 2
Gene Notes on regulation
(2'-5') Oligo A synthetase Transcription induced by interferon-β; distal poly(A) site favored after induction; proximal poly(A) site used predominantly during basal transcription β-Spectrin Proximal poly(A) site used exclusively in erythroid cells; default pattern of pre-mRNA processing uses distal poly(A) site
C3b/C4b receptor Use of proximal poly(A) site yields secreted form of receptor; (complement receptor type predominant membrane-bound receptor is generated by use of distal 1) poly(A) site
Cek5 Chicken receptor protein-tyrosine kinase of the Eph subfamily; use of the proximal poly(A) site yields secreted form of kinase, whose expression is low relative to the full-length Cek5 receptor
Epidermal growth factor Proximal poly(A) site leads to production of secreted form of receptor, (EGF) receptor; human, which can inhibit the activities of the membrane-bound receptor chicken exuperantia (exu) Drosophila gene required for both oogenesis and spermatogenesis that undergoes sex-specific alternative pre-mRNA processing; tra-2 gene required for male specific RNA processing
Fibrinogen γ-chain Rat pre-mRNA undergoes liver-specific choice of proximal poly(A) site; other cell types always use distal poly(A) site
Fibroblast growth factor Secreted form of receptor generated by use of the proximal poly(A) site; (FGF) receptor membrane-bound forms are produced by use of distal poly(A) site; secreted form also binds FGF
GARS/AIRS/GART Glycinamide ribonucleotide synthetase (GARS)/aminoamidazole ribonucleotide synthetase (AIRS)/glycinamide ribonucleotide formyltransferase (GART); enzyme required for purine synthesis; use of proximal site corresponds to production of the mono functional enzyme; use of the distal site yields the trifunctional enzyme; all tissues examined favor distal poly(A) site
Glucocorticoid receptor β form of receptor produced by use of the proximal poly(A) site; more abundant α form uses the distal poly(A) site
HER2/neu receptor Protein tyrosine kinase receptor in which membrane-bound form is produced from mRNA using he distal poly(A) site; use of proximal poly(A) site leads to shorter, intracellular form of the receptor; use of the proximal and distal poly(A) sites varies greatly in different tumor cell lines
Hepatocyte nuclear factor Hepatocyte nuclear factor homeoprotein family important for liver- (HNFl/vHNFl) specific expression of a number of genes; poly(A) site choice and intron inclusion contribute to the generation of HNF1 isoforms, all of which contain different C-terminal domains, have distinct effects on transcription and can form homo- and heterodimers; mRNA levels for these isoforms vary in different tissue types and in some fetal versus adult tissues Ig α heavy chain Use of proximal poly(A) site produces mRNA encoding secreted form of antibody; use of the distal poly(A) site generates mRNA for membrane- bound antigen receptor; secretory-specific mRNA dominant in plasma cells whereas there are equal amounts of the two mRNAs in mature or memory B cells
Ig ε heavy chain Pattern of regulation similar to Ig α heavy chain pre-mRNAs Ig γ heavy chain Pattern of regulation similar to Ig α heavy chain pre-mRNAs Ig μ heavy chain Pattern of regulation similar to Ig α and to other Ig heavy chain pre- mRNAs; can also include transcription termination as a mechanism of proximal poly(A) site selection
Leukemia inhibitory factor Member of homopoietin receptor family; murine gene produces a receptor α-chain secreted [proximal poly(A) site] and membrane-bound form [distal poly(A) site], with increase in the secreted form during pregnancy
Nuclear factor I-B3 Distal ρoly(A) site favored in all tissues examined, proximal poly(A) site used in heart and skeletal muscle; protein encoded by the shorter mRNA acts as a transcriptional repressor
Plasma membrane Ca2+- Use of proximal poly(A) site specific to skeletal muscle and brain ATPase isoform 3 Poly(A) polymerase Component of polyadenylation complex; six isoforms generated via alternative splicing and polyadenylation; some isoforms found in all tissues examined, others show tissue-specific expression; use of one of three proximal poly(A) sites yields forms that contain the polymerase domain but not the serine/threonine-rich domain and nuclear localization signal
Sarco/endoplasmic reticulum Five protein isoforms are generated from three different SERCA genes Ca2+-ATPase (SERCA) plus alterative processing events; regulation of expression is both developmental and tissue specific and is suggested to be at the level of splicing rather than polyadenylation; two SERCA2 protein isoforms are translated from four different mRNAs generated by tissue-dependent alternative processing, one of which is brain specific; SERCA2a protein is muscle specific, SERCA2b is found in non-muscle tissues and smooth muscle
Secretory PLA2 receptor Receptor has similar structural organization to macrophage mannose receptor; acts as a mediator of inflammatory processes; secreted form of phospholipaseA2 receptor found in human kidney; membrane bound receptor is widely expressed, including in kidney
Thyroid hormone receptor Proximal poly(A) site yields αl, which binds thyroid hormone; distal (c-erbA-1) ρoly(A) site produces α2, which cannot bind thyroid hormone; ratio of two mRNAs varies in different tissue; α2 transcript overlaps with gene transcribed in opposite direction, Rev-ErbAα α-Tropomyosin At least four poly(A) sites; proximal poly(A) site used in striated muscle and distal poly(A) site used in smooth muscle and fibroblasts; three of the poly(A) sites used in brain
Adeno virus major late Five poly(A) sites; the proximal poly(A) site, LI, used predominantly in transcription unit early infection; L3 dominates late in infection β-Tropomyosin Proximal poly(A) site used exclusively in skeletal muscle; other cell types use the distal poly(A) site; regulation may be at the level of the splice site choice
Calcitonin calcitonin gene- Proximal poly(A) site used in most cell types, generating the mRNA for related peptide (CGRP) calcitonin; distal poly(A) site used exclusively in neuronal cells, leading to production of CGRP doublesex (dsx) Drosophila gene required for somatic sexual differentiation that undergoes sex-specific alternative pre-mRNA processing; tra-2 protein required for regulated RNA processing and acts through its binding site in the dsx pre-mRNA Epidermal growth factor Proximal poly(A) site leads to production of secreted form of receptor, (EGF) receptor; rat which can inhibit the activities of the membrane-bound receptor; differs from human and chicken isoforms
FLT4 receptor tyrosine Ratio of the mRNAs using the proximal or distal poly(A) site varies in kinase different cell lines
Neural cell adhesion Ratio of the mRNAs produced varies in different cell types molecule (NCAM)
Plasma α( 1.3)- Two poly(A) sites are used equally in liver; proximal poly(A) site fiicosyltransferase (FUT6) favored in colon; distal poly(A) site used predominantly in kidney
Poly(A) polymerase Component of polyadenylation complex; six isoforms generated via alternative splicing and polyadenylation; some isoforms found in all tissues examined, others show tissue-specific expression; use of one of three proximal poly(A) sites yields forms that contain the polymerase domain but not the serine/threonine-rich domain and the nuclear localization signal (three exons also composite)
Unique human gene of Spans over 230 kb in human chromosome 8pl 1-12; codes multiple unknown function proteins sharing RNA binding motifs
The present invention is directed to identifying a target nucleic acid sequence which is predictive of a preselected disease state or biological condition. The disease states or biological conditions include, but are not limited to, nucleic acids known to be important during inflammation, cardiovascular disease, pain, cancer, arthritis, trauma, obesity, Huntingtons, neurological disorders, hyperproliferative conditions, neoplastic states or conditions, Lupus erythematosis, and many other diseases or disorders.
From analysis of Expressed Sequenced Tags (ESTs), it has been found that mRNA transcripts are much more heterogeneous than previously anticipated. Alternative transcript forms of mRNA molecules can be identified by using ESTs from a variety of databases. For example, preferred databases include, for example, Online Mendelian
Inheritance in Man (OMIM), the Cancer Genome Anatomy Project (CGAP), GenBank, EMBL, PIR, SWISS-PROT, and the like. OMIM, which is a database of genetic mutations associated with disease, was developed, in part, for the National Center for Biotechnology Information (NCBI). OMIM can be accessed through the Internet at, for example, http://www.ncbi.nlm.nih.gov/Omim/. CGAP, which is an interdisciplinary program to establish the information and technological tools required to decipher the molecular anatomy of a cancer cell. CGAP can be accessed through the Internet at, for example, http://www.ncbi.nlm.nih.gov/ncicgap/. Some of these databases may contain complete or partial nucleotide sequences. In addition, alternative transcript forms can also be selected from private genetic databases. Alternatively, alternative transcript forms can be selected from available publications or can be determined especially for use in connection with the present invention.
After an alternative transcript form is selected or provided, the nucleotide sequence of the alternative transcript form preferably is determined. In one embodiment of the invention, the nucleotide sequence of the nucleic acid target is determined by scanning at least one genetic database or is identified in available publications. Preferred databases known and available to those skilled in the art include, for example, the Expressed Gene Anatomy Database (EGAD) and Unigene-Homo Sapiens database (Unigene), GenBank, and the like. EGAD contains a non-redundant set of human transcript (HT) sequences and can be accessed through the Internet at, for example, http://www.tigr.org/tdb/egad/egad.html. Unigene is a system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters. Each Unigene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location.
In addition, Unigene contains hundreds of thousands of novel expressed sequence tag (EST) sequences. Unigene can be accessed through the Internet at, for example, http://www.ncbi.nlm.nih.gov/UniGene/. These databases can be used in connection with searching programs such as, for example, Entrez, which is known and available to those skilled in the art, and the like. Entrez can be accessed through the Internet at, for example, http://www.ncbi.nlm.nih.gov/Entrez/. Preferably, the most complete nucleic acid sequence representation available from various databases is used. The GenBank database, which is known and available to those skilled in the art, can also be used to obtain the most complete nucleotide sequence. GenBank is the NIH genetic sequence database and is an annotated collection of all publicly available DNA sequences. GenBank is described in, for example, Nuc. Acids Res., 1998, 26, 1-7, which is incorporated herein by reference in its entirety, and can be accessed by those skilled in the art through the Internet at, for example, http://www.ncbi.nlm.nih.gov/Web/Genbar_-k/index.html. Alternatively, partial nucleotide sequences of nucleic acid targets can be used when a complete nucleotide sequence is not available.
Alternative transcript forms can be generated from individual ESTs which are within each of the databases by computer software which generates contiguous sequences. In another embodiment of the present invention, the nucleotide sequence of the nucleic acid target is determined by assembling a plurality of overlapping ESTs. The EST database (dbEST), which is known and available to those skilled in the art, comprises approximately one million different human mRNA sequences comprising from about 500 to 1000 nucleotides, and various numbers of ESTs from a number of different organisms. dbEST can be accessed through the Internet at, for example, http://www.ncbi.nlm.nih.gov/dbEST/index. html. These sequences are derived from a cloning strategy that uses cDNA expression clones for genome sequencing. ESTs have applications in the discovery of new genes, mapping of genomes, and identification of coding regions in genomic sequences. Another important feature of EST sequence information that is becoming rapidly available is tissue-specific gene expression data. This can be extremely useful in targeting selective gene(s) for therapeutic intervention. Since EST sequences are relatively short, they must be assembled in order to provide a complete sequence. Because every available clone is sequenced, it results in a number of overlapping regions being reported in the database. The end result is the elicitation of alternative transcript forms from, for example, normal cells and cancer cells.
Assembly of overlapping ESTs extended along both the 5' and 3' directions results in a full-length "virtual transcript." The resultant virtual transcript may represent an already characterized nucleic acid or may be a novel nucleic acid with no known biological function. The Institute for Genomic Research (TIGR) Human Genome Index (HGI) database, which is known and available to those skilled in the art, contains a list of human transcripts. TIGR can be accessed through the Internet at, for example, http://www.tigr.org/. The transcripts were generated in this manner using TIGR- Assembler, an engine to build virtual transcripts and which is known and available to those skilled in the art. TIGR- Assembler is a tool for assembling large sets of overlapping sequence data such as ESTs, BACs, or small genomes, and can be used to assemble eukaryotic or prokaryotic sequences. TIGR- Assembler is described in, for example, Sutton, et al, Genome Science & Tech., 1995, 1, 9-19, which is incoφorated herein by reference in its entirety, and can be accessed through the Internet at, for example, ftp://ftp.tigr.org/pub/software/TIGR assembler. In addition, GLAXO-MRC, which is known and available to those skilled in the art, is another protocol for constructing virtual transcripts. In addition, "Find Neighbors and Assemble EST Blast" protocol, which runs on a UNIX platform, has been developed by Applicants to construct virtual transcripts. PHRAP is used for sequence assembly within Find Neighbors and Assemble EST Blast. PHRAP can be accessed through the Internet at, for example, http://chimera.biotech. washington.edu/uwgc/tools/phrap.htm. Identification of ESTs and generation of contiguous ESTs to form full length RNA molecules is described in detail in U.S. Application Serial No. 09/076,440, which is incorporated herein by reference in its entirety.
The members of a set of mRNA molecules are compared. Preferably, the set of mRNA molecules is a set of alternative transcript forms of mRNA. Preferably, the members of the set of alternative transcript forms of RNA include at least one member which is associated, or whose encoded protein is associated, with a disease state or biological condition. For example, a set of mRNA molecules for the mdm2 oncogene are compared. At least one of the members of the set of mRNA alternative transcript forms is associated with cancer, as described above. Thus, comparison of the members of the set of mRNA molecules results in the identification of at least one alternative transcript form of RNA which is associated, or whose encoded protein is associated, with a disease state or biological condition. In a preferred embodiment of the invention, the members of the set of mRNA molecules are from a common gene. In another embodiment of the invention, the members of the set of mRNA molecules are from a plurality of genes. In another embodiment of the invention, the members of the set of mRNA molecules are from different taxonomic species. Nucleotide sequences of a plurality of nucleic acids from different taxonomic species can be identified by performing a sequence similarity search, an ortholog search, or both, such searches being known to persons of ordinary skill in the art.
Sequence similarity searches can be performed manually or by using several available computer programs known to those skilled in the art. Preferably, Blast and Smith-Waterman algorithms, which are available and known to those skilled in the art, and the like can be used. Blast is NCBI's sequence similarity search tool designed to support analysis of nucleotide and protein sequence databases. Blast can be accessed through the Internet at, for example, http://www.ncbi.nlm.nih.gov/BLAST/. The GCG Package provides a local version of Blast that can be used either with public domain databases or with any locally available searchable database. GCG Package v9.0 is a commercially available software package that contains over 100 interrelated software programs that enables analysis of sequences by editing, mapping, comparing and aligning them. Other programs included in the GCG Package include, for example, programs which facilitate RNA secondary structure predictions, nucleic acid fragment assembly, and evolutionary analysis. In addition, the most prominent genetic databases (GenBank, EMBL, PIR, and S WISS-PROT) are distributed along with the GCG Package and are fully accessible with the database searching and manipulation programs. GCG can be accessed through the Internet at, for example, http://www.gcg.com/. Fetch is a tool available in GCG that can get annotated GenBank records based on accession numbers and is similar to Entrez. Another sequence similarity search can be performed with Gene World and GeneThesaurus from Pangea. Gene World 2.5 is an automated, flexible, high- throughput application for analysis of polynucleotide and protein sequences. GeneWorld allows for automatic analysis and annotations of sequences. Like GCG, GeneWorld incorporates several tools for homology searching, gene finding, multiple sequence alignment, secondary structure prediction, and motif identification. GeneThesaurus 1.Otm is a sequence and annotation data subscription service providing information from multiple sources, providing a relational data model for public and local data.
Another alternative sequence similarity search can be performed, for example, by BlastParse. BlastParse is a PERL script running on a UNIX platform that automates the strategy described above. BlastParse takes a list of target accession numbers of interest and parses all the GenBank fields into "tab-delimited" text that can then be saved in a "relational database" format for easier search and analysis, which provides flexibility. The end result is a series of completely parsed GenBank records that can be easily sorted, filtered, and queried against, as well as an annotations-relational database.
Preferably, the plurality of nucleic acids from different taxonomic species which have homo logy to the target nucleic acid, as described above in the sequence similarity search, are further delineated so as to find orthologs of the target nucleic acid therein. An ortholog is a term defined in gene classification to refer to two genes in widely divergent organisms that have sequence similarity, and perform similar functions within the context of the organism. In contrast, paralogs are genes within a species that occur due to gene duplication, but have evolved new functions, and are also referred to as isotypes. Optionally, paralog searches can also be performed. By performing an ortholog search, an exhaustive list of homologous sequences from as diverse organisms as possible is obtained. Subsequently, these sequences are analyzed to select the best representative sequence that fits the criteria for being an ortholog. An ortholog search can be performed by programs available to those skilled in the art including, for example, Compare. Preferably, an ortholog search is performed with access to complete and parsed GenBank annotations for each of the sequences. Currently, the records obtained from GenBank are "flat-files", and are not ideally suited for automated analysis. Preferably, the ortholog search is performed using a Q- Compare program. Preferred steps of the Q-Compare protocol are described in the flowchart set forth in U.S. Serial No. 09/076,440, incorporated herein by reference.
Preferably, interspecies sequence comparison is performed using Compare, which is available and known to those skilled in the art. Compare is a GCG tool that allows pair- wise comparisons of sequences using a window/stringency criterion. Compare produces an output file containing points where matches of specified quality are found. These can be plotted with another GCG tool, DotPlot.
Once the members of the set of mRNA molecules are compared, at least one molecular interaction site from among those that are present in the members of the set is identified. The molecular interaction site is present in the alternative transcript form of the mRNA which is likely associated, or whose encoded protein is likely associated, with a disease state or biological condition. The molecular interaction site is identified by procedures well known to the skilled artisan. The molecular interaction site can be identified based on the nucleic acid sequence of the particular alternative transcript form of the mRNA or can be based on secondary structures presented within the alternative transcript form of the mRNA. Molecular interaction sites are small, usually less than 30 nucleotides, independently folded, functional subdomains contained within a larger RNA molecule. Determining whether a particular alternative transcript form contains a molecular interaction site based on secondary structure can be performed by a number of procedures known to those skilled in the art. Determination of secondary structure is preferably performed by self complementarity comparison, alignment and covariance analysis, secondary structure prediction, or a combination thereof.
In one embodiment of the invention, secondary structure analysis is performed by alignment and covariance analysis. Numerous protocols for alignment and covariance analysis are known to those skilled in the art. Preferably, alignment is performed by ClustalW, which is available and known to those skilled in the art. ClustalW is a tool for multiple sequence alignment that, although not a part of GCG, can be added as an extension of the existing GCG tool set and used with local sequences. ClustalW can be accessed through the Internet at, for example, http://dot.imgen.bcm.tmc. edu:9331/multi- align Options/clustalw.html. ClustalW is also described in Thompson, et al. , Nuc. Acids Res. , 1994, 22, 4673-4680, which is incoφorated herein by reference in its entirety. These processes can be scripted to automatically use conserved UTR regions identified in earlier steps. Seqed, a UNIX command line interface available and known to those skilled in the art, allows extraction of selected local regions from a larger sequence. Multiple sequences from many different species can be clustered and aligned for further analysis. Covariation is a process of using phylogenetic analysis of primary sequence information for consensus secondary structure prediction. Covariation is described in the following references, each of which is incoφorated herein by reference in their entirety: Gutell, et al. , "Comparative Sequence Analysis Of Experiments Performed During Evolution" In Ribosomal RNA Group I Introns, Green, Ed., Austi Landes, 1996; Gautheret, et al. , Nuc. Acids Res., 1997, 25, 1559-1564; Gautheret, et al, RNA, 1995, 1, 807-814; Lodmell, et al, Proc. Natl. Acad. Sci. USA, 1995, 92, 10555-10559; Gautheret, etal.,J. Mol. Biol., 1995, 248, 27-43; Gutell,N-.c. Acids Res., 1994, 22, 3502-3517; Gutell,Nwc. Acids Res., 1993, 21, 3055- 3074; Gutell, Nuc. Acids Res., 1993, 21, 3051-3054; Woese, Proc. Natl. Acad. Sci. USA, 1989, 86, 3119-3122; and Woese, et al, Nuc. Acids Res., 1980, 8, 2275-2293. Preferably, covariance software is used for covariance analysis. Preferably, Covariation, a set of programs for the comparative analysis of RΝA structure from sequence alignments, is used. Covariation uses phylogenetic analysis of primary sequence information for consensus secondary structure prediction. Covariation can be obtained through the Internet at, for example, http://www.mbio.ncsu.edu/RNaseP/info/programs/programs.html. A complete description of a version of the program has been published (Brown, J. W. 1991 Phylogenetic analysis of RNA structure on the Macintosh computer. CABIOS7:391-393). The current version is v4.1 , which can perform various types of covariation analysis from RNA sequence alignments, including standard covariation analysis, the identification of compensatory base-changes, and mutual information analysis. The program is well-documented and comes with extensive example files. Compiled as a stand-alone program; it does not require HyperCard (although a much smaller 'stack' version is included). This program will run in any Macintosh environment running MacOS v7.1 or higher. Faster processor machines (68040 or PowerPC) is suggested for mutual information analysis or the analysis of large sequence alignments. In another embodiment of the invention, secondary structure analysis is performed by secondary structure prediction. There are a number of algorithms that predict RNA secondary structures based on thermodynamic parameters and energy calculations. Preferably, secondary structure prediction is performed using either M-fold or RNA Structure 2.52. M-fold canbe accessed through the Internet at, for example, http://www.ibc.wustl.edu/- zuker/ma form2.cgi or can be downloaded for local use on UNIX platforms. M-fold is also available as a part of GCG package. RNA Structure 2.52 is a windows adaptation of the M- fold algorithm and can be accessed through the Internet at, for example, http://128.151.176.70/RNAstructure.html.
In another embodiment of the invention, secondary structure analysis is performed by self complementarity comparison. Preferably, self complementarity comparison is performed using Compare, described above. More preferably, Compare can be modified to expand the pairing matrix to account for G-U or U-G basepairs in addition to the conventional Watson-Crick G-C/C-G or A-U/U-A pairs. Such a modified Compare program (modified Compare) begins by predicting all possible base-pairings within a given sequence. As described above, a small but conserved region, preferably a UTR, is identified based on primary sequence comparison of a series of orthologs. In modified Compare, each of these sequences is compared to its own reverse complement. Allowable base-pairings include Watson-Crick A-U, G-C pairing and non-canonical G-U pairing. An overlay of such self complementarity plots of all available orthologs, and selection for the most repetitive pattern in each, results in a minimal number of possible folded configurations. These overlays can then used in conjunction with additional constraints, including those imposed by energy considerations described above, to deduce the most likely secondary structure.
A result of the secondary structure analysis described above, whether performed by alignment and covariance, self complementarity analysis, secondary structure predictions, such as using M-fold or otherwise, is the identification of secondary structure in other alternative transcript forms. Exemplary secondary structures that may be identified include, but are not limited to, bulges, loops, stems, ha pins, knots, triple interacts, cloverleafs, or helices, or a combination thereof. Alternatively, new secondary structures may be identified. In another embodiment of the invention, once the secondary structure of the conserved region has been identified, as described above, at least one structural motif molecular interaction site is identified. These structural motifs correspond to the identified secondary structures described above. For example, analysis of secondary structure by self complementation may provide one type of secondary structure, whereas analysis by M-fold may provide another secondary structure. All the possible secondary structures identified by secondary structure analysis described above are, thus, represented by a family of structural motifs.
Once the secondary structure(s) of the target nucleic acids, as well as the secondary structures of nucleic acids from different taxonomic species, have been identified, further alternative transcript forms of mRNAs can be identified by searching on the basis of structure, rather than by primary nucleotide sequence, as described above. Additional alternative transcript forms which have secondary structure similar or identical to the secondary structure found as described above can be identified by constructing a family of descriptor elements for the structural motifs described above, and identifying other nucleic acids having secondary structures corresponding to the descriptor elements. The combination of any or all of the nucleic acids having secondary structure can be compiled into a database. The entire process can be repeated with a different target nucleic acid to generate a plurality of different secondary structure groups which can be compiled into the database. Thus, databases of molecular interaction sites can be compiled by performing by the invention described herein.
After the hypothetical structure motifs are determined from the secondary structure analysis described above, a family of structure descriptor elements is constructed, as described in U.S. Serial No. 09/076,440, which is incoφorated herein by reference in its entirety. Preferably, the structural motifs described above are converted into a family of descriptor elements. One skilled in the art is familiar with construction of descriptors. Structure descriptors are described in, for example, Laferriere, et al, Comput. Appl. Biosci., 1994, 10, 211-212, incoφorated herein by reference in its entirety. A different structure descriptor element is constructed for each of the structural motifs identified from the secondary structure analysis. Briefly, the secondary structure is converted to a generic text string. For novel motifs, further biochemical analysis such as chemical mapping or mutagenesis may be needed to confirm structure predictions. Descriptor elements may be defined to have various stringency. In addition, the descriptor elements can be defined to allow for a wobble. Thus, descriptor elements can be defined to have any level of stringency desired by the user. After a family of structure descriptor elements is constructed, nucleic acids having secondary structure which correspond to the structure descriptor elements are identified. Preferably, nucleic acids having secondary structure which correspond to the structure descriptor elements are identified by searching at least one database, performing clustering and analysis, identifying orthologs, or a combination thereof. Thus, the identified alternative transcript forms have secondary structure which falls within the scope of the secondary structure defined by the descriptor elements. Thus, the identified alternative transcript forms have secondary structure identical to nearly identical, depending on the stringency of the descriptor elements, to the alternative transcript forms previously identified.
In one embodiment of the invention, nucleic acids having secondary structure which correspond to the structure descriptor elements are identified by searching at least one database. Any genetic database can be searched. Preferably, the database is a UTR database, which is a compilation of the untranslated regions in messenger RNAs. A UTR database is accessible through the Internet at, for example, ftp://area.ba.cnr.it/pub/embnet/database/utr/. Preferably the database is searched using a computer program, such as, for example, Rnamot, a UNIX-based motif searching tool available from Daniel Gautheret. Each "new" sequence that has the same motif is then queried against public domain databases to identify additional sequences. Results are analyzed for recurrence of pattern in UTRs of these additional ortholog sequences, as described below, and a database of RNA secondary structures is built. One skilled in the art is familiar with Rnamot. Briefly, Rnamot takes a descriptor string and searches any Fasta format database for possible matches. Descriptors can be very specific, to match exact nucleotide(s), or can have built-in degeneracy. Lengths of the stem and loop can also be specified. Single stranded loop regions can have a variable length. G-U pairings are allowed and can be specified as a wobble parameter. Allowable mismatches can also be included in the descriptor definition. Functional significance is assigned to the motifs if their biological role is known based on previous analysis.
In another embodiment of the invention, the nucleic acids identified by searching databases such as, for example, searching a UTR database using Rnamot, are clustered and analyzed so as to determine their location within the genome. The results provided by Rnamot simply identify sequences containing the secondary structure but do not give any indication as to the location of the sequence in the genome. Clustering and analysis is preferably performed with ClustalW, as described above.
In another embodiment of the invention, after clustering and analysis is performed as described above, orthologs are identified as described above. However, in contrast to the orthologs identified above, which were solely identified on the basis of their primary nucleotide sequences, these new orthologous sequences are identified on the basis of structure using the nucleic acids identified using Rnamot. Identification of orthologs is preferably performed by BlastParse or Q-Compare, as described above. In embodiments of the invention in which a database containing prokaryotic molecular interaction sites is compiled, it is preferable to refrain from finding human orthologs or, alternatively, discarding human orthologs when found.
Once the molecular interaction site of an alternative transcript form which is associated to a disease state or biological condition is identified, the nucleic acid sequence from said molecular interaction site is ascertained by routine methodology. The nucleic acid sequences, in turn, can be used to design targeting biomolecules, such as, for example, oligonucleotides, peptide nucleic acid molecules, ribozymes, and small molecules, which interact with the molecular interaction site. The methods of the invention further include contacting the nucleic acid sequence with biomolecules, such as, for example, an oligonucleotide or small molecule. The biomolecules preferably comprise toxin molecules. While there are a number of ways to prepare biomolecules comprising toxins, preferred methodologies are described in a U.S. patent application filed on even date herewith and assigned to the assignee of this invention. This application bear U.S. Serial No. 09/200,107 filed November 25, 1998, which is incoφorated by reference herein in its entirety. The present application also incoφorates by reference in its entirety the following U.S. applications: SerialNo.09/200,355 filed November 25, 1998 and Serial No. 60/110,024 filed November 25, 1998.
The following examples are meant to be exemplary of embodiments of the invention and are not meant to be limiting.
EXAMPLES
Example 1 : Molecular target in RNA formed from alternative initiation and splicing of the mdm2 oncogene
The mdm2 oncogene has been associated with a variety of human cancers. The protein encoded by mdm2 physically binds to the anti-oncogene p53 protein and interferes with its function as a tumor suppressor. The net result of suppression of a tumor suppressor is tumorigenesis. It was recently discovered that many tumor cells have greatly increased levels or mdm2 protem without a proportionate increase in mdm2 mRNA levels, suggesting that regulation of protein levels occurs downstream of transcription. It was discovered that cancer cells contain a form of the mdm2 mRNA that is different in the 5 '-untranslated region. Both the normal and cancer-specific forms of the transcript encode an identical protein, since the heterogeneity is found upstream of the initiation of translation on the message. The cancer-specific mdm2 RNA was found to contain tree classes of unique structures shown in the box on the lower left side of the illustration. The first structure, shown labeled "unique exon structure" in Fig. 2, derives from unique sequences in Exon 1 that are not included in the mdm2 transcript found in normal cells. This structure contains two unique internal loops separated by a stack of 5 base pairs and adjacent to a cytosine rich stem loop. Analysis of all mRNA transcripts in the current release of genbank reveals that this structure is unique to the cancer-specific mdm2 transcript.
The second unique structure is found 3' to the first structure is shown in red and blue. It is comprised of mRNA originating from Exons 1 and 3, which are uniquely found adjacent to each other in the cancer-specific form. This structure, which is also unique, can only exist where these exons are spliced together because it contains parts of each.
The right hand structure in the box is derived exclusively from mRNA that is from Exon 3. This structure could potentially exist in both the cancer and normal forms of the message. However, in the normal form, this RNA is part of a different structure which is disfavored in the Exon l/Exon3 junction form. Example 2: The HER2/neu receptor in carcinoma cells
The HER2 proto-oncogene encodes a protein that binds to the membrane of the cell and transduces signals through a tyrosine kinase activity. This protein has clearly demonstrated association with breast cancer. A product that targets this protein with a monoclonal antibody (Herceptin) has recently been approved for use by the FDA for the treatment of breast cancer.
It is known that the HER2 receptor mRNA exists in at least two forms (Mol.
Cell. Biology 1993 ,2247-2257, which is incoφorated herein by reference in its entirety). The two transcript forms are generated from alternative use of a splice site located 2050 nucleotides downstream from the start of the mRNA. In some cases the splice site is used to generate a transcript greater than 4,000 nucleotides. At other times, the splice site is not used. When it is not used, polyadenylation site downstream of the splice junction triggers termination and polyadenylation of the mRNA. An in-frame stop codon is then used to terminate the protein during translation. The truncated form of the protein contains the extracellular domain of the normal protein without the membrane anchor domain, which results in a secreted rather than a cell associated protein. Transfection studies have shown that the truncated form of HER2 produces a protein that is released from the cell results in resistance to the growth inhibiting effects of the monoclonal antibody used in cancer treatment. Thus, cells producing the truncated form of the mRNA are undesired because they may play a role in resistance to an otherwise useful drug. The truncated form of the transcript contains unique structures not found in the normal form (see, Fig. 3). The structure on the left is a portion of the normal form and the truncated form is on the right. The arrow indicated the location of the divergence between the two forms. Helical structures 1 and 2 are common to both transcript forms. Helicies 4 from both forms are comprised of RNA that is, in part, common to both forms and unique to each form. Thus, helix 4 in the truncated form is a unique target for proximity trigger technology, as is 5, 6 and 7 which are unique to the truncated form. Helix 3 is another example of a structure that is comprised of RNA sequence that is common to both forms of the RNA, but still different in shape as a result of the sequences around it. It is also a useful target for proximity trigger technology.

Claims

What is claimed is:
1. A method of identifying a target nucleic acid sequence, said nucleic acid sequence being predictive of a preselected disease state or biological condition in cells containing the nucleic acid sequence comprising; comparing members of a set of mRNA molecules from a common gene, but containing different sequences and structures, said gene being predictive of said disease state or biological condition in cells containing the gene; identifying at least one molecular interaction site from among those present in said members of the set; said molecular interaction site being present in cells likely to have said disease state or biological condition; and ascertaining a nucleic acid sequence from said molecular interaction site.
2. The method of claim 1 wherein said molecular interaction site is common among a plurality of said members.
3. The method of claim 1 wherein said gene is vestigial.
4. The method of claim 1 wherein said gene codes for no protein essential for maintenance of the cells or of the disease state or condition.
5. The method of claim 1 further comprising contacting said nucleic acid sequence with an oligonucleotide or small molecule.
6. The method of claim 5 wherein said oligonucleotide gives rise to a cell killing event in cells containing said target nucleic acid sequence.
7. The method of claim 1 wherein said disease state is a hypeφroliferative condition.
8. The method of claim 1 wherein said disease state is neoplastic.
9. The method of claim 1 wherein said disease state is a cancerous state.
10. The method of claim 1 wherein said disease state is Lupus erythematosis.
11. The method of claim 1 wherein said disease state is psoriasis.
12. The method of claim 1 wherein said set of members of mRNA molecules includes molecules from non-human animals.
PCT/US1999/027710 1998-11-25 1999-11-22 Identification of disease predictive nucleic acids WO2000031110A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU17426/00A AU1742600A (en) 1998-11-25 1999-11-22 Identification of disease predictive nucleic acids

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US11002498P 1998-11-25 1998-11-25
US09/200,355 US6451524B1 (en) 1998-11-25 1998-11-25 Identification of disease predictive nucleic acids
US60/110,024 1998-11-25
US09/200,355 1998-11-25

Publications (1)

Publication Number Publication Date
WO2000031110A1 true WO2000031110A1 (en) 2000-06-02

Family

ID=26807630

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1999/027710 WO2000031110A1 (en) 1998-11-25 1999-11-22 Identification of disease predictive nucleic acids

Country Status (2)

Country Link
AU (1) AU1742600A (en)
WO (1) WO2000031110A1 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9394333B2 (en) 2008-12-02 2016-07-19 Wave Life Sciences Japan Method for the synthesis of phosphorus atom modified nucleic acids
US9598458B2 (en) 2012-07-13 2017-03-21 Wave Life Sciences Japan, Inc. Asymmetric auxiliary group
US9605019B2 (en) 2011-07-19 2017-03-28 Wave Life Sciences Ltd. Methods for the synthesis of functionalized nucleic acids
US9617547B2 (en) 2012-07-13 2017-04-11 Shin Nippon Biomedical Laboratories, Ltd. Chiral nucleic acid adjuvant
US9744183B2 (en) 2009-07-06 2017-08-29 Wave Life Sciences Ltd. Nucleic acid prodrugs and methods of use thereof
US9982257B2 (en) 2012-07-13 2018-05-29 Wave Life Sciences Ltd. Chiral control
US10144933B2 (en) 2014-01-15 2018-12-04 Shin Nippon Biomedical Laboratories, Ltd. Chiral nucleic acid adjuvant having immunity induction activity, and immunity induction activator
US10149905B2 (en) 2014-01-15 2018-12-11 Shin Nippon Biomedical Laboratories, Ltd. Chiral nucleic acid adjuvant having antitumor effect and antitumor agent
US10160969B2 (en) 2014-01-16 2018-12-25 Wave Life Sciences Ltd. Chiral design
US10322173B2 (en) 2014-01-15 2019-06-18 Shin Nippon Biomedical Laboratories, Ltd. Chiral nucleic acid adjuvant having anti-allergic activity, and anti-allergic agent
US10428019B2 (en) 2010-09-24 2019-10-01 Wave Life Sciences Ltd. Chiral auxiliaries

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5885834A (en) * 1996-09-30 1999-03-23 Epstein; Paul M. Antisense oligodeoxynucleotide against phosphodiesterase
US5977311A (en) * 1997-09-23 1999-11-02 Curagen Corporation 53BP2 complexes

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5885834A (en) * 1996-09-30 1999-03-23 Epstein; Paul M. Antisense oligodeoxynucleotide against phosphodiesterase
US5977311A (en) * 1997-09-23 1999-11-02 Curagen Corporation 53BP2 complexes

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BALVAY ET AL.: "Pre-mRNA Secondary Structure and the Regulation of Splicing", BIOESSAYS, vol. 15, no. 3, March 1993 (1993-03-01), pages 165 - 169, XP002923216 *
LANDERS ET AL.: "Translational Enhancement of mdm2 Oncogene Expression in Human Tumor Cells Containing a Stabilized Wild-Type p53 Protein", CANCER RESEARCH, vol. 57, 15 August 1997 (1997-08-15), pages 3562 - 3568, XP002923213 *
MULLER-PILLASCH ET AL.: "Cloning of Novel Transcripts of the Human Guanine-Nucleotide-Exchange Factor Mss4: In Situ Chromosomal Mapping and Expression in Pancreatic Cancer", GENOMICS, vol. 46, 1997, pages 389 - 396, XP002923215 *
SCOTT ET AL.: "A truncated Intracellular HER2/neu Receptor Produced by Alternative RNA Processing Affects Growth of Human Carcinoma Cells", MOLECULAR AND CELLULAR BIOLOGY, vol. 13, no. 4, April 1993 (1993-04-01), pages 2247 - 2257, XP002923214 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9695211B2 (en) 2008-12-02 2017-07-04 Wave Life Sciences Japan, Inc. Method for the synthesis of phosphorus atom modified nucleic acids
US10329318B2 (en) 2008-12-02 2019-06-25 Wave Life Sciences Ltd. Method for the synthesis of phosphorus atom modified nucleic acids
US9394333B2 (en) 2008-12-02 2016-07-19 Wave Life Sciences Japan Method for the synthesis of phosphorus atom modified nucleic acids
US10307434B2 (en) 2009-07-06 2019-06-04 Wave Life Sciences Ltd. Nucleic acid prodrugs and methods of use thereof
US9744183B2 (en) 2009-07-06 2017-08-29 Wave Life Sciences Ltd. Nucleic acid prodrugs and methods of use thereof
US10428019B2 (en) 2010-09-24 2019-10-01 Wave Life Sciences Ltd. Chiral auxiliaries
US10280192B2 (en) 2011-07-19 2019-05-07 Wave Life Sciences Ltd. Methods for the synthesis of functionalized nucleic acids
US9605019B2 (en) 2011-07-19 2017-03-28 Wave Life Sciences Ltd. Methods for the synthesis of functionalized nucleic acids
US9598458B2 (en) 2012-07-13 2017-03-21 Wave Life Sciences Japan, Inc. Asymmetric auxiliary group
US9982257B2 (en) 2012-07-13 2018-05-29 Wave Life Sciences Ltd. Chiral control
US9617547B2 (en) 2012-07-13 2017-04-11 Shin Nippon Biomedical Laboratories, Ltd. Chiral nucleic acid adjuvant
US10590413B2 (en) 2012-07-13 2020-03-17 Wave Life Sciences Ltd. Chiral control
US10167309B2 (en) 2012-07-13 2019-01-01 Wave Life Sciences Ltd. Asymmetric auxiliary group
US10144933B2 (en) 2014-01-15 2018-12-04 Shin Nippon Biomedical Laboratories, Ltd. Chiral nucleic acid adjuvant having immunity induction activity, and immunity induction activator
US10322173B2 (en) 2014-01-15 2019-06-18 Shin Nippon Biomedical Laboratories, Ltd. Chiral nucleic acid adjuvant having anti-allergic activity, and anti-allergic agent
US10149905B2 (en) 2014-01-15 2018-12-11 Shin Nippon Biomedical Laboratories, Ltd. Chiral nucleic acid adjuvant having antitumor effect and antitumor agent
US10160969B2 (en) 2014-01-16 2018-12-25 Wave Life Sciences Ltd. Chiral design

Also Published As

Publication number Publication date
AU1742600A (en) 2000-06-13

Similar Documents

Publication Publication Date Title
US6221587B1 (en) Identification of molecular interaction sites in RNA for novel drug discovery
US6451524B1 (en) Identification of disease predictive nucleic acids
Stark et al. FGFR-4, a new member of the fibroblast growth factor receptor family, expressed in the definitive endoderm and skeletal muscle lineages of the mouse
Kim et al. Retroposition and evolution of the DNA-binding motifs of YY1, YY2 and REX1
Muller et al. Foreign DNA integration: genome-wide perturbations of methylation and transcription in the recipient genomes
Chetverin et al. Nonhomologous RNA recombination in a cell-free system: evidence for a transesterification mechanism guided by secondary structure
Plummer et al. Alternative splicing of the sodium channel SCN8A predicts a truncated two-domain protein in fetal brain and non-neuronal cells
Packer et al. Expression of the murine Hoxa4 gene requires both autoregulation and a conserved retinoic acid response element
Alfano et al. Natural antisense transcripts associated with genes involved in eye development
Dutton et al. An evolutionarily conserved intronic region controls the spatiotemporal expression of the transcription factor Sox10
Pagani et al. Promoter architecture modulates CFTR exon 9 skipping
Wahlstedt et al. Site‐selective versus promiscuous A‐to‐I editing
Han et al. Molecular cloning of six novel Krüppel-like zinc finger genes from hematopoietic cells and identification of a novel transregulatory domain KRNB
Kim et al. Specific SR protein‐dependent splicing substrates identified through genomic SELEX
Xie et al. Computational analysis of alternative splicing using EST tissue information
WO2000031110A1 (en) Identification of disease predictive nucleic acids
WO2008000186A1 (en) A method for identifying novel gene and the resulting novel genes
Jahroudi et al. An NF1-like protein functions as a repressor of the von Willebrand factor promoter
Wei et al. Regulation of the alternative neural transcriptome by ELAV/Hu RNA binding proteins
Birnstiel et al. Dangerous liaisons: spermatozoa as natural vectors for foreign DNA?
JPH08507919A (en) Genetic suppressor factors associated with sensitivity to chemotherapeutic drugs
Desai et al. Heterogeneous distribution of genetic mutations in myosin binding protein-C paralogs
Inman et al. Gene organization and chromosome location of the neural-specific RNA binding protein Elavl4
Ding et al. Transcriptome analysis of blood for the discovery of sex-related genes in ricefield eel Monopterus albus
US20050239737A1 (en) Identification of molecular interaction sites in RNA for novel drug discovery

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref country code: AU

Ref document number: 2000 17426

Kind code of ref document: A

Format of ref document f/p: F

AK Designated states

Kind code of ref document: A1

Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CR CU CZ DE DK DM EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase