WO2008097632A2 - Methods for determining splice variant types and amounts - Google Patents

Methods for determining splice variant types and amounts Download PDF

Info

Publication number
WO2008097632A2
WO2008097632A2 PCT/US2008/001682 US2008001682W WO2008097632A2 WO 2008097632 A2 WO2008097632 A2 WO 2008097632A2 US 2008001682 W US2008001682 W US 2008001682W WO 2008097632 A2 WO2008097632 A2 WO 2008097632A2
Authority
WO
WIPO (PCT)
Prior art keywords
exon
gene
splicing
indicator
alternatively
Prior art date
Application number
PCT/US2008/001682
Other languages
French (fr)
Other versions
WO2008097632A3 (en
WO2008097632A9 (en
Inventor
Jonathan Bingham
Subha Srinivasan
Original Assignee
Jiv An Biologics, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiv An Biologics, Inc. filed Critical Jiv An Biologics, Inc.
Publication of WO2008097632A2 publication Critical patent/WO2008097632A2/en
Publication of WO2008097632A3 publication Critical patent/WO2008097632A3/en
Publication of WO2008097632A9 publication Critical patent/WO2008097632A9/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6813Hybridisation assays
    • C12Q1/6816Hybridisation assays characterised by the detection means

Definitions

  • the methods and kits of parts relate to the fields of gene expression and to microarray based methods for measuring gene expression, particularly the expression of splice-variants.
  • Microarrays capable of detecting splice variants may comprise indicator polynucleotides that indicate exons, exon-exon junctions, introns, modules, intron-exon junctions, exon-intron junctions or module-module junctions of a gene.
  • a microarray experiment determines expression levels for indicator polynucleotides in one or more samples.
  • An expression level comprises one or more numerical values for an indicator polynucleotide in a sample.
  • the relative expression levels of splice variants of a gene in a single sample, and changes in the relative expression levels of splice variants of a gene between and across samples, can yield meaningful insights into splicing regulation that may have biological function specific to disease state, tissue, intracellular localization, population, individual, drug treatment, etc.
  • H-DBAS alternative splicing database of completely sequenced and manually annotated full-length cDNAs based on H-lnvitational. Nucleic Acids Res., 2007. 35(Database Issue): p. D104-9.
  • a hybridization method for measuring the levels of alternatively-spliced forms of a gene comprising:
  • the mutually exclusive indicator polynucleotides are non-overlapping. In some embodiments, the mutually exclusive indicator polynucleotides are overlapping.
  • At least one mutually exclusive indicator polynucleotide corresponds to a polynucleotide that is constitutively present in alternatively spliced forms of the gene. In some embodiments, at least one mutually exclusive indicator polynucleotide corresponds to a polynucleotide that is not constitutively present in alternatively spliced forms of the gene. [0010] In some embodiments, an overall level of expression of alternatively-spliced forms of a gene is calculated by summing the amount of hybridization signal corresponding to the relative amounts of hybridization to each of the mutually exclusive indicator polynucleotides. [0011] In some embodiments, the overall level of gene expression (G) is calculated using the equations:
  • G is the overall gene expression level; each I is an alternatively-spliced form of the gene; n is the number of indicator polynucleotides; and each B is the sum of expression levels of all alternatively-spliced forms of the gene; wherein when at least one indicator polynucleotide corresponds to an exon or intron, B exO n is equal to the sum of p + p c , wherein p is the amount of hybridization signal corresponding to the amount of hybridization of the alternatively-spliced forms of a gene to the indicator polynucleotide corresponding to the exon or intron, and p c is the sum of the amounts of hybridization from each indicator polynucleotide, and [0012] When at least one indicator polynucleotide corresponds to an exon- exon or exon-intron junction, B jUnc is V((p5 + p5 c ) * (p3 + p3 c )), wherein p
  • At least one of the two or more mutually exclusive indicator polynucleotides corresponds to an exon. In some embodiments, at least one of the two or more mutually exclusive indicator polynucleotides corresponds to an intron. In some embodiments, at least one of the two or more mutually exclusive indicator polynucleotides corresponds to a module. [0015] In some embodiments, at least one of the two or more mutually exclusive indicator polynucleotides corresponds to an exon-exon junction. In some embodiments, at least one of the two or more mutually exclusive indicator polynucleotides corresponds to an exon-intron junction.
  • At least one of the two or more mutually exclusive indicator polynucleotides corresponds to an intron-exon junction. In some embodiments, at least one of the two or more mutually exclusive indicator polynucleotides corresponds to a module- module junction.
  • the indicator polynucleotides are in a microarray.
  • the alternatively-spliced forms of a gene are mRNAs, and the indicator polynucleotides are complementary to the mRNAs. In some embodiments, the alternatively-spliced forms of a gene are cDNAs, and the indicator polynucleotides are complementary to the cDNA. [0018] In another aspect, software for performing the calculations described herein are provided.
  • the software are for determining the amounts of different gene splice variants using data obtained in a microarray having two or more mutually exclusive indicator polynucleotides corresponding to polynucleotide sequences of the gene, selected from exons, introns, modules, exon-exon junctions, exon-intron junctions, intron-exon junctions, and module-module junctions, of alternatively-spliced forms of a gene, wherein the software applies a mathematical algorithm to calculate the relative expression levels of different gene splice variants.
  • the software performs at least one calculation selected from calculations (1 ), (2), (3), (4), (5), (6), (7), (8), (9), (10), (11 ), (12), (13), (14), and (15), below, or any combination or subset, thereof.
  • a kit of parts for measuring the levels of alternatively-spliced forms of a gene comprising:
  • the indicator polynucleotides are in a microarray.
  • the mathematical algorithms are provided in an executable computer application.
  • a nucleic acid probe set for use in, e.g., a nucleic acid microarray format.
  • the probe sets comprises two or more mutually exclusive indicator polynucleotides corresponding to polynucleotide sequences selected from exons, introns, modules, exon-exon junctions, exon-intron junctions, intron-exon junctions, and module-module junctions, of alternatively- spliced forms of a gene.
  • Figure 1 shows a series of scatter plots depicting splice variants with a minimum signal of 200 and a "Splice Fold" (i.e., linearized Splice Ratio) score of > 2 (>99.9% confidence for all splice types).
  • Plots on the left i.e., plots A, C, E, and G
  • plots on the right i.e., B, D, F, and H
  • FIG. 1 is graph showing the distribution of differentially expressed splice types. The splice types and relative amounts of each splice type are indicated. These amounts were determined using the SEHS, Splicing Index, ASPIRE, and Splice Ratio methods, described.
  • the methods and kits are for quantifying overall gene expression levels (or transcriptional activity), or the relative levels of particular splice variants, using a plurality of indicator polynucleotides that correspond to the polynucleotide sequences of exons, introns, exon-exon junctions, exon-intron junction, intron-exon junctions, modules, or module-module junctions of a gene.
  • exon, exon-exon junction, module, exon-intron junction, intron- exon junction or module-module junction is a "constitutive" polynucleotide if all expected splice variants comprise that exon, exon-exon junction, module, exon- intron junction, intron-exon junction or module-module junction. Otherwise, it is "non- constitutive”.
  • a first indicator polynucleotide is considered "mutually exclusive" with a second indicator polynucleotide if the first indicator polynucleotide indicates a first exon, module or junction of a first splice variant and the second indicates a second exon, module or junction of a second splice variant, wherein the first splice variant does not comprise the second exon, module or junction and the second splice variant does not comprise the first exon, module or junction.
  • a pair of mutually exclusive indicator polynucleotides comprises a first indicator polynucleotide and a second indicator polynucleotide.
  • the pair is "overlapping" if the first indicator polynucleotide comprises a polynucleotide sequence comprised by the second indicator polynucleotide.
  • the pair is "non- overlapping" if the first does not comprise a polynucleotide sequence comprised by the second. This will be clear to one of ordinary skill in the arts from the following examples.
  • a polynucleotide sequence is preferably at least 1 , at least 5, at least 10, or even at least 100 nucleotides.
  • Example 1 Examples of mutually exclusive indicator polynucleotides: Example 1
  • a gene comprises at least a first exon, an intron, and a second exon, and the second exon has a short form and a long form that differ at the 5' end.
  • the gene has a first splice variant that comprises the first exon and the long form of the second exon., and a second splice variant that comprises the first exon and the short form of the second exon.
  • An indicator polynucleotide for the exon-exon junction between the first exon and the long form of the second exon is mutually exclusive (overlapping) with an indicator polynucleotide for the exon-exon junction between the first exon and the short form of the second exon.
  • An indicator polynucleotide for the module at the 5' end of the second exon that is part of the long form but not the short form is mutually exclusive (overlapping) with an indicator polynucleotide for the exon-exon junction between the first exon and the short form of the second exon.
  • An indicator polynucleotide for the intron-exon junction between the intron and the short form of the second exon is mutually exclusive (overlapping) with an indicator polynucleotide for the exon-exon junction between the first exon and the short form of the second exon.
  • a gene comprises at least a first exon, an intron, and a second exon, and the first exon has a short form and a long form that differ at the 3' end.
  • the gene has a first splice variant that comprises the long form of the first exon and the second exon., and a second splice variant that comprises the short form of the first exon and the second exon.
  • An indicator polynucleotide for the exon-exon junction between the long form of the first exon and the second exon is mutually exclusive (overlapping) with an indicator polynucleotide for the exon-exon junction between the short form of the first exon and the second exon.
  • An indicator polynucleotide for the module at the 3' end of the first exon that is part of the long form but not the short form is mutually exclusive (overlapping) with an indicator polynucleotide for the exon-exon junction between the short form of the first exon and the second exon.
  • An indicator polynucleotide for the exon-intron junction between the short form of the first exon and the intron is mutually exclusive (overlapping) with an indicator polynucleotide for the exon-exon junction between the short form of the first exon and the second exon.
  • a gene comprises at least a first exon, a second exon and a third exon.
  • a first splice variant comprises the first exon, the second exon and the third exon.
  • a second splice variant comprises the first exon and the third exon.
  • An indicator polynucleotide for the junction between the first exon and the second exon is mutually exclusive (overlapping) with an indicator polynucleotide for the junction between the first exon and the third exon.
  • An indicator polynucleotide for the junction between the second exon and the third exon is mutually exclusive with an indicator polynucleotide for the junction between the first exon and the third exon.
  • An indicator polynucleotide for the junction between the first exon and the third exon is mutually exclusive (non-overlapping) with an indicator polynucleotide for the second exon.
  • a gene comprises at least a first exon, a second exon, a third exon and a fourth exon.
  • a first splice variant comprises the first exon and the fourth exon.
  • a second splice variant comprises the first exon, the second exon, the third exon and the fourth exon.
  • An indicator polynucleotide for the junction between the first exon and the fourth exon is mutually exclusive (non-overlapping) with an indicator polynucleotide for the junction between the second exon and the third exon.
  • a gene comprises at least a first exon, a second exon, a third exon and a fourth exon.
  • a first splice variant comprises the first exon and the third exon.
  • a second splice variant comprises the second exon and the third exon.
  • An indicator polynucleotide for the junction between the first exon and the third exon is mutually exclusive (non-overlapping) with an indicator polynucleotide for the junction between the second exon and the fourth exon.
  • An indicator polynucleotide for the second exon is mutually exclusive (non-overlapping) with an indicator polynucleotide for the third exon if the gene comprises no splice variant that comprises both the second exon and the third exon.
  • a gene comprises at least a first exon, a second exon and a third exon.
  • a first splice variant comprises the first exon and the third exon and starts with the first exon.
  • a second splice variant comprises the second exon and the third exon and starts with the second exon.
  • An indicator polynucleotide for the junction between the first exon and the third exon is mutually exclusive (overlapping) with an indicator polynucleotide for the junction between the second exon and the third exon.
  • An indicator polynucleotide for the first exon is mutually exclusive (non-overlapping) with an indicator polynucleotide for the second exon if the gene comprises no splice variant that comprises both the first exon and the second exon.
  • a gene comprises at least a first exon, a second exon and a third exon.
  • a first splice variant comprises the first exon and the third exon and ends with the third exon.
  • a second splice variant comprises the first exon and the second exon and ends with the second exon.
  • An indicator polynucleotide for the junction between the first exon and the third exon is mutually exclusive (overlapping) with an indicator polynucleotide for the junction between the first exon and the second exon.
  • An indicator polynucleotide for the second exon is mutually exclusive (non-overlapping) with an indicator polynucleotide for the third exon if the gene comprises no splice variant that comprises both the first exon and the second exon.
  • the gene expression technique applies to two or more indicator polynucleotides for exons, introns, modules, exon-exon junction, exon-intron junctions, intron-exon junctions, or module-module junctions of a gene.
  • the gene expression technique applies to a plurality of genes, such as a set of genes for a gene/protein family, a pathway, or an organism's entire genome. It applies to a plurality of splice variants of genes. It applies to one or more samples. Exemplary samples are cell lines, tissue samples, patient samples, pools of samples, and computational pools.
  • the gene expression technique applies a mathematical algorithm to expression levels for a plurality of indicator polynucleotides that indicate constitutive polynucleotides of a gene.
  • the gene expression technique applies a mathematical algorithm to expression levels for a plurality of indicator polynucleotides that indicate non- constitutive polynucleotides of a gene.
  • the gene expression technique applies a mathematical algorithm to expression levels for a plurality of indicator polynucleotides that indicate mutually exclusive (overlapping) polynucleotides of a gene.
  • the gene expression technique applies a mathematical algorithm to expression levels for a plurality of indicator polynucleotides that indicate mutually exclusive (non-overlapping) polynucleotides of a gene. In other embodiments, the gene expression technique applies a mathematical algorithm to expression levels for indicator polynucleotides that indicate constitutive, non-constitutive, or mutually exclusive (overlapping or non- overlapping) polynucleotides of a gene.
  • the gene expression technique applies a mathematical algorithm to expression levels of indicator polynucleotides for a gene. In some embodiments, the gene expression technique determines a gene expression level in a sample by summing (i.e., adding) the expression levels of mutually exclusive indicator polynucleotides. In another embodiment, the gene expression technique determines a gene expression level in a sample by summing the background-subtracted expression levels of mutually exclusive indicator polynucleotides.
  • the gene expression technique can assign a gene expression level to the gene in Example 1 (above) by averaging (or log averaging, or taking the geometric mean of, or taking a median of, or otherwise applying a mathematical algorithm to) one or more of the following expression levels: (1 ) an expression level for an indicator polynucleotide for the first exon; (2) the sum of expression levels for the mutually exclusive indicator polynucleotides in (a); (3) the sum of expression levels for the mutually exclusive indicator polynucleotides in (b); (4) the sum of expression levels for the mutually exclusive indicator polynucleotides in (c); and (5) an expression level for an indicator polynucleotide for the second exon.
  • the gene expression technique may apply a mathematical algorithm to the expression levels in (1 ) and (2), or (2) only, or (2) and (3), etc.
  • the gene expression technique sums background-subtracted expression levels in (2) and (3).
  • the gene expression technique considers background readings for the indicator polynucleotides. In an embodiment, it adds a mathematical function of the background readings, such as a sum or log average or geometric mean, to the sum of background subtracted expression levels for the indicator polynucleotides.
  • a gene comprises three polynucleotide splice modules, a first module, a second module and a third module.
  • the three modules are exons.
  • the first module is an exon
  • the second module is an intron
  • the third module is an exon.
  • the first module is an exon
  • the second module is an extension of an exon
  • the third module is an exon.
  • the gene has a first splice isoform comprising the first module, the second module and the third module.
  • the gene has a second splice isoform comprising the first module and the third module.
  • indicator polynucleotides an indicator polynucleotide for the first module (M1 ), an indicator polynucleotide for the junction between the first module and the second module (J1-2), an indicator polynucleotide for the second module (M2); an indicator polynucleotide for the junction between the second module and the third module (J2-3); an indicator polynucleotide for the junction between the first module and the third module (J1-3); and an indicator polynucleotide for the third module (M3).
  • a nucleotide array comprises a set of the indicator polynucleotides for the gene.
  • the nucleotide array comprises the indicator polynucleotide for the three junctions and for the three modules. In another embodiment, the nucleotide array comprises the indicator polynucleotides for the three junctions. In another embodiment, the nucleotide array comprises the indicator polynucletide for the junction between the first module and the second module, and the indicator polynucleotide for the junction between the first module and the third module. In another embodiment, the nucleotide array comprises the indicator polynucleotide for the junction between the first module and the third module, and the indicator polynucleotide for the junction between the second module and the third module.
  • the nucleotide array comprises a combination or subset of the six types of indicator polynucleotides. In an embodiment, the nucleotide array comprises multiple indicator polynucleotides for a junction between a first module and a second module. In an embodiment, the nucleotide array comprises multiple indicator polynucleotides for a module.
  • the gene expression method calculates a gene expression level G using the equations:
  • B Junc V((p5 + p5 c ) * (p3 + p3 c )) (3)
  • G is the gene expression level
  • each I is a splice isoform
  • each B is the sum of expression levels of all splice isoforms at a base range targeted by one or more probes
  • n is the number of probes for the gene
  • B exOn covers the case of an exon or intron probe, with p equal to a probe signal and p c equal to the sum of signals of probes exclusive with (or complementary to) p
  • B junc addresses the case of an exon-exon or exon-intron junction probe, with ps being the 5' portion of the junction and P 3 being the 3' portion, each portion having its own complement.
  • p5 is the 5' portion of an exon-exon or exon-intron junction probe p.
  • p3 is the 3' portion of an exon-exon or exon-intron junction probe p.
  • p c is the "complement", meaning the set of probes that are complementary to p.
  • a complementary probe p c is mutually exclusive with a probe p, meaning that the same splice isoform cannot contain both the polynucleotide indicated by p and the polynucleotide indicated by p c .
  • p5 c is the complement of the 5 1 portion of p
  • p3 c is the complement of the 3' portion of p.
  • first splice isoform containing exon-exon junction J1-3 and a second splice isoform containing J2-3.
  • first splice isoform p5 is the signal for the portion of J 1-3 for exon 1 ;
  • p5 c is the complementary signal from the second splice isoform, J2-3.
  • second splice isoform p5 is the signal for the portion of J2-3 for exon 2;
  • p5 c is the complementary signal from J1-2.
  • the two splice isoforms have mutually exclusive splicing patterns, so each is a complement of the other.
  • the two probes share the same 3' portion, so in both cases, p3 is the signal for the portion that detects exon 3.
  • J2-3 is the complement.
  • J1-3 is the complement.
  • the gene expression method takes background levels into account.
  • each of the 6 indicator polynucleotides has a background level (k) equal to 100.
  • the background level can be subtracted from indicator polynucleotides when solving the equation. It can be subtracted from all indicator polynucleotides. Alternatively, it can be subtracted from the 'complementary' indicator polynucleotides in the equation. E.g.,
  • the gene expression method uses a different background level for each indicator polynucleotide.
  • the gene expression method calculates a gene expression level from the expression levels of the junction indicator polynucleotides:
  • the gene expression calculation method subtracts background levels when calculating a gene expression level from junction indicator polynucleotides.
  • the gene expression method calculates a gene expression level using exon-exon junction probes. In an embodiment, the gene expression method calculates a gene expression level using exon-exon junction probes and exon-intron junction probes. In an embodiment, the gene expression method calculates a gene expression level using exon-exon junction probes and exon probes. In an embodiment, the gene expression method calculates a gene expression level using exon-exon junction probes, exon-intron junction probes and exon probes. In an embodiment, the gene expression method calculates a gene expression level using exon probes.
  • the gene expression method generalizes to cases where there are more than two splice variants. It applies to cases where there are 3 or more splice isoforms, or 5 or more splice isoforms, or 10 or more splice isoforms. It applies to cases where a splice isoform is predicted, or where multiple splice isoforms are predicted. It applies to cases where there are more than three modules, or more than 5 modules, or more than 10 modules, or more than 20 modules, or more than 30 modules.
  • expression level data from a database to expression level data from a compute file, or to expression level data obtained from a storage media, including any of the types mentioned in the provisional applications incorporated by reference, such as computer memory, RAM, an ASCII file, a binary file, a CSV file, a tab-delimited file, an XML file, a database, a hard drive, flash memory, USB memory, a CD, a DVD, a network store, a tape drive, etc.
  • a storage medium may store a gene expression level derived using the methods above.
  • the storage medium may be any of the types of storage medium already mentioned, such as computer memory, RAM, an ASCII file, a binary file, a CSV file, a tab-delimited file, an XML file, a database, a hard drive, flash memory, USB memory, a CD, a DVD, a network store, a tape drive, etc.
  • a storage medium may store a plurality of gene expression levels derived using the methods above.
  • the storage medium may comprise data for 10 or more gene expression levels, or 100 or more gene expression levels, or 1000 or more gene expression levels, or 10000 or more gene expression levels. It may comprise data for gene expression levels for one or more samples, or 2 or more samples, or 10 or more samples, or 20 or more samples, or 50 or more samples, or 100 or more samples, or 1000 or more samples, or 10000 or more samples.
  • Gene expression level data calculated using the methods above may be transmitted in various ways: over a network by TCP-IP, FTP, SMTP, in an email attachment, by courier, by mail, by copying onto a CD or DVD or memory device and transferring from one storage medium to another. It may be transferred in any of the ways mentioned in the patent applications incorporated by reference, or in any other way that may occur to one of skill in the arts.
  • the data store and data transmission are aspects of the present invention that are useful in their own right.
  • a normalization method uses a gene expression level to normalize expression levels for indicator polynucleotides for a gene.
  • a normalization method applies a mathematical equation to normalize expression levels for indicator polynucleotides for a gene. In an embodiment, it applies an equation of the form:
  • S p M/G (4)
  • S is the gene-normalized signal of a probe in a sample
  • p is the probe signal
  • G is the gene expression level
  • M is a scalar, such as the geometric mean of G and the geometric mean of gene expression levels in the sample and at least one other sample.
  • M's value is a function of the gene expression level relative to the probe in one or more samples (the "local gene expression level"). In an embodiment, M's value is a function of the local gene expression level in the sample. In another embodiment, M's value is a function of the local gene expression level in two or more samples. In an embodiment, M's value is the average of the local gene expression level in two or more samples, (e.g., if the local gene expression levels in two samples are 500 and 1000, M's value is the average, which is 750.) In another embodiment, M equals the geometric mean of the local gene expression levels in two or more samples. In another embodiment, M equals the median of the local gene expression levels in two or more samples.
  • M equals the geometric median of the local gene expression levels in two or more samples. In another embodiment, M equals an arbitrary function of the local gene expression levels in two or more samples. For example, M might discard outliers and take a function, such as the geometric mean, of the remaining values. [0060] In an embodiment, G is calculated using the equation above.
  • G is calculated by combining proximal probes relative to the probe for which p is the signal, such as the flanking exon signals for an exon-exon junction probe for an exon skip, or the one flanking exon signal for an alternative first or last exon, or the flanking exon signals for a probe for a retained intron, or the flanking exon signals for a probe for an alternative acceptor or donor site.
  • proximal probes relative to the probe for which p is the signal, such as the flanking exon signals for an exon-exon junction probe for an exon skip, or the one flanking exon signal for an alternative first or last exon, or the flanking exon signals for a probe for a retained intron, or the flanking exon signals for a probe for an alternative acceptor or donor site.
  • a short splice form probe (an exon-exon junction probe for a skipped exon, a spliced intron, or an alternative donor or acceptor site that truncates an exon) has two flanking exons with probe signals of 200 and 250.
  • the local transcription level G could be calculated as F (exoni , exon2).
  • F takes the median.
  • it takes the geometric mean square root (200 * 250).
  • it takes the geometric median.
  • it performs an arbitrary function on the flanking exon probes.
  • F is the identity function. In another embodiment, F is an arbitrary function.
  • a nucleotide array comprises a first indicator polynucleotide (pi) for a gene and a second indicator polynucleotide (p 2 ) for the gene.
  • the first indicator polynucleotide has an expression level of 400 in a first sample and an expression level of 600 in a second sample.
  • the second indicator polynucleotide has an expression level of 800 in the first sample and 1200 in the second sample.
  • the gene has an expression level of 500 in the first sample and an expression level of 1000 in the second sample.
  • 5 1 400 * V(500 MOOO) / 500
  • S 1 is equal in both samples
  • S 2 is equal in both samples.
  • the normalization adjusts the values based on gene expression level.
  • the remaining differences between the two samples may result from RNA splicing. In this case, there are no remaining differences between the two samples, hence there may not be a difference in RNA splicing.
  • a nucleotide array comprises a first indicator polynucleotide (p3) for a gene and a second indicator polynucleotide (p4) for the gene.
  • the first indicator polynucleotide has an expression level of 400 in a first sample and an expression level of 600 in a second sample.
  • the second indicator polynucleotide has an expression level of 1200 in the first sample and 800 in the second sample.
  • the gene has an expression level of 500 in the first sample and an expression level of 1000 in the second sample.
  • the normalization method normalizes a probe signal using a global or local gene expression level derived from more than two samples; e.g., instead of normalizing the two samples to each other with the value of V(500 * 1000), one could normalize them using another value of m, such as the geometric mean of gene expression levels in each of a plurality of samples.. In another embodiment, the normalization method normalizes a probe signal in a first sample using a gene expression level from a second sample.
  • the normalization method applies a mathematical equation to expression level data for a plurality of indicator polynucleotides: 10 or more, or 100 or more, or 1000 or more, or 10000 or more, or 100000 or more, or 1000000 or more.
  • the normalization method applies a mathatical equation to expression level data for a plurality of exon indicator polynucleotides: 10 or more, or 100 or more, or 1000 or more, or 10000 or more, or 100000 or more, or 1000000 or more.
  • the normalization method applies a mathatical equation to expression level data for a plurality of exon-exon junction indicator polynucleotides: 10 or more, or 100 or more, or 1000 or more, or 10000 or more, or 100000 or more, or 1000000 or more.
  • the normalization method applies a mathatical equation to expression level data for a plurality of exon indicator polynucleotides and exon-exon junction indicator polynucleotides: 10 or more, or 100 or more, or 1000 or more, or 10000 or more, or 100000 or more, or 1000000 or more.
  • the method applies cases where there are more than two splice variants.
  • expression level data from a database to expression level data from a compute file, or to expression level data obtained from a storage media, including any of the types mentioned in the provisional applications incorporated by reference, such as computer memory, RAM, an ASCII file, a binary file, a CSV file, a tab-delimited file, an XML file, a database, a hard drive, flash memory, USB memory, a CD, a DVD, a network store, a tape drive, etc.
  • a storage medium may store a normalized expression level data derived using the methods above.
  • the storage medium may be any of the types of storage medium already mentioned, such as computer memory, RAM, an ASCII file, a binary file, a CSV file, a tab-delimited file, an XML file, a database, a hard drive, flash memory, USB memory, a CD, a DVD, a network store, a tape drive, etc.
  • a storage medium may store a plurality of normalized expression levels derived using the methods above.
  • the storage medium may comprise data for 10 or more normalized expression levels, or 100 or more normalized expression levels, or 1000 or more normalized expression levels, or 10000 or more normalized expression levels. It may comprise data for normalized expression levels for one or more samples, or 2 or more samples, or 10 or more samples, or 20 or more samples, or 50 or more samples, or 100 or more samples, or 1000 or more samples, or 10000 or more samples.
  • Normalized expression level data calculated using the methods above may be transmitted in various ways: over a network by TCP-IP, FTP, SMTP, in an email attachment, by courier, by mail, by copying onto a CD or DVD or memory device and transferring from one storage medium to another. It may be transferred in any of the ways mentioned in the patent applications incorporated by reference, or in any other way that may occur to one of skill in the arts.
  • the data store and data transmission are aspects of the present invention that are useful in their own right.
  • the methods described above, and the analysis described below can be accomplished using a software program written to conduct the described methods or analysis. The software is stored on a suitable storage medium, such as those already mentioned above. IV. Analysis of splicing patterns
  • a splicing analysis method computes a score for changes in splicing. In an embodiment, it computes a generalized Splicing Index.
  • the Splicing Index can be defined as:
  • Riong [ ⁇ log (K 1 S / K 11 )] / n - [ ⁇ log(D ls /D lt )] / m (7)
  • J is a junction signal for a short splice form (an exon skip, later first exon, earlier last exon, intron splice or truncated exon from alternative donor or acceptor)
  • Kj are signals for n long splice form junctions or exons
  • s and t are samples or computational pools of samples using an equation such as mean, geometric mean, median, or geometric median, perhaps omitting outliers.
  • Riong (log 200/400 + log 200/400 + log 200/400) / 3 - ( log 600/600 + log 600/600 ) / 2.
  • the generalized Splicing Index omits the signal for the skipped exon. It should be noted that the term for the flanking exons cancels out.
  • an exon skip but without J1-2.
  • This example without J1-2 addresses the case of an alternative first exon. Either the gene begins with J1-3 or with J2-3.
  • the generalized Splicing Index can be applied as before, except that the term for J1-2 will be omitted.
  • an exon skip but without J2-3.
  • This example without J2-3 addresses the case of an alternative last exon. Either the gene ends with J1-3 or with J1-2.
  • the generalized Splicing Index can be applied as before, except that the term for J2-3 will be omitted.
  • the generalized Splicing Index combines long form probes using a function other than the average log ratio. It applies a function of the form F(log K is / K it ), where K, i, s and t are defined as above. F might be the mean, minimum, maximum, median, or other function of the values of Kj. In an embodiment, F takes the minimum unless two (log (K is / K it )) pairs have opposite signs, in which case F gives the value of zero (0).
  • the splicing analysis method computes a score for splicing changes using a generalized ASPIRE algorithm:
  • Riong min (log [K is / KiJ) (10) where the variables have the same meanings as for the generalized Splicing Index, and k is a constant, such as 2. If any (log [K ⁇ s / K lt] ) pair has opposite signs, the data analysis method sets R
  • O ng 0.
  • the splicing analysis method sets the sign of the
  • the splicing analysis method sets the sign of the ASPIRE value based on R
  • the generalized ASPIRE equation can address all types of splicing events.
  • the generalized ASPIRE algorithm combines long form probes using a function other than the minimum log ratio. It applies a function of the form F(log K is / K it ), where K, i, s and t are defined as above. F might be the mean, minimum, maximum, median, or other function of the values of Kj.
  • the splicing analysis method computes a score for splicing changes using a Splice Ratio equation that combines elements of the generalized Splicing Index and generalized ASPIRE. It identifies splicing changes that occur in opposite directions, but relative to the global or local gene expression level G rather than in absolute terms, as is the case with ASPIRE:
  • ISplice Ratio 2 * min (
  • the splicing analysis method uses a global gene expression level to normalize probes.
  • the data analysis method uses a local gene expression level to normalize probes.
  • the splicing analysis method applies a mathematical algorithm to determine a splicing score for a splice event with more than two alternatives. .
  • the splicing analysis method applies a variant of the generalized Splicing Index, generalized ASPIRE algorithm or Splice Ratio to determine a splicing score for such a splice event.
  • the spliceoforms are S1-3-4, S1-4 and S1-2-3-4.
  • exon probes target each of the four exons (E1 , E2, E3 and E4)
  • exon-exon junction probes target each of the five distinct exon-exon junctions (J1-3, J3-4, J1-4, J1-2 and J2-3).
  • the splicing analysis method applies a mathematical algorithm to determine a splicing score for each spliceoform.
  • J1-2, E2 and J2-3 all have signals of 200 in a first sample and 400 in a second sample
  • J 1-3 has a signal of 100 in the first and second samples
  • J1-4 has a signal of 300 in the first sample and 600 in the second samples.
  • the probes determine a signal for each of the three spliceoforms.
  • the splicing analysis method applies a mathematical algorithm to the probe signals pertaining to the three spliceoforms.
  • the splicing analysis method sums the signals of each long form probe and treats the sum of signals as if it were a single alternative splice form.
  • Riong the log of the sum of signals for each long splice form in the first sample minus the log of the sum of signals for each long splice form in the second sample.
  • Ri On9 F (log [L js / L jt ]) (12) where F is a function such as the minimum or mean and each L j is a long splice form in a sample s or t. L is further defined as
  • the splicing analysis method converts a splicing score to a fraction or a percentage. For example, suppose the score is -2 in log 2 . I.e., the fold change is -4. Therefore the signal for the short splice form is % the signal of the long splice form in a first sample relative to a second sample. Hence, the fraction is 0.2 for the short splice form and 0.8 for the long splice form, and the percentages are 20% and 80% respectively. In an embodiment, the splicing analysis method converts a splicing score for a single sample to a fraction or a percentage.
  • a splicing score can be straightforwardly calculated for the first sample, sample s.
  • a percentage or fraction can be determined. For example, consider the case of an alternative acceptor site.
  • a short splice form might have a signal of 400 and a long splice form might have three probes (a module probe, an exon-exon junction probe and a module-module junction probe) having a geometric mean signal of 600.
  • the Splicing Index could be calculated using equation 1 as log 400 - log 600. The resulting value can be converted to a fraction using the equations.
  • the splicing analysis method converts a splicing score for two spliceoforms to a fraction or a percentage.
  • the splicing analysis method converts a splicing score for three or more spliceoforms in one sample to a fraction or a percentage.
  • Equation 12 provides a way to compute a splicing score using several of the algorithms above.
  • Equations 14 and 15 provide a way to compute a fraction or a percentage for each spliceoform.
  • the splicing analysis method converts a splicing score for two spliceoforms in two samples to a fraction or a percentage.
  • the splicing analysis method converts a splicing score for two spliceoforms in more than two samples, or more than two spliceoforms in two samples, or more than two spliceoforms in more than two samples, to a fraction or a percentage.
  • spliceoform and a splice event are terminologies that may be interchangeable depending on the gene model. All of the events above are intended to apply to whole spliceoforms as well as to specific alternatively spliced gene regions.
  • the splicing analysis method applies a mathematical algorithm to determine a splicing score for an indicated polynucleotide such as a single exon, intron, exon-exon junction, exon-intron junction, module or module-module junction (as opposed to a splice event or a spliceoform).
  • the splicing analysis method applies a generalized Splicing Index, generalized ASPIRE or Splice Ratio algorithm to determine a splicing score for an indicated polynucleotide. For a short splice form, this is straightforward, since there will typically be only one exon-exon junction identifying the splice form.
  • the splicing analysis method applies a mathematical equation to determine a signal for the multiple probes. For example, the probe signal intensities might be averaged, log averaged, a median value might be used, outliers might be discarded, the values might be first normalized using a z-score, etc.
  • each indicated polynucleotide can be treated individually. This gives a splicing score for each probe.
  • the change involves treating each indicated polynucleotide as a "short form” in the equations above, and the mutually exclusive splice events or spliceoforms as the "long form”.
  • the signal measured for J1-2 is 300.
  • This indicated polynucleotide (sometimes, for simplicity's sake, referred to as a probe, even though in fact this 'probe' may comprise multiple probes) can be treated as a 'short form' in the equations above such as Equation 1.
  • the corresponding 'long forms' relative to this indicated polynucleotide are J1-3 and J1-4.
  • a splicing score can be calculated for E2. Again the 'long forms' relative to the indicated polynucleotide are J1-3 and J1-4.
  • the splicing scores for J1-2, E2 and J1-3 may be different, although in theory, if the gene model comprises only three spliceoforms, the scores should be the same. Differences may arise because of an unexpected additional spliceoform (a surprising result) or because of experimental error.
  • Application of an equation above can determine a splicing score, a fraction or a percent composition for each indicated polynucleotide.
  • the splicing analysis method uses a score for splicing changes to filter data for one or more splice events.
  • the splicing analysis method calculates a score for one or more splice events, and the splice event passes the filter if the score satisfies some filtering criterion, F(score).
  • F(score) some filtering criterion
  • the splicing analysis method accepts a score if the absolute value exceeds a constant value. For example, suppose there are ten exon skip events detected by probes on a microarray, and a Splicing Index score is calculated for each.
  • the splicing analysis method may filter out any events that have a score with an absolute value less than 1.0 in log base two (ie., a splice fold change of less than two). Suppose that two of the exon skip events pass the filtering criterion.
  • the splicing analysis method calculates a score for splice events using another algorithm that may be familiar to one skilled in the arts, such as ASAP or genASAP. In another embodiment, the splicing analysis method calculates a score using a machine learning algorithm, a genetic algorithm, a neural network, a simulated annealing algorithm, or another algorithm that may occur to one of skill in the arts.
  • the splicing analysis method processes data for a plurality of splice events in a data set, computes a score, and filters data based on a function of the score. In another embodiment, the splicing analysis method further filters the data based on a minimum signal level. In another embodiment, the data splicing method processes data for 10 or more, or 100 or more, or 1000 or more, or 10000 or more, or 100000 or more, or 1 million or more, splice events in a data set. In an embodiment, the splicing analysis method processes data for splice events for one sample, or 2 or more samples, or 10 or more samples, or 20 or more samples, or 50 or more samples, or 100 or more samples.
  • the splicing analysis method processes data for exon skip events. In another embodiment it processes data for alternative first exons. In another embodiment, it processes data for alternative last exons. In another embodiment, it processes data for intron retention events. In another embodiment, it processes data for alternative donor sites. In another embodiment, it processes data for alternative acceptor sites. In another embodiment, it processes data for multiple splice event types. [0092] For example, suppose one had data for 10000 exon skip events,
  • the splicing analysis method may process data for one splice event type only, or a combination. It then filters the data and determines which splice events pass the filtering criteria using an algorithm such as those described above. For example, it might find that 100 exon skip events pass the filtering test.
  • the splicing analysis method filters data by splice event type. For example, it might process data from a high-density microarray and output all data for probes that detect alternative donor sites. In an embodiment, the splicing analysis method filters data to identify data for a single splice event type. In another embodiment, the splicing analysis method filters data to identify multiple splice event types. For example, it might process data and output all data for probes that detect either alternative first exons or alternative last exons. [0094] In an embodiment, the splicing analysis method filters data based on the sign of the score.
  • the splicing analysis method filters data and outputs data for up- regulated splice events or probes. In an embodiment, the splicing analysis method filters data and outputs data for splice events or probes with a positive score. In another embodiment, the splicing analysis method filters data and outputs data for down-regulated splice events or probes. In another embodiment, the data analysis method filters data and outputs data for splice events or probes with a negative score.
  • the splicing analysis method filters data and outputs data for splice events or probes with a score of zero, or of non-zero.
  • the splicing analysis method may filter data using a single criterion or a combination of criteria. For example, it may find data with a minimum signal intensity, a positive score, and a specific splice event type, such as intron retentions.
  • the splicing analysis method stores data for probes or for splice events that pass the filtering test. It stores the data in a storage medium, such as any of those listed above, including computer memory, a disk drive, etc.
  • the data analysis method may transmit the data using any of the methods mentioned above, including network protocols, courier, etc.
  • the storage medium is an aspect of the present invention of utility in its own right.
  • the splicing analysis method finds the intersection of data from two data stores. For example, suppose data was generated for two samples with replicates of each sample. The samples are S1R1 , S1R2, S2R1 and S2R2. Suppose intron retention events in the first replicate pair (S1 R1 and S2R1) are filtered and stored in a first file. Suppose intron retention events in the second replicate pair (S1 R2 and S2R2) are filtered and stored in a second file. The intersection of the two data files may be of special interest, since it will contain only those intron retention events that passed the filtering criteria in both replicate pairs. The splicing analysis method creates a data store for the intersection of two data or more stores. The splicing analysis method transmits a data store for the intersection of two or more data stores. The data store and data transmission may be of any of the types mentioned in this document. The data store and data transmission are aspects of the present invention that are useful in their own rights.
  • the splicing analysis method analyzes data in order to determine statistical significance.
  • the splicing analysis method calculates a standard deviation of data. For example, suppose technical replicates are used as control. The two replicates can be compared using any of the algorithms above. Suppose Splicing Indexes are calculated for each exon skip event. Suppose there are 1000 exon skip events. Hence, there will be 1000 Splicing Index scores. The splicing analysis method calculates the standard deviation of these 1000 scores. In an embodiment, the splicing analysis method calculates a standard error of data. In an embodiment, the splicing analysis method uses empirical statistics to determine a confidence level.
  • the splicing analysis method calculates a confidence interval for a single splice event type. For example, the 1000 exon skip events mentioned above. In an embodiment, the splicing analysis method calculates a confidence interval for splice events that involve the same number of probe measurements. For example, suppose a microarray contains 4 probes per exon skip event (3 exon-exon junction probes plus one exon probe).
  • the microarray also contains 4 probes per intron retention event (1 exon- exon junction probe, 2 exon-intron junction probes, and 1 intron probe).
  • the statistical properties of these two splice events might reasonably be assumed to be similar.
  • the same confidence interval, standard deviation or standard error could be used for both splice event types.
  • alternative donor sites and alternative acceptor sites might be detected by the same number of probes, perhaps 2 exon- exon junction probes for each (one probe for the short form, one for the long form).
  • the statistical properties of donor and acceptor sites might reasonably be assumed to be the same.
  • the splicing analysis method calculates a single confidence interval for multiple splice event types; e.g., exon skips and intron retentions.
  • the splicing analysis method creates a data store for the results of statistical calculation.
  • the data store may be an address in computer memory, a data file, etc.
  • the splicing analysis method uses a statistical confidence interval as a filter. For example, suppose alternative first and last exons have a 99% confidence interval of +/- 0.10 for the Splice Ratio. When comparing data for two samples, a potential alternative first exon may have a Splice Ratio score of 0.15. This value lies outside of the confidence interval, which is centered at zero. Hence, the alternative first exon event passes the filtering criterion and is statistically significant with p ⁇ 0.01. (Additional criteria might be applied, of course, and the event would have to pass these additional criteria as well.) In an embodiment, the splicing analysis method uses a confidence interval as a filtering criterion for a single splice event type.
  • the splicing analysis method uses a confidence interval as a filtering criterion for two or more splice event types. E.g., alternative first and alternative last exons. In an embodiment, the splicing analysis method uses a confidence interval as a filtering criterion for all splice event types in a data set. [0099] The splicing analysis method calculates a confidence interval for gene expression. In an embodiment, the gene expression is calculated as described above. The splicing analysis method filters gene expression data derived from a splice variant microarray using the confidence interval. The splicing analysis method creates a data store for the confidence interval and transmits it.
  • the splicing analysis method generates a report of alternative splicing changes in one or more samples.
  • the splicing analysis method computes splicing scores for data for one or more splice types.
  • it creates a data store (a computer file, a database table, a series of rows in a database, a series of addresses in memory, a printed document, a worksheet in a spreadsheet) for each splice type. For example, it creates a tab- delimited computer file containing results from processing alternative first exons, and another file containing results from processing intron retentions.
  • it includes the splicing score in the data store.
  • it includes only data that passes some filtering criterion or criteria as described above.
  • it creates a single data store for data from multiple splice types. For example, it stores data for each splice type in worksheets of a spreadsheet file. As another example, it stores all data in a single database table with a column that defines the splice type.
  • the splicing analysis method creates a summary of the analysis results for one or more splice types.
  • Table 1 provides an example:
  • Table 1 identifies the number of 'instances' of each data type (gene or splice event type), such as alternatively spliced short spliceoforms, or alternative donor sites (short form or long form). It shows the number of instances of each splice event that pass the filtering criterion defined in the columns Min Fold, Min Signal and Evidence.
  • the MCF7 cell line contained 38 exon include events that passed the filtering criterion of a minimum fold change (linearized Splice Ratio) of 1.4 and minimum signal intensity for each probe of 300.
  • the splicing analysis method creates a report for the comparison of two sample groups.
  • the splicing analysis method creates a report for a multi-sample comparison, comparing each sample to all others or to the average of all others.
  • the splicing analysis method creates a report for present or absent spliceoforms for each splice event type in one or more samples. For example, there may be 58 exon skip/include events where the skip form is present in a sample and 21 events where the include form is present in the sample.
  • the splicing analysis method creates a profile of splicing scores for splice events in two or more samples.
  • a profile may comprise a vector of log ratio values for each sample, e.g., (-0.01 , 0, 0.006, -0.15, 1.21 ).
  • the first element in the vector is the score for a first sample vs. a second sample; the second element is the score of a second sample vs. a third sample.
  • the scores may be present/absent scores for samples taken independently. Given two such profiles, the splicing analysis method computes a distance measure. In an embodiment, the splicing analysis method computes a Pearson correlation coefficient between two splicing profiles.
  • the splicing analysis method computes a Euclidean distance between two splicing profiles.
  • the splicing analysis method converts splicing profiles into bit strings and computes a Hamming distance. For example, suppose the bit-conversion assigns two bits to each sample and assigns a bit string of 00 for splicing scores less than plus or minus the standard error or confidence interval; it assigns a bit string of 01 for positive splicing scores greater than the standard error or outside of the confidence interval, and it assigns a bit string of 10 to negative splicing scores less than minus the standard error or outside of the confidence interval.
  • the Hamming distance between two such bit strings is equal to the number of bits that differ. For example, the Hamming distance between 00 10 01 and 00 10 01 is 2, since the two middle bits differ.
  • the splicing analysis method may compute a distance for profiles using another algorithm that may occur to one of skill in the arts.
  • the splicing analysis method creates a matrix of splicing profiles, e.g., each row contains the splicing scores for a splice event, and each column contains the splicing scores for a sample or sample comparison. Alternatively, the matrix could be arranged with splicing events in columns and samples in rows.
  • the splicing analysis method creates a data store for the matrix or collection of splicing profiles.
  • the data store may be a computer file, memory address, or other storage as mentioned above.
  • the data store is an aspect of the present invention and is of use in its own right for storage and transmission.
  • the splicing analysis method transmits the matrix of splicing profiles using a network protocol, postal service or other method mentioned above.
  • the data transmission is an aspect of the present invention and is of use in its own right.
  • the splicing analysis method performs a mathematical algorithm on a profiles or matrices of splicing scores. In an embodiment, it performs a principle component analysis. In another embodiment, it clusters the profiles using a greedy clustering algorithm. In another embodiment, it clusters the profiles using self-organizing maps. In another embodiment, it clusters the profiles using a hierarchical clustering algorithm. In another embodiment, it clusters the profiles using k-means. In another embodiment, it clusters the profiles using another algorithm that may occur to one of skill in the arts. One of skill will appreciate that, once splicing scores have been calculated and stored in a matrix, the application of a mathematical algorithm is a straightforward matter that can be performed using statistical analysis software.
  • the splicing analysis method creates a data store for the results of the clustering analysis or principle component analysis.
  • the data store may be of any of the types described in this document.
  • the splicing analysis method transmits the data store using any of the methods described in this document.
  • the data store and transmission are aspects of the present invention that are useful in their own right.
  • the splicing visualization method visually indicates data that passes the filtering test in a computer file, a spreadsheet or another storage medium such as a monitor.
  • the splicing visualization method visually indicates data that passes the filtering test in a scatter plot in two or more dimensions. See the figures below.
  • the scatter plot with visually indication is an aspect of the present invention of utility in its own right, since it enables a scientist to visually determine the extent of alternative splicing in a multi-sample comparison.
  • the splicing visualization method visually indicates data (in a scatter plot or spreadsheet or computer file) that passes the filtering criterion by using a different color from other data points.
  • the splicing visualization method visually indicates data by using a different font: a different font face, font style, font decoration, font size, or other attribute of the type face.
  • the data analysis method visually differentiates between short form and long form probe data. For example, suppose the data contains one exon-exon junction probe for a spliced intron and two exon- intron junctions for a retained intron. The data analysis method visually differentiates between the exon-exon junction probe's data and the two exon-intron junction probes' data. In an embodiment, the data analysis method displays short form and long form data in different colors. For example, red for short form data and orange for long form data. In an embodiment, the data analysis method visually indicates short form data. In an embodiment, the data analysis method visually indicates long form data. In an embodiment, the data analysis method visually indicates both short form and long form data. The data analysis method may use any of the visual cues mentioned above.
  • the data analysis method visually indicates the score for splicing data. In an embodiment, it indicates score by adjusting the hue. For example, a positive score could be red and a negative score green in a spreadsheet or scatter plot. In another embodiment, it visually indicates the score using the color saturation. For example, larger positive scores might be brighter red, larger negative scores brighter green, and scores close to zero nearly black. In another embodiment, it visually indicates the score using the transparency. For example, scores close to zero might be nearly transparent, whereas scores with large absolute values might be more opaque. In another embodiment, it visually indicates the score using the symbol size. For example, larger points in a scatter plot may indicate scores with larger absolute values.
  • the data analysis method may visually indicate the score using any of the methods suggested above, or using another method that may occur to one of skill in the arts.
  • the data analysis method visually indicates data of a given splice event type. For example, it may visually indicate exon skip data in one way, and intron retention data in another. In an embodiment, the data analysis method visually indicates data for a given splice event type using a different color. In an embodiment, the data analysis method visually indicates data for a given splice event type using a different shape. The data analysis method may visually indicate the splice event type using any of the methods suggested above, or using another method that may occur to one of skill in the arts.
  • a spreadsheet, scatter plot, line plot, or other image with visually indicated data is an invention in its own right, of use to scientists in interpreting alternative splicing data.
  • the data analysis method may store the visually indicated data in a storage medium or transmit it in any of the ways mentioned or that may occur to one of skill in the arts.
  • the data analysis method links data in a spreadsheet or scatter plot with a gene model viewer (a "splice graph").
  • the user indicates data in a scatter plot and the corresponding part of the gene model is visually indicated.
  • the user may click on a spreadsheet row containing data for an exon-exon junction probe, and a gene model viewer may open, if it is not already open, and that exon-exon junction will be visually indicated.
  • the user may move a mouse over a point in a scatter plot, and the relevant part of the gene model will be visually indicated. For example, an exon portion may be highlighted.
  • the visual indication may employ any of the methods already mentioned, or another.
  • the visual indication might involve changing the color of a portion of the gene model, such as an exon-exon junction or exon or intron or exonic portion or exon-intron junction. Or the visual indication may change the visual attributes of the portion, or underline it, or outline it, or label it with text, or display an icon near it.
  • the data analysis method visually indicates a single region of the gene model, such as an exon, intron, exonic portion, module, exon-exon junction, exon-intron junction, or module junction. In another embodiment, the data analysis method visually indicates multiple regions of the gene model.
  • the user may select rows in a spreadsheet for multiple probes, and the gene model would then highlight all of the gene regions and splice isoform regions, targeted by those probes.
  • the data analysis method may visually indicate the portion of the gene model using any of the methods suggested above, or using another method that may occur to one of skill in the arts.
  • the user indicates a portion or portions of a gene model, and the corresponding data in a spreadsheet or scatter plot is visually indicated. For example, the user moves a computer pointer over an exonic portion, or exon, or intron, or exon-exon junction, or exon-intron junction, and the associated data in a spreadsheet or scatter plot or line plot is highlighted.
  • the data analysis method may visually indicate the data in the spreadsheet or scatter plot using any of the methods suggested above, or using another method that may occur to one of skill in the arts.
  • the visually indicated data (embodied in an image on screen, an image file, spreadsheet, storage medium, etc) is an invention of use in its own right, since it can facilitate more intuitive understanding of alternative splicing data.
  • the data analysis method may store the visually indicated data in a storage medium or transmit it in any of the ways mentioned or that may occur to one of skill in the arts.
  • the splicing integration method connects splicing data to software resources such as gene ontology tools, alternative splicing databases, sequence databases, pathway software, chemistry database, etc.
  • the splicing integration method links splicing data to a gene ontology tool.
  • the splicing integration method links splicing data to a sequence database.
  • the splicing integration method links splicing data to a genome browser. In an embodiment, the splicing integration method links splicing data to an alternative splicing database. In an embodiment, the splicing integration method links splicing data to a pathway tool. In an embodiment, the splicing integration method links splicing data to a chemistry database. In an embodiment, the splicing integration method links splicing data to a gene model viewer. [00117] For example, suppose an exon-exon junction probe is annotated with ACCESSIONJD and GENE_SYMBOL and has genomic coordinates on chromosome 6 of 5000 to 5020 for the first exonic portion and of 6000 to 6020 for the second exonic portion.
  • the probe annotation might be displayed in a spreadsheet, or the probe might be represented as a point in a scatter plot or line plot, or in a region of a gene model viewer.
  • the user might indicate the probe annotation, representation or region by moving a cursor or navigating using the keyboard.
  • the indicated probe might then be highlighted, or a tooltip or popup window or context menu might appear.
  • the user gives a specified cue, such as a mouse cue (left click, right click, center click, mouse wheel, mouse drag, hover) or keyboard cue (key press) or input on a touch screen, or a combination of these or other methods.
  • the software application launches an external tool. For example, a hyperlink might open with a URL to a web-based tool.
  • the splicing integration method links a probe to a genome browser displaying information related to the genomic or chromosomal region targeted by the probe.
  • the University of California Santa Cruz genome database and web browser might open, displaying a base range that includes the probe's genomic locations. The browser might display genomically aligned sequences within that base range.
  • the genome browser displays the nucleotide sequence of the probe.
  • the genome browser displays the nucleotide sequence of the genomic region to which the sequence with the ACCESSION ID aligns. For example, suppose the probe detects a spliceoform indicated by accession ABC12345. Suppose the sequence with that accession has been aligned to chromosome 6 with a given set of coordinates. The genome browser displays the nucleotide sequence of that coordinate set.
  • the splicing integration method connects splicing data to a resource using a hyperlink. In an embodiment, the splicing integration method connects splicing data to a resource using a menu item in the menu bar or in a context menu. In an embodiment, the splicing integration method connects splicing data to a resource using a toolbar button. In an embodiment, the splicing method connects splicing data to a resource using a keyboard shortcut. In an embodiment, the splicing method connects splicing data to a resource using a mouse cue. In an embodiment, the splicing method connects splicing data to a resource using a mouse cue and a keyboard shortcut. In an embodiment, the splicing method connects splicing data to a resource using another input method or cue.
  • One of skill will appreciate the variety of methods that may be employed to connect splicing data to software and database resources.
  • Figures 1A-1 H show scatter plots obtained using indicator polynucleotides corresponding to particular polynucleotide sequences of differentially expressed splice variants with a minimum signal of 200 and a "Splice Fold" (linearized Splice Ratio) score > 2 (a >99.9% confidence interval for all splice types).
  • the indicator polynucleotides were used in a microarray.
  • Plots on the left show the splice variants present in MCF7 cells vs.
  • CaCO2 cells Plots on the right show technical replicates from HEK293 cells.
  • the indicator polynucleotides used in the microarrays detect exon skip events (panels A and B); alternative first and last exons (C and D), intron retentions (E and F); and alternative acceptor and donor sites (G and H). Different gene isoforms are clearly present in the different cell types.

Abstract

Methods, kits, and software for quantifying overall gene expression levels, or the relative levels of particular splice variants, using a plurality of indicator polynucleotides that correspond to exons, introns, exon-exon junctions, exon-intron junction, intron-exon junctions, modules, or module-module junctions of a gene are described.

Description

METHODS FOR DETERMINING SPLICE VARIANT TYPES AND AMOUNTS
TECHNICAL FIELD
[0001] The methods and kits of parts relate to the fields of gene expression and to microarray based methods for measuring gene expression, particularly the expression of splice-variants. BACKGROUND
[0002] Microarrays capable of detecting splice variants may comprise indicator polynucleotides that indicate exons, exon-exon junctions, introns, modules, intron-exon junctions, exon-intron junctions or module-module junctions of a gene. A microarray experiment determines expression levels for indicator polynucleotides in one or more samples. An expression level comprises one or more numerical values for an indicator polynucleotide in a sample.
[0003] However, data analysis of expression levels is non-trivial. Gene expression analysis and transcript expression analysis generally do not consider variations among the expression levels of indicator polynucleotides for the same gene, i.e., they do not in general consider the complexities of alternative splicing, which can lead to multiple products of a single gene. Rather, they tend to treat each indicator polynucleotide as an independent measurement, or aggregate the expression levels of indicator polynucleotides of a gene as a single measurement of overall transcriptional activity. However, mathematical algorithms may quantify the expression of each splice product separately. For example, the inventors have described systems of linear equations and deconvolution algorithms for quantifying the expression levels of splice products of a gene in one or more samples. Other published algorithms includes ASAP, Splicing Index, SPLICE, ASPIRE, ANOSVA, Gene-sequence-based splice variant deconvolution, and genASAP. All except for ANOSVA and deconvolution focus primarily on individual exon-skip events occurring in the middle of a gene. To address alternative first exons, last exons, donor sites, acceptor sites, exon skips in tandem, and intron retentions requires generalization of the algorithms at the minimum, and perhaps new algorithms. [0004] Even so, the expression levels of splice products considered independently are not the only quantities of biological interest. The relative expression levels of splice variants of a gene in a single sample, and changes in the relative expression levels of splice variants of a gene between and across samples, can yield meaningful insights into splicing regulation that may have biological function specific to disease state, tissue, intracellular localization, population, individual, drug treatment, etc.
[0005] Even the quantification of gene expression levels is complicated when using a splice variant microarray, since indicator polynucleotides for a gene may indicate certain splice variants but not others. It would be desirable to improve the accuracy when quantifying gene expression levels from splice variant microarrays. It would be desirable to measure changes in splicing and to differentiate such changes from changes in overall transcriptional activity. It would be desirable to measure regulated splicing and to have algorithms and representations appropriate for quantifying, grouping, ranking, and understanding splicing of genes in samples. Finally, it would be desirable to place the results of such algorithms within a data configuration, and to transmit the data configuration from one location to another.
REFERENCES
Stamm S, R.J., Le Texier V, Gopalakrishnan C, Kumanduri V, Tang Y, Barbosa- Morais NL, Thanaraj TA., ASD: a bioinformatics resource on alternative splicing. Nucleic Acids Res., 2006. 34(1 ): p. D46-D55.
Kim N, S. S., Lee S., ECgene: genome-based EST clustering and gene modeling for alternative splicing. Genome Res., 2005. 15(4): p. 566-576.
Kim P, K.N., Lee Y, Kim B, Shin Y, Lee S., ECgene: genome annotation for alternative splicing. Nucleic Acids Res., 2005. 33(1 ): p. D75-79.
Lee C, A.L., Modrek B, Xing Y., ASAP: the Alternative Splicing Annotation Project. Nucleic Acids Res., 2003. 31(1): p. 101-105.
Kim N, A. A., Roy M, Lee C, The ASAP Il database: analysis and comparative genomics of alternative splicing in 15 animal species. Nucleic Acids Res., 2007. 35(Database Issue): p. D93-8.
Takeda J, et al., H-DBAS: alternative splicing database of completely sequenced and manually annotated full-length cDNAs based on H-lnvitational. Nucleic Acids Res., 2007. 35(Database Issue): p. D104-9.
Pospisil H, et al., EASED: Extended Alternatively Spliced EST Database. Nucleic Acids Res., 2004. 32(Database Issue): p. D70-4.
Zheng CL, K.Y., Li HR, Zhang K, Coutinho-Mansfield G, Yang C, Nair TM, Gribskov M, Fu XD., MAASE: an alternative splicing database designed for supporting splicing microarray applications. RNA., 2005. 11(12): p. 1767-76.
Holste D, et al., HOLLYWOOD: a comparative relational database of alternative splicing. Nucleic Acids Res., 2006. 34(Database Issue): p. D56-62.
Huang HD, H.J., Lin FM, Chang YC, Huang CC, Splicelnfo: an information repository for mRNA alternative splicing in human genome. Nucleic Acids Res., 2005. 33(1): p. D80-D85.
Clark TA, S. C, Ares M Jr., Genomewide analysis of mRNA processing in yeast using splicing-specific microarrays. Science, 2002. 296(5569): p. 907-910.
Johnson JM, CJ. , Garrett-Engele P, Kan Z, Loerch PM, Armour CD, Santos R, Schadt EE, Stoughton R, Shoemaker DD., Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science, 2003. 302(5653): p. 2141-2144.
Li C, et al., Cell type and culture condition-dependent alternative splicing in human breast cancer cells revealed by splicing-sensitive microarrays. Cancer Research, 2006. 66(4): p. 1990-99. Pan Q, SA, Kim YK, Misquitta C, Shai O, Maquat LE, Frey BJ, Blencowe BJ., Quantitative microarray profiling provides evidence against widespread coupling of alternative splicing with nonsense-mediated mRNA decay to control gene expression. Genes Dev., 2006. 20(2): p. 153-8.
Pan Q, et al., Revealing global regulatory features of mammalian alternative splicing using a quantitative microarray platform. MoI Cell. 2004 Dec 22;16(6):929-41., 2004. 16(6): p. 929-41.
Sugnet CW, et al., Unusual intron conservation near tissue-regulated exons found by splicing microarrays. PLoS Comput Biol., 2006. 2(1 ): p. e4.
Le K, M. K., Roy M, Wang Q, Xu Q, Nelson SF, Lee C, Detecting tissue-specific regulation of alternative splicing as a qualitative change in microarray data. Nucleic Acids Res., 2004. 32(22): p. e180.
Srinivasan K, S. L., Hayes JD, Centers R, Fitzwater S, Loewen R, Edmondson LR, Bryant J, Smith M, Rommelfanger C, Welch V, Clark TA, Sugnet CW, Howe KJ, Mandel-Gutfreund Y, Ares M Jr., Abstract Detection and measurement of alternative splicing using splicing-sensitive microarrays. Methods, 2005. 37(4): p. 345-59.
Hu GK, M.S., Moldover B, Jatkoe T, Balaban D, Thomas J, Wang Y., Predicting splice variant from DNA chip expression data. Genome Res., 2001. 11(7): p. 1237- 1245.
Cline MS, et al., ANOSVA: a statistical method for detecting splice variation from expression data. Bioinformatics., 2005. 21(1 ): p. M07-15.
Wang H, et al., Gene structure-based splice variant deconvolution using a microarray platform. Bioinformatics., 2003. 19(Supp 1 ): p. i315-22.
Cuperlovic-Culf M, B. N., CuIf AS, Ouellette RJ., Data analysis of alternative splicing microarrays. Drug Discov Today., 2006. 11(21-22): p. 983-90.
Wang BB, B. V., Genomewide comparative analysis of alternative splicing in plants. Proc Natl Acad Sci U S A., 2006. 103(18): p. 7175-7180.
Nuwaysir EF, H.W., Albert TJ, Singh J, Nuwaysir K, Pitas A, Richmond T, Gorski T, Berg JP, BaIMn J, McCormick M, Norton J, Pollock T, Sumwalt T, Butcher L, Porter D, MoIIa M, Hall C, Blattner F, Sussman MR, Wallace RL, Cerrina F, Green RD., Gene expression analysis using oligonucleotide arrays produced by maskless photolithography. Genome Res., 2002. 12(11 ): p. 1749-55.
Each of these references, as well as other references cited in the text, are herein incorporated by reference. SUMMARY
[0006] The following aspects and embodiments thereof described and illustrated below are meant to be exemplary and illustrative, not limiting in scope. [0007] In one aspect, a hybridization method for measuring the levels of alternatively-spliced forms of a gene is provided, the method comprising:
(a) providing two or more mutually exclusive indicator polynucleotides corresponding to polynucleotide sequences of the gene selected from exons, introns, modules, exon-exon junctions, exon-intron junctions, intron- exon junctions, and module-module junctions, of alternatively-spliced forms of a gene;
(b) incubating a sample comprising alternatively-spliced forms of the gene in the presence of the two or more indicator polynucleotides;
(c) measuring a hybridization signal corresponding to the amount of hybridization of the alternatively-spliced forms of a gene to each of the two or more indicator polynucleotides;
(d) applying a mathematical algorithm to calculate the relative expression levels of alternatively-spliced forms of the gene.
[0008] In some embodiments, the mutually exclusive indicator polynucleotides are non-overlapping. In some embodiments, the mutually exclusive indicator polynucleotides are overlapping.
[0009] In some embodiments, at least one mutually exclusive indicator polynucleotide corresponds to a polynucleotide that is constitutively present in alternatively spliced forms of the gene. In some embodiments, at least one mutually exclusive indicator polynucleotide corresponds to a polynucleotide that is not constitutively present in alternatively spliced forms of the gene. [0010] In some embodiments, an overall level of expression of alternatively-spliced forms of a gene is calculated by summing the amount of hybridization signal corresponding to the relative amounts of hybridization to each of the mutually exclusive indicator polynucleotides. [0011] In some embodiments, the overall level of gene expression (G) is calculated using the equations:
G = ∑ I = (π B)l/n (1)
Bexon = P + PC (2)
Bjunc = V((p5 + p5c) * (p3 + p3c)) (3) wherein
G is the overall gene expression level; each I is an alternatively-spliced form of the gene; n is the number of indicator polynucleotides; and each B is the sum of expression levels of all alternatively-spliced forms of the gene; wherein when at least one indicator polynucleotide corresponds to an exon or intron, BexOn is equal to the sum of p + pc, wherein p is the amount of hybridization signal corresponding to the amount of hybridization of the alternatively-spliced forms of a gene to the indicator polynucleotide corresponding to the exon or intron, and pc is the sum of the amounts of hybridization from each indicator polynucleotide, and [0012] When at least one indicator polynucleotide corresponds to an exon- exon or exon-intron junction, BjUnc is V((p5 + p5c) * (p3 + p3c)), wherein p5 is the amount of hybridization signal corresponding to the amount of hybridization of the alternatively-spliced forms of a gene to the indicator polynucleotide corresponding to a 5r portion of the junction, and p3 is the amount of hybridization signal corresponding to the amount of hybridization of the alternatively-spliced forms of a gene to the indicator polynucleotide corresponding to a 3' portion of the junction. [0013] In some embodiments, background levels of hybridization signal are subtracted from the overall expression level.
[0014] In some embodiments, at least one of the two or more mutually exclusive indicator polynucleotides corresponds to an exon. In some embodiments, at least one of the two or more mutually exclusive indicator polynucleotides corresponds to an intron. In some embodiments, at least one of the two or more mutually exclusive indicator polynucleotides corresponds to a module. [0015] In some embodiments, at least one of the two or more mutually exclusive indicator polynucleotides corresponds to an exon-exon junction. In some embodiments, at least one of the two or more mutually exclusive indicator polynucleotides corresponds to an exon-intron junction. In some embodiments, at least one of the two or more mutually exclusive indicator polynucleotides corresponds to an intron-exon junction. In some embodiments, at least one of the two or more mutually exclusive indicator polynucleotides corresponds to a module- module junction.
[0016] In some embodiments, the indicator polynucleotides are in a microarray.
[0017] In some embodiments, the alternatively-spliced forms of a gene are mRNAs, and the indicator polynucleotides are complementary to the mRNAs. In some embodiments, the alternatively-spliced forms of a gene are cDNAs, and the indicator polynucleotides are complementary to the cDNA. [0018] In another aspect, software for performing the calculations described herein are provided. In some embodiments, the software are for determining the amounts of different gene splice variants using data obtained in a microarray having two or more mutually exclusive indicator polynucleotides corresponding to polynucleotide sequences of the gene, selected from exons, introns, modules, exon-exon junctions, exon-intron junctions, intron-exon junctions, and module-module junctions, of alternatively-spliced forms of a gene, wherein the software applies a mathematical algorithm to calculate the relative expression levels of different gene splice variants.
[0019] In some embodiments, the software performs at least one calculation selected from calculations (1 ), (2), (3), (4), (5), (6), (7), (8), (9), (10), (11 ), (12), (13), (14), and (15), below, or any combination or subset, thereof. [0020] In another aspect, a kit of parts for measuring the levels of alternatively-spliced forms of a gene is provided, the kit comprising:
(a) two or more mutually exclusive indicator polynucleotides corresponding to polynucleotide sequences selected from exons, introns, modules, exon-exon junctions, exon-intron junctions, intron-exon junctions, and module- module junctions, of alternatively-spliced forms of a gene;
(b) mathematical algorithms for calculating the total and relative levels of alternatively-spliced forms of a gene using hybridizations signals corresponding to the amount of hybridization of the alternatively-spliced forms of the gene to each of the indicator polynucleotides; and
(c) instructions for using the indicator polynucleotides and mathematical algorithms.
[0021] In some embodiments, the indicator polynucleotides are in a microarray.
[0022] In some embodiments, the mathematical algorithms are provided in an executable computer application.
[0023] In yet another aspect, a nucleic acid probe set for use in, e.g., a nucleic acid microarray format, is provided. The probe sets comprises two or more mutually exclusive indicator polynucleotides corresponding to polynucleotide sequences selected from exons, introns, modules, exon-exon junctions, exon-intron junctions, intron-exon junctions, and module-module junctions, of alternatively- spliced forms of a gene.
[0024] In addition to the exemplary aspects and embodiments described above, further aspects and embodiments will become apparent by reference to the following descriptions, example, and figures.
BRIEF DESCRIPTION OF THE DRAWINGS
[0025] Figure 1 shows a series of scatter plots depicting splice variants with a minimum signal of 200 and a "Splice Fold" (i.e., linearized Splice Ratio) score of > 2 (>99.9% confidence for all splice types). Plots on the left (i.e., plots A, C, E, and G) show MCF7 cells vs. CaCO2 cells; plots on the right (i.e., B, D, F, and H) show technical replicates of HEK293 cells. The indicator polynucleotides identified the following types of splice variants: A and B, exon skips and includes; C and D, alternative first and last exons; E and F, alternative donor and acceptor sites; and G and H intron retentions and splices. [0026] Figure 2 is graph showing the distribution of differentially expressed splice types. The splice types and relative amounts of each splice type are indicated. These amounts were determined using the SEHS, Splicing Index, ASPIRE, and Splice Ratio methods, described.
DETAILED DESCRIPTION
[0027] Unless otherwise specified, all scientific terms and expressions are as used in the art. Singular terms can include the plural, such that "a," "an," and, "the" can intend both singular and plural. I. Indicator polynucleotides for detecting splice variants
[0028] The methods and kits are for quantifying overall gene expression levels (or transcriptional activity), or the relative levels of particular splice variants, using a plurality of indicator polynucleotides that correspond to the polynucleotide sequences of exons, introns, exon-exon junctions, exon-intron junction, intron-exon junctions, modules, or module-module junctions of a gene.
[0029] An exon, exon-exon junction, module, exon-intron junction, intron- exon junction or module-module junction is a "constitutive" polynucleotide if all expected splice variants comprise that exon, exon-exon junction, module, exon- intron junction, intron-exon junction or module-module junction. Otherwise, it is "non- constitutive".
[0030] A first indicator polynucleotide is considered "mutually exclusive" with a second indicator polynucleotide if the first indicator polynucleotide indicates a first exon, module or junction of a first splice variant and the second indicates a second exon, module or junction of a second splice variant, wherein the first splice variant does not comprise the second exon, module or junction and the second splice variant does not comprise the first exon, module or junction. [0031] A pair of mutually exclusive indicator polynucleotides comprises a first indicator polynucleotide and a second indicator polynucleotide. The pair is "overlapping" if the first indicator polynucleotide comprises a polynucleotide sequence comprised by the second indicator polynucleotide. The pair is "non- overlapping" if the first does not comprise a polynucleotide sequence comprised by the second. This will be clear to one of ordinary skill in the arts from the following examples. A polynucleotide sequence is preferably at least 1 , at least 5, at least 10, or even at least 100 nucleotides.
Examples of mutually exclusive indicator polynucleotides: Example 1
[0032] A gene comprises at least a first exon, an intron, and a second exon, and the second exon has a short form and a long form that differ at the 5' end. The gene has a first splice variant that comprises the first exon and the long form of the second exon., and a second splice variant that comprises the first exon and the short form of the second exon.
(a) An indicator polynucleotide for the exon-exon junction between the first exon and the long form of the second exon is mutually exclusive (overlapping) with an indicator polynucleotide for the exon-exon junction between the first exon and the short form of the second exon.
(b) An indicator polynucleotide for the module at the 5' end of the second exon that is part of the long form but not the short form is mutually exclusive (overlapping) with an indicator polynucleotide for the exon-exon junction between the first exon and the short form of the second exon.
(c) An indicator polynucleotide for the intron-exon junction between the intron and the short form of the second exon is mutually exclusive (overlapping) with an indicator polynucleotide for the exon-exon junction between the first exon and the short form of the second exon.
Example 2
[0033] A gene comprises at least a first exon, an intron, and a second exon, and the first exon has a short form and a long form that differ at the 3' end. The gene has a first splice variant that comprises the long form of the first exon and the second exon., and a second splice variant that comprises the short form of the first exon and the second exon.
(a) An indicator polynucleotide for the exon-exon junction between the long form of the first exon and the second exon is mutually exclusive (overlapping) with an indicator polynucleotide for the exon-exon junction between the short form of the first exon and the second exon.
(b) An indicator polynucleotide for the module at the 3' end of the first exon that is part of the long form but not the short form is mutually exclusive (overlapping) with an indicator polynucleotide for the exon-exon junction between the short form of the first exon and the second exon.
(c) An indicator polynucleotide for the exon-intron junction between the short form of the first exon and the intron is mutually exclusive (overlapping) with an indicator polynucleotide for the exon-exon junction between the short form of the first exon and the second exon.
Example 3
[0034] A gene comprises at least a first exon, a second exon and a third exon. A first splice variant comprises the first exon, the second exon and the third exon. A second splice variant comprises the first exon and the third exon.
(a) An indicator polynucleotide for the junction between the first exon and the second exon is mutually exclusive (overlapping) with an indicator polynucleotide for the junction between the first exon and the third exon.
(b) An indicator polynucleotide for the junction between the second exon and the third exon is mutually exclusive with an indicator polynucleotide for the junction between the first exon and the third exon.
(c) An indicator polynucleotide for the junction between the first exon and the third exon is mutually exclusive (non-overlapping) with an indicator polynucleotide for the second exon.
Example 4
[0035] A gene comprises at least a first exon, a second exon, a third exon and a fourth exon. A first splice variant comprises the first exon and the fourth exon. A second splice variant comprises the first exon, the second exon, the third exon and the fourth exon. (a) An indicator polynucleotide for the junction between the first exon and the fourth exon is mutually exclusive (non-overlapping) with an indicator polynucleotide for the junction between the second exon and the third exon.
Example 5
[0036] A gene comprises at least a first exon, a second exon, a third exon and a fourth exon. A first splice variant comprises the first exon and the third exon. A second splice variant comprises the second exon and the third exon.
(a) An indicator polynucleotide for the junction between the first exon and the third exon is mutually exclusive (non-overlapping) with an indicator polynucleotide for the junction between the second exon and the fourth exon.
(b) An indicator polynucleotide for the second exon is mutually exclusive (non-overlapping) with an indicator polynucleotide for the third exon if the gene comprises no splice variant that comprises both the second exon and the third exon.
Example 6
[0037] A gene comprises at least a first exon, a second exon and a third exon. A first splice variant comprises the first exon and the third exon and starts with the first exon. A second splice variant comprises the second exon and the third exon and starts with the second exon.
(a) An indicator polynucleotide for the junction between the first exon and the third exon is mutually exclusive (overlapping) with an indicator polynucleotide for the junction between the second exon and the third exon.
(b) An indicator polynucleotide for the first exon is mutually exclusive (non-overlapping) with an indicator polynucleotide for the second exon if the gene comprises no splice variant that comprises both the first exon and the second exon.
Example 7
[0038] A gene comprises at least a first exon, a second exon and a third exon. A first splice variant comprises the first exon and the third exon and ends with the third exon. A second splice variant comprises the first exon and the second exon and ends with the second exon.
(a) An indicator polynucleotide for the junction between the first exon and the third exon is mutually exclusive (overlapping) with an indicator polynucleotide for the junction between the first exon and the second exon.
(b) An indicator polynucleotide for the second exon is mutually exclusive (non-overlapping) with an indicator polynucleotide for the third exon if the gene comprises no splice variant that comprises both the first exon and the second exon.
[0039] These examples, as well as other examples not listed, may be combined in various ways to address a wide range of alternative splicing events and alternative splice forms of a gene. The gene expression technique applies to two or more indicator polynucleotides for exons, introns, modules, exon-exon junction, exon-intron junctions, intron-exon junctions, or module-module junctions of a gene. The gene expression technique applies to a plurality of genes, such as a set of genes for a gene/protein family, a pathway, or an organism's entire genome. It applies to a plurality of splice variants of genes. It applies to one or more samples. Exemplary samples are cell lines, tissue samples, patient samples, pools of samples, and computational pools.
II. Mathematical algorithms for calculating total and relative levels of splice variants [0040] In some embodiments, the gene expression technique applies a mathematical algorithm to expression levels for a plurality of indicator polynucleotides that indicate constitutive polynucleotides of a gene. In particular embodiments, the gene expression technique applies a mathematical algorithm to expression levels for a plurality of indicator polynucleotides that indicate non- constitutive polynucleotides of a gene. In other embodiments, the gene expression technique applies a mathematical algorithm to expression levels for a plurality of indicator polynucleotides that indicate mutually exclusive (overlapping) polynucleotides of a gene. In other embodiments, the gene expression technique applies a mathematical algorithm to expression levels for a plurality of indicator polynucleotides that indicate mutually exclusive (non-overlapping) polynucleotides of a gene. In other embodiments, the gene expression technique applies a mathematical algorithm to expression levels for indicator polynucleotides that indicate constitutive, non-constitutive, or mutually exclusive (overlapping or non- overlapping) polynucleotides of a gene.
[0041] In some embodiments, the gene expression technique applies a mathematical algorithm to expression levels of indicator polynucleotides for a gene. In some embodiments, the gene expression technique determines a gene expression level in a sample by summing (i.e., adding) the expression levels of mutually exclusive indicator polynucleotides. In another embodiment, the gene expression technique determines a gene expression level in a sample by summing the background-subtracted expression levels of mutually exclusive indicator polynucleotides.
[0042] For example, the gene expression technique can assign a gene expression level to the gene in Example 1 (above) by averaging (or log averaging, or taking the geometric mean of, or taking a median of, or otherwise applying a mathematical algorithm to) one or more of the following expression levels: (1 ) an expression level for an indicator polynucleotide for the first exon; (2) the sum of expression levels for the mutually exclusive indicator polynucleotides in (a); (3) the sum of expression levels for the mutually exclusive indicator polynucleotides in (b); (4) the sum of expression levels for the mutually exclusive indicator polynucleotides in (c); and (5) an expression level for an indicator polynucleotide for the second exon. For example, the gene expression technique may apply a mathematical algorithm to the expression levels in (1 ) and (2), or (2) only, or (2) and (3), etc. In another embodiment, the gene expression technique sums background-subtracted expression levels in (2) and (3). In another embodiment, the gene expression technique considers background readings for the indicator polynucleotides. In an embodiment, it adds a mathematical function of the background readings, such as a sum or log average or geometric mean, to the sum of background subtracted expression levels for the indicator polynucleotides. III. Methods for calculating levels of gene expression and levels of splice variants [0043] What follows are methods of calculating a gene expression level from data from a nucleotide array comprising indicator polynucleotide probes for exon-exon junctions, exon-intron junctions, exons, modules, module-module junctions or introns.
[0044] Suppose a gene comprises three polynucleotide splice modules, a first module, a second module and a third module. In one embodiment, the three modules are exons. In another embodiment, the first module is an exon, the second module is an intron, and the third module is an exon. In another embodiment, the first module is an exon, the second module is an extension of an exon, and the third module is an exon. Suppose the gene has a first splice isoform comprising the first module, the second module and the third module. Suppose the gene has a second splice isoform comprising the first module and the third module. Now consider the following indicator polynucleotides: an indicator polynucleotide for the first module (M1 ), an indicator polynucleotide for the junction between the first module and the second module (J1-2), an indicator polynucleotide for the second module (M2); an indicator polynucleotide for the junction between the second module and the third module (J2-3); an indicator polynucleotide for the junction between the first module and the third module (J1-3); and an indicator polynucleotide for the third module (M3). A nucleotide array comprises a set of the indicator polynucleotides for the gene. In an embodiment, the nucleotide array comprises the indicator polynucleotide for the three junctions and for the three modules. In another embodiment, the nucleotide array comprises the indicator polynucleotides for the three junctions. In another embodiment, the nucleotide array comprises the indicator polynucletide for the junction between the first module and the second module, and the indicator polynucleotide for the junction between the first module and the third module. In another embodiment, the nucleotide array comprises the indicator polynucleotide for the junction between the first module and the third module, and the indicator polynucleotide for the junction between the second module and the third module. In an embodiment, the nucleotide array comprises a combination or subset of the six types of indicator polynucleotides. In an embodiment, the nucleotide array comprises multiple indicator polynucleotides for a junction between a first module and a second module. In an embodiment, the nucleotide array comprises multiple indicator polynucleotides for a module.
[0045] Now consider the following expression levels for the indicator polynucleotides:
M1 = 300 M2 = 200 M3 = 300 J1-2 = 200 J1-3 = 100 J2-3 = 200 [0046] The gene expression method calculates a gene expression level G using the equations:
G = ∑ l = (π B)1/n (1)
Bexon = P + PC (2)
BJunc = V((p5 + p5c) * (p3 + p3c)) (3) where G is the gene expression level; each I is a splice isoform; each B is the sum of expression levels of all splice isoforms at a base range targeted by one or more probes; n is the number of probes for the gene; BexOn covers the case of an exon or intron probe, with p equal to a probe signal and pc equal to the sum of signals of probes exclusive with (or complementary to) p; and Bjunc addresses the case of an exon-exon or exon-intron junction probe, with ps being the 5' portion of the junction and P3 being the 3' portion, each portion having its own complement. [0047] p5 is the 5' portion of an exon-exon or exon-intron junction probe p. p3 is the 3' portion of an exon-exon or exon-intron junction probe p. pc is the "complement", meaning the set of probes that are complementary to p. A complementary probe pc is mutually exclusive with a probe p, meaning that the same splice isoform cannot contain both the polynucleotide indicated by p and the polynucleotide indicated by pc. In this manner, p5c is the complement of the 51 portion of p and p3c is the complement of the 3' portion of p. [0048] For example, suppose there is a first splice isoform containing exon-exon junction J1-3 and a second splice isoform containing J2-3. For the first splice isoform, p5 is the signal for the portion of J 1-3 for exon 1 ; p5c is the complementary signal from the second splice isoform, J2-3. For the second splice isoform, p5 is the signal for the portion of J2-3 for exon 2; p5c is the complementary signal from J1-2. The two splice isoforms have mutually exclusive splicing patterns, so each is a complement of the other. In this example, the two probes share the same 3' portion, so in both cases, p3 is the signal for the portion that detects exon 3. For the first splice isoform, J2-3 is the complement. For the second splice isoform, J1-3 is the complement.
[0049] Returning to the example using the above-identified expression levels for the indicator polynucleotides,
G = ( BMI * BMI * BMI * BJ-I-2 * Bji-3 * Bj2-3 )
((M1 + 0) * (M2 + J1-3) * (M3 + 0)
* V((J1-2 + J1-3) * (J1-2 + J1-3))
* V((J1-3 + J1-2)*(J1-3 + J2-3))
* V((J2-3 + J1-3) * (J2-3 + J1-3)) ) 1/6 = (300
* (200 + 100)
* 300
* V ((200 + 100) * (200 + 100))
* V ((100 + 200) * (100 + 200))
* V ((200 + 100) * (200 + 100)) ) 1/6 = (3006) 1/6 = 300.
[0050] In an embodiment the gene expression method takes background levels into account. Suppose each of the 6 indicator polynucleotides has a background level (k) equal to 100. The background level can be subtracted from indicator polynucleotides when solving the equation. It can be subtracted from all indicator polynucleotides. Alternatively, it can be subtracted from the 'complementary' indicator polynucleotides in the equation. E.g.,
G = ( BMI * BMI * BMI * Bj-i-2 * BJI_3 * Bj2-3 )
((M1 + 0) * (M2 + J1-3 - k)
* (M3 + 0)
* V((J1-2 + J1-3 - k) * (J1-2 + J1-3 - k))
* V((J1-3 + J1-2- k)*(J1-3 + J2-3 - k))
* V((J2-3 + J 1-3 - k) * (J2-3 + J 1-3 - k)) ) m = (300
* (200 + 100 - 100) * 300
* V ((200 + 100 - 100) * (200 + 100 - 100))
* V ((100 + 200 - 100) * (100 + 200 - 100))
* V ((200 + 100 - 100) * (200 + 100 - 100)) ) 1/6
= (300 * 200 * 300 * V(200*200) * V(200*200) * V(200*200) ) 1/6
= (3002 * 2004) 1/6
[0051] In an embodiment the gene expression method uses a different background level for each indicator polynucleotide.
[0052] In an embodiment, the gene expression method calculates a gene expression level from the expression levels of the junction indicator polynucleotides:
G = (BJI-2 * BJI-3 * BJ2-3 )
( V((J1-2 + J1-3) * (J1-2 + J1-3)) * V((J1-3 + J1-2) * (J1-3 + J2-3))
* V((J2-3 + J1-3) * (J2-3 + J1-3)) ) 1/3 V ((200 + 100) * (200 + 100))
* V ((100 + 200) * (100 + 200))
* V ((200 + 100) * (200 + 100)) ) 1/3 = (3003) 1/3 = 300
[0053] In some embodiment the gene expression calculation method subtracts background levels when calculating a gene expression level from junction indicator polynucleotides.
[0054] In an embodiment, the gene expression method calculates a gene expression level using exon-exon junction probes. In an embodiment, the gene expression method calculates a gene expression level using exon-exon junction probes and exon-intron junction probes. In an embodiment, the gene expression method calculates a gene expression level using exon-exon junction probes and exon probes. In an embodiment, the gene expression method calculates a gene expression level using exon-exon junction probes, exon-intron junction probes and exon probes. In an embodiment, the gene expression method calculates a gene expression level using exon probes.
[0055] The gene expression method generalizes to cases where there are more than two splice variants. It applies to cases where there are 3 or more splice isoforms, or 5 or more splice isoforms, or 10 or more splice isoforms. It applies to cases where a splice isoform is predicted, or where multiple splice isoforms are predicted. It applies to cases where there are more than three modules, or more than 5 modules, or more than 10 modules, or more than 20 modules, or more than 30 modules. It can be used for a set of indicator polynucleotides for all exon-exon junctions of a splice variant of a gene, or all exon-exon junctions of two or more splice variants of a gene, or all exon-exon junctions of three or more, or five or more, or ten or more splice variants of a gene. It can be used for a set of indicator polynucleotides for all exon-exon junctions and all exons of a splice variant of a gene, or for two or more, or three or more, or five or more, or ten or more splice variants of a gene. It can be used for genes that undergo altemativee splicing leading to exon skips, exon inclusions, exon trims, exon extensions, intron retentions, intron splicing, altemativee first exons, altemativee last exons, different promoters, or different lengths of 3' UTR. It can be applied to a single gene, or to a plurality of genes, or to all genes on an array, or to all genes in an organisms's genome, or to a plurality of genes from multiple organisms. It applies to expression level data from a database, to expression level data from a compute file, or to expression level data obtained from a storage media, including any of the types mentioned in the provisional applications incorporated by reference, such as computer memory, RAM, an ASCII file, a binary file, a CSV file, a tab-delimited file, an XML file, a database, a hard drive, flash memory, USB memory, a CD, a DVD, a network store, a tape drive, etc.
[0056] A storage medium may store a gene expression level derived using the methods above. The storage medium may be any of the types of storage medium already mentioned, such as computer memory, RAM, an ASCII file, a binary file, a CSV file, a tab-delimited file, an XML file, a database, a hard drive, flash memory, USB memory, a CD, a DVD, a network store, a tape drive, etc. A storage medium may store a plurality of gene expression levels derived using the methods above. The storage medium may comprise data for 10 or more gene expression levels, or 100 or more gene expression levels, or 1000 or more gene expression levels, or 10000 or more gene expression levels. It may comprise data for gene expression levels for one or more samples, or 2 or more samples, or 10 or more samples, or 20 or more samples, or 50 or more samples, or 100 or more samples, or 1000 or more samples, or 10000 or more samples.
[0057] Gene expression level data calculated using the methods above may be transmitted in various ways: over a network by TCP-IP, FTP, SMTP, in an email attachment, by courier, by mail, by copying onto a CD or DVD or memory device and transferring from one storage medium to another. It may be transferred in any of the ways mentioned in the patent applications incorporated by reference, or in any other way that may occur to one of skill in the arts. The data store and data transmission are aspects of the present invention that are useful in their own right. [0058] A normalization method uses a gene expression level to normalize expression levels for indicator polynucleotides for a gene. A normalization method applies a mathematical equation to normalize expression levels for indicator polynucleotides for a gene. In an embodiment, it applies an equation of the form:
S = p M/G (4) wherein S is the gene-normalized signal of a probe in a sample, p is the probe signal, G is the gene expression level, and M is a scalar, such as the geometric mean of G and the geometric mean of gene expression levels in the sample and at least one other sample.
[0059] In an embodiment, M's value is a function of the gene expression level relative to the probe in one or more samples (the "local gene expression level"). In an embodiment, M's value is a function of the local gene expression level in the sample. In another embodiment, M's value is a function of the local gene expression level in two or more samples. In an embodiment, M's value is the average of the local gene expression level in two or more samples, (e.g., if the local gene expression levels in two samples are 500 and 1000, M's value is the average, which is 750.) In another embodiment, M equals the geometric mean of the local gene expression levels in two or more samples. In another embodiment, M equals the median of the local gene expression levels in two or more samples. In another embodiment, M equals the geometric median of the local gene expression levels in two or more samples. In another embodiment, M equals an arbitrary function of the local gene expression levels in two or more samples. For example, M might discard outliers and take a function, such as the geometric mean, of the remaining values. [0060] In an embodiment, G is calculated using the equation above. In another embodiment, G is calculated by combining proximal probes relative to the probe for which p is the signal, such as the flanking exon signals for an exon-exon junction probe for an exon skip, or the one flanking exon signal for an alternative first or last exon, or the flanking exon signals for a probe for a retained intron, or the flanking exon signals for a probe for an alternative acceptor or donor site. [0061] Examples for calculation of G as a "global gene expression level" appear above. For "local gene expression levels", here are some examples. Suppose a short splice form probe (an exon-exon junction probe for a skipped exon, a spliced intron, or an alternative donor or acceptor site that truncates an exon) has two flanking exons with probe signals of 200 and 250. The local transcription level G could be calculated as F (exoni , exon2). In an embodiment, F takes the average of the two exon signals, so in our example, the value is average (200, 250) = 225. In another embodiment, F takes the median. In another, it takes the geometric mean = square root (200 * 250). In another embodiment, it takes the geometric median. In another embodiment, it performs an arbitrary function on the flanking exon probes. Or suppose the probe is an exon-exon junction probe or an exon probe for an alternative first exon. F (common_exon) computes a signal value for the first nearest common exon between the splice forms. For example, suppose a gene has a first spliceoform that contains J1-3 and a second spliceoform that contains J2-3. Probes for these two junctions can be normalized using the transcriptional level G computed as F(common_exon) = F(E3). In an embodiment, F is the identity function. In another embodiment, F is an arbitrary function.
[0062] Here follows an example of applying the normalization of equation
4. Suppose a nucleotide array comprises a first indicator polynucleotide (pi) for a gene and a second indicator polynucleotide (p2) for the gene. Suppose the first indicator polynucleotide has an expression level of 400 in a first sample and an expression level of 600 in a second sample. Suppose the second indicator polynucleotide has an expression level of 800 in the first sample and 1200 in the second sample. Suppose the gene has an expression level of 500 in the first sample and an expression level of 1000 in the second sample. Applying the equation, we get, in the first sample:
51 = 400 * V(500 MOOO) / 500
52 = 600 * V(500 * 1000) / 500 And in the second sample:
51 = 800 * V(500 * 1000) / 1000
52 = 1200 * V(500 * 1000) / 1000
[0063] Note that S1 is equal in both samples, and S2 is equal in both samples. The normalization adjusts the values based on gene expression level. The remaining differences between the two samples may result from RNA splicing. In this case, there are no remaining differences between the two samples, hence there may not be a difference in RNA splicing.
[0064] Consider a second example: suppose a nucleotide array comprises a first indicator polynucleotide (p3) for a gene and a second indicator polynucleotide (p4) for the gene. Suppose the first indicator polynucleotide has an expression level of 400 in a first sample and an expression level of 600 in a second sample. Suppose the second indicator polynucleotide has an expression level of 1200 in the first sample and 800 in the second sample. Suppose the gene has an expression level of 500 in the first sample and an expression level of 1000 in the second sample. Applying the equation, we get, in the first sample:
53 = 400 * V(500 * 1000) / 500 =
54 = 600 * V(500 * 1000) / 500 = And in the second sample:
S3 = 1200 * V(500 * 1000) / 1000
S4 = 800 * Λ/(500 * 1000) / 1000
[0065] Note that now S3 in the second sample is larger than S3 in the first sample, while S4 in the second sample is smaller than S4 in the first sample. The difference may suggest a difference in RNA splicing between the two samples. [0066] In an embodiment, the normalization method normalizes a probe signal using a global or local gene expression level derived from more than two samples; e.g., instead of normalizing the two samples to each other with the value of V(500 * 1000), one could normalize them using another value of m, such as the geometric mean of gene expression levels in each of a plurality of samples.. In another embodiment, the normalization method normalizes a probe signal in a first sample using a gene expression level from a second sample. E.g., instead of normalizing the two samples above to each other with the value of V(500 * 1000), one could normalize using either of the gene expression levels (500 or 1000). [0067] In an embodiment, the normalization method applies a mathematical equation to expression level data for a plurality of indicator polynucleotides: 10 or more, or 100 or more, or 1000 or more, or 10000 or more, or 100000 or more, or 1000000 or more. In an embodiment, the normalization method applies a mathatical equation to expression level data for a plurality of exon indicator polynucleotides: 10 or more, or 100 or more, or 1000 or more, or 10000 or more, or 100000 or more, or 1000000 or more. In an embodiment, the normalization method applies a mathatical equation to expression level data for a plurality of exon-exon junction indicator polynucleotides: 10 or more, or 100 or more, or 1000 or more, or 10000 or more, or 100000 or more, or 1000000 or more. In an embodiment, the normalization method applies a mathatical equation to expression level data for a plurality of exon indicator polynucleotides and exon-exon junction indicator polynucleotides: 10 or more, or 100 or more, or 1000 or more, or 10000 or more, or 100000 or more, or 1000000 or more. The method applies cases where there are more than two splice variants. It applies to cases where there are 3 or more splice isoforms, or 5 or more splice isoforms, or 10 or more splice isoforms. It applies to cases where a splice isoform is predicted, or where multiple splice isoforms are predicted. It applies to cases where there are more than three modules, or more than 5 modules, or more than 10 modules, or more than 20 modules, or more than 30 modules. It can be used for a set of indicator polynucleotides for all exon-exon junctions of a splice variant of a gene, or all exon-exon junctions of two or more splice variants of a gene, or all exon-exon junctions of three or more, or five or more, or ten or more splice variants of a gene. It can be used for a set of indicator polynucleotides for all exon-exon junctions and all exons of a splice variant of a gene, or for two or more, or three or more, or five or more, or ten or more splice variants of a gene. It can be used for genes that undergo alternative splicing leading to exon skips, exon inclusions, exon trims, exon extensions, intron retentions, intron splicing, alternative first exons, alternative last exons, different promoters, or different lengths of 3' UTR. It can be applied to a single gene, or to a plurality of genes, or to all genes on an array, or to all genes in an organisms's genome, or to a plurality of genes from multiple organisms. It applies to expression level data from a database, to expression level data from a compute file, or to expression level data obtained from a storage media, including any of the types mentioned in the provisional applications incorporated by reference, such as computer memory, RAM, an ASCII file, a binary file, a CSV file, a tab-delimited file, an XML file, a database, a hard drive, flash memory, USB memory, a CD, a DVD, a network store, a tape drive, etc.
[0068] A storage medium may store a normalized expression level data derived using the methods above. The storage medium may be any of the types of storage medium already mentioned, such as computer memory, RAM, an ASCII file, a binary file, a CSV file, a tab-delimited file, an XML file, a database, a hard drive, flash memory, USB memory, a CD, a DVD, a network store, a tape drive, etc. A storage medium may store a plurality of normalized expression levels derived using the methods above. The storage medium may comprise data for 10 or more normalized expression levels, or 100 or more normalized expression levels, or 1000 or more normalized expression levels, or 10000 or more normalized expression levels. It may comprise data for normalized expression levels for one or more samples, or 2 or more samples, or 10 or more samples, or 20 or more samples, or 50 or more samples, or 100 or more samples, or 1000 or more samples, or 10000 or more samples.
[0069] Normalized expression level data calculated using the methods above may be transmitted in various ways: over a network by TCP-IP, FTP, SMTP, in an email attachment, by courier, by mail, by copying onto a CD or DVD or memory device and transferring from one storage medium to another. It may be transferred in any of the ways mentioned in the patent applications incorporated by reference, or in any other way that may occur to one of skill in the arts. The data store and data transmission are aspects of the present invention that are useful in their own right. [0070] It will be appreciated that the methods described above, and the analysis described below, can be accomplished using a software program written to conduct the described methods or analysis. The software is stored on a suitable storage medium, such as those already mentioned above. IV. Analysis of splicing patterns
[0071] A splicing analysis method computes a score for changes in splicing. In an embodiment, it computes a generalized Splicing Index. The Splicing Index can be defined as:
Splicing Index = RShort- Riong (5)
Rshort = log(Js/Jt) - [∑log(Dls/Dlt)] / m (6)
Riong = [∑ log (K1S / K11)] / n - [∑log(Dls/Dlt)] / m (7) where J is a junction signal for a short splice form (an exon skip, later first exon, earlier last exon, intron splice or truncated exon from alternative donor or acceptor), D1 are signals for m exons that flank the alternative splice site (m=1 for an alternative first or last exon, m=2 for all other splice events); Kj are signals for n long splice form junctions or exons; and s and t are samples or computational pools of samples using an equation such as mean, geometric mean, median, or geometric median, perhaps omitting outliers. [0072] As an example of the generalized Splicing Index, consider a single- exon skip event with long splice form probes for J1-2, E2 and J2-3 which include the exon, and a short splice form probe for J1-3 which skips the exon. Probes for the flanking E1 and E3 may be included as well. [0073] Suppose the signals in sample s are as follows:
J1-2 = E2 = J2-3 = 200
J1-3 = 400
E1 = E3 = 600 and in sample t they are
J1-2 = E2 = J2-3 = 400
J1-3 = 200.
E1 = E3 = 600. Hence, log 400/200 - ( log 600/600 + log 600/600 ) / 2.
= log 2
Riong = (log 200/400 + log 200/400 + log 200/400) / 3 - ( log 600/600 + log 600/600 ) / 2.
= log (1/2)
Splicing Index = log 2 - log (1/2)
[0074] In one embodiment, the generalized Splicing Index omits the signal for the skipped exon. It should be noted that the term for the flanking exons cancels out. Consider the above example of an exon skip, but without J1-2. This example without J1-2 addresses the case of an alternative first exon. Either the gene begins with J1-3 or with J2-3. The generalized Splicing Index can be applied as before, except that the term for J1-2 will be omitted. Consider the above example of an exon skip, but without J2-3. This example without J2-3 addresses the case of an alternative last exon. Either the gene ends with J1-3 or with J1-2. The generalized Splicing Index can be applied as before, except that the term for J2-3 will be omitted. [0075] In an embodiment, the generalized Splicing Index combines long form probes using a function other than the average log ratio. It applies a function of the form F(log Kis/ Kit), where K, i, s and t are defined as above. F might be the mean, minimum, maximum, median, or other function of the values of Kj. In an embodiment, F takes the minimum unless two (log (Kis/ Kit)) pairs have opposite signs, in which case F gives the value of zero (0).
[0076] The splicing analysis method computes a score for splicing changes using a generalized ASPIRE algorithm:
|ASPIRE| = k * min (|Rsh0rt |, |R|ong|); Rshort * Rlong < 0 (8)
Rshort = log [Js/JJ (9)
Riong = min (log [Kis / KiJ) (10) where the variables have the same meanings as for the generalized Splicing Index, and k is a constant, such as 2. If any (log [Kιs/ Klt]) pair has opposite signs, the data analysis method sets R|Ong = 0.
[0077] For example, take the example data from the generalized Splicing
Index for an exon skip. Let k = 2. Then:
|ASPIRE| = 2 * min (|log 2|, |log 1/2|) = 2 * log 2.
[0078] In this case, the generalized ASPIRE value and the generalized
Splicing Index value are the same. Now suppose RShort = log 4.
|ASPIRE| = 2 * min (|log 4|, |log 1/2|) = 2 * |log 1/2|.
[0079] In an embodiment, the splicing analysis method sets the sign of the
ASPIRE value based on RShort- In that case, for these two examples, the ASPIRE value is positive. In another embodiment, the splicing analysis method sets the sign of the ASPIRE value based on R|Ong- In that case, for these two examples, the ASPIRE value is negative.
[0080] As with the generalized Splicing Index, the generalized ASPIRE equation can address all types of splicing events. [0081] In an embodiment, the generalized ASPIRE algorithm combines long form probes using a function other than the minimum log ratio. It applies a function of the form F(log Kis/ Kit), where K, i, s and t are defined as above. F might be the mean, minimum, maximum, median, or other function of the values of Kj. [0082] The splicing analysis method computes a score for splicing changes using a Splice Ratio equation that combines elements of the generalized Splicing Index and generalized ASPIRE. It identifies splicing changes that occur in opposite directions, but relative to the global or local gene expression level G rather than in absolute terms, as is the case with ASPIRE:
ISplice Ratio] = 2 * min (|R'short |, |R'ιOng|); R'short * R'bng < 0 (11 ) where R' is the ratio computed after normalizing all probes by the gene expression level using equation 4. In an embodiment, the splicing analysis method uses a global gene expression level to normalize probes. In another embodiment, the data analysis method uses a local gene expression level to normalize probes. [0083] In an embodiment, the splicing analysis method applies a mathematical algorithm to determine a splicing score for a splice event with more than two alternatives. . In an embodiment, the splicing analysis method applies a variant of the generalized Splicing Index, generalized ASPIRE algorithm or Splice Ratio to determine a splicing score for such a splice event. For example, suppose there are three spliceoforms for a gene with four possible exons. The spliceoforms are S1-3-4, S1-4 and S1-2-3-4. Suppose exon probes target each of the four exons (E1 , E2, E3 and E4), and exon-exon junction probes target each of the five distinct exon-exon junctions (J1-3, J3-4, J1-4, J1-2 and J2-3). In an embodiment, the splicing analysis method applies a mathematical algorithm to determine a splicing score for each spliceoform. Suppose J1-2, E2 and J2-3 all have signals of 200 in a first sample and 400 in a second sample, J 1-3 has a signal of 100 in the first and second samples, and J1-4 has a signal of 300 in the first sample and 600 in the second samples. In this case, the probes determine a signal for each of the three spliceoforms. The splicing analysis method applies a mathematical algorithm to the probe signals pertaining to the three spliceoforms. For example, the algorithm could assign a score for S1 -2-3-4 = log (200 / 400) - log ((100 + 300) / (100 + 600)). In an embodiment, the splicing analysis method sums the signals of each long form probe and treats the sum of signals as if it were a single alternative splice form. In other words, Riong = the log of the sum of signals for each long splice form in the first sample minus the log of the sum of signals for each long splice form in the second sample.
RiOn9 = F (log [Ljs / Ljt]) (12) where F is a function such as the minimum or mean and each Lj is a long splice form in a sample s or t. L is further defined as
L = G ( Kis) (13) where G is a function such as the sum or geometric mean and each Kj is a probe detecting the long splice form i. In an embodiment, this value for R|Ong is substituted in the equations for the generalized Splicing Index, generalized ASPIRE or Splice Ratio equations above.
[0084] In an embodiment, the splicing analysis method converts a splicing score to a fraction or a percentage. For example, suppose the score is -2 in log2. I.e., the fold change is -4. Therefore the signal for the short splice form is % the signal of the long splice form in a first sample relative to a second sample. Hence, the fraction is 0.2 for the short splice form and 0.8 for the long splice form, and the percentages are 20% and 80% respectively. In an embodiment, the splicing analysis method converts a splicing score for a single sample to a fraction or a percentage. For example, replace the variables for the second sample, sample t, with one ('1') in an equation above for generalized Splicing Index, generalized ASPIRE or Splice Ratio. A splicing score can be straightforwardly calculated for the first sample, sample s. For each spliceoform, a percentage or fraction can be determined. For example, consider the case of an alternative acceptor site. A short splice form might have a signal of 400 and a long splice form might have three probes (a module probe, an exon-exon junction probe and a module-module junction probe) having a geometric mean signal of 600. The Splicing Index could be calculated using equation 1 as log 400 - log 600. The resulting value can be converted to a fraction using the equations.
F + Fc = 1 (14) log F / Fc = log (400 / 600) (15) where F is a fraction of gene expression deriving from a first spliceoform and Fc is a fraction of gene expression deriving from a second spliceoform or two or more alternative spliceoforms mutually exclusive with the first spliceoform. Multiplying F by 100 gives a percentage.
[0085] In an embodiment, the splicing analysis method converts a splicing score for two spliceoforms to a fraction or a percentage. The above examples address this case. In another embodiment, the splicing analysis method converts a splicing score for three or more spliceoforms in one sample to a fraction or a percentage. Equation 12 provides a way to compute a splicing score using several of the algorithms above. Equations 14 and 15 provide a way to compute a fraction or a percentage for each spliceoform. In an embodiment the splicing analysis method converts a splicing score for two spliceoforms in two samples to a fraction or a percentage. In another embodiment the splicing analysis method converts a splicing score for two spliceoforms in more than two samples, or more than two spliceoforms in two samples, or more than two spliceoforms in more than two samples, to a fraction or a percentage.
[0086] One of skill in the art will appreciate that a spliceoform and a splice event (an alternatively spliced region of a gene) are terminologies that may be interchangeable depending on the gene model. All of the events above are intended to apply to whole spliceoforms as well as to specific alternatively spliced gene regions.
[0087] In an embodiment, the splicing analysis method applies a mathematical algorithm to determine a splicing score for an indicated polynucleotide such as a single exon, intron, exon-exon junction, exon-intron junction, module or module-module junction (as opposed to a splice event or a spliceoform). In an embodiment, the splicing analysis method applies a generalized Splicing Index, generalized ASPIRE or Splice Ratio algorithm to determine a splicing score for an indicated polynucleotide. For a short splice form, this is straightforward, since there will typically be only one exon-exon junction identifying the splice form. (There may be multiple measurements for the exon-exon junction, of course. In the case of multiple probes for an indicated polynucleotide (one of the set of an exon-exon junction, an exon, an exon-intron junction, a module, a module-module junction and an intron), the splicing analysis method applies a mathematical equation to determine a signal for the multiple probes. For example, the probe signal intensities might be averaged, log averaged, a median value might be used, outliers might be discarded, the values might be first normalized using a z-score, etc. The resulting value derived from multiple probes can be substituted for J (a short form junction) or K ( long form junctions, exons, introns or modules) in the equations above) For a long splice form, each indicated polynucleotide can be treated individually. This gives a splicing score for each probe. Conceptually, the change involves treating each indicated polynucleotide as a "short form" in the equations above, and the mutually exclusive splice events or spliceoforms as the "long form". [0088] An example will make this clear. Take the three spliceoforms described above: S1-3-4, S1 -2-3-4, and S1-4. A splicing score can be calculated for J1-2. Suppose the signal measured for J1-2 is 300. This indicated polynucleotide (sometimes, for simplicity's sake, referred to as a probe, even though in fact this 'probe' may comprise multiple probes) can be treated as a 'short form' in the equations above such as Equation 1. The corresponding 'long forms' relative to this indicated polynucleotide are J1-3 and J1-4. Similarly, a splicing score can be calculated for E2. Again the 'long forms' relative to the indicated polynucleotide are J1-3 and J1-4. It is interesting to note that the splicing scores for J1-2, E2 and J1-3 may be different, although in theory, if the gene model comprises only three spliceoforms, the scores should be the same. Differences may arise because of an unexpected additional spliceoform (a surprising result) or because of experimental error. Application of an equation above can determine a splicing score, a fraction or a percent composition for each indicated polynucleotide.
[0089] The splicing analysis method uses a score for splicing changes to filter data for one or more splice events. In other words, the splicing analysis method calculates a score for one or more splice events, and the splice event passes the filter if the score satisfies some filtering criterion, F(score). In an embodiment, the splicing analysis method accepts a score if the absolute value exceeds a constant value. For example, suppose there are ten exon skip events detected by probes on a microarray, and a Splicing Index score is calculated for each. The splicing analysis method may filter out any events that have a score with an absolute value less than 1.0 in log base two (ie., a splice fold change of less than two). Suppose that two of the exon skip events pass the filtering criterion.
[0090] In an embodiment, the splicing analysis method calculates a score for splice events using another algorithm that may be familiar to one skilled in the arts, such as ASAP or genASAP. In another embodiment, the splicing analysis method calculates a score using a machine learning algorithm, a genetic algorithm, a neural network, a simulated annealing algorithm, or another algorithm that may occur to one of skill in the arts.
[0091] In an embodiment, the splicing analysis method processes data for a plurality of splice events in a data set, computes a score, and filters data based on a function of the score. In another embodiment, the splicing analysis method further filters the data based on a minimum signal level. In another embodiment, the data splicing method processes data for 10 or more, or 100 or more, or 1000 or more, or 10000 or more, or 100000 or more, or 1 million or more, splice events in a data set. In an embodiment, the splicing analysis method processes data for splice events for one sample, or 2 or more samples, or 10 or more samples, or 20 or more samples, or 50 or more samples, or 100 or more samples. In an embodiment, the splicing analysis method processes data for exon skip events. In another embodiment it processes data for alternative first exons. In another embodiment, it processes data for alternative last exons. In another embodiment, it processes data for intron retention events. In another embodiment, it processes data for alternative donor sites. In another embodiment, it processes data for alternative acceptor sites. In another embodiment, it processes data for multiple splice event types. [0092] For example, suppose one had data for 10000 exon skip events,
5000 alternative first exons, 2000 alternative last exons, 5000 alternative donor sites, 7000 alternative acceptor sites, and 6000 intron retention events. The splicing analysis method may process data for one splice event type only, or a combination. It then filters the data and determines which splice events pass the filtering criteria using an algorithm such as those described above. For example, it might find that 100 exon skip events pass the filtering test.
[0093] In an embodiment, the splicing analysis method filters data by splice event type. For example, it might process data from a high-density microarray and output all data for probes that detect alternative donor sites. In an embodiment, the splicing analysis method filters data to identify data for a single splice event type. In another embodiment, the splicing analysis method filters data to identify multiple splice event types. For example, it might process data and output all data for probes that detect either alternative first exons or alternative last exons. [0094] In an embodiment, the splicing analysis method filters data based on the sign of the score. For example, it may filter exon skip events, output data for all probes or events that are up-regulated in a sample relative to one or more others. In an embodiment, the splicing analysis method filters data and outputs data for up- regulated splice events or probes. In an embodiment, the splicing analysis method filters data and outputs data for splice events or probes with a positive score. In another embodiment, the splicing analysis method filters data and outputs data for down-regulated splice events or probes. In another embodiment, the data analysis method filters data and outputs data for splice events or probes with a negative score. In another embodiment, the splicing analysis method filters data and outputs data for splice events or probes with a score of zero, or of non-zero. The splicing analysis method may filter data using a single criterion or a combination of criteria. For example, it may find data with a minimum signal intensity, a positive score, and a specific splice event type, such as intron retentions.
[0095] The splicing analysis method stores data for probes or for splice events that pass the filtering test. It stores the data in a storage medium, such as any of those listed above, including computer memory, a disk drive, etc. The data analysis method may transmit the data using any of the methods mentioned above, including network protocols, courier, etc. The storage medium is an aspect of the present invention of utility in its own right.
[0096] The splicing analysis method finds the intersection of data from two data stores. For example, suppose data was generated for two samples with replicates of each sample. The samples are S1R1 , S1R2, S2R1 and S2R2. Suppose intron retention events in the first replicate pair (S1 R1 and S2R1) are filtered and stored in a first file. Suppose intron retention events in the second replicate pair (S1 R2 and S2R2) are filtered and stored in a second file. The intersection of the two data files may be of special interest, since it will contain only those intron retention events that passed the filtering criteria in both replicate pairs. The splicing analysis method creates a data store for the intersection of two data or more stores. The splicing analysis method transmits a data store for the intersection of two or more data stores. The data store and data transmission may be of any of the types mentioned in this document. The data store and data transmission are aspects of the present invention that are useful in their own rights.
[0097] The splicing analysis method analyzes data in order to determine statistical significance. In an embodiment, the splicing analysis method calculates a standard deviation of data. For example, suppose technical replicates are used as control. The two replicates can be compared using any of the algorithms above. Suppose Splicing Indexes are calculated for each exon skip event. Suppose there are 1000 exon skip events. Hence, there will be 1000 Splicing Index scores. The splicing analysis method calculates the standard deviation of these 1000 scores. In an embodiment, the splicing analysis method calculates a standard error of data. In an embodiment, the splicing analysis method uses empirical statistics to determine a confidence level. For example, the cutoff of the 10 scores (out of 1000) with the highest absolute values could be used as the 99% confidence level. For the present purposes, the standard deviation, standard error and confidence interval will all be referred to simply as 'confidence intervals'. In an embodiment, the splicing analysis method calculates a confidence interval for a single splice event type. For example, the 1000 exon skip events mentioned above. In an embodiment, the splicing analysis method calculates a confidence interval for splice events that involve the same number of probe measurements. For example, suppose a microarray contains 4 probes per exon skip event (3 exon-exon junction probes plus one exon probe). Suppose the microarray also contains 4 probes per intron retention event (1 exon- exon junction probe, 2 exon-intron junction probes, and 1 intron probe). The statistical properties of these two splice events might reasonably be assumed to be similar. The same confidence interval, standard deviation or standard error could be used for both splice event types. Similarly, alternative donor sites and alternative acceptor sites might be detected by the same number of probes, perhaps 2 exon- exon junction probes for each (one probe for the short form, one for the long form). The statistical properties of donor and acceptor sites might reasonably be assumed to be the same. In an embodiment, the splicing analysis method calculates a single confidence interval for multiple splice event types; e.g., exon skips and intron retentions. The splicing analysis method creates a data store for the results of statistical calculation. The data store may be an address in computer memory, a data file, etc.
[0098] The splicing analysis method uses a statistical confidence interval as a filter. For example, suppose alternative first and last exons have a 99% confidence interval of +/- 0.10 for the Splice Ratio. When comparing data for two samples, a potential alternative first exon may have a Splice Ratio score of 0.15. This value lies outside of the confidence interval, which is centered at zero. Hence, the alternative first exon event passes the filtering criterion and is statistically significant with p < 0.01. (Additional criteria might be applied, of course, and the event would have to pass these additional criteria as well.) In an embodiment, the splicing analysis method uses a confidence interval as a filtering criterion for a single splice event type. In an embodiment, the splicing analysis method uses a confidence interval as a filtering criterion for two or more splice event types. E.g., alternative first and alternative last exons. In an embodiment, the splicing analysis method uses a confidence interval as a filtering criterion for all splice event types in a data set. [0099] The splicing analysis method calculates a confidence interval for gene expression. In an embodiment, the gene expression is calculated as described above. The splicing analysis method filters gene expression data derived from a splice variant microarray using the confidence interval. The splicing analysis method creates a data store for the confidence interval and transmits it. [00100] The splicing analysis method generates a report of alternative splicing changes in one or more samples. In an embodiment, the splicing analysis method computes splicing scores for data for one or more splice types. In an embodiment, it creates a data store (a computer file, a database table, a series of rows in a database, a series of addresses in memory, a printed document, a worksheet in a spreadsheet) for each splice type. For example, it creates a tab- delimited computer file containing results from processing alternative first exons, and another file containing results from processing intron retentions. In an embodiment, it includes the splicing score in the data store. In an embodiment, it includes only data that passes some filtering criterion or criteria as described above. In an embodiment, it creates a single data store for data from multiple splice types. For example, it stores data for each splice type in worksheets of a spreadsheet file. As another example, it stores all data in a single database table with a column that defines the splice type.
[00101] In an embodiment, the splicing analysis method creates a summary of the analysis results for one or more splice types. Table 1 provides an example:
Table 1 : Summary of analysis of splice types
Min Min
Data Type Instances MCF7 MCF10A Fold Signal Evidence
Genes Up 2679 247 313 2 300 All
Genes Down 2679 313 247 2 300 All
Alt Spliced: Short 16247 426 456 1 .4 300 All
Alt Spliced: Long 16247 456 426 1.4 300 All
Exon Skips 2217 30 38 1.4 300 All
Exon Includes 2217 38 30 1 .4 300 All
Alt First Exons: 3' 1 155 27 27 1.4 300 All
Alt First Exons: 51 1 155 27 27 1 .4 300 All
Alt Last Exons: 5' 530 23 13 1 .4 300 All
Alt Last Exons: 3' 530 13 23 1 .4 300 All
Alt 5' Donor: Short 712 24 33 1 .4 300 All Alt 5' Donor: Long 712 33 24 1.4 300 All
Alt 3' Acceptor: Short 1071 58 27 1.4 300 All
Alt 3' Acceptor: Long 1071 27 58 1.4 300 All lntron Splices 708 14 7 1.4 300 All lntron Retentions 708 7 14 1.4 300 All
[00102] Table 1 identifies the number of 'instances' of each data type (gene or splice event type), such as alternatively spliced short spliceoforms, or alternative donor sites (short form or long form). It shows the number of instances of each splice event that pass the filtering criterion defined in the columns Min Fold, Min Signal and Evidence. For example, the MCF7 cell line contained 38 exon include events that passed the filtering criterion of a minimum fold change (linearized Splice Ratio) of 1.4 and minimum signal intensity for each probe of 300. The changes were observed in all sample comparisons between the sample group labeled MCF7 and the sample group labeled MCF10A, e.g., if there were only one sample or replicate for each cell line, there would be one pairwise comparison. If there were replicates, there would be two samples compared to two other samples. [00103] In an embodiment, the splicing analysis method creates a report for the comparison of two sample groups. In another embodiment, the splicing analysis method creates a report for a multi-sample comparison, comparing each sample to all others or to the average of all others. In another embodiment, the splicing analysis method creates a report for present or absent spliceoforms for each splice event type in one or more samples. For example, there may be 58 exon skip/include events where the skip form is present in a sample and 21 events where the include form is present in the sample.
[00104] In an embodiment, the splicing analysis method creates a profile of splicing scores for splice events in two or more samples. For example, a profile may comprise a vector of log ratio values for each sample, e.g., (-0.01 , 0, 0.006, -0.15, 1.21 ). The first element in the vector is the score for a first sample vs. a second sample; the second element is the score of a second sample vs. a third sample. As another example, the scores may be present/absent scores for samples taken independently. Given two such profiles, the splicing analysis method computes a distance measure. In an embodiment, the splicing analysis method computes a Pearson correlation coefficient between two splicing profiles. In another embodiment, the splicing analysis method computes a Euclidean distance between two splicing profiles. In another embodiment, the splicing analysis method converts splicing profiles into bit strings and computes a Hamming distance. For example, suppose the bit-conversion assigns two bits to each sample and assigns a bit string of 00 for splicing scores less than plus or minus the standard error or confidence interval; it assigns a bit string of 01 for positive splicing scores greater than the standard error or outside of the confidence interval, and it assigns a bit string of 10 to negative splicing scores less than minus the standard error or outside of the confidence interval. The Hamming distance between two such bit strings is equal to the number of bits that differ. For example, the Hamming distance between 00 10 01 and 00 10 01 is 2, since the two middle bits differ. The splicing analysis method may compute a distance for profiles using another algorithm that may occur to one of skill in the arts.
[00105] The splicing analysis method creates a matrix of splicing profiles, e.g., each row contains the splicing scores for a splice event, and each column contains the splicing scores for a sample or sample comparison. Alternatively, the matrix could be arranged with splicing events in columns and samples in rows. The splicing analysis method creates a data store for the matrix or collection of splicing profiles. The data store may be a computer file, memory address, or other storage as mentioned above. The data store is an aspect of the present invention and is of use in its own right for storage and transmission. The splicing analysis method transmits the matrix of splicing profiles using a network protocol, postal service or other method mentioned above. The data transmission is an aspect of the present invention and is of use in its own right.
[00106] In some embodiments, the splicing analysis method performs a mathematical algorithm on a profiles or matrices of splicing scores. In an embodiment, it performs a principle component analysis. In another embodiment, it clusters the profiles using a greedy clustering algorithm. In another embodiment, it clusters the profiles using self-organizing maps. In another embodiment, it clusters the profiles using a hierarchical clustering algorithm. In another embodiment, it clusters the profiles using k-means. In another embodiment, it clusters the profiles using another algorithm that may occur to one of skill in the arts. One of skill will appreciate that, once splicing scores have been calculated and stored in a matrix, the application of a mathematical algorithm is a straightforward matter that can be performed using statistical analysis software.
[00107] The splicing analysis method creates a data store for the results of the clustering analysis or principle component analysis. The data store may be of any of the types described in this document. The splicing analysis method transmits the data store using any of the methods described in this document. The data store and transmission are aspects of the present invention that are useful in their own right.
[00108] The splicing visualization method visually indicates data that passes the filtering test in a computer file, a spreadsheet or another storage medium such as a monitor. The splicing visualization method visually indicates data that passes the filtering test in a scatter plot in two or more dimensions. See the figures below. The scatter plot with visually indication is an aspect of the present invention of utility in its own right, since it enables a scientist to visually determine the extent of alternative splicing in a multi-sample comparison. In an embodiment, the splicing visualization method visually indicates data (in a scatter plot or spreadsheet or computer file) that passes the filtering criterion by using a different color from other data points. In an embodiment, it visually indicates data by changing the size of the symbol used to indicate the data, for example by using a larger circle or larger square. In an embodiment, it visually indicates data by changing the symbol shape, for example by displaying highlighted data with squares and non-highlighted data with circles. In another embodiment, it visually indicates data by outlining it, or making it blink, or giving it an animation effect. In another embodiment, the splicing visualization method visually indicates data by adjusting its opacity, saturation, hue, brightness, transparency, or other visual attribute. In another embodiment, the splicing visualization method visually indicates data by displaying a label, a popup, a tooltip, or a text message. In another embodiment, the splicing visualization method visually indicates data by using a different font: a different font face, font style, font decoration, font size, or other attribute of the type face. [00109] In some embodiments, the data analysis method visually differentiates between short form and long form probe data. For example, suppose the data contains one exon-exon junction probe for a spliced intron and two exon- intron junctions for a retained intron. The data analysis method visually differentiates between the exon-exon junction probe's data and the two exon-intron junction probes' data. In an embodiment, the data analysis method displays short form and long form data in different colors. For example, red for short form data and orange for long form data. In an embodiment, the data analysis method visually indicates short form data. In an embodiment, the data analysis method visually indicates long form data. In an embodiment, the data analysis method visually indicates both short form and long form data. The data analysis method may use any of the visual cues mentioned above.
[00110] In an embodiment, the data analysis method visually indicates the score for splicing data. In an embodiment, it indicates score by adjusting the hue. For example, a positive score could be red and a negative score green in a spreadsheet or scatter plot. In another embodiment, it visually indicates the score using the color saturation. For example, larger positive scores might be brighter red, larger negative scores brighter green, and scores close to zero nearly black. In another embodiment, it visually indicates the score using the transparency. For example, scores close to zero might be nearly transparent, whereas scores with large absolute values might be more opaque. In another embodiment, it visually indicates the score using the symbol size. For example, larger points in a scatter plot may indicate scores with larger absolute values. The data analysis method may visually indicate the score using any of the methods suggested above, or using another method that may occur to one of skill in the arts.
[00111] The data analysis method visually indicates data of a given splice event type. For example, it may visually indicate exon skip data in one way, and intron retention data in another. In an embodiment, the data analysis method visually indicates data for a given splice event type using a different color. In an embodiment, the data analysis method visually indicates data for a given splice event type using a different shape. The data analysis method may visually indicate the splice event type using any of the methods suggested above, or using another method that may occur to one of skill in the arts.
[00112] A spreadsheet, scatter plot, line plot, or other image with visually indicated data (for splice events, for data that passes a filter, for up- or down- regulation, for a splicing score, etc) is an invention in its own right, of use to scientists in interpreting alternative splicing data. The data analysis method may store the visually indicated data in a storage medium or transmit it in any of the ways mentioned or that may occur to one of skill in the arts.
[00113] The data analysis method links data in a spreadsheet or scatter plot with a gene model viewer (a "splice graph"). In an embodiment, the user indicates data in a scatter plot and the corresponding part of the gene model is visually indicated. For example, the user may click on a spreadsheet row containing data for an exon-exon junction probe, and a gene model viewer may open, if it is not already open, and that exon-exon junction will be visually indicated. Or the user may move a mouse over a point in a scatter plot, and the relevant part of the gene model will be visually indicated. For example, an exon portion may be highlighted. The visual indication may employ any of the methods already mentioned, or another. For example, the visual indication might involve changing the color of a portion of the gene model, such as an exon-exon junction or exon or intron or exonic portion or exon-intron junction. Or the visual indication may change the visual attributes of the portion, or underline it, or outline it, or label it with text, or display an icon near it. In an embodiment, the data analysis method visually indicates a single region of the gene model, such as an exon, intron, exonic portion, module, exon-exon junction, exon-intron junction, or module junction. In another embodiment, the data analysis method visually indicates multiple regions of the gene model. For example, the user may select rows in a spreadsheet for multiple probes, and the gene model would then highlight all of the gene regions and splice isoform regions, targeted by those probes. The data analysis method may visually indicate the portion of the gene model using any of the methods suggested above, or using another method that may occur to one of skill in the arts.
[00114] In an embodiment, the user indicates a portion or portions of a gene model, and the corresponding data in a spreadsheet or scatter plot is visually indicated. For example, the user moves a computer pointer over an exonic portion, or exon, or intron, or exon-exon junction, or exon-intron junction, and the associated data in a spreadsheet or scatter plot or line plot is highlighted. The data analysis method may visually indicate the data in the spreadsheet or scatter plot using any of the methods suggested above, or using another method that may occur to one of skill in the arts.
[00115] The visually indicated data (embodied in an image on screen, an image file, spreadsheet, storage medium, etc) is an invention of use in its own right, since it can facilitate more intuitive understanding of alternative splicing data. The data analysis method may store the visually indicated data in a storage medium or transmit it in any of the ways mentioned or that may occur to one of skill in the arts. [00116] The splicing integration method connects splicing data to software resources such as gene ontology tools, alternative splicing databases, sequence databases, pathway software, chemistry database, etc. In an embodiment, the splicing integration method links splicing data to a gene ontology tool. In an embodiment, the splicing integration method links splicing data to a sequence database. In an embodiment, the splicing integration method links splicing data to a genome browser. In an embodiment, the splicing integration method links splicing data to an alternative splicing database. In an embodiment, the splicing integration method links splicing data to a pathway tool. In an embodiment, the splicing integration method links splicing data to a chemistry database. In an embodiment, the splicing integration method links splicing data to a gene model viewer. [00117] For example, suppose an exon-exon junction probe is annotated with ACCESSIONJD and GENE_SYMBOL and has genomic coordinates on chromosome 6 of 5000 to 5020 for the first exonic portion and of 6000 to 6020 for the second exonic portion. In a software application, the probe annotation might be displayed in a spreadsheet, or the probe might be represented as a point in a scatter plot or line plot, or in a region of a gene model viewer. The user might indicate the probe annotation, representation or region by moving a cursor or navigating using the keyboard. The indicated probe might then be highlighted, or a tooltip or popup window or context menu might appear. The user then gives a specified cue, such as a mouse cue (left click, right click, center click, mouse wheel, mouse drag, hover) or keyboard cue (key press) or input on a touch screen, or a combination of these or other methods. Afterward, the software application launches an external tool. For example, a hyperlink might open with a URL to a web-based tool. In an embodiment, the splicing integration method links a probe to a genome browser displaying information related to the genomic or chromosomal region targeted by the probe. For example, the University of California Santa Cruz genome database and web browser might open, displaying a base range that includes the probe's genomic locations. The browser might display genomically aligned sequences within that base range. As another example, the genome browser displays the nucleotide sequence of the probe. As another example, the genome browser displays the nucleotide sequence of the genomic region to which the sequence with the ACCESSION ID aligns. For example, suppose the probe detects a spliceoform indicated by accession ABC12345. Suppose the sequence with that accession has been aligned to chromosome 6 with a given set of coordinates. The genome browser displays the nucleotide sequence of that coordinate set.
[00118] In an embodiment, the splicing integration method connects splicing data to a resource using a hyperlink. In an embodiment, the splicing integration method connects splicing data to a resource using a menu item in the menu bar or in a context menu. In an embodiment, the splicing integration method connects splicing data to a resource using a toolbar button. In an embodiment, the splicing method connects splicing data to a resource using a keyboard shortcut. In an embodiment, the splicing method connects splicing data to a resource using a mouse cue. In an embodiment, the splicing method connects splicing data to a resource using a mouse cue and a keyboard shortcut. In an embodiment, the splicing method connects splicing data to a resource using another input method or cue. One of skill will appreciate the variety of methods that may be employed to connect splicing data to software and database resources.
EXAMPLE
[00119] The following example is provided to illustrate the methods.
Additional embodiments will apparent to one skilled in the art without departing from the scope of the invention.
Example: Splice variants in different cells
[00120] Figures 1A-1 H show scatter plots obtained using indicator polynucleotides corresponding to particular polynucleotide sequences of differentially expressed splice variants with a minimum signal of 200 and a "Splice Fold" (linearized Splice Ratio) score > 2 (a >99.9% confidence interval for all splice types). The indicator polynucleotides were used in a microarray. [00121] Plots on the left show the splice variants present in MCF7 cells vs.
CaCO2 cells. Plots on the right show technical replicates from HEK293 cells. The indicator polynucleotides used in the microarrays detect exon skip events (panels A and B); alternative first and last exons (C and D), intron retentions (E and F); and alternative acceptor and donor sites (G and H). Different gene isoforms are clearly present in the different cell types.
[00122] The data were analyzed using a database of alternative splicing in human (i.e., the SpliceExpress Human Spliceome database, or SEHS), or the Splicing Index, ASPIRE, and Splice Ratio methods described herein. A comparison of the results is shown in Figure 2.

Claims

CLAIMSWhat is claimed is:
1. A hybridization method for measuring the levels of alternatively-spliced forms of a gene, the method comprising:
(a) providing two or more mutually exclusive indicator polynucleotides corresponding to polynucleotide sequences of the gene selected from exons, introns, modules, exon-exon junctions, exon-intron junctions, intron-exon junctions, and module-module junctions, of alternatively-spliced forms of a gene;
(b) incubating a sample comprising alternatively-spliced forms of the gene in the presence of the two or more indicator polynucleotides;
(c) measuring a hybridization signal corresponding to the amount of hybridization of the alternatively-spliced forms of a gene to each of the two or more indicator polynucleotides;
(d) applying a mathematical algorithm to calculate the relative expression levels of alternatively-spliced forms of the gene.
2. The method of claim 1 , wherein the mutually exclusive indicator polynucleotides are non-overlapping.
3. The method of claim 1 , wherein the mutually exclusive indicator polynucleotides are overlapping.
4. The method of claim 1 , wherein at least one mutually exclusive indicator polynucleotide corresponds to a polynucleotide that is constitutively present in alternatively spliced forms of the gene.
5. The method of claim 1 , wherein at least one mutually exclusive indicator polynucleotide corresponds to a polynucleotide that is not constitutively present in alternatively spliced forms of the gene.
6. The method of claim 1 , wherein an overall level of expression of alternatively-spliced forms of a gene is calculated by summing the amount of hybridization signal corresponding to the relative amounts of hybridization to each of the mutually exclusive indicator polynucleotides.
7. The method of claim 6, wherein the overall level of gene expression (G) is calculated using the equations:
G = ∑ l = (π B)1/n (1 )
Bexon = P + PC (2)
BJUnc = V((p5 + p5c) * (p3 + p3c)) (3) wherein
G is the overall gene expression level; each I is an alternatively-spliced form of the gene; n is the number of indicator polynucleotides; and each B is the sum of expression levels of all alternatively-spliced forms of the gene; wherein when at least one indicator polynucleotide corresponds to an exon or intron, BexOn is equal to the sum of p + pc, wherein p is the amount of hybridization signal corresponding to the amount of hybridization of the alternatively-spliced forms of a gene to the indicator polynucleotide corresponding to the exon or intron, and pc is the sum of the amounts of hybridization from each indicator polynucleotide, and when at least one indicator polynucleotide corresponds to an exon-exon or exon-intron junction, Bjunc is V((p5 + p5c) * (p3 + p3c)), wherein p5 is the amount of hybridization signal corresponding to the amount of hybridization of the alternatively-spliced forms of a gene to the indicator polynucleotide corresponding to a 5' portion of the junction, and p3 is the amount of hybridization signal corresponding to the amount of hybridization of the alternatively-spliced forms of a gene to the indicator polynucleotide corresponding to a 3' portion of the junction.
8. The method of claim 7, wherein background levels of hybridization signal are subtracted from the overall expression level.
9. The method claim 1 , wherein at least one of the two or more mutually exclusive indicator polynucleotides corresponds to an exon.
10. The method claim 1 , wherein at least one of the two or more mutually exclusive indicator polynucleotides corresponds to an intron.
11. The method claim 1 , wherein at least one of the two or more mutually exclusive indicator polynucleotides corresponds to a module.
12. The method claim 1 , wherein at least one of the two or more mutually exclusive indicator polynucleotides corresponds to an exon-exon junction.
13. The method claim 1 , wherein at least one of the two or more mutually exclusive indicator polynucleotides corresponds to an exon-intron junction.
14. The method claim 1 , wherein at least one of the two or more mutually exclusive indicator polynucleotides corresponds to an intron-exon junction.
15. The method claim 1 , wherein at least one of the two or more mutually exclusive indicator polynucleotides corresponds to a module-module junction.
16. The method of claim 1 , wherein the indicator polynucleotides are in a microarray.
17. The method of claim 1 , wherein the alternatively-spliced forms of a gene are mRNAs, and the indicator polynucleotides are complementary to the mRNA.
18. The method of claim 1 , wherein the alternatively-spliced forms of a gene are cDNAs, and the indicator polynucleotides are complementary to the cDNA.
19. Software for performing the calculations of claim 7.
20. Software for determining the amounts of different gene splice variants using data obtained in a microarray having two or more mutually exclusive indicator polynucleotides corresponding to polynucleotide sequences of the gene, selected from exons, introns, modules, exon-exon junctions, exon- intron junctions, intron-exon junctions, and module-module junctions, of alternatively-spliced forms of a gene, wherein the software applies a mathematical algorithm to calculate the relative expression levels of different gene splice variants.
21. A kit of parts for measuring the levels of alternatively-spliced forms of a gene, the kit comprising:
(a) two or more mutually exclusive indicator polynucleotides corresponding to polynucleotide sequences selected from exons, introns, modules, exon-exon junctions, exon-intron junctions, intron-exon junctions, and module-module junctions, of alternatively-spliced forms of a gene;
(b) mathematical algorithms for calculating the total and relative levels of alternatively-spliced forms of a gene using hybridizations signals corresponding to the amount of hybridization of the alternatively-spliced forms of the gene to each of the indicator polynucleotides; and
(c) instructions for using the indicator polynucleotides and mathematical algorithms.
22. The kit of parts of claim 22, wherein the indicator polynucleotides are in a microarray.
23. The kit of parts of claim 22, wherein the mathematical algorithms are provided in an executable computer application.
PCT/US2008/001682 2007-02-08 2008-02-08 Methods for determining splice variant types and amounts WO2008097632A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US90067907P 2007-02-08 2007-02-08
US60/900,679 2007-02-08

Publications (3)

Publication Number Publication Date
WO2008097632A2 true WO2008097632A2 (en) 2008-08-14
WO2008097632A3 WO2008097632A3 (en) 2008-11-13
WO2008097632A9 WO2008097632A9 (en) 2008-12-24

Family

ID=39682324

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2008/001682 WO2008097632A2 (en) 2007-02-08 2008-02-08 Methods for determining splice variant types and amounts

Country Status (1)

Country Link
WO (1) WO2008097632A2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111370055A (en) * 2020-03-05 2020-07-03 中南大学 Intron retention prediction model establishing method and prediction method thereof
CN116469456A (en) * 2022-12-30 2023-07-21 浙江安诺优达生物科技有限公司 Training method and prediction method for machine learning model of variable shear event prediction and application

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6664046B1 (en) * 1999-12-16 2003-12-16 Roche Molecular Systems, Inc. Quantitation of hTERT mRNA expression

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6664046B1 (en) * 1999-12-16 2003-12-16 Roche Molecular Systems, Inc. Quantitation of hTERT mRNA expression

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HU G.K.: 'Predicting Splice Variant from DNA Chip Expression Data' vol. 11, 2001, pages 1237 - 1245 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111370055A (en) * 2020-03-05 2020-07-03 中南大学 Intron retention prediction model establishing method and prediction method thereof
CN116469456A (en) * 2022-12-30 2023-07-21 浙江安诺优达生物科技有限公司 Training method and prediction method for machine learning model of variable shear event prediction and application
CN116469456B (en) * 2022-12-30 2023-12-15 浙江安诺优达生物科技有限公司 Training method and prediction method for machine learning model of variable shear event prediction and application

Also Published As

Publication number Publication date
WO2008097632A3 (en) 2008-11-13
WO2008097632A9 (en) 2008-12-24

Similar Documents

Publication Publication Date Title
Mulligan et al. GeneNetwork: a toolbox for systems genetics
Nam et al. GSA-SNP: a general approach for gene set analysis of polymorphisms
Mecham et al. Supervised normalization of microarrays
Jiang et al. Extensions to gene set enrichment
Kliebenstein et al. Identification of QTLs controlling gene expression networks defined a priori
Clark et al. Discovery of tissue-specific exons using comprehensive human exon microarrays
Khaitovich et al. Regional patterns of gene expression in human and chimpanzee brains
Stalteri et al. Interpretation of multiple probe sets mapping to the same gene in Affymetrix GeneChips
Bhangale et al. Automating resequencing-based detection of insertion-deletion polymorphisms
Wright et al. ALCHEMY: a reliable method for automated SNP genotype calling for small batch sizes and highly homozygous populations
Alsheikh et al. The landscape of GWAS validation; systematic review identifying 309 validated non-coding variants across 130 human diseases
Batut et al. RNA-seq data analysis in Galaxy
Wang et al. A fast and powerful W-test for pairwise epistasis testing
Pedotti et al. Can subtle changes in gene expression be consistently detected with different microarray platforms?
Wang et al. Computational genetics: from mouse to human?
Vazquez et al. MARQ: an online tool to mine GEO for experiments with similar or opposite gene expression signatures
Szklarczyk et al. WeGET: predicting new genes for molecular systems by weighted co-expression
Baty et al. High-throughput alternative splicing detection using dually constrained correspondence analysis (DCCA)
Yan et al. SR4R: an integrative SNP resource for genomic breeding and population research in rice
Simonin-Wilmer et al. An overview of strategies for detecting genotype-phenotype associations across ancestrally diverse populations
Alberts et al. QTLminer: identifying genes regulating quantitative traits
WO2007126882A2 (en) Analysis of splice variant expression data
Gonçalves-Dias et al. PopAmaranth: a population genetic genome browser for grain amaranths and their wild relatives
Cuperlovic-Culf et al. Microarray analysis of alternative splicing
WO2008097632A2 (en) Methods for determining splice variant types and amounts

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08725328

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08725328

Country of ref document: EP

Kind code of ref document: A2