WO2003050264A2 - Method for the analysis of differential gene expression - Google Patents

Method for the analysis of differential gene expression Download PDF

Info

Publication number
WO2003050264A2
WO2003050264A2 PCT/US2002/039650 US0239650W WO03050264A2 WO 2003050264 A2 WO2003050264 A2 WO 2003050264A2 US 0239650 W US0239650 W US 0239650W WO 03050264 A2 WO03050264 A2 WO 03050264A2
Authority
WO
WIPO (PCT)
Prior art keywords
sample
stepper
data
signature
samples
Prior art date
Application number
PCT/US2002/039650
Other languages
French (fr)
Other versions
WO2003050264A3 (en
Inventor
Jing-Zhong Lin
Christian D. Haudenschild
Ramesh V. Naire
Benjamin A. Bowen
Original Assignee
Lynx Therapeutics, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lynx Therapeutics, Inc. filed Critical Lynx Therapeutics, Inc.
Priority to AU2002360560A priority Critical patent/AU2002360560A1/en
Publication of WO2003050264A2 publication Critical patent/WO2003050264A2/en
Publication of WO2003050264A3 publication Critical patent/WO2003050264A3/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/158Expression markers

Definitions

  • the present invention relates to methods, computer programs and systems for the analysis of differential gene expression.
  • short sequences are sometimes called signature sequences, e.g., as in MPSS, or tags, e.g., as in SAGE. Determination of this short sequence allows identification of an individual mRNA and, possibly, the identification of the corresponding gene.
  • the frequency of the specific short sequence in one sample can be compared with the frequency of the specific short sequence in another sample. A comparison of the two different samples can yield information regarding the expression pattern. For example, previous comparisons have examined only up to about 50,000 mRNA molecules in a cell (sample) using SAGE data.
  • MPSS can generate expression data on greater than about 150,000 mRNA molecules in a sample (cell). This type of data allows for detection of genes that are expressed at low levels within the cell. Analysis procedures are lacking for the quantity of data generated by such procedures from a single sample.
  • the present invention uses methods, computer program products and systems to statistically analyze differential gene expression from data generated from greater than 150,000 mRNA molecules in a single sample.
  • the present invention statistically analyzes differential gene expression from varying samples, wherein each sample generates greater than, e.g., about 150,000 mRNA molecules.
  • Methods, computer program products and systems are provided.
  • the methods include collecting data from a plurality of samples, each sample comprising a population of signature sequences that represents greater than about 150,000 mRNA molecules; importing the data into a database, e.g., a relational database; analyzing the data for changes in gene expression using a pairwise comparison of a first sample and a second sample selected from the plurality of samples using a statistical test of significance; thereby generating a probability value of a differential expression for each signature sequence; and, providing the probability of the differential expression value to a user or an automated system.
  • a database e.g., a relational database
  • the statistical tests include a two-tailed normal approximation test, a chi-square test, a Fisher Exact test, a generalized linear model, Bayesian methods and/or the like.
  • the statistical test of significance is a normal approximation test, which includes a two-tailed test using a first equation:
  • ni is a number of mRNA molecules or cloned cDNA copies sequenced that represents the population of signature sequences for the first sample
  • n 2 is a number of mRNA molecules or cloned cDNA copies sequenced that represents the population of signature sequences for the second sample
  • abundance of an individual signature sequence in the first sample is represented by the X ⁇ and the abundance of the individual signature sequence in the second sample is the x 2
  • the pi and the p 2 are represented by a second equation and a third equation:
  • ni and n 2 are large, e.g., greater than about 150,000 mRNA molecules or cloned cDNA copies, e.g., greater than about 200,000 mRNA molecules or cloned cDNA copies, e.g., greater than about 300,000 mRNA molecules or cloned cDNA copies, and typically, about 300,000 to about 10,000,000 mRNA molecules or cloned cDNA copies.
  • the difference of p] and the p 2 is distributed according to a first expression:
  • the level of significance of the test is determined by calculating the probability of observing a larger ⁇ by chance. This probability is calculated by using a sixth equation:
  • x is the value of the concerned variable under a standard normal distribution.
  • Pr(x> ⁇ ) is equal to or smaller than a predetermined level of significance, e.g., 0.05, or e.g., 0.01, or e.g., 0.001, then a significant differential expression between a first sample and a second sample is declared for that particular significance level.
  • the methods of analysis include analyzing data from a sample, where the sample includes a large number of sequenced nucleic acids, e.g., mRNAs, cloned cDNA copies and/or the like.
  • sequenced nucleic acid can be a partially or fully sequenced mRNA, cloned cDNA copy and/or the like.
  • the methods include where the population of signature sequences in a sample represent, e.g., greater than about 150,000 mRNA molecules, e.g., greater than about 200,000 mRNA molecules, e.g., greater than about 300,000 mRNA molecules, e.g., greater than about 500,000 mRNA molecules, e.g., greater than about 750,000 mRNA molecules, or e.g., greater than about 1,000,000 mRNA molecules.
  • the population of signature sequences can be generated from a variety of techniques, e.g., massively parallel signature sequencing (MPSS), e.g., microarrays, e.g., SAGE data and/or the like.
  • MPSS massively parallel signature sequencing
  • one or more abundance estimates for an individual signature sequence from the population of signature sequences are generated from one or more independent sampling procedures, e.g., MPSS separate "stepper" datasets.
  • one stepper dataset is generated by a 2-stepper procedure ("2-stepper") and another stepper dataset is generated by a 4-stepper procedure ("4-stepper").
  • the methods include choosing the independent sampling procedure, e.g, the stepper dataset, which produces a highest value of the individual signature sequence.
  • an individual signature sequence from the population of signature sequences is an average from more than one 2-steppers or from more than one 4-steppers.
  • a standard deviation for the average can be calculated.
  • Methods of the present invention include generating tables for the data.
  • a table can include information regarding one sample, e.g., a list of unique signature sequences from the population of signature sequences correlated with an abundance estimate or level for each of the unique signature sequences found in the one sample, and/or a list of the probability value of differential expression for the signature sequence.
  • the table is generated from information for a plurality of samples.
  • the methods of the present invention can optionally include filtering the data.
  • the filtering step occurs before the analyzing the data step.
  • the data can be filtered by removing sequences from the population of signature sequences that contain more than one ambiguous nucleotide, and/or that do not match to a known sequence in a known genome.
  • Data can also be filtered by removing values that fail a minimum abundance test.
  • the methods use a pairwise comparison of two samples, e.g., a first sample and a second sample.
  • One or more datasets for each sample can be used, e.g., that are replicates, or e.g., that are obtained via independent sampling procedures.
  • the methods can be performed with anything that generates at least two different samples.
  • Examples include, but are not limited to, a first cell type and a second cell type, a first normal sample and a second other than normal sample, a first species sample and a second species sample, a first recombinant inbred line and a second recombinant inbred line, a first state sample and a second state sample, a first treated sample and a second untreated sample, a first infected sample and a second uninfected sample, a first disease-state sample and a second non- disease state sample, a first injured state sample and a second non-injured state sample, a first developmental stage sample and a second different developmental stage sample, at least two disparate species, at least two disparate organs and the like.
  • a variety of parameters can be associated with a signature sequence.
  • the methods of the present invention can further include a comparison test of the Pi and p 2 to assign a set of parameters to the samples being analyzed.
  • the methods can be used to compare datasets that have been generated from, e.g., replicate samples, or e.g., independent sampling procedures.
  • independent sampling procedures for estimating gene expression abundance
  • each procedure can have its own systematic biases that can influence the determination of the abundance values in a procedure-specific manner.
  • 2-stepper and 4-stepper data are generated via independent sampling procedures.
  • one or more abundance estimates for an individual signature sequence for the population of signature sequences are generated from an independent sampling procedure.
  • an estimate of abundance is determined via both procedures, e.g., which can be similar or dissimilar, but for some signatures, e.g., those with palindromes, only one estimate is determined with one procedure or the other.
  • a comparison test can be done.
  • the comparison test includes the following steps: (1) choosing either the data from a first independent sampling procedure or the data from a second independent sampling procedure based on the procedure that gave a higher total value, where the higher total value is the pi plus the p 2 , thereby providing the chosen data; and (2) comparing the p] value from the first sample and the p 2 value from the second sample in the chosen data for a direction of change, where the direction of change is an upward change, a downward change or an inconsistent change when comparing data generated by both the first and second procedures.
  • the steps can include: (1) choosing either the data from the first and the second sample generated by a 2-stepper or the data from the first and the second sample generated by a 4- stepper based on the procedure that gave a higher total stepper value, where the higher total stepper value is the pi plus the p 2 , thereby providing the chosen stepper data; and (2) comparing the p ⁇ value from the first sample and the p 2 value from the second sample in the chosen stepper data for a direction of change, where the direction of change is an upward change, a downward change or an inconsistent change when comparing data generated by both stepper procedures.
  • a first parameter is assigned to the chosen stepper procedure, wherein the first parameter is a T when the 2-stepper is chosen in the comparison test or is an F when the 4-stepper is chosen in the comparison test.
  • This first parameter provides a stepper value.
  • the first parameter is optionally stored in a database.
  • a second parameter is assigned to the each signature sequence from the population of signature sequences, where the second parameter is 1, 2, 4, 14, or 24.
  • the second parameter is 1, 2, 4, 14, or 24.
  • a 1 is assigned when a downward change is found in the comparison test.
  • a 2 is assigned when an upward change is found in the comparison test.
  • a 4 is assigned when an inconsistent change is found in the comparison test.
  • a 14 is assigned when a downward change is found in the comparison test and only one dataset is used, e.g., either from a 2- stepper procedure or a 4-stepper procedure
  • a 24 is assigned when an upward change is found in the comparison test and only one dataset is used, e.g., either from a 2-stepper procedure or a 4-stepper procedure.
  • a 1 is assigned when a downward change is found in the comparison test and the 2-stepper procedure and the 4-stepper procedure both exhibit the downward change.
  • a 2 is assigned when an upward change is found in the comparison test and the 2-stepper procedure and the 4-stepper procedure both exhibit the upward change.
  • a 4 is assigned when the direction of change is inconsistent.
  • a 14 is assigned when the downward change is found in the comparison test and when the data from only the 2-stepper procedure or the 4-stepper procedure can be used.
  • a 24 is assigned when the upward change is found in the comparison test and when the data from only the 2-stepper procedure or the 4-stepper can be used.
  • a 1 is assigned when a downward change is found in the comparison test, and more than one independent sampling method, e.g. 2-stepper and 4- stepper, exhibit a downward change in the same direction.
  • a 2 is assigned when an upward change is found in the comparison test, and more than one independent sampling method, e.g. 2-stepper and 4-stepper, exhibit an upward change in the same direction.
  • a 4 is assigned when a change is found in the comparison test, and when the change with an independent sampling method, e.g. 2-stepper and 4-stepper, is inconsistent.
  • a 14 is assigned when a downward change is found in the comparison test and when data from only one independent sampling procedure, e.g.
  • This second parameter provides a direction value and, typically, can provide a way of identifying the gene expression differences that are consistently observed with independent sampling methods.
  • the second parameter is optionally stored in a database.
  • the methods of the present invention also include methods of gathering information on a signature sequence.
  • the method includes collecting data from a plurality of samples, each sample comprising a population of signature sequences that is generated from greater than about 150,000 mRNA molecules; importing the data into a database; gathering information on at least one unique signature sequence from the population of signature sequences from at least one other database or at least one web page or a combination of both. This generates unique information for the at least one unique signature sequence; and provides the unique information for the at least one unique signature sequence to a user or an automated system.
  • the other database can be a variety of databases, e.g., a sequence database, a publication database, a gene annotation database and the like.
  • the population of signature sequences can be batch-blasted against a variety of databases, e.g., the various NCBI databases, e.g., NR, UniGene, EST and the like, using a network client, e.g., NCBFs network BLAST client 'blastcl3,' or using a local client, e.g., local client 'blastall,' to generate individual BLAST files of hits against each of the variety of databases.
  • a hit is a match of the signature sequence to a sequence in a database based on desired criteria, e.g., number of nucleotides matching, score based on homology, the location of the match in the sequence in the database and the like.
  • the hits are then optionally ordered, flagged based on desired criteria, sorted by the flag(s) and associated with gene identifications from the database(s).
  • Other information of relevance associated with genes e.g. functional information, cellular localization or references to associated literature can be collected from one or more databases, e.g. those available on the Internet, and compiled in a local database associated with any selected population of signature sequences.
  • the present invention can be understood in the context of importing data over a communication media.
  • An important application for the present invention, and an independent embodiment, is in the field of gathering information on a signature sequence over the Internet, optionally using Internet media protocols and formats, such as HTTP, RTTP, XML, HTML, dHTML, VRML, as well as image, audio, or video formats etc.
  • Methods and apparatus of the present invention can also be used in other related situations where users access content over a communication channel, such as modem access systems, institution network systems, wireless systems, etc.
  • the methods of the present invention are optionally performed in a computer.
  • the data are collected from a central database and the data are stored on a local system.
  • values calculated, e.g., the probability of differential expression value, parameters assigned, e.g., the first and/or second parameter, information gathered, tables generated, are stored in a storage medium.
  • Computer program products are also aspects of present invention.
  • the present invention includes a computer program product for analysis of differential gene expression.
  • the program includes code that receives as input a plurality of signature sequences from a plurality of samples; code that selects samples from the plurality of samples to be analyzed; and code that determines the probability of differential expression for each signature sequence from the selected samples using a pairwise comparison of a first sample and a second sample from the plurality of samples using a statistical test of significance, where the codes are stored on or in a tangible medium.
  • the statistical test of significance is a two-tailed normal approximation test.
  • the program includes a code that removes signature sequences that fail to meet a minimum abundance test and, optionally, code that removes signature sequences with ambiguous nucleotide assignments. Code is optionally provided that removes signature sequences that do not match a known genome.
  • the program includes code that receives as input a plurality of signature sequences from a plurality of samples, code that gathers available information on at least one signature sequence from at least one different database, and code that matches the specific signature sequence with the information gathered and code that stored the matched information in a tangible medium.
  • the program includes code that compares pi and p 2 to generate a first parameter and/or a second parameter as described above.
  • code is provided that generates tables of the signature sequences and the frequency of the signature sequence in a sample, and/or the probability of differential expression value for the signature sequence.
  • the systems include a processor; and a computer readable medium coupled to the processor, said computer readable medium storing a computer program.
  • the computer program includes code for any of the methods used in the present invention.
  • the code can include a code that receives as input a plurality of signature sequences from a plurality of samples; a code that selects samples from the plurality of samples to be analyzed; and a code that determines the probability of differential expression for each signature sequence from the selected samples using a pairwise comparison of a first sample and a second sample from the plurality of samples.
  • Other codes are described above.
  • Figure 1 is a flow-chart illustrating a method of differential gene expression analysis.
  • Figure 2 illustrates an inverse relationship between the level of gene expression and fold differences that can be detected by a normal approximation test at three levels of significance.
  • FIG. 3 Panels A and B is a flow-chart illustrating a method of differential gene expression analysis using a normal approximation test for the statistical test of significance.
  • Figure 4 is a flow-chart illustrating a method of assigning parameter 1, the stepper value.
  • Figure 5 is a flow-chart illustrating a method of assigning parameter 2, the direction value.
  • Figure 6 illustrates a method of parsing multi-database hits, e.g., BLAST hits, in a desired hit order, flagging hits based on desired criteria, sorting the hits by flag and appending gene identifications from a database, e.g., UCSC GoldenPath, in parallel.
  • a database e.g., UCSC GoldenPath
  • Figure 7 illustrates a sample of gene identifications for 10 signature sequences.
  • FIG. 8 Panels A, B, C and D is a flow-chart illustrating a differential gene expression system according to the specific embodiments of the present invention.
  • Panel 8A illustrates steps to select non-ambiguous signatures.
  • Panel 8B illustrates steps to remove a signature that does not match a known sequence in a known genome.
  • Panel 8C illustrates steps to analyze the samples by using a normal approximation test.
  • Panel 8D illustrates steps to remove signatures that fail to meet a minimum abundance test.
  • Figure 9 Panels A and B illustrates the correlation between replicate sampling methods of 4-stepper data versus 4-stepper data using MPSS, as in Panel A, and illustrates the correlation between independent sampling methods of 4-stepper data versus 2-stepper data using MPSS as in Panel B.
  • signature sequence refers to any identification method that uniquely identifies a specific nucleic acid, e.g., mRNA molecule, cDNA, and the like. Signature sequences are also called “signatures.” Examples of signature sequences include nucleotide sequences of 17 or more bases generated, e.g., from MPSS.
  • Signature sequences also include nucleotide sequence tags generated from SAGE techniques.
  • the term "abundance estimate” refers to the frequency of a single type of mRNA molecule relative to all the mRNA molecules in a sample. For example, the abundance estimate of any single gene in a dataset is calculated by dividing the number of signature sequences from that gene by the total number of signature sequences for all mRNAs present in the dataset.
  • dataset refers to data that is generated by a method for gene expression analysis using a sample from an organism. As described in further detail below, any of a variety of gene expression analyses can be used to generate the dataset.
  • a dataset includes a set of sequences, or a "population of sequences" representing the genes that are expressed in the sample.
  • a dataset also includes "associated values,” e.g., representing the level of expression of each gene/sequence.
  • independent sampling procedure(s) refers to one or more procedures for estimating the abundance of expressed gene products, e.g. cDNA or polypeptides.
  • cDNA or polypeptides For example, with MPSS, 2-stepper and 4-stepper datasets are generated via independent sampling procedures. Similarly, MPSS and SAGE are independent procedures for sampling cDNA abundance.
  • Gene expression data produced with different kinds of microarrays, e.g. those from Affymetrix vs. spotted cDNA microarrays, are also generated via independent sampling procedures.
  • the present invention provides methods, computer program products and systems to analyze expression of large sets of genes in parallel. Specifically, a pairwise comparison is performed with data generated from two samples using a statistical test of significance.
  • FIG. 1 is a flow-chart illustrating a method of differential gene expression analysis. This figure illustrates a general embodiment that analyzes the data for changes in gene expression using a pairwise comparison of a first sample and a second sample using a statistical test of significance.
  • the steps include collecting data, e.g., via one or more independent sampling procedures with or without replicates, from a plurality of samples (Step 100), e.g., collecting data can include accessing data that is currently being generated, or e.g., accessing data that is stored, e.g., in a database; importing the collected data (Step 102), and analyzing the data for differential gene expression (Step 104) using a statistical test of significance, e.g., a normal approximation, a chi-square test, a Fisher Exact test, a generalized linear model, Bayesian methods and/or the like.
  • a statistical test of significance e.g., a normal approximation, a chi-square test, a Fisher Exact test, a generalized linear model, Bayesian methods and/or the like.
  • the statistical test of significance is a normal approximation test, which includes a two-tailed test using a first equation:
  • n ! is a number of mRNA molecules or cloned cDNA copies sequenced that represents (generates) the population of signature sequences for the first sample
  • n 2 is a number of mRNA molecules or cloned cDNA copies sequenced that represents (generates) the population of signature sequences for the second sample
  • abundance of an individual signature sequence in the first sample is represented by the and the abundance of the individual signature sequence in the second sample is the x 2
  • the pi and the p 2 are represented by a second equation and a third equation:
  • ni and n 2 are large, e.g., greater than about 150,000, e.g., greater than about 200,000, e.g., greater than about 300,000 mRNA molecules or cloned cDNA copies, and typically, about 300,000 to about 10,000,000 mRNA molecules or cloned cDNA copies.
  • xi and x 2 represent the abundance of a specific signature in samples 1 and 2, respectively, and ni and n 2 represent the total number of signatures generated for all mRNAs in samples 1 and 2.
  • the level of significance of the test is determined by calculating the probability of observing a larger ⁇ by chance. This probability is calculated by using the following equation:
  • x is the value of the concerned variable under a standard normal distribution.
  • Pr(x> ⁇ ) is equal or smaller than a predetermined significance level of significance, e.g., 0.05, e.g., 0.01, or e.g., 0.001
  • a significant differential expression between a first sample and a second sample is declared for that particular significance level.
  • This test is also known as a Z-test. See, e.g., Kal et al., (1999), Dynamics of gene expression revealed by comparison of serial analysis of gene expression transcript profiles from yeast growth on two different carbon sources, Mol. Bio. Cell. 10:1859-1872.
  • An inverse relationship exits between the level of expression and size of the difference that can be evaluated between samples.
  • a 2-fold change for a gene that is expressed at a level of about 30-40 copies per million can be detected.
  • genes that are expressed at a higher abundance a substantially smaller difference can be detected.
  • An about 40% difference can be ascertained for genes that are expressed at about 200 copies per million. See, e.g., Figure 2 which illustrates the inverse relationship between the level of gene expression and fold differences that can be detected by the normal approximation test at three levels of significance.
  • FIG. 3 Panels A, and B is a flow chart illustrating a method of differential gene expression analysis of MPSS data using a normal approximation test as the statistical test of significance as described in Steps 300-312.
  • a significance level is chosen, e.g., 0.05, 0.01, or 0.001.
  • proportions, pi and p2 are calculated for xi and x 2 for each stepper dataset, e.g., a 2-stepper and a 4-stepper.
  • the sum of the proportions of p ! and p 2 for a 2-stepper and the sum of the proportions of p] and p 2 of a 4- stepper are compared in Step 304.
  • Step 305 If the sum of proportions of the 2-stepper is greater than the sum of proportions as the 4-stepper, then the statement is true and the 2-stepper dataset is used, as in Step 305. If the sum of proportions of the 2-stepper is less than the sum of proportions as the 4-stepper, then the statement is false and the 4-stepper dataset is used, as in Step 306.
  • Panel B Step 308 the probability differential value is calculated. If this value is greater than the significance level, the statement is true, as in Step 311, which indicates no differential expression. If this value is less than the significance level, the statement is false, as in Step 312, which indicates differential expression.
  • Exact test e.g., a generalized linear model, e.g., Audic and Claverie's Bayesian method and the like.
  • a chi-square test calculates a probability value for the relationship between two dichotomous variables. It is an estimated value of the true probability value.
  • a chi-squared test is used to statistically analyze the data when the sample size (n) is greater than 5. Chi-square tests can also be extended to cover situations where there are more than two categories of possible outcome. Information on statistical tests can be found in variety of places, such as, textbooks, papers and the World Wide Web. For example, see Fisher and van Belle, (1993) Biostatistics: a Methodology for the Health Science.
  • a Fisher Exact test can also be used.
  • the Fisher Exact test calculates an exact probability value for the relationship between two dichotomous variables. It works in a similar manner as a chi-squared test, except that a Fisher Exact test calculates an exact probability rather than an estimate of the true probability value generated from a chi-squared test. Typically, the Fisher Exact test is used when the sample size is small.
  • the data can be categorical data, e.g., data generated from MPSS.
  • Data are categorical data when a signature sequence in a particular dataset is either present at a certain proportion or absent.
  • Other types of data are optionally used in the present invention include, e.g., continuous data, normalized data, analog data, e.g., data generated from microarrays (where the results are represented by a ratio of the fluorescent levels of two probes) and the like.
  • Data can be collected from several different samples and a plurality of sets of pairwise comparisons can be imported into a database, e.g., a relational database, and the data can then be mined for biologically meaningful changes in gene expression.
  • a database e.g., a relational database
  • the data are collected from a central database and the data are stored on a local system.
  • values generated e.g., the probability of differential expression value, first parameter, the second parameter, and/or the like, are stored in a storage medium.
  • Comparison of the two different samples can yield information regarding the expression pattern, e.g., attributable to change in environment, attributable to differences in a parent and a child, attributable to a state of differentiation, attributable a treatment with an agent(s), attributable to a disease state, attributable to an injured state, attributable to differences in species, attributable to different cell types, attributable to different times, attributable to disparate species, attributable to disparate organs and the like.
  • any two samples that differ from one another can be compared.
  • Examples include but are not limited to a first cell type and a second cell type, a first normal sample and a second other than normal sample, a first species sample and a second species sample, a first recombinant inbred line and a second recombinant inbred line, a first state sample and a second state sample, a first treated sample and a second untreated sample, a first infected sample and a second uninfected sample, a first disease-state sample and a second non-disease state sample, a first injured state sample and a second non-injured state sample, a first developmental stage sample and a second different developmental stage sample, and the like.
  • a variety of techniques can be used to generate the data in the present invention.
  • massively parallel signature sequencing is used to generate the data used in differential gene expression.
  • Other examples include SAGE data, microarrays and cDNA fragment profiling methods. See, e.g., Brenner et al., (2000), Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays, Nature Biotech., 18:630-634; Tyagi, (2000), Taking a census of mRNA populations with microbeads, Nature Biotech..
  • Massively parallel signature sequencing is one technology available for in-depth quantitative expression profiling. It is designed for large-scale counting of individual mRNA molecules in a sample. MPSS provides data for all genes in a tissue or cell sample, not just those that have been previously identified and characterized. No prior knowledge of a gene's sequence is required for MPSS; thus, gene expression datasets can be generated from any organism. In addition, MPSS has a high sensitivity level. Anywhere from about 100,000 to about ten million molecules are typically counted in any given sample, so that even genes that are expressed at low levels can be quantified with high accuracy. Typically, an MPSS dataset typically involves greater than, e.g., about 100,000 signature sequences, to about 750,000 signature sequences.
  • MPSS is a "digital" gene expression tool that counts all mRNA molecules simultaneously. Counting mRNAs with MPSS is based on the ability to uniquely identify every mRNA in a sample.
  • MPSS signatures for mRNAs in a sample are generated by sequencing double stranded cDNAs fragments cloned on to microbeads using the Lynx Megaclone technology.
  • a clone refers to a single microbead from which 17 or more bases have been sequenced to create a signature sequence tag from an individual cDNA molecule that has been cloned into the Megaclone library. Fragments from 100,000-10,000,000 individual cDNA molecules from a sample are cloned on to 100,000-10,000,000 separate microbeads using, e.g., the procedure described in Brenner et al., supra, PNAS, thereby making a Megaclone library of cloned cDNA fragments.
  • the DNA is sequenced through an automated series of adaptor ligations and enzymatic steps.
  • Two, e.g., independent sampling, procedures typically used involve either a 4-stepper or 2-stepper, which differ by using two alternative reading-frame adaptors.
  • a 4-stepper the process is initiated by ligating an adaptor molecule to the GATC (DpnII) single-stranded overhangs, and then digesting the samples with Bbvl, which is a type IIs restriction enzyme that cuts the DNA at a position 9-13 nucleotides away from the recognition sequence. This produces molecules with a 4 base single stranded overhang immediately adjacent to the DpnII recognition sequence.
  • encoded adaptors Another set of adaptors, called encoded adaptors, are hybridized and ligated to the 4 base overhangs on each molecule.
  • the encoded adaptors contain a 4 base single stranded overhang with all possible nucleotide combinations at one end, and a single stranded coded sequence at the other end.
  • One member of the encoded adaptor set will find a partner on the DNA molecules attached to the beads in the flow cell.
  • the exact sequence of each encoded adaptor that hybridizes to the DNA on a microbead is decoded through 16 different sequential hybridization reactions with a set of fluorescent decoder probes. This process yields the first 4 nucleotides at the end of each molecule.
  • the encoded adaptor from the first round is removed by digestion with Bbvl, and the process is repeated several times. In the end, a 17 or more -base signature sequence is generated for each bead in the flow-cell.
  • a 2-stepper the process is the same as described above for a 4-stepper but sequence is obtained in a different reading frame that is staggered by two bases compared to the 4- stepper.
  • the recognition site for the type IIS restriction enzyme, e.g., Bbvl used to expose the first four nucleotides to identify the signature sequence, is located 11 nucleotides from the GATC site at the end of the adaptor.
  • the recognition site for the type IIS restriction enzyme, e.g., Bbvl, used to expose the first four nucleotides to identify the signature sequence is located 9 nucleotides from the GATC site at the end of the adaptor.
  • the difference between the 2- stepper protocol and the 4-stepper protocol allows the choice of what overhang will be produced after the first restriction enzyme, e.g., Bbvl, digestion.
  • the datasets generated with the two different adaptors are different, because a different set of four base-pair overhangs will be generated for each signature sequence depending on whether a 2-stepper or 4-stepper protocol is used.
  • Each exposed four base pair can potentially contain a palindromic structure, e.g., 16 of 256 different possible four base-pair overhangs.
  • the dataset generated and the biases make the 2-stepper and 4-stepper protocols independent sampling methods. See Figures 9, Panel A and B. Similar caveats can be encountered when, e.g., comparing SAGE to MPSS or Affymetrix chips to conventionally spotted cDNA arrays.
  • the data analysis of the present invention takes into consideration the inconsistencies that can exist in large datasets obtained with independent sampling procedures.
  • an individual signature sequence from the population of signature sequences is generated from one or more stepper procedures, e.g., a 2-stepper or a 4-stepper, as described above.
  • the methods include choosing the stepper, e.g., 2-stepper or 4-stepper, dataset that produces a highest value of the individual signature sequence.
  • an individual signature sequence from the population of signature sequences is an average from more than one 2-steppers or from more than one 4-steppers.
  • a standard deviation for the average can be calculated.
  • SAGE is ' another transcript counting technique that generates a tag sequence for each mRNA. It also generates a digital gene expression profile. SAGE is based on the principles that a short sequence tag derived from a defined position from a mRNA can uniquely identify the transcript and concatenation of the tags allows for high-throughput sequencing. The length of the SAGE tag is about 10 to about 14 nucleotides. The tag sequence is determined using conventional sequencing technologies. See the following publications and references cited within: Velculescu et al, (1995), Serial analysis of gene expression, Science. 270:484-487; and Zhang et al., (1997), Gene expression profiles in normal and cancer cells, Science. 276:1268-1272.
  • the frequency of a sequence tag derived from the corresponding mRNA transcript is measured.
  • adjustments to consider bias and normalization are optionally included in the present invention. See, e.g., Marguiles et al., (2001) Identification and prevention of a GC content bias in SAGE libraries, Nucleic Acid Res., 29(12):E60-0.
  • Microarrays are also technologies that can be used in the present invention.
  • a microarray is a solid support that contains a variety of genes.
  • the mRNAs from the sample are then allowed to hybridize to the microarray.
  • Microarrays have the advantage of high throughput analysis of multiple samples.
  • some or all of a variety of variables should be considered. These variables include, e.g., that the desired genes are represented on a given array.
  • a microarray exists for the organism of interest.
  • the detection sensitivity is optimized to achieve detection of low expressed genes.
  • a sample is compared with a control sample to compensate for several sources of bias and noise in the intensity results. Typically, the experiment is replicated several times to provide a more reliable dataset.
  • the techniques described above are examples of techniques that can be used to generate the data used in the present invention, where the methods include analyzing greater than about 150,000 mRNA molecules in a sample.
  • the sequenced mRNA can be a partial or full sequenced mRNA.
  • the methods include, where the population of signature sequences represent, e.g., greater than about 150,000 mRNA molecules, e.g., greater than about 200,000 mRNA molecules, e.g., greater than about 300,000 mRNA molecules, e.g., greater than about 500,000 mRNA molecules, e.g., greater than about 750,000 mRNA molecules, or e.g., greater than about 1,000,000 mRNA molecules, in a sample.
  • one or more abundance estimates for an individual signature sequence from the population of signature sequences are generated from one or more independent sampling procedures, e.g., MPSS separate "stepper" datasets.
  • one stepper dataset is generated by a 2-stepper procedure ("2-stepper") and another stepper dataset is generated by a 4-stepper procedure ("4-stepper").
  • the methods include choosing the independent sampling procedure, e.g, the stepper dataset, which produces a highest value of the individual signature sequence.
  • an individual signature sequence from the population of signature sequences is an average from more than one 2-steppers or from more than one 4-steppers.
  • a standard deviation for the average can be calculated.
  • the methods of the present invention can further include a comparison test of the pi and p 2 to assign a set of parameters to the samples being analyzed.
  • the comparison test includes the following steps: (1) choosing either the data from a first independent sampling procedure or the data from a second independent sampling procedure based on the procedure that gave a higher total value, where the higher total value is the pi plus the p 2 , thereby providing the chosen data; and (2) comparing the p value from the first sample and the p 2 value from the second sample in the chosen data for a direction of change, where the direction of change is an upward change, a downward change or an inconsistent change when comparing data generated by both the first and second procedures.
  • the steps can include: (1) choosing either the data from the first and the second sample generated by a 2-stepper or the data from the first and the second sample generated by a 4-stepper based on the procedure that gave a higher total stepper value, where the higher total stepper value is the pi plus the p 2 , thereby providing the chosen stepper data; and (2) comparing the p] value from the first sample and the p 2 value from the second sample in the chosen stepper data for a direction of change, where the direction of change is an upward change, a downward change or an inconsistent change when comparing data generated by both stepper procedures.
  • a first parameter is assigned to the chosen stepper data, where the first parameter is a T when the 2-stepper is used for the comparison test or is an F when the 4- stepper is used for the comparison test.
  • This first parameter provides a stepper value.
  • the first parameter is optionally stored in a database. This is illustrated in the flow chart in Figure 4.
  • a second parameter is assigned to the each signature sequence from the population of signature sequences, where the second parameter is 1, 2, 4, 14, or 24.
  • the second parameter is 1, 2, 4, 14, or 24.
  • a 1 is assigned when a downward change is found in the comparison test.
  • a 2 is assigned when an upward change is found in the comparison test.
  • a 4 is assigned when an inconsistent change is found in the comparison test.
  • a 14 is assigned when a downward change is found in the comparison test and only one stepper dataset is used, e.g., either a 2- stepper procedure or a 4-stepper procedure, and a 24 is assigned when an upward change is found in the comparison test and only one of the stepper datasets is used.
  • a 1 is assigned when the downward change is found in the comparison test and the 2-stepper procedure and the 4-stepper procedure both exhibit the downward change.
  • a 2 is assigned when the upward change is found in the comparison test and the 2-stepper procedure and the 4-stepper procedure both exhibit the upward change.
  • a 4 is assigned when the direction of change is inconsistent.
  • a 14 is assigned when the downward change is found in the comparison test and when the data from only the 2-stepper procedure or the 4-stepper procedure can be used.
  • a 24 is assigned when the upward change is found in the comparison test and when the data from only the 2-stepper procedure or the 4-stepper can be used.
  • a 1 is assigned when a downward change is found in the comparison test, and more than one independent sampling method, e.g. 2-stepper and 4- stepper, exhibit a downward change in the same direction.
  • a 2 is assigned when an upward change is found in the comparison test, and more than one independent sampling method, e.g. 2-stepper and 4-stepper, exhibit an upward change in the same direction.
  • a 4 is assigned when a change is found in the comparison test, and when the change with an independent sampling method, e.g. 2-stepper and 4-stepper, is inconsistent.
  • a 14 is assigned when a downward change is found in the comparison test and when data from only one independent sampling procedure, e.g. only one stepper dataset, can be used, and a 24 is assigned when an upward change is found in the comparison test and when data from only one independent sampling procedure, e.g. only one of the stepper datasets, can be used.
  • This second parameter provides a direction value and, typically, can provide a way of identifying the gene expression differences that are consistently observed with independent sampling methods.
  • the second parameter is optionally stored in a database.
  • Figure 5 illustrates a method of assigning parameter 2, the direction value.
  • Step 500 the p ! value from a first sample is compared to the p 2 value in the second sample. If there is an inconsistent change between the values, e.g. a downward change when examining the 2-stepper data and an upward change when examining the 4-stepper data, the signature sequence is assigned a 4 as a direction value. If there is not an inconsistent change, the direction of the change is determined as in Step 502. If p ! > p 2 is true, there is a downward change and Step 504 is followed. In Step 504, data are separated on the basis of whether one or more stepper procedure can be used.
  • Step 504 If only one stepper procedure, e.g., either a 2-stepper or a 4-stepper, is used, e.g., due to a zero value in both the p] and p 2 of the one stepper, as in Step 502, the statement in Step 504 is true and a number 14 is assigned to the parameter 2 as in Step 506. If the statement in Step 504 is false, more than one stepper procedure can be used and a number 1 is assigned to the parameter, as in Step 507. If pi > p 2 is false in Step 502, there is an upward change and Step 508 is followed. In Step 508, data are separated on the basis of whether one or more stepper procedure can be used.
  • Step 508 If only one stepper procedure, e.g., either a 2-stepper or a 4- stepper, e.g., due to a zero value in both the pi and p 2 of one of the stepper, is used in Step 502, the statement in Step 508 is true and a number 24 is assigned to the parameter 2 as in Step 510. If the statement in Step 508 is false, more than one stepper procedure can be used and a number 2 is assigned to the parameter, as in Step 512.
  • a 2-stepper or a 4- stepper e.g., due to a zero value in both the pi and p 2 of one of the stepper
  • the methods of the present invention optionally include filtering the data.
  • the filtering step occurs before the data analysis step.
  • the methods can further include filtering the population of signature sequences by removing sequences from the population of signature sequences that contain one or more ambiguous nucleotides.
  • the data can be filtered by removing signature sequences that fail to meet a minimum abundance test.
  • the minimum abundance test includes removing signature sequences that, when analyzing the samples with a normal approximation test, cannot be distinguished from zero.
  • an individual signature sequence with the highest abundance across all MPSS runs that is less than about 4 per million is not significantly different from zero when a p ⁇ 0.05 significance level is chosen, and e.g., an abundance of 6 is not significantly different from zero when a p ⁇ 0.01 significance level is chosen, and e.g., an abundance of 11 is not significantly different from zero when a p ⁇ 0.001 significance level is chosen.
  • the filter removes sequences that do not match a known genome.
  • signature sequences can be matched to known genes by comparison with data available in genomic and/or EST sequence databases, e.g., the National Center for Biotechnology Information (NCBI), as described below. If no match is found, the signature sequence is removed from further analysis. Alternatively, if multiple matches are found, e.g. to a repeated sequence, the signature is removed from further analysis. Typically, signatures that match a genome sequence at about 16 or more nucleotides out of about 17 and have about 1-3 matches in the genome are retained for further analysis.
  • NCBI National Center for Biotechnology Information
  • Methods of the present invention include generating tables for the data.
  • a table can include information regarding one sample, e.g., a list of unique signature sequences from the population of signature sequences correlated with an abundance estimate or level for each of the unique signature sequences found in the each sample, and/or the probability of differential expression value calculated when a comparison is made with a different sample.
  • the table is generated from information regarding a plurality of samples.
  • the methods of the present invention also include methods of gathering information on a signature sequence.
  • the method includes collecting data from a plurality of samples, each sample comprising a population of signature sequences that is generated from greater than about 150,000 mRNA molecules or cloned cDNA copies; importing the data into a database; gathering information on at least one unique signature sequence from the population of signature sequences from at least one other database or at least one web page from the Internet or a combination of both. This generates unique information for the at least one unique signature sequence; and provides the unique information for the at least one unique signature sequence to a user or an automated system.
  • the other database can be a variety of databases, e.g., a sequence database, a gene annotation database, a publication database and/or the like.
  • the World Wide Web and/or other computer networks or internetwork are also included in the present invention as a source to gather information on a signature sequence.
  • the methods of the present invention optionally include identifying at least one gene from the population of signature sequences and collecting information about at least one gene in a single computer file from distributed sites.
  • the population of signature sequences can be batch-blasted against a variety of databases, e.g., the various NCBI databases, e.g., NR, UniGene, EST and the like, using a network client, e.g., NCBI's network BLAST client 'blastcl3,' or using a local client, e.g., local client 'blastall,' to generate individual BLAST files of hits against each of the variety of databases.
  • the different BLAST outputs can be read and further analyzed. For example, hits can be ordered, flagged based on desired criteria, sorted by the flag(s) and associated with gene identifications from the database(s).
  • Figure 6 illustrates a method of parsing multi-database hits, e.g., BLAST hits, in a desired hit order, flagging hits based on desired criteria, sorting the hits by flag and appending gene identifications from a database, e.g., UCSC GoldenPath, in parallel.
  • a population of signature sequences is provided, e.g., MPSS signature list.
  • signature sequences are batch-blasted against various NCBI databases, e.g., NR Blast (BLAST).
  • the desired number of matched bases includes a value that can identify possible gene matches, e.g., about 9 matched bases to about entire gene.
  • Step 604 Upon extracting matches for an individual signature sequence based on a desired number of matched bases, e.g., NR Blast output with all human exact (1-17 bases) matches, this output is further divided as in Steps 604 and 606.
  • the matches from the NR Blast output are separated into a reference sequence category, e.g., RefSeq (Category 1), or separated into a chromosomal category, e.g., Chromosomal hits (Category 2), as in Step 606.
  • Step 608 signature sequences are batch-blasted against a different NCBI database(s), e.g., UniGene Blast (BLAST).
  • a third Category of hits is provided, e.g., UniGene (Category 3) hits, as in Step 610.
  • UniGene (Category 3) hits are provided in Step 612.
  • all qualifying hits from each of the three categories e.g., Category 1, Category 2 and Category 3 are combined in a desired order, e.g., Ref Seq, UniGene and Chromosomal Hits.
  • Signature sequences are flagged based on desired criteria, e.g., number of different chromosomes with matches, total number of hits obtained, etc.
  • Step 613 illustrates an example of a flagging scheme where signature sequences are flagged with an "A” if the signature sequence has a unique chromosome hit or the total qualifying hits is less than a desired cutoff, or signature sequences are flagged with a "B” if the signature sequence does not fall into one of the descriptions in "A.”
  • the parsed hits can be sorted by flag, as in Step 614.
  • Step 616 known or predicted identified genes are verified and/or added via a database, e.g., UCSC GoldenPath, lookup for each parsed signature sequence.
  • parsed hits can be sorted by flags with simultaneous appending of annotation from UCSC GoldenPath for known genes from RefSeq, Acembly gene predictions with alternative-splicing, Ensembl gene predictions, Fgenesh++ gene predictions, and Genscan gene predictions, e.g., based on BLAST chromosomal hits, BAC clone hits or other hits.
  • the signature sequence coordinates and orientation can be determined by accessing the corresponding UCSC Genome Browser page, and the most likely candidate gene(s) can be determined by matching orientations and coordinates of the signature sequence and the various gene candidates. In one embodiment, more weight is given to the gene(s) with the signature sequence located near their 3' ends in the same orientation.
  • Step 617 signature sequences that were excluded from the analysis, e.g., absence of 1-17 base matches, can be further analyzed by repeating Steps 600-616 once or multiple times based on a different, e.g., reduced, number of matched bases, e.g., 1-16 base matches in the BLASToutput files, 1-15 base matches in the BLAST output files and/or the like.
  • Figure 7 illustrates a sample of gene identifications for 10 signature sequences.
  • a variety of methods for determining relationships between two or more sequences are available, and well known in the art, including manual alignment and computer assisted sequence alignment and analysis.
  • a number of algorithms for performing sequence alignment are widely available, or can be produced by one of skill, including: the local homology algorithm of Smith and Waterman (1981) Adv. Appl. Math. 2:482; the homology alignment algorithm of Needleman and Wunsch (1970) J. Mol. Biol. 48:443; the search for similarity method of Pearsofi and Lipman (1988) Proc. Natl. Acad. Sci.
  • HSPs high scoring sequence pairs
  • initial neighborhood word hits act as seeds for initiating searches to find longer HSPs containing them.
  • the word hits are then extended in both directions along each sequence for as far as the cumulative alignment score can be increased. Cumulative scores are calculated using, for nucleotide sequences, the parameters M (reward score for a pair of matching residues; always > 0) and N (penalty score for mismatching residues; always ⁇ 0). For amino acid sequences, a scoring matrix is used to calculate the cumulative score. Extension of the word hits in each direction are halted when: the cumulative alignment score falls off by the quantity X from its maximum achieved value; the cumulative score goes to zero or below, due to the accumulation of one or more negative-scoring residue alignments; or the end of either sequence is reached.
  • the BLAST algorithm parameters W, T, and X determine the sensitivity and speed of the alignment.
  • the BLASTP program uses as defaults a wordlength (W) of 3, an expectation (E) of 10, and the BLOSUM62 scoring matrix (see, Henikoff & Henikoff (1989) Proc. Natl. Acad. Sci. USA 89:10915).
  • the BLAST algorithm performs a statistical analysis of the similarity between two sequences (see, e.g., Karlin & Altschul (1993) Proc. Nat'l. Acad.
  • BLAST algorithm One measure of similarity provided by the BLAST algorithm is the smallest sum probability (p(N)), which provides an indication of the probability by which a match between two nucleotide or amino acid sequences would occur by chance.
  • a nucleic acid is considered similar to a reference sequence (and, therefore, in this context, homologous) if the smallest sum probability in a comparison of the test nucleic acid to the reference nucleic acid is less than about 0.1, or less than about 0.01, and or even less than about 0.001.
  • PILEUP Another example of a useful sequence alignment algorithm is PILEUP.
  • PILEUP creates a multiple sequence alignment from a group of related sequences using progressive, pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment.
  • PILEUP uses a simplification of the progressive alignment method of Feng & Doolittle (1987) J. Mol. Evol. 35:351-360. The method used is similar to the method described by Higgins & Sharp (1989) CA5/OS5: 151-153.
  • the program can align, e.g., up to 300 sequences of a maximum length of 5,000 letters.
  • the multiple alignment procedure begins with the pairwise alignment of the two most similar sequences, producing a cluster of two aligned sequences. This cluster can then be aligned to the next most related sequence or cluster of aligned sequences.
  • Two clusters of sequences can be aligned by a simple extension of the pairwise alignment of two individual sequences.
  • the final alignment is achieved by a series of progressive, pairwise alignments.
  • the program can also be used to plot a dendogram or tree representation of clustering relationships.
  • the program is run by designating specific sequences and their amino acid or nucleotide coordinates for regions of sequence comparison.
  • information gathered on a specific signature sequence can be obtained from the Internet, which comprises computers and computer networks that are interconnected through communication links.
  • the interconnected computers exchange information using various services, such as electronic mail, ftp, the World Wide Web (“WWW”), and other services including secure and/or non-public services.
  • the WWW service can be understood as allowing a server computer system (e.g., a Web server or a Web site) to send Web pages of information to a remote client computer system.
  • the remote client computer system can then display the Web pages.
  • each resource e.g., computer or Web page
  • each resource e.g., computer or Web page
  • URL Uniform Resource Locator
  • IP Internet Protocol
  • a client computer system specifies the URL for that Web page in a request.
  • the request is forwarded to the Web server that supports that Web page.
  • that Web server receives the request, it sends that Web page to the client computer system.
  • the client computer system receives that Web page, it typically displays the Web page using a browser.
  • a browser is a special-purpose application program that effects the requesting of Web pages and the displaying of Web pages.
  • Web pages can be defined using a hypertext Markup Language (HTML) or similar constructs that allow a server computer to indicate how a web page looks and/or behaves. HTML provides a standard set of tags that define how a Web page is to be displayed.
  • the browser When a user indicates to the browser to display a Web page, the browser sends a request to the server computer system to transfer to the client computer system data and/or one or more electronic documents that defines the Web page. When the requested data is received by the client computer system, the browser displays the Web page as defined by the document and performs indicated behaviors.
  • the document contains various tags that control the displaying of text, graphics, controls, and other features.
  • the HTML document can contain URLs of other Web pages available on that server computer system or other server computer systems.
  • a client system is provided with a set of interfaces that allow a user to select the samples that they want to compare.
  • the client system displays information regarding the expression data and displays an indication of an action that a user is to perform to request a comparison.
  • the client system sends to a server system the necessary information to access signature sequence information.
  • the server system uses the request data, and optionally one or more sets of server data, to process the request.
  • a client system is, or has previously been, provided with an executable code file that allows the client system to receive the data and analyze the data.
  • Computer program products are also aspects of present invention. Any methods of the present invention can be performed in a computer.
  • the present invention includes a computer program product for analysis of differential gene expression.
  • the program includes code that receives as input a plurality of signature sequences from a plurality of samples; code that selects samples from the plurality of samples to be analyzed; and code that determines the probability of differential expression for each signature sequence from the selected samples using a pairwise comparison of a first sample and a second sample from the plurality of samples using a statistical test of significance, where the codes are stored on a tangible medium.
  • the statistical test of significance is a two-tailed normal approximation test.
  • the program includes a code that removes signature sequences that fail to meet a minimum abundance test. Code is optionally provided that removes signature sequences that do not match a known genome, and/or that contain more than one ambiguous nucleotide.
  • the program includes code that receives as input a plurality of signature sequences from a plurality of samples, code that gathers available information on each signature sequences from other databases, and code that matches the specific signature sequence with the information gathered and code that stored the matched information in a tangible medium.
  • the program includes code that compares pi and p2 to assign a first parameter and/or a second parameter as described above.
  • code is provided that generates tables of the signature sequences that contain the frequency of the signature sequence in a sample, and/or the probability of differential expression value determined for the signature sequence.
  • One embodiment of the invention can be implemented on a general purpose computer using a suitable programming language such as PERL, Java, C++, Cobol, C, Pascal, Fortran, PL1, LISP, assembly, etc.
  • a suitable programming language such as PERL, Java, C++, Cobol, C, Pascal, Fortran, PL1, LISP, assembly, etc.
  • Systems for analysis of differential gene expression are also a part of the present invention. These systems include a processor and a computer readable medium coupled to the processor, said computer readable medium storing a computer program.
  • the computer program includes code for any of the analysis used in the present invention.
  • the code can include instructions that receive as an input a plurality of signature sequences from a plurality of samples; instructions that select samples from the. plurality of samples to be analyzed; and instructions that determine the probability of differential expression for each signature sequence from the selected samples using a pairwise comparison of a first sample and a second sample from the plurality of samples.
  • Other relevant codes are described under the computer program products section herein.
  • Logic systems and methods such as described herein can include a variety of different components and different functions assembled in a modular fashion. Different embodiments of the invention can include different mixtures of elements and functions and can group various functions as parts of various elements. The invention is described in terms of systems that include many different innovative components and innovative combinations of innovative components and known components. No inference should be taken to limit the invention to combinations containing all of the innovative components listed in any illustrative embodiment in this specification.
  • the functional aspects of the invention that are implemented on a computer can be implemented or accomplished using any appropriate implementation environment or programming language, such as PERL, C, C++, Cobol, Pascal, Java, Java-script, HTML, XML, dHTML, assembly or machine code programming, etc.
  • the present invention encompasses a variety of specific embodiments for performing these steps.
  • the request for importing the data collected can be received in a variety of ways, including through one or more graphical user interfaces provided by the collection system to a database or by the collection system receiving an email or other digital message or communication from the client system connected to the database.
  • data and/or indications can be transmitted to the database using any method for transmitting digital data, including HTML communications, FTP communications, email communications, wireless communications, etc.
  • indications of desired data can be received from a human user selecting from a graphical interface at a computing device.
  • the collection system accesses the requested data.
  • a collection system can hold data files prior to receiving a request for particular data or the collection system can create requested data while responding to a request from a user to receive the data.
  • the collection system transmits or imports the data to a client system.
  • a logic routine can be used to access the file that is transmitted.
  • Accessing the data can be done with or without the active participation of a human user.
  • a voice command can be spoken by a user
  • a key may be depressed by a user
  • a button on a scientific device can be depressed by a user or selecting using any pointing device can be effected by a user.
  • requested data can be submitted by automated equipment at a client site.
  • a scientific device can be programmed to automatically request needed datasets from a collection system according to specific embodiments of the present invention.
  • Example 1- Differential Gene expression analysis of a first sample and a second sample [0106]
  • signature sequence data e.g., generated from MPSS experiment
  • the data are imported into a database where the user can analyze the data.
  • This process is illustrated in Figure 8 Panels A, B, C, and D.
  • Panel A Step 800 signature sequences in the database from the treated and untreated sample are grouped based on whether or not the signature sequence is a non-ambiguous signature, e.g., a signature sequence that contains one or less ambiguous nucleotide. If the statement is true, as in Step 801, the signature sequence is selected.
  • Step 800 If the statement in Step 800 is false, the signature sequence is not selected for further analysis.
  • Panel B Step 803 signature sequences are selected that match to a known sequence in a known genome. If the signature sequence matches a known sequence in a known genome, the statement is true as in Step 804 and the signature sequence is selected. If the statement is false, as in Step 805, the signature sequence is not selected for further analysis.
  • Panel C Step 806 For each pair of samples, Panel C Step 806, the samples are compared using a pairwise comparison using a statistical test of significance, as in Step 807, for those showing a statistically significant differential expression, thereby assigning a probability of differential expression value.
  • An example of a procedure for this test is illustrated in Figure 2. An additional filtering step can also be applied to the data.
  • signature sequences that for a desired significance level, e.g., 0.05, 0.01, or 0.001, as in Panel D Step 808 that meet or exceed a minimum abundance test are selected as the signature sequences of interest for the differential expression, as in Steps 810-811. If a signature sequence fails the minimum abundance test, the signature sequence is not selected for further examination or analysis, as in Step 812.
  • the samples are optionally compared to assign parameters as described in the present invention and create tables of the expression data.

Landscapes

  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

This invention relates to statistical analysis of differential gene expression 5 data using a pairwise comparison of two different samples. Methods, computer programs and systems are provided for the analysis and comparison of gene frequency distributions generated by one or more replicate samples or by independent sampling procedures.

Description

METHOD FOR THE ANALYSIS OF DIFFERENTIAL GENE
EXPRESSION
CROSS-REFERENCE TO RELATED APPLICATION
[0001] This application claims the benefit of U.S. Provisional Application No. 60/341,030 filed December 11, 2001, entitled "Methods for the Analysis of Differential Gene Expression" and naming Jing-Zhong Lin et al. as the inventors. This prior application is hereby incorporated by reference in its entirety.
STATEMENT AS TO RIGHTS TO INVENTIONS MADE UNDER FEDERALLY SPONSORED RESEARCH AND DEVELOPMENT
[0002] Not applicable.
FIELD OF THE INVENTION
[0003] The present invention relates to methods, computer programs and systems for the analysis of differential gene expression.
BACKGROUND OF THE INVENTION
[0004] Recent developments in elucidation of the sequences of many vertebrate, invertebrate, plant and microbial genomes have prompted an increasing interest in using genomic reagents to study in-depth patterns of gene expression. Difference in gene expression profiles can lead to the molecular basis of, e.g., disease, injury, development, differences in species, differences in organs, differences in cells and the like.
[0005] Many technologies are available for analyzing the expression of hundreds to thousands of genes simultaneously, e.g., DNA microarray platforms, serial analysis of gene expression (SAGE), massively parallel signature sequencing (MPSS), etc. Gene expression is even being monitored by sequencing individual clones in a non-normalized cDNA library (EST).
[0006] Some techniques use short sequences to distinguish an individual mRNA.
These short sequences are sometimes called signature sequences, e.g., as in MPSS, or tags, e.g., as in SAGE. Determination of this short sequence allows identification of an individual mRNA and, possibly, the identification of the corresponding gene. The frequency of the specific short sequence in one sample can be compared with the frequency of the specific short sequence in another sample. A comparison of the two different samples can yield information regarding the expression pattern. For example, previous comparisons have examined only up to about 50,000 mRNA molecules in a cell (sample) using SAGE data. See, e.g., Man, et al, (2000) Power SAGE: comparing statistical tests or SAGE experiments, Bioinformatics, 16(11): 953-959 and van Kampen et al., (2000) USAGE: a web-based approach towards the analysis of SAGE data, Bioinformatics, 16(10): 899-905.
[0007] However, some techniques, e.g., MPSS, can generate expression data on greater than about 150,000 mRNA molecules in a sample (cell). This type of data allows for detection of genes that are expressed at low levels within the cell. Analysis procedures are lacking for the quantity of data generated by such procedures from a single sample. The present invention uses methods, computer program products and systems to statistically analyze differential gene expression from data generated from greater than 150,000 mRNA molecules in a single sample.
SUMMARY OF THE INVENTION
[0008] The present invention statistically analyzes differential gene expression from varying samples, wherein each sample generates greater than, e.g., about 150,000 mRNA molecules. Methods, computer program products and systems are provided. [0009] The methods include collecting data from a plurality of samples, each sample comprising a population of signature sequences that represents greater than about 150,000 mRNA molecules; importing the data into a database, e.g., a relational database; analyzing the data for changes in gene expression using a pairwise comparison of a first sample and a second sample selected from the plurality of samples using a statistical test of significance; thereby generating a probability value of a differential expression for each signature sequence; and, providing the probability of the differential expression value to a user or an automated system. [0010] A variety of statistical tests of significance can be used. For example, the statistical tests include a two-tailed normal approximation test, a chi-square test, a Fisher Exact test, a generalized linear model, Bayesian methods and/or the like. In one embodiment, the statistical test of significance is a normal approximation test, which includes a two-tailed test using a first equation:
Figure imgf000004_0001
wherein the ni is a number of mRNA molecules or cloned cDNA copies sequenced that represents the population of signature sequences for the first sample, and the n2 is a number of mRNA molecules or cloned cDNA copies sequenced that represents the population of signature sequences for the second sample, wherein abundance of an individual signature sequence in the first sample is represented by the XΪ and the abundance of the individual signature sequence in the second sample is the x2, and wherein the pi and the p2 are represented by a second equation and a third equation:
Pl = ^ , and p2 = — ; n, n2
wherein the p is represented by a fourth equation:
P = — — ; nx + n2
wherein the q is represented by a fifth equation:
q = 1 - p ; and,
wherein the ni and n2 are large, e.g., greater than about 150,000 mRNA molecules or cloned cDNA copies, e.g., greater than about 200,000 mRNA molecules or cloned cDNA copies, e.g., greater than about 300,000 mRNA molecules or cloned cDNA copies, and typically, about 300,000 to about 10,000,000 mRNA molecules or cloned cDNA copies. The difference of p] and the p2 is distributed according to a first expression:
Figure imgf000005_0001
[0011] The level of significance of the test is determined by calculating the probability of observing a larger λ by chance. This probability is calculated by using a sixth equation:
for λ<0; and a seventh equation:
Figure imgf000005_0002
Figure imgf000005_0003
wherein the x is the value of the concerned variable under a standard normal distribution. When Pr(x>λ) is equal to or smaller than a predetermined level of significance, e.g., 0.05, or e.g., 0.01, or e.g., 0.001, then a significant differential expression between a first sample and a second sample is declared for that particular significance level.
[0012] The methods of analysis include analyzing data from a sample, where the sample includes a large number of sequenced nucleic acids, e.g., mRNAs, cloned cDNA copies and/or the like. The sequenced nucleic acid can be a partially or fully sequenced mRNA, cloned cDNA copy and/or the like. The methods include where the population of signature sequences in a sample represent, e.g., greater than about 150,000 mRNA molecules, e.g., greater than about 200,000 mRNA molecules, e.g., greater than about 300,000 mRNA molecules, e.g., greater than about 500,000 mRNA molecules, e.g., greater than about 750,000 mRNA molecules, or e.g., greater than about 1,000,000 mRNA molecules. [0013] The population of signature sequences can be generated from a variety of techniques, e.g., massively parallel signature sequencing (MPSS), e.g., microarrays, e.g., SAGE data and/or the like.
[0014] In a further embodiment, one or more abundance estimates for an individual signature sequence from the population of signature sequences are generated from one or more independent sampling procedures, e.g., MPSS separate "stepper" datasets. In one embodiment, one stepper dataset is generated by a 2-stepper procedure ("2-stepper") and another stepper dataset is generated by a 4-stepper procedure ("4-stepper"). In another aspect, the methods include choosing the independent sampling procedure, e.g, the stepper dataset, which produces a highest value of the individual signature sequence. In another embodiment, an individual signature sequence from the population of signature sequences is an average from more than one 2-steppers or from more than one 4-steppers. In further embodiments, a standard deviation for the average can be calculated.
[0015] Methods of the present invention include generating tables for the data. For example, a table can include information regarding one sample, e.g., a list of unique signature sequences from the population of signature sequences correlated with an abundance estimate or level for each of the unique signature sequences found in the one sample, and/or a list of the probability value of differential expression for the signature sequence. Optionally, the table is generated from information for a plurality of samples.
[0016] The methods of the present invention can optionally include filtering the data. In one embodiment, the filtering step occurs before the analyzing the data step. For example, the data can be filtered by removing sequences from the population of signature sequences that contain more than one ambiguous nucleotide, and/or that do not match to a known sequence in a known genome. Data can also be filtered by removing values that fail a minimum abundance test.
[0017] The methods use a pairwise comparison of two samples, e.g., a first sample and a second sample. One or more datasets for each sample can be used, e.g., that are replicates, or e.g., that are obtained via independent sampling procedures. The methods can be performed with anything that generates at least two different samples. Examples include, but are not limited to, a first cell type and a second cell type, a first normal sample and a second other than normal sample, a first species sample and a second species sample, a first recombinant inbred line and a second recombinant inbred line, a first state sample and a second state sample, a first treated sample and a second untreated sample, a first infected sample and a second uninfected sample, a first disease-state sample and a second non- disease state sample, a first injured state sample and a second non-injured state sample, a first developmental stage sample and a second different developmental stage sample, at least two disparate species, at least two disparate organs and the like. [0018] A variety of parameters can be associated with a signature sequence. For example, the methods of the present invention can further include a comparison test of the Pi and p2 to assign a set of parameters to the samples being analyzed. For example, the methods can be used to compare datasets that have been generated from, e.g., replicate samples, or e.g., independent sampling procedures. Typically, with independent sampling procedures for estimating gene expression abundance, each procedure can have its own systematic biases that can influence the determination of the abundance values in a procedure-specific manner. For example, with MPSS, 2-stepper and 4-stepper data are generated via independent sampling procedures. Typically, one or more abundance estimates for an individual signature sequence for the population of signature sequences are generated from an independent sampling procedure. For the majority of signatures, an estimate of abundance is determined via both procedures, e.g., which can be similar or dissimilar, but for some signatures, e.g., those with palindromes, only one estimate is determined with one procedure or the other. In order to treat all the signatures sampled by both procedures in equivalent fashion, a comparison test can be done. The comparison test includes the following steps: (1) choosing either the data from a first independent sampling procedure or the data from a second independent sampling procedure based on the procedure that gave a higher total value, where the higher total value is the pi plus the p2, thereby providing the chosen data; and (2) comparing the p] value from the first sample and the p2 value from the second sample in the chosen data for a direction of change, where the direction of change is an upward change, a downward change or an inconsistent change when comparing data generated by both the first and second procedures. For example, the steps can include: (1) choosing either the data from the first and the second sample generated by a 2-stepper or the data from the first and the second sample generated by a 4- stepper based on the procedure that gave a higher total stepper value, where the higher total stepper value is the pi plus the p2, thereby providing the chosen stepper data; and (2) comparing the p\ value from the first sample and the p2 value from the second sample in the chosen stepper data for a direction of change, where the direction of change is an upward change, a downward change or an inconsistent change when comparing data generated by both stepper procedures.
[0019] A first parameter is assigned to the chosen stepper procedure, wherein the first parameter is a T when the 2-stepper is chosen in the comparison test or is an F when the 4-stepper is chosen in the comparison test. This first parameter provides a stepper value. The first parameter is optionally stored in a database.
[0020] A second parameter is assigned to the each signature sequence from the population of signature sequences, where the second parameter is 1, 2, 4, 14, or 24. For example, a 1 is assigned when a downward change is found in the comparison test. A 2 is assigned when an upward change is found in the comparison test. A 4 is assigned when an inconsistent change is found in the comparison test. A 14 is assigned when a downward change is found in the comparison test and only one dataset is used, e.g., either from a 2- stepper procedure or a 4-stepper procedure, and a 24 is assigned when an upward change is found in the comparison test and only one dataset is used, e.g., either from a 2-stepper procedure or a 4-stepper procedure.
[0021] In one embodiment, a 1 is assigned when a downward change is found in the comparison test and the 2-stepper procedure and the 4-stepper procedure both exhibit the downward change. A 2 is assigned when an upward change is found in the comparison test and the 2-stepper procedure and the 4-stepper procedure both exhibit the upward change. A 4 is assigned when the direction of change is inconsistent. A 14 is assigned when the downward change is found in the comparison test and when the data from only the 2-stepper procedure or the 4-stepper procedure can be used. A 24 is assigned when the upward change is found in the comparison test and when the data from only the 2-stepper procedure or the 4-stepper can be used.
[0022] In another embodiment, a 1 is assigned when a downward change is found in the comparison test, and more than one independent sampling method, e.g. 2-stepper and 4- stepper, exhibit a downward change in the same direction. A 2 is assigned when an upward change is found in the comparison test, and more than one independent sampling method, e.g. 2-stepper and 4-stepper, exhibit an upward change in the same direction. A 4 is assigned when a change is found in the comparison test, and when the change with an independent sampling method, e.g. 2-stepper and 4-stepper, is inconsistent. A 14 is assigned when a downward change is found in the comparison test and when data from only one independent sampling procedure, e.g. a 2-stepper or a 4-stepper procedure, can be used, and a 24 is assigned when an upward change is found in the comparison test and when data from only one independent sampling procedure, e.g. a 2-stepper or a 4-stepper, can be used. [0023] This second parameter provides a direction value and, typically, can provide a way of identifying the gene expression differences that are consistently observed with independent sampling methods. The second parameter is optionally stored in a database.
[0024] The methods of the present invention also include methods of gathering information on a signature sequence. The method includes collecting data from a plurality of samples, each sample comprising a population of signature sequences that is generated from greater than about 150,000 mRNA molecules; importing the data into a database; gathering information on at least one unique signature sequence from the population of signature sequences from at least one other database or at least one web page or a combination of both. This generates unique information for the at least one unique signature sequence; and provides the unique information for the at least one unique signature sequence to a user or an automated system. The other database can be a variety of databases, e.g., a sequence database, a publication database, a gene annotation database and the like. [0025] For example, the population of signature sequences can be batch-blasted against a variety of databases, e.g., the various NCBI databases, e.g., NR, UniGene, EST and the like, using a network client, e.g., NCBFs network BLAST client 'blastcl3,' or using a local client, e.g., local client 'blastall,' to generate individual BLAST files of hits against each of the variety of databases. Typically, a hit is a match of the signature sequence to a sequence in a database based on desired criteria, e.g., number of nucleotides matching, score based on homology, the location of the match in the sequence in the database and the like. The hits are then optionally ordered, flagged based on desired criteria, sorted by the flag(s) and associated with gene identifications from the database(s). Other information of relevance associated with genes, e.g. functional information, cellular localization or references to associated literature can be collected from one or more databases, e.g. those available on the Internet, and compiled in a local database associated with any selected population of signature sequences.
[0026] In further embodiments, the present invention can be understood in the context of importing data over a communication media. An important application for the present invention, and an independent embodiment, is in the field of gathering information on a signature sequence over the Internet, optionally using Internet media protocols and formats, such as HTTP, RTTP, XML, HTML, dHTML, VRML, as well as image, audio, or video formats etc. Methods and apparatus of the present invention can also be used in other related situations where users access content over a communication channel, such as modem access systems, institution network systems, wireless systems, etc.
[0027] As discussed in the embodiments above and below, the methods of the present invention are optionally performed in a computer. In one embodiment, the data are collected from a central database and the data are stored on a local system. In one aspect, values calculated, e.g., the probability of differential expression value, parameters assigned, e.g., the first and/or second parameter, information gathered, tables generated, are stored in a storage medium. [0028] Computer program products are also aspects of present invention. For example, the present invention includes a computer program product for analysis of differential gene expression. The program includes code that receives as input a plurality of signature sequences from a plurality of samples; code that selects samples from the plurality of samples to be analyzed; and code that determines the probability of differential expression for each signature sequence from the selected samples using a pairwise comparison of a first sample and a second sample from the plurality of samples using a statistical test of significance, where the codes are stored on or in a tangible medium.
[0029] In one embodiment, the statistical test of significance is a two-tailed normal approximation test. In further embodiments, the program includes a code that removes signature sequences that fail to meet a minimum abundance test and, optionally, code that removes signature sequences with ambiguous nucleotide assignments. Code is optionally provided that removes signature sequences that do not match a known genome.
[0030] In another embodiment, the program includes code that receives as input a plurality of signature sequences from a plurality of samples, code that gathers available information on at least one signature sequence from at least one different database, and code that matches the specific signature sequence with the information gathered and code that stored the matched information in a tangible medium. In another aspect, the program includes code that compares pi and p2 to generate a first parameter and/or a second parameter as described above. In another aspect, code is provided that generates tables of the signature sequences and the frequency of the signature sequence in a sample, and/or the probability of differential expression value for the signature sequence. [0031] Systems for analysis of differential gene expression are also a part of the present invention. These systems include a processor; and a computer readable medium coupled to the processor, said computer readable medium storing a computer program. The computer program includes code for any of the methods used in the present invention. For example, the code can include a code that receives as input a plurality of signature sequences from a plurality of samples; a code that selects samples from the plurality of samples to be analyzed; and a code that determines the probability of differential expression for each signature sequence from the selected samples using a pairwise comparison of a first sample and a second sample from the plurality of samples. Other codes are described above.
BRIEF DESCRIPTION OF THE DRAWINGS
[0032] Figure 1 is a flow-chart illustrating a method of differential gene expression analysis.
[0033] Figure 2 illustrates an inverse relationship between the level of gene expression and fold differences that can be detected by a normal approximation test at three levels of significance.
[0034] Figure 3 Panels A and B is a flow-chart illustrating a method of differential gene expression analysis using a normal approximation test for the statistical test of significance. [0035] Figure 4 is a flow-chart illustrating a method of assigning parameter 1, the stepper value.
[0036] Figure 5 is a flow-chart illustrating a method of assigning parameter 2, the direction value.
[0037] Figure 6 illustrates a method of parsing multi-database hits, e.g., BLAST hits, in a desired hit order, flagging hits based on desired criteria, sorting the hits by flag and appending gene identifications from a database, e.g., UCSC GoldenPath, in parallel.
[0038] Figure 7 illustrates a sample of gene identifications for 10 signature sequences.
[0039] Figure 8 Panels A, B, C and D is a flow-chart illustrating a differential gene expression system according to the specific embodiments of the present invention. Panel 8A illustrates steps to select non-ambiguous signatures. Panel 8B illustrates steps to remove a signature that does not match a known sequence in a known genome. Panel 8C illustrates steps to analyze the samples by using a normal approximation test. Panel 8D illustrates steps to remove signatures that fail to meet a minimum abundance test. [0040] Figure 9 Panels A and B illustrates the correlation between replicate sampling methods of 4-stepper data versus 4-stepper data using MPSS, as in Panel A, and illustrates the correlation between independent sampling methods of 4-stepper data versus 2-stepper data using MPSS as in Panel B.
DEFINITIONS [0041] Before describing the present invention in detail, it is to be understood that this invention is not limited to particular compositions or systems, which can, of course vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting. As used in this specification and appended claims, the singular forms "a", "an", and "the" include plural referents unless the content and context clearly dictates otherwise. Thus, for example, reference to "a processor" includes a combination of two or more such processors, and the like.
[0042] Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention pertains. Although any methods and materials similar or equivalent to those described herein can be used in the practice for testing of the invention, the preferred materials and methods are described herein. In describing and claiming the present invention, the following terminology will be used in accordance with the definitions set out below. [0043] The term "signature sequence" refers to any identification method that uniquely identifies a specific nucleic acid, e.g., mRNA molecule, cDNA, and the like. Signature sequences are also called "signatures." Examples of signature sequences include nucleotide sequences of 17 or more bases generated, e.g., from MPSS. Signature sequences also include nucleotide sequence tags generated from SAGE techniques. [0044] The term "abundance estimate" refers to the frequency of a single type of mRNA molecule relative to all the mRNA molecules in a sample. For example, the abundance estimate of any single gene in a dataset is calculated by dividing the number of signature sequences from that gene by the total number of signature sequences for all mRNAs present in the dataset.
[0045] The term "dataset" refers to data that is generated by a method for gene expression analysis using a sample from an organism. As described in further detail below, any of a variety of gene expression analyses can be used to generate the dataset. A dataset includes a set of sequences, or a "population of sequences" representing the genes that are expressed in the sample. A dataset also includes "associated values," e.g., representing the level of expression of each gene/sequence.
[0046] The term "independent sampling procedure(s)" refers to one or more procedures for estimating the abundance of expressed gene products, e.g. cDNA or polypeptides. For example, with MPSS, 2-stepper and 4-stepper datasets are generated via independent sampling procedures. Similarly, MPSS and SAGE are independent procedures for sampling cDNA abundance. Gene expression data produced with different kinds of microarrays, e.g. those from Affymetrix vs. spotted cDNA microarrays, are also generated via independent sampling procedures.
DETAILED DESCRIPTION [0047] The present invention provides methods, computer program products and systems to analyze expression of large sets of genes in parallel. Specifically, a pairwise comparison is performed with data generated from two samples using a statistical test of significance.
[0048] To determine differential gene expression in a sample compared to another sample(s), the level of expression of a given mRNA in a given sample is determined. For example, the level of expression of any single gene in a dataset is calculated by dividing the number of signatures from that gene by the total number of signatures for all mRNAs present in the dataset. The level of expression of a particular mRNA in one sample is statistically compared to the level of expression of the same particular mRNA in another sample. [0049] Figure 1 is a flow-chart illustrating a method of differential gene expression analysis. This figure illustrates a general embodiment that analyzes the data for changes in gene expression using a pairwise comparison of a first sample and a second sample using a statistical test of significance. The steps include collecting data, e.g., via one or more independent sampling procedures with or without replicates, from a plurality of samples (Step 100), e.g., collecting data can include accessing data that is currently being generated, or e.g., accessing data that is stored, e.g., in a database; importing the collected data (Step 102), and analyzing the data for differential gene expression (Step 104) using a statistical test of significance, e.g., a normal approximation, a chi-square test, a Fisher Exact test, a generalized linear model, Bayesian methods and/or the like.
[0050] As mentioned above, a variety of statistical tests of significance can be used, for example, a two-tailed normal approximation test, a chi-square test, a Fisher Exact test, a generalized linear model, Bayesian methods and/or the like. In one embodiment, the statistical test of significance is a normal approximation test, which includes a two-tailed test using a first equation:
Figure imgf000014_0001
wherein the n! is a number of mRNA molecules or cloned cDNA copies sequenced that represents (generates) the population of signature sequences for the first sample, and the n2 is a number of mRNA molecules or cloned cDNA copies sequenced that represents (generates) the population of signature sequences for the second sample, wherein abundance of an individual signature sequence in the first sample is represented by the and the abundance of the individual signature sequence in the second sample is the x2, and wherein the pi and the p2 are represented by a second equation and a third equation:
Pj = — , and p2 = -2- ; n, n2
wherein the p is represented by a fourth equation: Λ \ ' 2
«, + n2
wherein the q is represented by a fifth equation:
q = l- p ; md,
wherein the ni and n2 are large, e.g., greater than about 150,000, e.g., greater than about 200,000, e.g., greater than about 300,000 mRNA molecules or cloned cDNA copies, and typically, about 300,000 to about 10,000,000 mRNA molecules or cloned cDNA copies. [0051] Typically, xi and x2 represent the abundance of a specific signature in samples 1 and 2, respectively, and ni and n2 represent the total number of signatures generated for all mRNAs in samples 1 and 2. As a result, the proportions pι=
Figure imgf000015_0001
and p2=x2/n2 follow a binomial distribution. Because nj and n2 are large, e.g., in MPSS, the difference (p!-p2) follows an approximate normal distribution defined by a first expression:
N (Pi - Pi)>Jpq(— n. + — n, )
where the unknown parameters p and q can be estimated as p=(x1+x2)/(nι +n2) and q=l-p, respectively.
[0052] The level of significance of the test is determined by calculating the probability of observing a larger λ by chance. This probability is calculated by using the following equation:
Figure imgf000015_0002
where x is the value of the concerned variable under a standard normal distribution. When Pr(x>λ) is equal or smaller than a predetermined significance level of significance, e.g., 0.05, e.g., 0.01, or e.g., 0.001, then a significant differential expression between a first sample and a second sample is declared for that particular significance level. This test is also known as a Z-test. See, e.g., Kal et al., (1999), Dynamics of gene expression revealed by comparison of serial analysis of gene expression transcript profiles from yeast growth on two different carbon sources, Mol. Bio. Cell. 10:1859-1872. [0053] An inverse relationship exits between the level of expression and size of the difference that can be evaluated between samples. For example, for a p<0.001, a 2-fold change for a gene that is expressed at a level of about 30-40 copies per million can be detected. For genes that are expressed at a higher abundance, a substantially smaller difference can be detected. An about 40% difference can be ascertained for genes that are expressed at about 200 copies per million. See, e.g., Figure 2 which illustrates the inverse relationship between the level of gene expression and fold differences that can be detected by the normal approximation test at three levels of significance.
[0054] Figure 3 Panels A, and B is a flow chart illustrating a method of differential gene expression analysis of MPSS data using a normal approximation test as the statistical test of significance as described in Steps 300-312. In Panel A Step 300, a significance level is chosen, e.g., 0.05, 0.01, or 0.001. In Step 302, proportions, pi and p2 are calculated for xi and x2 for each stepper dataset, e.g., a 2-stepper and a 4-stepper. The sum of the proportions of p! and p2 for a 2-stepper and the sum of the proportions of p] and p2 of a 4- stepper are compared in Step 304. If the sum of proportions of the 2-stepper is greater than the sum of proportions as the 4-stepper, then the statement is true and the 2-stepper dataset is used, as in Step 305. If the sum of proportions of the 2-stepper is less than the sum of proportions as the 4-stepper, then the statement is false and the 4-stepper dataset is used, as in Step 306. In Panel B Step 308, the probability differential value is calculated. If this value is greater than the significance level, the statement is true, as in Step 311, which indicates no differential expression. If this value is less than the significance level, the statement is false, as in Step 312, which indicates differential expression.
[0055] Other statistical tests can also be used, e.g., a chi-squared test, e.g., a Fisher
Exact test, e.g., a generalized linear model, e.g., Audic and Claverie's Bayesian method and the like. A chi-square test calculates a probability value for the relationship between two dichotomous variables. It is an estimated value of the true probability value. Typically, a chi-squared test is used to statistically analyze the data when the sample size (n) is greater than 5. Chi-square tests can also be extended to cover situations where there are more than two categories of possible outcome. Information on statistical tests can be found in variety of places, such as, textbooks, papers and the World Wide Web. For example, see Fisher and van Belle, (1993) Biostatistics: a Methodology for the Health Science. John Wiley & Sons, New York; Man et al., (2000) POWER SAGE: comparing statistical tests for SAGE experiments, Bioinformatics, 16(11): 953-959; and, Audic and Claverie, (1997) The significance of digital gene expression profiles, Genome Research. 7:986-995.
[0056] As mentioned above, a Fisher Exact test can also be used. The Fisher Exact test calculates an exact probability value for the relationship between two dichotomous variables. It works in a similar manner as a chi-squared test, except that a Fisher Exact test calculates an exact probability rather than an estimate of the true probability value generated from a chi-squared test. Typically, the Fisher Exact test is used when the sample size is small.
[0057] Various types of data can be generated from differential gene expression experiments. For example, the data can be categorical data, e.g., data generated from MPSS. Data are categorical data when a signature sequence in a particular dataset is either present at a certain proportion or absent. Other types of data are optionally used in the present invention include, e.g., continuous data, normalized data, analog data, e.g., data generated from microarrays (where the results are represented by a ratio of the fluorescent levels of two probes) and the like. [0058] Data can be collected from several different samples and a plurality of sets of pairwise comparisons can be imported into a database, e.g., a relational database, and the data can then be mined for biologically meaningful changes in gene expression. In one embodiment, the data are collected from a central database and the data are stored on a local system. In another embodiment, values generated, e.g., the probability of differential expression value, first parameter, the second parameter, and/or the like, are stored in a storage medium.
[0059] A wide variety of samples can be compared in the present invention.
Comparison of the two different samples can yield information regarding the expression pattern, e.g., attributable to change in environment, attributable to differences in a parent and a child, attributable to a state of differentiation, attributable a treatment with an agent(s), attributable to a disease state, attributable to an injured state, attributable to differences in species, attributable to different cell types, attributable to different times, attributable to disparate species, attributable to disparate organs and the like. Basically, any two samples that differ from one another can be compared. Examples include but are not limited to a first cell type and a second cell type, a first normal sample and a second other than normal sample, a first species sample and a second species sample, a first recombinant inbred line and a second recombinant inbred line, a first state sample and a second state sample, a first treated sample and a second untreated sample, a first infected sample and a second uninfected sample, a first disease-state sample and a second non-disease state sample, a first injured state sample and a second non-injured state sample, a first developmental stage sample and a second different developmental stage sample, and the like.
Technology to eenerate Data
[0060] A variety of techniques can be used to generate the data in the present invention. Typically, massively parallel signature sequencing is used to generate the data used in differential gene expression. Other examples include SAGE data, microarrays and cDNA fragment profiling methods. See, e.g., Brenner et al., (2000), Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays, Nature Biotech., 18:630-634; Tyagi, (2000), Taking a census of mRNA populations with microbeads, Nature Biotech.. 18:597-598; Brenner et al., (2000) In vitro cloning of complex mixtures of DNA on microbeads: Physical separation of differentially expressed cDNAs, PNAS USA 97: 1665-1670; Okubo et al., (1992), Large scale cDNA sequencing for analysis of quantitative and qualitative aspects of gene expression, Nature Genetics. 2:173-179; Bachem et al., (1996) Visualization of differential gene expression using a novel method of RNA fingerprinting based on AFLP: analysis of gene expression during potato tuber development, Plant J.. 9:745-753; Nelson M, et al., (1993) Sequencing two DNA templates in five channels by digital compression, PNAS (US), 90(5): 1647-51; and Shimkets et al., (1999) Gene expression analysis by transcript profiling coupled to database query, Nature Biotechnology. 17:798-803.
[0061] Massively parallel signature sequencing (MPSS) is one technology available for in-depth quantitative expression profiling. It is designed for large-scale counting of individual mRNA molecules in a sample. MPSS provides data for all genes in a tissue or cell sample, not just those that have been previously identified and characterized. No prior knowledge of a gene's sequence is required for MPSS; thus, gene expression datasets can be generated from any organism. In addition, MPSS has a high sensitivity level. Anywhere from about 100,000 to about ten million molecules are typically counted in any given sample, so that even genes that are expressed at low levels can be quantified with high accuracy. Typically, an MPSS dataset typically involves greater than, e.g., about 100,000 signature sequences, to about 750,000 signature sequences. Two-flow cells with microbeads initiated with either of two different initiating adaptors can be used for each experiment, e.g., a 2-stepper and 4-stepper as described above. Therefore, datasets containing from about 200,000 to about 1,400,000 signature sequences can be generated for any given sample. The data from multiple MPSS experiments can optionally be combined. [0062] MPSS is a "digital" gene expression tool that counts all mRNA molecules simultaneously. Counting mRNAs with MPSS is based on the ability to uniquely identify every mRNA in a sample. This is done by generating a sequence of 17 or more bases for each mRNA at a specific site upstream from its poly(A) tail (e.g., the last DpnLI site in double stranded cDNA). The sequence of 17 or more bases is then used as an mRNA identification "signature." To measure the level of expression of any given gene in a sample analyzed by MPSS, the total number of signatures for that gene's mRNA are counted.
[0063] MPSS signatures for mRNAs in a sample are generated by sequencing double stranded cDNAs fragments cloned on to microbeads using the Lynx Megaclone technology. A clone refers to a single microbead from which 17 or more bases have been sequenced to create a signature sequence tag from an individual cDNA molecule that has been cloned into the Megaclone library. Fragments from 100,000-10,000,000 individual cDNA molecules from a sample are cloned on to 100,000-10,000,000 separate microbeads using, e.g., the procedure described in Brenner et al., supra, PNAS, thereby making a Megaclone library of cloned cDNA fragments.
[0064] MPSS and microbead technology is further described in the following patents and references cited within: U.S. Patent No. 6,306,597 to Macevicz entitled "DNA sequencing by parallel oligonucleotide extensions" issued October 23, 2001; U.S. Patent No. 6,280,935 to Macevicz entitled "Method of detecting the presence or absence of a plurality of target sequences using oligonucleotide tags" issued August 28, 2001; U.S. Patent No. 6,265,163 to Albrecht et al., entitled "Solid phase selection of differentially expressed genes" issued July 24, 2001; U.S. Patent No. 6,235,475 to Brenner et al., entitled "Oligonucleotide tags for sorting and identification" issued May 22, 2001; U. S. Patent No. 6,228,589 to Brenner entitled "Measurement of gene expression profiles in toxicity determination" issued May 8, 2001; U.S. Patent No. 6,175,002 to DuBridge et al., entitled "Adaptor-based sequence analysis" issued January 16, 2001; U.S. Patent No. 6,172,218 to Brenner entitled "Oligonucleotide tags for sorting and identification" issued January 9, 2001; U.S. Patent No. 6,172,214 to Brenner entitled "Oligonucleotide tags for sorting and identification" issued January 9, 2001; U. S. Patent No. 6,150,516 to Brenner et al., entitled "Kits for sorting and identifying polynucleotides" issued November 21, 2000; U.S. Patent No. 6,140,489 to Brenner entitled "Compositions for sorting polynucleotides" issued October 31, 2000; U.S. Patent No. 6,138,077 to Brenner entitled "Method, apparatus and computer program product for determining a set of non-hybridizing oligonucleotides" issued on October 24, 2000; U. S. Patent No. 6,013,445 to Albrecht et al., entitled "Massively parallel signature sequencing by ligation of encoded adaptors" issued January 11, 2000; U.S. Patent No. 5,962,228 to Brenner entitled "DNA extension and analysis with rolling primers" issued October 5, 1999; U.S. Patent No. 5,888,737 to DuBridge et al., entitled "Adaptor-based sequence analysis" issued March 30, 1999; U.S. Patent No. 5,780,231 to Brenner entitled "DNA extension and analysis with rolling primers" issued July 14, 1998; U. S. Patent No. 5,750,341 to Macevicz entitled "DNA sequencing by parallel oligonucleotide extensions" issued May 12, 1998; U.S. Patent No. 5,747,255 to Brenner entitled "Polynucleotide detection by isothermal amplification using cleavable oligonucleotides" issued May 5, 1998; U.S. Patent No. 5,969,119 to Macevicz entitled "DNA sequencing by parallel oligonucleotide extensions" issued October 19, 1999; U. S. Patent No. 5,863,722 to Brenner entitled "Method of sorting polynucleotides" issued January 26, 1999; U.S. Patent No. 5,846,719 to Brenner et al. entitled "Oligonucleotide tags for sorting and identification" issued December 8, 1998; U.S. Patent No. 5,763,175 to
Brenner entitled "Simultaneous sequencing of tagged polynucleotides" issued June 9, 1998; U.S. Patent No. 5,695,934 to Brenner entitled "Massively Parallel sequencing of sorted polynucleotides" issued December 9, 1997; U.S. Patent No. 5,635,400 to Brenner entitled "Minimally cross-hybridizing sets of oligonucleotide tags" issued June 3, 1997; and, U.S. Patent No. 5,604,097 to Brenner entitled "Methods for sorting polynucleotides using oligonucleotide tags" issued February 19, 1997.
[0065] The DNA is sequenced through an automated series of adaptor ligations and enzymatic steps. Two, e.g., independent sampling, procedures typically used involve either a 4-stepper or 2-stepper, which differ by using two alternative reading-frame adaptors. For example, in a 4-stepper, the process is initiated by ligating an adaptor molecule to the GATC (DpnII) single-stranded overhangs, and then digesting the samples with Bbvl, which is a type IIs restriction enzyme that cuts the DNA at a position 9-13 nucleotides away from the recognition sequence. This produces molecules with a 4 base single stranded overhang immediately adjacent to the DpnII recognition sequence. Another set of adaptors, called encoded adaptors, are hybridized and ligated to the 4 base overhangs on each molecule. The encoded adaptors contain a 4 base single stranded overhang with all possible nucleotide combinations at one end, and a single stranded coded sequence at the other end. One member of the encoded adaptor set will find a partner on the DNA molecules attached to the beads in the flow cell. The exact sequence of each encoded adaptor that hybridizes to the DNA on a microbead is decoded through 16 different sequential hybridization reactions with a set of fluorescent decoder probes. This process yields the first 4 nucleotides at the end of each molecule. To collect additional sequence, the encoded adaptor from the first round is removed by digestion with Bbvl, and the process is repeated several times. In the end, a 17 or more -base signature sequence is generated for each bead in the flow-cell. In a 2-stepper, the process is the same as described above for a 4-stepper but sequence is obtained in a different reading frame that is staggered by two bases compared to the 4- stepper. [0066] Specifically, in a 2-stepper protocol, the recognition site for the type IIS restriction enzyme, e.g., Bbvl, used to expose the first four nucleotides to identify the signature sequence, is located 11 nucleotides from the GATC site at the end of the adaptor. In the 4-stepper protocol, the recognition site for the type IIS restriction enzyme, e.g., Bbvl, used to expose the first four nucleotides to identify the signature sequence, is located 9 nucleotides from the GATC site at the end of the adaptor. The difference between the 2- stepper protocol and the 4-stepper protocol allows the choice of what overhang will be produced after the first restriction enzyme, e.g., Bbvl, digestion. The datasets generated with the two different adaptors are different, because a different set of four base-pair overhangs will be generated for each signature sequence depending on whether a 2-stepper or 4-stepper protocol is used. Each exposed four base pair can potentially contain a palindromic structure, e.g., 16 of 256 different possible four base-pair overhangs. There can also be additional biases due to the relative efficiency of individual overhangs in the ligation processes involved during the sequencing cycles. The dataset generated and the biases make the 2-stepper and 4-stepper protocols independent sampling methods. See Figures 9, Panel A and B. Similar caveats can be encountered when, e.g., comparing SAGE to MPSS or Affymetrix chips to conventionally spotted cDNA arrays. In one embodiment, the data analysis of the present invention takes into consideration the inconsistencies that can exist in large datasets obtained with independent sampling procedures.
[0067] Ligation-based sequencing is further described in the following patents and references cited within: U.S. Patent No. 5,714,330 to Brenner et al., entitled "DNA sequencing by stepwise ligation and cleavage" issued February 3, 1998; U.S. Patent No. 5,599,675 to Brenner entitled " DNA sequencing by stepwise ligation and cleavage" issued February 4, 1997; U.S. Patent No. 5,831,065 to Brenner entitled "Kits for DNA sequencing by stepwise ligation and cleavage" issued November 3, 1998; U.S. Patent No. 5,856,093 to Brenner entitled "Method of determining zygosity by ligation and cleavage" issued January 5, 1999; and, U.S. Patent No. 5,552,278 to Brenner entitled "DNA sequencing by stepwise ligation and cleavage" issued September 3, 1996.
[0068] When MPSS is used to generate the population of signature sequences, an individual signature sequence from the population of signature sequences is generated from one or more stepper procedures, e.g., a 2-stepper or a 4-stepper, as described above. In one aspect, the methods include choosing the stepper, e.g., 2-stepper or 4-stepper, dataset that produces a highest value of the individual signature sequence. In another embodiment, an individual signature sequence from the population of signature sequences is an average from more than one 2-steppers or from more than one 4-steppers. In further embodiments, a standard deviation for the average can be calculated.
[0069] Another technology that can be used is SAGE technology. SAGE is' another transcript counting technique that generates a tag sequence for each mRNA. It also generates a digital gene expression profile. SAGE is based on the principles that a short sequence tag derived from a defined position from a mRNA can uniquely identify the transcript and concatenation of the tags allows for high-throughput sequencing. The length of the SAGE tag is about 10 to about 14 nucleotides. The tag sequence is determined using conventional sequencing technologies. See the following publications and references cited within: Velculescu et al, (1995), Serial analysis of gene expression, Science. 270:484-487; and Zhang et al., (1997), Gene expression profiles in normal and cancer cells, Science. 276:1268-1272. To determine expression level of a gene from SAGE technique, the frequency of a sequence tag derived from the corresponding mRNA transcript is measured. As with microarray data described below, adjustments to consider bias and normalization are optionally included in the present invention. See, e.g., Marguiles et al., (2001) Identification and prevention of a GC content bias in SAGE libraries, Nucleic Acid Res., 29(12):E60-0.
[0070] Microarrays are also technologies that can be used in the present invention.
Typically, a microarray is a solid support that contains a variety of genes. The mRNAs from the sample are then allowed to hybridize to the microarray. Microarrays have the advantage of high throughput analysis of multiple samples. Typically with microarray techniques, some or all of a variety of variables should be considered. These variables include, e.g., that the desired genes are represented on a given array. Second, a microarray exists for the organism of interest. Third, the detection sensitivity is optimized to achieve detection of low expressed genes. Fourth, a sample is compared with a control sample to compensate for several sources of bias and noise in the intensity results. Typically, the experiment is replicated several times to provide a more reliable dataset. Fifth, compensation is made for multiple values for single gene, because multiple values can arise from, e.g., distinct probe sets within different sections within the gene. See Kerr and Churchhill, G.A., (2001), Statistical design and the analysis of gene expression microarray data, Biostatistics, 2:183-201; Wodicka et al., (1997), Genome wide expression monitoring in Saccharomyces cerevisiae, Nature Biotech., 15:1359-1367; Lockhart et al., (1996), Expression monitoring by hybridization to high-density oligonucleotide arrays, Nature Biotech., 14: 1675-1680; Aach et al., Systematic management and analysis of yeast gene expression data, Genome Res., 10:431-445 and Wittes and Friedman, (1999) Searching for evidence of altered gene expression: a comment on statistical analysis of microarray data, J. Natl. Cancer Inst.. 91:400-401.
[0071] More information can be found in the following publications and references cited within: Duggan et al., (1999), Expression profiling using cDNA microarrays, Nature Genetics, 21:10-14; Lipshutz et al., High density synthetic oligonucleotide arrays, Nature Genetics Suppl. 21:20-24; Evertsz et al., (2000), Technology and applications of gene expression microarrays, in Microarray Biochip technology, Schena, M., Ed. BioTechniques Books, Natick, MA, pp.149-166; Lockhart and Winzeler, (2000), Genomics, gene expression and DNA arrays, Nature, 405:827-836; Zhou et al., (2000), Information processing issues and solutions associated with microarray technology, in Microarray Biochip technology, Schena, M., Ed., BioTechniques Books, Natick, MA, pp. 167-200; and Hughes et al., (2001), Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer, Nature Biotech.. 19:342-347.
[0072] The techniques described above are examples of techniques that can be used to generate the data used in the present invention, where the methods include analyzing greater than about 150,000 mRNA molecules in a sample. The sequenced mRNA can be a partial or full sequenced mRNA. The methods include, where the population of signature sequences represent, e.g., greater than about 150,000 mRNA molecules, e.g., greater than about 200,000 mRNA molecules, e.g., greater than about 300,000 mRNA molecules, e.g., greater than about 500,000 mRNA molecules, e.g., greater than about 750,000 mRNA molecules, or e.g., greater than about 1,000,000 mRNA molecules, in a sample.
[0073] In a further embodiment, one or more abundance estimates for an individual signature sequence from the population of signature sequences are generated from one or more independent sampling procedures, e.g., MPSS separate "stepper" datasets. In one embodiment, one stepper dataset is generated by a 2-stepper procedure ("2-stepper") and another stepper dataset is generated by a 4-stepper procedure ("4-stepper"). In another aspect, the methods include choosing the independent sampling procedure, e.g, the stepper dataset, which produces a highest value of the individual signature sequence. In another embodiment, an individual signature sequence from the population of signature sequences is an average from more than one 2-steppers or from more than one 4-steppers. In further embodiments, a standard deviation for the average can be calculated.
Parameters [0074] Assigning parameters to a signature sequence is optionally included in the present invention. For example, the methods of the present invention can further include a comparison test of the pi and p2 to assign a set of parameters to the samples being analyzed. The comparison test includes the following steps: (1) choosing either the data from a first independent sampling procedure or the data from a second independent sampling procedure based on the procedure that gave a higher total value, where the higher total value is the pi plus the p2, thereby providing the chosen data; and (2) comparing the p value from the first sample and the p2 value from the second sample in the chosen data for a direction of change, where the direction of change is an upward change, a downward change or an inconsistent change when comparing data generated by both the first and second procedures. For example, the steps can include: (1) choosing either the data from the first and the second sample generated by a 2-stepper or the data from the first and the second sample generated by a 4-stepper based on the procedure that gave a higher total stepper value, where the higher total stepper value is the pi plus the p2, thereby providing the chosen stepper data; and (2) comparing the p] value from the first sample and the p2 value from the second sample in the chosen stepper data for a direction of change, where the direction of change is an upward change, a downward change or an inconsistent change when comparing data generated by both stepper procedures.
[0075] A first parameter is assigned to the chosen stepper data, where the first parameter is a T when the 2-stepper is used for the comparison test or is an F when the 4- stepper is used for the comparison test. This first parameter provides a stepper value. The first parameter is optionally stored in a database. This is illustrated in the flow chart in Figure 4.
[0076] A second parameter is assigned to the each signature sequence from the population of signature sequences, where the second parameter is 1, 2, 4, 14, or 24. For example, a 1 is assigned when a downward change is found in the comparison test. A 2 is assigned when an upward change is found in the comparison test. A 4 is assigned when an inconsistent change is found in the comparison test. A 14 is assigned when a downward change is found in the comparison test and only one stepper dataset is used, e.g., either a 2- stepper procedure or a 4-stepper procedure, and a 24 is assigned when an upward change is found in the comparison test and only one of the stepper datasets is used. [0077] In one embodiment, a 1 is assigned when the downward change is found in the comparison test and the 2-stepper procedure and the 4-stepper procedure both exhibit the downward change. A 2 is assigned when the upward change is found in the comparison test and the 2-stepper procedure and the 4-stepper procedure both exhibit the upward change. A 4 is assigned when the direction of change is inconsistent. A 14 is assigned when the downward change is found in the comparison test and when the data from only the 2-stepper procedure or the 4-stepper procedure can be used. A 24 is assigned when the upward change is found in the comparison test and when the data from only the 2-stepper procedure or the 4-stepper can be used.
[0078] In another embodiment, a 1 is assigned when a downward change is found in the comparison test, and more than one independent sampling method, e.g. 2-stepper and 4- stepper, exhibit a downward change in the same direction. A 2 is assigned when an upward change is found in the comparison test, and more than one independent sampling method, e.g. 2-stepper and 4-stepper, exhibit an upward change in the same direction. A 4 is assigned when a change is found in the comparison test, and when the change with an independent sampling method, e.g. 2-stepper and 4-stepper, is inconsistent. A 14 is assigned when a downward change is found in the comparison test and when data from only one independent sampling procedure, e.g. only one stepper dataset, can be used, and a 24 is assigned when an upward change is found in the comparison test and when data from only one independent sampling procedure, e.g. only one of the stepper datasets, can be used.
[0079] This second parameter provides a direction value and, typically, can provide a way of identifying the gene expression differences that are consistently observed with independent sampling methods. The second parameter is optionally stored in a database.
[0080] Figure 5 illustrates a method of assigning parameter 2, the direction value.
The steps include Steps 500-512. In Step 500, the p! value from a first sample is compared to the p2 value in the second sample. If there is an inconsistent change between the values, e.g. a downward change when examining the 2-stepper data and an upward change when examining the 4-stepper data, the signature sequence is assigned a 4 as a direction value. If there is not an inconsistent change, the direction of the change is determined as in Step 502. If p! > p2 is true, there is a downward change and Step 504 is followed. In Step 504, data are separated on the basis of whether one or more stepper procedure can be used. If only one stepper procedure, e.g., either a 2-stepper or a 4-stepper, is used, e.g., due to a zero value in both the p] and p2 of the one stepper, as in Step 502, the statement in Step 504 is true and a number 14 is assigned to the parameter 2 as in Step 506. If the statement in Step 504 is false, more than one stepper procedure can be used and a number 1 is assigned to the parameter, as in Step 507. If pi > p2 is false in Step 502, there is an upward change and Step 508 is followed. In Step 508, data are separated on the basis of whether one or more stepper procedure can be used. If only one stepper procedure, e.g., either a 2-stepper or a 4- stepper, e.g., due to a zero value in both the pi and p2 of one of the stepper, is used in Step 502, the statement in Step 508 is true and a number 24 is assigned to the parameter 2 as in Step 510. If the statement in Step 508 is false, more than one stepper procedure can be used and a number 2 is assigned to the parameter, as in Step 512.
Filtering Data [0081] The methods of the present invention optionally include filtering the data. In one embodiment, the filtering step occurs before the data analysis step. For example, the methods can further include filtering the population of signature sequences by removing sequences from the population of signature sequences that contain one or more ambiguous nucleotides. [0082] In one embodiment, the data can be filtered by removing signature sequences that fail to meet a minimum abundance test. The minimum abundance test includes removing signature sequences that, when analyzing the samples with a normal approximation test, cannot be distinguished from zero. For example, an individual signature sequence with the highest abundance across all MPSS runs that is less than about 4 per million is not significantly different from zero when a p< 0.05 significance level is chosen, and e.g., an abundance of 6 is not significantly different from zero when a p<0.01 significance level is chosen, and e.g., an abundance of 11 is not significantly different from zero when a p<0.001 significance level is chosen.
[0083] In another embodiment, the filter removes sequences that do not match a known genome. For example, signature sequences can be matched to known genes by comparison with data available in genomic and/or EST sequence databases, e.g., the National Center for Biotechnology Information (NCBI), as described below. If no match is found, the signature sequence is removed from further analysis. Alternatively, if multiple matches are found, e.g. to a repeated sequence, the signature is removed from further analysis. Typically, signatures that match a genome sequence at about 16 or more nucleotides out of about 17 and have about 1-3 matches in the genome are retained for further analysis.
Gathering Data and Creating Tables
[0084] Methods of the present invention include generating tables for the data. For example, a table can include information regarding one sample, e.g., a list of unique signature sequences from the population of signature sequences correlated with an abundance estimate or level for each of the unique signature sequences found in the each sample, and/or the probability of differential expression value calculated when a comparison is made with a different sample. Optionally, the table is generated from information regarding a plurality of samples.
[0085] The methods of the present invention also include methods of gathering information on a signature sequence. The method includes collecting data from a plurality of samples, each sample comprising a population of signature sequences that is generated from greater than about 150,000 mRNA molecules or cloned cDNA copies; importing the data into a database; gathering information on at least one unique signature sequence from the population of signature sequences from at least one other database or at least one web page from the Internet or a combination of both. This generates unique information for the at least one unique signature sequence; and provides the unique information for the at least one unique signature sequence to a user or an automated system. The other database can be a variety of databases, e.g., a sequence database, a gene annotation database, a publication database and/or the like. The World Wide Web and/or other computer networks or internetwork are also included in the present invention as a source to gather information on a signature sequence.
[0086] The methods of the present invention optionally include identifying at least one gene from the population of signature sequences and collecting information about at least one gene in a single computer file from distributed sites. For example, the population of signature sequences can be batch-blasted against a variety of databases, e.g., the various NCBI databases, e.g., NR, UniGene, EST and the like, using a network client, e.g., NCBI's network BLAST client 'blastcl3,' or using a local client, e.g., local client 'blastall,' to generate individual BLAST files of hits against each of the variety of databases. The different BLAST outputs can be read and further analyzed. For example, hits can be ordered, flagged based on desired criteria, sorted by the flag(s) and associated with gene identifications from the database(s).
[0087] Figure 6 illustrates a method of parsing multi-database hits, e.g., BLAST hits, in a desired hit order, flagging hits based on desired criteria, sorting the hits by flag and appending gene identifications from a database, e.g., UCSC GoldenPath, in parallel. In Step 600, a population of signature sequences is provided, e.g., MPSS signature list. In Step 602, signature sequences are batch-blasted against various NCBI databases, e.g., NR Blast (BLAST). The desired number of matched bases includes a value that can identify possible gene matches, e.g., about 9 matched bases to about entire gene. Upon extracting matches for an individual signature sequence based on a desired number of matched bases, e.g., NR Blast output with all human exact (1-17 bases) matches, this output is further divided as in Steps 604 and 606. In Step 604, the matches from the NR Blast output are separated into a reference sequence category, e.g., RefSeq (Category 1), or separated into a chromosomal category, e.g., Chromosomal hits (Category 2), as in Step 606. In Step 608, signature sequences are batch-blasted against a different NCBI database(s), e.g., UniGene Blast (BLAST). Upon extracting matches for an individual signature sequence based on a desired number of matched bases, e.g., UniGene Blast output with all human exact (1-17 bases) matches, a third Category of hits is provided, e.g., UniGene (Category 3) hits, as in Step 610. In Step 612, for an individual signature sequence, all qualifying hits from each of the three categories, e.g., Category 1, Category 2 and Category 3, are combined in a desired order, e.g., Ref Seq, UniGene and Chromosomal Hits. Signature sequences are flagged based on desired criteria, e.g., number of different chromosomes with matches, total number of hits obtained, etc. Step 613 illustrates an example of a flagging scheme where signature sequences are flagged with an "A" if the signature sequence has a unique chromosome hit or the total qualifying hits is less than a desired cutoff, or signature sequences are flagged with a "B" if the signature sequence does not fall into one of the descriptions in "A." The parsed hits can be sorted by flag, as in Step 614. In Step 616, known or predicted identified genes are verified and/or added via a database, e.g., UCSC GoldenPath, lookup for each parsed signature sequence. For example, parsed hits can be sorted by flags with simultaneous appending of annotation from UCSC GoldenPath for known genes from RefSeq, Acembly gene predictions with alternative-splicing, Ensembl gene predictions, Fgenesh++ gene predictions, and Genscan gene predictions, e.g., based on BLAST chromosomal hits, BAC clone hits or other hits. The signature sequence coordinates and orientation can be determined by accessing the corresponding UCSC Genome Browser page, and the most likely candidate gene(s) can be determined by matching orientations and coordinates of the signature sequence and the various gene candidates. In one embodiment, more weight is given to the gene(s) with the signature sequence located near their 3' ends in the same orientation. For RefSeq hits, additional annotation features are simultaneously appended from NCBI LocusLink from UniGene, OMIM, Gene Ontology, GeneCard, and the like, links. In one embodiment, for each RefSeq hit, a unique set of gene aliases are compiled from alternative titles and/or symbols entries in, e.g., OMIM and LocusLink, which aliases are then queried against, e.g., Entrez PubMed/MEDLINE for related publications from which are extracted, e.g., titles, abstracts, CAS registry numbers, MeSH terms, and the like, to create a knowledge base. When appropriate, for each MPSS signature sequence, various annotation features extracted as above are written to a database, e.g., relational, for querying and data mining. In Step 617, signature sequences that were excluded from the analysis, e.g., absence of 1-17 base matches, can be further analyzed by repeating Steps 600-616 once or multiple times based on a different, e.g., reduced, number of matched bases, e.g., 1-16 base matches in the BLASToutput files, 1-15 base matches in the BLAST output files and/or the like. Figure 7 illustrates a sample of gene identifications for 10 signature sequences.
[0088] A variety of methods for determining relationships between two or more sequences (e.g., identity, similarity) are available, and well known in the art, including manual alignment and computer assisted sequence alignment and analysis. A number of algorithms for performing sequence alignment are widely available, or can be produced by one of skill, including: the local homology algorithm of Smith and Waterman (1981) Adv. Appl. Math. 2:482; the homology alignment algorithm of Needleman and Wunsch (1970) J. Mol. Biol. 48:443; the search for similarity method of Pearsofi and Lipman (1988) Proc. Natl. Acad. Sci. (USA) 85:2444; and/or by computerized implementations of these algorithms (e.g., GAP, BESTFIT, FASTA, and TFASTA in the Wisconsin Genetics Software Package Release 7.0, Genetics Computer Group, 575 Science Dr., Madison, WI).
[0089] For example, software for performing sequence identity (and sequence similarity) analysis using the BLAST algorithm, described in Altschul et al. (1990) J. Mol. Biol. 215:403-410, is publicly available through the National Center for Biotechnology Information (available on the World Wide Web at ncbi.nlm.nih.gov). This algorithm involves first identifying high scoring sequence pairs (HSPs) by identifying short words of length W in the query sequence, which either match or satisfy some positive-valued threshold score T when aligned with a word of the same length in a database sequence. T is referred to as the neighborhood word score threshold (Altschul et al., supra). These initial neighborhood word hits act as seeds for initiating searches to find longer HSPs containing them. The word hits are then extended in both directions along each sequence for as far as the cumulative alignment score can be increased. Cumulative scores are calculated using, for nucleotide sequences, the parameters M (reward score for a pair of matching residues; always > 0) and N (penalty score for mismatching residues; always < 0). For amino acid sequences, a scoring matrix is used to calculate the cumulative score. Extension of the word hits in each direction are halted when: the cumulative alignment score falls off by the quantity X from its maximum achieved value; the cumulative score goes to zero or below, due to the accumulation of one or more negative-scoring residue alignments; or the end of either sequence is reached. The BLAST algorithm parameters W, T, and X determine the sensitivity and speed of the alignment. The BLASTN program (for nucleotide sequences) uses as defaults a wordlength (W) of 11, an expectation (E) of 10, a cutoff of 100, M=5, N=-4, and a comparison of both strands. For amino acid sequences, the BLASTP program uses as defaults a wordlength (W) of 3, an expectation (E) of 10, and the BLOSUM62 scoring matrix (see, Henikoff & Henikoff (1989) Proc. Natl. Acad. Sci. USA 89:10915).
[0090] Additionally, the BLAST algorithm performs a statistical analysis of the similarity between two sequences (see, e.g., Karlin & Altschul (1993) Proc. Nat'l. Acad.
Sci. USA 90:5873-5787). One measure of similarity provided by the BLAST algorithm is the smallest sum probability (p(N)), which provides an indication of the probability by which a match between two nucleotide or amino acid sequences would occur by chance.
For example, a nucleic acid is considered similar to a reference sequence (and, therefore, in this context, homologous) if the smallest sum probability in a comparison of the test nucleic acid to the reference nucleic acid is less than about 0.1, or less than about 0.01, and or even less than about 0.001.
[0091] Another example of a useful sequence alignment algorithm is PILEUP.
PILEUP creates a multiple sequence alignment from a group of related sequences using progressive, pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment. PILEUP uses a simplification of the progressive alignment method of Feng & Doolittle (1987) J. Mol. Evol. 35:351-360. The method used is similar to the method described by Higgins & Sharp (1989) CA5/OS5: 151-153. The program can align, e.g., up to 300 sequences of a maximum length of 5,000 letters. The multiple alignment procedure begins with the pairwise alignment of the two most similar sequences, producing a cluster of two aligned sequences. This cluster can then be aligned to the next most related sequence or cluster of aligned sequences. Two clusters of sequences can be aligned by a simple extension of the pairwise alignment of two individual sequences. The final alignment is achieved by a series of progressive, pairwise alignments. The program can also be used to plot a dendogram or tree representation of clustering relationships. The program is run by designating specific sequences and their amino acid or nucleotide coordinates for regions of sequence comparison.
[0092] An additional example of an algorithm that is suitable for multiple DNA (or amino acid) sequence alignments is the CLUSTALW program (Thompson, J. D. et al. (1994) Nucl. Acids. Res. 22: 4673-4680). ClustalW performs multiple pairwise comparisons between groups of sequences and assembles them into a multiple alignment based on homology. Gap open and Gap extension penalties were 10 and 0.05 respectively. For amino acid alignments, the BLOSUM algorithm can be used as a protein weight matrix (Henikoff and Henikoff (1992) Proc. Natl. Acad. Sci. USA 89: 10915-10919).
[0093] As mentioned above, information gathered on a specific signature sequence can be obtained from the Internet, which comprises computers and computer networks that are interconnected through communication links. The interconnected computers exchange information using various services, such as electronic mail, ftp, the World Wide Web ("WWW"), and other services including secure and/or non-public services. The WWW service can be understood as allowing a server computer system (e.g., a Web server or a Web site) to send Web pages of information to a remote client computer system. The remote client computer system can then display the Web pages. Generally, each resource (e.g., computer or Web page) of the WWW is uniquely identifiable by a Uniform Resource Locator ("URL") which is generally associated with an address, such as an IP (Internet Protocol) address. To view a specific Web page, a client computer system specifies the URL for that Web page in a request. The request is forwarded to the Web server that supports that Web page. When that Web server receives the request, it sends that Web page to the client computer system. When the client computer system receives that Web page, it typically displays the Web page using a browser. A browser is a special-purpose application program that effects the requesting of Web pages and the displaying of Web pages. [0094] Web pages can be defined using a hypertext Markup Language (HTML) or similar constructs that allow a server computer to indicate how a web page looks and/or behaves. HTML provides a standard set of tags that define how a Web page is to be displayed. When a user indicates to the browser to display a Web page, the browser sends a request to the server computer system to transfer to the client computer system data and/or one or more electronic documents that defines the Web page. When the requested data is received by the client computer system, the browser displays the Web page as defined by the document and performs indicated behaviors. The document contains various tags that control the displaying of text, graphics, controls, and other features. The HTML document can contain URLs of other Web pages available on that server computer system or other server computer systems.
[0095] As mentioned above, various embodiments of the present invention provide methods and/or systems for gathering signature sequence information over a communications network. According to specific embodiments of the invention, a client system is provided with a set of interfaces that allow a user to select the samples that they want to compare. The client system displays information regarding the expression data and displays an indication of an action that a user is to perform to request a comparison. In response to user input, the client system sends to a server system the necessary information to access signature sequence information. The server system uses the request data, and optionally one or more sets of server data, to process the request. According to specific embodiments of the present invention, a client system is, or has previously been, provided with an executable code file that allows the client system to receive the data and analyze the data.
Computer Program Product
[0096] Computer program products are also aspects of present invention. Any methods of the present invention can be performed in a computer. For example, the present invention includes a computer program product for analysis of differential gene expression. The program includes code that receives as input a plurality of signature sequences from a plurality of samples; code that selects samples from the plurality of samples to be analyzed; and code that determines the probability of differential expression for each signature sequence from the selected samples using a pairwise comparison of a first sample and a second sample from the plurality of samples using a statistical test of significance, where the codes are stored on a tangible medium. [0097] In one embodiment, the statistical test of significance is a two-tailed normal approximation test. In further embodiment, the program includes a code that removes signature sequences that fail to meet a minimum abundance test. Code is optionally provided that removes signature sequences that do not match a known genome, and/or that contain more than one ambiguous nucleotide.
[0098] In another embodiment, the program includes code that receives as input a plurality of signature sequences from a plurality of samples, code that gathers available information on each signature sequences from other databases, and code that matches the specific signature sequence with the information gathered and code that stored the matched information in a tangible medium. In another aspect, the program includes code that compares pi and p2 to assign a first parameter and/or a second parameter as described above. In another aspect, code is provided that generates tables of the signature sequences that contain the frequency of the signature sequence in a sample, and/or the probability of differential expression value determined for the signature sequence. [0099] One embodiment of the invention can be implemented on a general purpose computer using a suitable programming language such as PERL, Java, C++, Cobol, C, Pascal, Fortran, PL1, LISP, assembly, etc. In the interest of clarity, not all features of an actual implementation are described in this specification. It will be understood that in the development of any such actual implementation (as in any software development project), numerous implementation-specific decisions must be made to achieve the developers' specific goals and subgoals, such as compliance with system- and business-related constraints, which will vary from one implementation to another. Moreover, it will be appreciated that such a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking of software engineering for those of ordinary skill having the benefit of this disclosure.
Systems
[0100] Systems for analysis of differential gene expression are also a part of the present invention. These systems include a processor and a computer readable medium coupled to the processor, said computer readable medium storing a computer program. The computer program includes code for any of the analysis used in the present invention. For example, the code can include instructions that receive as an input a plurality of signature sequences from a plurality of samples; instructions that select samples from the. plurality of samples to be analyzed; and instructions that determine the probability of differential expression for each signature sequence from the selected samples using a pairwise comparison of a first sample and a second sample from the plurality of samples. Other relevant codes are described under the computer program products section herein.
[0101] Logic systems and methods such as described herein can include a variety of different components and different functions assembled in a modular fashion. Different embodiments of the invention can include different mixtures of elements and functions and can group various functions as parts of various elements. The invention is described in terms of systems that include many different innovative components and innovative combinations of innovative components and known components. No inference should be taken to limit the invention to combinations containing all of the innovative components listed in any illustrative embodiment in this specification. The functional aspects of the invention that are implemented on a computer can be implemented or accomplished using any appropriate implementation environment or programming language, such as PERL, C, C++, Cobol, Pascal, Java, Java-script, HTML, XML, dHTML, assembly or machine code programming, etc.
[0102] The present invention encompasses a variety of specific embodiments for performing these steps. As further described below, the request for importing the data collected can be received in a variety of ways, including through one or more graphical user interfaces provided by the collection system to a database or by the collection system receiving an email or other digital message or communication from the client system connected to the database. Thus, according to specific embodiments of the present invention, data and/or indications can be transmitted to the database using any method for transmitting digital data, including HTML communications, FTP communications, email communications, wireless communications, etc. In various embodiments, indications of desired data can be received from a human user selecting from a graphical interface at a computing device.
[0103] After the request is received, the collection system according to specific embodiments of the present invention accesses the requested data. As discussed further below, a collection system can hold data files prior to receiving a request for particular data or the collection system can create requested data while responding to a request from a user to receive the data. When the data is available at the collection system, the collection system transmits or imports the data to a client system. At the client system, a logic routine can be used to access the file that is transmitted.
[0104] Accessing the data can be done with or without the active participation of a human user. For example, a voice command can be spoken by a user, a key may be depressed by a user, a button on a scientific device can be depressed by a user or selecting using any pointing device can be effected by a user. Thus, in different embodiments, requested data can be submitted by automated equipment at a client site. For example, a scientific device can be programmed to automatically request needed datasets from a collection system according to specific embodiments of the present invention.
EXAMPLES
[0105] The following example is offered to illustrate, but not to limit the claimed invention.
Example 1- Differential Gene expression analysis of a first sample and a second sample [0106] In this example, signature sequence data, e.g., generated from MPSS experiment, is collected from a first sample and a second sample, e.g., a treated and untreated sample. The data are imported into a database where the user can analyze the data. This process is illustrated in Figure 8 Panels A, B, C, and D. In Panel A Step 800, signature sequences in the database from the treated and untreated sample are grouped based on whether or not the signature sequence is a non-ambiguous signature, e.g., a signature sequence that contains one or less ambiguous nucleotide. If the statement is true, as in Step 801, the signature sequence is selected. If the statement in Step 800 is false, the signature sequence is not selected for further analysis. In Panel B Step 803, signature sequences are selected that match to a known sequence in a known genome. If the signature sequence matches a known sequence in a known genome, the statement is true as in Step 804 and the signature sequence is selected. If the statement is false, as in Step 805, the signature sequence is not selected for further analysis. For each pair of samples, Panel C Step 806, the samples are compared using a pairwise comparison using a statistical test of significance, as in Step 807, for those showing a statistically significant differential expression, thereby assigning a probability of differential expression value. An example of a procedure for this test is illustrated in Figure 2. An additional filtering step can also be applied to the data. For example, signature sequences that for a desired significance level, e.g., 0.05, 0.01, or 0.001, as in Panel D Step 808 that meet or exceed a minimum abundance test are selected as the signature sequences of interest for the differential expression, as in Steps 810-811. If a signature sequence fails the minimum abundance test, the signature sequence is not selected for further examination or analysis, as in Step 812. The samples are optionally compared to assign parameters as described in the present invention and create tables of the expression data.
[0107] It is understood that the examples and embodiments described herein are for illustrative purposes only and that various modifications or changes in light thereof will be suggested to persons skilled in the art and are to be included within the spirit and purview of this application and scope of the appended claims.
[0108] While the foregoing invention has been described in some detail for purposes of clarity and understanding, it will be clear to one skilled in the art from a reading of this disclosure that various changes in form and detail can be made without departing from the true scope of the invention. For example, all the techniques and apparatus described above can be used in various combinations. All publications, patents, patent applications, and/or other documents cited in this application are incorporated by reference in their entirety for all purposes to the same extent as if each individual publication, patent, patent application, and/or other document were individually indicated to be incorporated by reference for all purposes.

Claims

CLAIMSWhat is claimed is:
1. A method of differential gene expression analysis, the method comprising:
collecting data from a plurality of samples, each sample comprising a population of signature sequences that represents greater than about 150,000 mRNA molecules;
importing the data into a database;
analyzing the data for changes in gene expression using a pairwise comparison of a first sample and a second sample selected from the plurality of samples using a statistical test of significance; thereby generating a probability value of a differential expression for each signature sequence; and,
providing the probability of the differential expression value to a user or an automated system.
2. The method of claim 1, wherein the statistical test of significance is a normal approximation test, which comprises a two-tailed test using a first equation:
Figure imgf000038_0001
wherein the ni is the total number of signature sequences generated from mRNA molecules or cloned cDNA copies sequenced in the first sample, and the n is the total number of signature sequences generated from mRNA molecules or cloned cDNA copies sequenced in the second sample, wherein abundance of an individual signature sequence in the first sample is represented by the xi and the abundance of the individual signature sequence in the second sample is the x2, and wherein the pi and the p2 are represented by a second equation and a third equation: p, = -L , and p2 = -*- ; n, n2
wherein the p is represented by a fourth equation:
p ~ ~ n, ι + — n2 '
wherein the q is represented by a fifth equation:
q = 1 - p ; and,
wherein the
Figure imgf000039_0001
and n2 are large.
3. The method of claim 2, wherein the difference of the pi and the p2 is distributed according to a first expression:
Figure imgf000039_0002
4. The method of claim 2, wherein the population of signature sequences represents greater than about 200,000 mRNA molecules.
5. The method of claim 2, wherein the population of signature sequences represents greater than about 300,000 mRNA molecules.
6. The method of claim 1, wherein the population of signature sequences is generated by massively parallel signature sequencing (MPSS).
7. The method of claim 1, wherein one or more abundance estimates for an individual signature sequence from the population of signature sequences are generated from independent sampling procedures.
8. The method of claim 7, wherein the independent sampling procedures comprise 2-stepper or 4-stepper procedures.
9. The method of claim 7, further comprising choosing the independent sampling procedure which produces a highest value of the individual signature sequence.
10. The method of claim 1, wherein an individual signature sequence from the population of signature sequences is an average from more than one 2-stepper dataset or from more than one 4-stepper dataset.
11. The method of claim 1, further comprising generating a table for each sample, wherein the table comprises a list of unique signature sequences from the population of signature sequences and one or more abundance estimates produced for each of the unique signature sequences found in the each sample.
12. The method of claim 11, wherein the table is generated from the plurality of samples.
13. The method of claim 1, further comprising filtering the population of signature sequences, wherein the filtering comprises removing sequences from the population of signature sequences that do not match to a known sequence in a known genome.
14. The method of claim 13, wherein the filtering step occurs before the analyzing the data step.
15. The method of claim 1, further comprising removing the data that fail a minimum abundance test.
16. The method of claim 1, wherein the data are categorical data.
17. The method of claim 1, wherein the database is a relational database.
18. The method of claim 1, wherein the steps are performed in a computer.
19. The method of claim 1, wherein the probability of differential expression value is stored in a storage medium.
20. The method of claim 1, wherein the first sample and the second sample comprise a first recombinant inbred line and a second recombinant inbred line, a first treated sample and a second untreated sample, a first infected sample and a second uninfected sample, a first disease-state sample and a second non-disease state sample or a first developmental stage sample and a second different developmental stage sample.
21. The method of claim 2, further comprising comparing the pi and p2 using a comparison test, wherein the comparison test comprises the steps of:
i) choosing either the data from the first and the second sample generated by a 2-stepper procedure or the data from the first and the second sample generated by a 4-stepper procedure based on the procedure that gives a higher total stepper value, wherein the higher total stepper value is the pi plus the p2, thereby providing a chosen stepper data; and
ii) comparing the pi value from the first sample and the p2 value from the second sample in the chosen stepper data for a direction of change, wherein the direction of change is an upward change, a downward change or an inconsistent change.
22. The method of claim 21, further comprising assigning a first parameter to the chosen stepper data, wherein the first parameter is a T when the 2-stepper procedure is used for the comparison test or is a F when the 4-stepper procedure is used for the comparison test, thereby providing a stepper value.
23. The method of claim 22, wherein the first parameter is stored in a database.
24. The method of claim 21, further comprising assigning a second parameter to the each signature sequence from the population of signature sequences, wherein the second parameter is 1, 2, 4, 14, or 24, wherein the 1 is assigned when the downward change is found in the comparison test and the 2-stepper procedure and the 4- stepper procedure both exhibit the downward change, wherein the 2 is assigned when the upward change is found in the comparison test and the 2-stepper procedure and the 4- stepper procedure both exhibit the upward change, wherein the 4 is assigned when the direction of change is inconsistent, wherein the 14 is assigned when the downward change is found in the comparison test and when the data from only the 2-stepper procedure or the 4-stepper procedure can be used, and wherein the 24 is assigned when the upward change is found in the comparison test and when the data from only the 2-stepper procedure or the 4- stepper can be used.
25. The method of claim 24, wherein the second parameter is stored in a database.
26. The method of claim 1, wherein the data are collected from a central database and the data are stored on a local system.
27. A method of gathering information on a signature sequence, the method comprising:
collecting data from a plurality of samples, each sample comprising a population of signature sequences that represents greater than about 150,000 mRNA molecules;
importing the data into a database;
gathering information on at least one unique signature sequence from the population of signature sequences from at least one other database or at least one web page or a combination of both, thereby generating unique information for the each unique signature sequence; and
providing the unique information for the each unique signature sequence to a user or an automated system.
28. The method of claim 27, wherein the at least one other database is a sequence database.
29. The method of claim 27, wherein the at least one other database is a publication database.
30. The method of claim 27, wherein the at least one other database is a gene annotation database.
31. A computer program product for analysis of differential gene expression, the program comprising:
code that receives as input a plurality of signature sequences from a plurality of samples;
code that selects samples from the plurality of samples to be analyzed; and
code that determines the probability of differential expression for each signature sequence from the selected samples using a pairwise comparison of a first sample and a second sample from the plurality of samples,
wherein the codes are stored on a tangible medium.
32. A system for analysis of differential gene expression, the system comprising:
a processor; and
a computer readable medium coupled to the processor, said computer readable medium storing a computer program comprising:
code that receives as input a plurality of signature sequences from a plurality of samples;
code that selects samples from the plurality of samples to be analyzed; and
code that determines the probability of differential expression for each signature sequence from the selected samples using a pairwise comparison of a first sample and a second sample from the plurality of samples.
PCT/US2002/039650 2001-12-11 2002-12-10 Method for the analysis of differential gene expression WO2003050264A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2002360560A AU2002360560A1 (en) 2001-12-11 2002-12-10 Method for the analysis of differential gene expression

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US34103001P 2001-12-11 2001-12-11
US60/341,030 2001-12-11

Publications (2)

Publication Number Publication Date
WO2003050264A2 true WO2003050264A2 (en) 2003-06-19
WO2003050264A3 WO2003050264A3 (en) 2003-12-04

Family

ID=23335969

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2002/039650 WO2003050264A2 (en) 2001-12-11 2002-12-10 Method for the analysis of differential gene expression

Country Status (2)

Country Link
AU (1) AU2002360560A1 (en)
WO (1) WO2003050264A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017222596A1 (en) 2015-06-23 2017-12-28 Tupac Bio, Inc. Computer-implemented method for designing synthetic dna, and terminal, system and computer-readable medium for the same

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MAN ET AL.: 'Power-Sage: comparing statistical tests for sage experiments' BIOINFORMATICS vol. 16, no. 11, 2000, pages 953 - 959, XP002965979 *
VAN KAMPEN ET AL.: 'Usage: a web-based approach towards the analysis of sage data' BIOINFORMATICS vol. 16, no. 10, 2000, pages 899 - 905, XP001024467 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017222596A1 (en) 2015-06-23 2017-12-28 Tupac Bio, Inc. Computer-implemented method for designing synthetic dna, and terminal, system and computer-readable medium for the same
JP2019527443A (en) * 2015-06-23 2019-09-26 タパック バイオ, インコーポレイテッドTupac Bio, Inc. Computer mounting method for designing synthetic DNA, terminal, system and computer readable medium for designing synthetic DNA
EP3475422A4 (en) * 2015-06-23 2021-01-06 Tupac Bio, Inc. Computer-implemented method for designing synthetic dna, and terminal, system and computer-readable medium for the same

Also Published As

Publication number Publication date
AU2002360560A1 (en) 2003-06-23
WO2003050264A3 (en) 2003-12-04
AU2002360560A8 (en) 2003-06-23

Similar Documents

Publication Publication Date Title
US20240004885A1 (en) Systems and methods for annotating biomolecule data
AU2018254595B2 (en) Using cell-free DNA fragment size to detect tumor-associated variant
Reinartz et al. Massively parallel signature sequencing (MPSS) as a tool for in-depth quantitative gene expression profiling in all organisms
AU2021257920A1 (en) Variant classifier based on deep neural networks
Du et al. Evaluation of STAR and Kallisto on single cell RNA-Seq data alignment
Finkelstein et al. Microarray data quality analysis: lessons from the AFGC project
US20190318806A1 (en) Variant Classifier Based on Deep Neural Networks
US20110105346A1 (en) Universal fingerprinting chips and uses thereof
US20030033290A1 (en) Program for microarray design and analysis
Zheng et al. An approach to identify over‐represented cis‐elements in related sequences
Negi et al. Applications and challenges of microarray and RNA-sequencing
JP2016518822A (en) Characterization of biological materials using unassembled sequence information, probabilistic methods, and trait-specific database catalogs
Goswami et al. RNA-Seq for revealing the function of the transcriptome
Reverter et al. A rapid method for computationally inferring transcriptome coverage and microarray sensitivity
WO2012096016A1 (en) Nucleic acid information processing device and processing method thereof
KR20200102182A (en) Method and apparatus of the Classification of Species using Sequencing Clustering
JP2008161056A (en) Dna sequence analyzer and method and program for analyzing dna sequence
WO2003050264A2 (en) Method for the analysis of differential gene expression
JP7483127B2 (en) Computer-implemented method for providing nucleic acid sequence data sets for oligonucleotide design - Patent Application 20070123333
WO2003050748A2 (en) Genetic analysis of gene expression in heterosis
Gardner et al. Software for optimization of SNP and PCR-RFLP genotyping to discriminate many genomes with the fewest assays
Marić et al. Approaches to metagenomic classification and assembly
WO2018026576A1 (en) Genomic analysis of cord blood
Loots et al. Array2BIO: from microarray expression data to functional annotation of co-regulated genes
Marinus et al. RNA framework for assaying the structure of RNAs by high-throughput sequencing

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SC SD SE SG SK SL TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR IE IT LU MC NL PT SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP