GENETIC ANALYSIS OF GENE EXPRESSION IN HETEROSIS
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims the benefit of U.S. Provisional Application No.
60/341,031 filed December 11, 2001, entitled "Genetic Analysis of Gene Expression in Heterosis" and naming Jing-Zhong Lin et al. as the inventors. This prior application is hereby incorporated by reference in its entirety.
STATEMENT AS TO RIGHTS TO INVENTIONS MADE UNDER FEDERALLY SPONSORED RESEARCH AND DEVELOPMENT
[0002] Not applicable.
FIELD OF THE INVENTION
[0003] The invention relates to novel methods of analyzing gene expression in heterotic organisms, and identifying genes that play roles in heterosis.
BACKGROUND OF THE INVENTION [0004] The phenomenon of heterosis, or hybrid vigor, refers to the increased fitness of a hybrid offspring in comparison to its two inbred parents. Increased fitness is displayed in any of a variety of ways, including increased size, yield, disease resistance, and the like.
[0005] Heterosis is of particular interest to and importance in the agricultural community, which has long taken advantage of the increased fitness of hybrid animals and plants. The high yield of many domesticated crops, including rice, sugar beet, and corn is the result of heterosis. Breeders of domesticated animals, e.g., poultry and beef cattle, also take advantage of increased values of hybrids.
[0006] Despite the long time recognition and use of heterosis, the genetic basis of heterosis is not clear. Two rival theories include the dominance hypothesis and the overdominance hypothesis. The factors that lead to hybrid vigor are unknown at the genetic level.
[0007] The result of this lack of knowledge about the genetic basis of heterosis is that development of many domesticated plants and animals is a long and expensive process. Essentially a process of trial and error, it involves the crossing of inbred strains and
identified, it must be reproduced by continual crossing of the appropriate inbred parents. It is not possible to predict whether or not an organism will display a trait attributed to hybrid vigor, nor is it generally possible to engineer an organism with this trait.
[0008] Therefore, a need exists for a method to identify genes responsible for heterosis. The present invention provides this and other advantages.
SUMMARY OF THE INVENTION
[0009] The present invention statistically analyzes differential gene expression from hybrid offspring and their inbred parents, and identifies genes that play a role in heterosis. Methods, computer program products, and systems are provided.
[0010] The invention provides a method for identifying a heterosis set of sequences in a heterotic organism that entails collecting a dataset from each of a first inbred parent sample, a second inbred parent sample, a first hybrid sample, and a second hybrid sample, each dataset including a population of sequences and associated values; using a comparison test with the dataset of the first hybrid sample and the datasets of the first parent sample and the second parent sample, thereby generating first comparison test results; analyzing the first comparison test results by a heterosis analysis thereby placing sequences in a first set of non-additively expressed sequences and assigning one of a group of heterosis parameters to each sequence in the first set; using the comparison test with the dataset of the second hybrid sample and the datasets of the first parent sample and the second parent sample, thereby generating second comparison test results; analyzing the second comparison test results by the heterosis analysis thereby placing sequences in a second set of non-additively expressed sequences and assigning one of the group of heterosis parameters to each sequence in the second set; identifying a heterosis set of sequences, wherein each sequence in the heterosis set is present in and is assigned the same heterosis parameter in both the first and second sets of non-additively expressed sequences; and providing the heterosis set of sequences to e.g., a user or an automated system.
[0011] In one embodiment, the heterotic organism is a domesticated animal or a domesticated plant. In one example embodiment described herein, the heterotic organism is an oyster, but the approaches described can be extended to any other heterotic organism of
embodiment, the samples include RNA.
[0012] In one embodiment, the heterosis analysis entails calculating a midparent value (MP) for each sequence, wherein the MP value is the average of the first and second inbred parent values associated with the sequence; placing the sequence in the first set of non-additively expressed genes if the comparison test between the first hybrid value and the MP value is significant; and placing the sequence in the second set of non-additively expressed genes if the comparison test between the second hybrid value and the MP value is significant.
[0013] In a preferred embodiment, the heterosis parameters are D+, PD+, OD, D- P-
, and UD, and the heterosis analysis entails for each sequence in the datasets, identifying the first and second inbred parents as either a high inbred parent (HP) or a low inbred parent (LP), wherein the HP value is greater than the LP value; assigning the sequence the parameter of dominant and resembling the HP (D+) if the comparison test between the hybrid value and the MP value is significant and greater than zero and the comparison test between the hybrid value and the HP value is non-significant; assigning the sequence the parameter of partially dominant and resembling the HP (PD+) if the comparison test between the hybrid value and the MP value is significant and greater than zero and the comparison test between the hybrid value and the HP value is significant and less than zero; assigning the sequence the parameter of over dominant (OD) if the comparison test between the hybrid value and the MP value is significant and greater than zero and the comparison test between the hybrid value and the HP value is significant and greater than zero; assigning the sequence the parameter of dominant and resembling the LP (D-) if the comparison test between the hybrid value and the MP value is significant and less than zero and the comparison test between the hybrid value and the LP value is non-significant; assigning the sequence the parameter of partially dominant and resembling the LP (PD-) if the comparison test between the hybrid value and the MP value is significant and less than zero and the comparison test between the hybrid value and the LP value is significant and greater than zero; and assigning the sequence the parameter of under dominant (UD) if the comparison test between the hybrid value and the MP value is significant and less than zero and the comparison test between the hybrid value and the LP value is significant and less than zero.
expression levels. In one embodiment, the dataset includes categorical data. For example, in a preferred embodiment, the dataset is generated by massively parallel signature sequencing (MPSS).
[0015] A variety of comparison tests can be used in the method of the invention. In a preferred embodiment, the comparison test includes a two-tailed normal approximation test.
[0016] The methods of the present invention can optionally include filtering the datasets. For example, the methods can further include removing ambiguous sequences, e.g., sequences where one or more bases or nucleotides is unknown. In one embodiment, the filtering step occurs before the analyzing the data step. In another embodiment, the method further includes removing the sequences that fail a minimum abundance test.
[0017] The invention also includes methods that further entail further analysis of the heterosis set of sequences. In one embodiment, the method entails performing a sequence comparison between the heterosis set of sequences and a sequence database. In another embodiment, the method further entails analyzing the heterosis set of sequences using RT- PCR, northern blotting, and/or cDNA sequencing.
[0018] As discussed in the embodiments, at least one of the steps of the methods of the invention is performed in a computer. In one embodiment, one or more of the datasets, the comparison test results, the sets of non-additively expressed sequences, and the set of heterosis sequences are stored in a storage medium.
[0019] Similarly, the invention also includes computer program products. For example, the invention includes computer program products for identifying a heterosis set of sequences that play a role in heterosis. In one embodiment, the invention includes a program that includes code that receives as input a dataset from each of a first inbred parent sample, a second inbred parent sample, a first hybrid sample, and a second hybrid sample, each dataset including a population of sequences and associated values; code that uses a comparison test with the dataset of the first hybrid sample and the datasets of the first parent sample and the second parent sample, thereby generating first comparison test results; code that analyzes the first comparison test results by a heterosis analysis thereby placing sequences in a first set of non-additively expressed sequences and assigning one of a group
with the dataset of the second hybrid sample and the datasets of the first parent sample and the second parent sample, thereby generating second comparison test results; code that analyzes the second comparison test results by the heterosis analysis thereby placing sequences in a second set of non-additively expressed sequences and assigning one of the group of heterosis parameters to each sequence in the second set; code that identifies a heterosis set of sequences, wherein each sequence in the heterosis set is present in and is assigned the same heterosis parameter in both the first and second sets of non-additively expressed sequences; and code that provides the heterosis set of sequences, wherein the codes are stored on a tangible medium.
[0020] Systems for identifying a heterosis set of sequences that play a role in heterosis are also a part of the invention. These systems include a processor and a computer readable medium coupled to the processor; this computer readable medium stores a computer program. This computer readable medium includes code that receives as input a dataset from each of a first inbred parent sample, a second inbred parent sample, a first hybrid sample, and a second hybrid sample, each dataset including a population of sequences and associated values; code that uses a comparison test with the dataset of the first hybrid sample and the datasets of the first parent sample and the second parent sample, thereby generating first comparison test results; code that analyzes the first comparison test results by a heterosis analysis thereby placing sequences in a first set of non-additively expressed sequences and assigning one of a group of heterosis parameters to each sequence in the first set; code that uses the comparison test with the dataset of the second hybrid sample and the datasets of the first parent sample and the second parent sample, thereby generating second comparison test results; code that analyzes the second comparison test results by the heterosis analysis thereby placing sequences in a second set of non-additively expressed sequences and assigning one of the group of heterosis parameters to each sequence in the second set; code that identifies a heterosis set of sequences, wherein each sequence in the heterosis set is present in and is assigned the same heterosis parameter in both the first and second sets of non-additively expressed sequences; and code that provides the heterosis set of sequences.
[0021] Figure 1 is a flowchart illustrating one emooαimen- υi a mcuiυ υ± identifying a heterosis set of sequences. The method includes a method for determining the high parent and the low parent, and a method for calculating hybrid significance.
[0022] Figure 2 is a flowchart illustrating one embodiment of a method for determining the high parent (HP) and low parent (HP).
[0023] Figure 3A-3G is a flow chart illustrating one embodiment of a method for calculating hybrid significance, e.g., performing a heterosis analysis and assigning heterosis parameters to sequences.
[0024] Figure 4 illustrates the results of a heterosis analysis preformed using MPSS gene expression data from a species of Pacific oyster. The two bar graphs illustrate the percentage of signatures showing each mode of significant (p<0.001, 0.01, 0.05) non- additive gene expression. Hybrid 35 and Hybrid 53 are reciprocal hybrids.
[0025] Figure 5 illustrates the number and proportion of signatures showing each mode of significant (p<0.001) non-additive gene expression. 53 and 35 are reciprocal hybrids.
DEFINITIONS
[0026] Before describing the present invention in detail, it is to be understood that this invention is not limited to particular devices or biological systems, which can, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting. As used in this specification and the appended claims, the singular forms "a", "an" and "the" include plural referents unless the content clearly dictates otherwise. Thus, for example, reference to "a surface" includes a combination of two or more surfaces; reference to "bacteria" includes mixtures of bacteria, and the like.
[0027] Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention pertains. Although any methods and materials similar or equivalent to those described herein can be used in the practice for testing of the present invention, the preferred materials and methods are described herein. In describing and claiming the
set out below.
[0028] Categorical data: The term "categorical data" refers to gene expression data that is not measured on a continuous scale.
[0029] Comparison test: A "comparison test" is a statistical test used to compare two or more sets of data. Comparison tests are well-known to one of skill in the art. One example of a comparison test is a two tailed, normal approximation test for binomial proportions that is described in U.S. Patent Application No. 60/341,030, concurrently filed on December 11, 2001, LOJAQ Docket No. 37-000700US, the contents of which are incorporated by reference.
[0030] Dataset: A "dataset" is generated by a method for gene expression analysis using a sample from an organism. As described in further detail below, any of a variety of gene expression analyses can be used to generate a dataset. A dataset includes a set of sequences, or a "population of sequences" representing the genes that are expressed in the sample. A dataset also includes "associated values" representing the level of expression of each gene/sequence.
[0031] Domesticated animal: A "domesticated animal" is an animal that is adapted to life in intimate association with and to the advantage of humans. Examples include but are not limited to species of poultry, cows, pigs, and sheep.
[0032] Domesticated plant: A "domesticated plant" is a plant that is adapted to life in intimate association with and to the advantage of humans. Examples include but are not limited to species of rice, corn, wheat, sorghum, sunflower, and many vegetable varieties.
[0033] Heterosis parameter: As used herein, a "heterosis parameter" is a factor assigned to a sequence depending on the relationship between the expression level of the sequence in the hybrid offspring and the expression level in both hybrid parents.
[0034] . Heterosis: "Heterosis" or "hybrid vigor" is the increased fitness of a hybrid offspring as compared to the midpoint of its two inbred parents. The increased fitness can be measured in a variety of ways including but not limited to growth rate, size, yield, weight, survival, fertility, etc.
plant or animal that displays heterosis.
[0036] Hybrid: A "hybrid" is an individual or a population that is the result of outbreeding, where genetically dissimilar parents are crossed.
[0037] Inbred: An "inbred" is an individual or a population that is the result of inbreeding, where genetically similar or identical parents are crossed.
[0038] MPSS: Massively parallel signature sequencing, or "MPSS" refers to a method for gene expression analysis that is described in Brenner et al., (2000), Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays, Nature Biotech.. 18:630-634 and in U.S. Patent Application No. 60/341,030, concurrently filed on December 11, 2001, LOJAQ Docket No. 37-000700US, the contents of which are incorporated by reference. Briefly, MPSS is a method for large scale counting of individual rnRNAs in a sample using Lynx Megaclone technology.
[0039] Non-additively expressed sequences: The term "non-additively expressed sequences" refers to one or more sequences that have been determined to be expressed in a hybrid offspring at a level that is significantly different from the expression level that is the midpoint between the two inbred parents.
[0040] Nucleic acid: A "nucleic acid" is a polymer of nucleotides, e.g., adenine, guanine, cytosine, threonine, etc. and modified nucleotides thereof. Examples of nucleic acids include RNA and DNA. A "ribonucleic acid" is a polymer of ribonucleotides, and includes mRNA.
[0041] Oyster: An "oyster" refers to one or more individuals from the family
Ostreidae.
[0042] Polypeptide: A "polypeptide" or "protein" is a polymer of amino acids, e.g., alanine, serine, etc. and modified amino acids thereof.
[0043] Sample: A "sample" is a portion of material isolated from an organism, either the hybrid offspring or the inbred parents. A sample can be an organism, isolated tissue, or cell, or a portion of material made from the organism, isolated tissue, or cell.
[0044] Sequence: A "sequence" refers to a nucleic acid or polypeptide sequence.
For example, the sequence of a nucleic acid includes the description of the primary
sequences.
DETAILED DESCRIPTION
[0045] The method described herein is a novel approach for analyzing gene expression in a heterotic organism. Briefly, gene expression data is collected for two inbred parents and both reciprocal hybrid offspring. Gene expression levels of both parents are compared with gene expression levels of each hybrid offspring, resulting in the classification of each gene. Genes with the same mode of non-additive expression in both hybrids are identified as those that might play an important role in heterosis.
[0046] The methods are useful for identifying genes that play a role in heterosis.
These heterotic genes can be used to study heterosis. For example, the study of heterotic genes can be used to formulate a theory of the genetic basis of heterosis. Knowledge of heterotic genes can allow the prediction of heterotic organisms without empirical observation. Finally, the heterotic genes can be used to create non-hybrid organisms with similar hybrid vigor related properties.
[0047] Among the advantages of the methods of the invention is the ability to analyze organisms with poorly characterized genomes. In addition, the method can be used with data generated using a variety of gene expression techniques.
[0048] The present invention provides methods, computer program products, and systems for identifying genes that play a role in heterosis in a heterotic organism.
[0049] In one embodiment of the invention, the method entails determining gene expression levels for each of two inbred parents and two reciprocal hybrid offspring. The datasets generated by the gene expression analysis include a set of sequences and associated values representing gene expression levels. The gene expression levels of each hybrid are compared to the gene expression levels of both parents using a comparison test. The results of these comparisons are then analyzed by a heterosis analysis. The heterosis analysis produces for each hybrid a set of non-additively expressed sequences with associated heterosis parameters. Sequences that are present in both sets of non-additively expressed sequences and which have the same associated heterosis parameter are identified as sequences that play a role in heterosis.
[0050] Jhe heterosis analysis oi tne invention uses resuns nυπi a uυmp-u.iϊ.uι. - a- comparing the gene expression levels of a hybrid offspring and both inbred parents, and generates a set of non-additively expressed sequences. The heterosis analysis further includes assigning a heterosis parameter to each sequence in the set of non-additively expressed sequences. A novel feature of the invention is that the heterosis analysis is performed for each reciprocal hybrid offspring.
[0051] In one embodiment, the heterosis analysis entails calculating a midparent
(MP) value for each sequence. The MP value is the average of the first and second inbred parent values associated with the sequence. A comparison test is performed using the MP value and the hybrid offspring value for each sequence. If the comparison test between the hybrid value and the MP value is significant then the sequence is placed in a set of non- additively expressed genes.
[0052] The heterosis parameters are determined by the relationship between the gene expression levels of the hybrid offspring and both parents. In one embodiment, the heterosis parameters include dominant, partially dominant, over dominant, and under dominant.
[0053] In a preferred embodiment of the method of the invention, the heterosis analysis entails identifying the first and second inbred parents as either a high inbred parent (HP) or a low inbred parent (LP) for each sequence in the datasets, wherein the HP value is greater than the LP value. This preferred embodiment further entails assigning to the sequence the heterosis parameter of dominant and resembling the HP (D+), if the comparison test between the hybrid value and the MP value is significant and greater than zero and the comparison test between the hybrid value and the HP value is non-significant; assigning the sequence the parameter of partially dominant and resembling the HP (PD+), if the comparison test between the hybrid value and the MP value is significant and greater than zero and the comparison test between the hybrid value and the HP value is significant and less than zero; assigning the sequence the parameter of over dominant (OD), if the comparison test between the hybrid value and the MP value is significant and greater than zero and the comparison test between the hybrid value and the HP value is significant and greater than zero; assigning the sequence the parameter of dominant and resembling the LP (D-), if the comparison test between the hybrid value and the MP value is significant and
significant; assigning the sequence the parameter of partially dominant and resembling the LP (PD-), if the comparison test between the hybrid value and the MP value is significant and less than zero and the comparison test between the hybrid value and the LP value is significant and greater than zero; and assigning the sequence the parameter of under dominant (UD), if the comparison test between the hybrid value and the MP value is significant and less than zero and the comparison test between the hybrid value and the LP value is significant and less than zero.
Samples [0054] The samples used in the methods of the invention are obtained from two inbred parents and two reciprocal hybrid offspring from a heterotic organism species. In one embodiment, the heterotic organisms include domesticated plants and animals. In another embodiment, the heterotic organisms include agriculturally important plants and animals. In a further embodiment, the heterotic organism is an oyster. Other examples of heterotic organisms are well known to one of skill in the art and include rice, maize, sugar beets, sunflowers, poultry, beef cattle, etc.
[0055] Any material that can be used to analyze gene expression levels can be used as a sample in the method of the invention. The sample should include material that represents the expressed genes of the sample, e.g., the sample should include mRNA, cloned cDNA copies, protein, and/or the like. For example, the sample can include crude or partially purified cell extract. Alternatively, the sample can include purified nucleic acid or polypeptide. In a preferred embodiment, the sample includes nucleic acids. In another embodiment of the invention, the sample includes mRNA.
Analyzing Gene Expression Levels [0056] As described in detail above and below, the method of the invention includes comparing datasets generated by methods of analyzing gene expression levels. The datasets represent gene expression levels and include a set of sequences and associated values. The associated value of a sequence represents the expression level of the sequence in the sample.
[0057] A variety of techniques can be used to generate the dataset in the present invention. Typically, massively parallel signature sequencing (MPSS) is used to generate the data. Other examples also include techniques for profiling mRNA, such as SAGE,
expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays, Nature Biotech., 18:630-634; Tyagi, (2000), Taking a census of mRNA populations with microbeads, Nature Biotech., 18:597-598; Okubo et al, (1992), Large scale cDNA sequencing for analysis of quantitative and qualitative aspects of gene expression, Nature Genetics, 2:173-179; Bachem et al, (1996) Visualization of differential gene expression using a novel method of RN A fingerprinting based on AFLP: analysis of gene expression during potato tuber development, Plant J„ 9:745-753 and Shimkets et al., (1999) Gene expression analysis by transcript profiling coupled to database query, Nature Biotechnology, 17:798-803.
MPSS
[0058] Massively parallel signature sequencing (MPSS) is one useful current technology available for in-depth quantitative expression profiling. MPSS is a "digital" gene expression tool designed for large-scale, simultaneous counting of individual mRNA molecules in a sample. MPSS provides data for all genes in a tissue or cell sample, not just those that have been previously identified and characterized. No prior knowledge of a gene's sequence is required for MPSS, thus gene expression data sets can be generated from any organism. In addition, MPSS has a high sensitivity level. Typically, an MPSS data set involves greater than, e.g., about 100,000 signature sequences, e.g., about 250,000 signature sequences, e.g., about 500,000 signature sequences, e.g., about 750,000 signature sequences, or e.g., about 1,000,000 signature sequences. It has the capacity to routinely detect genes that are expressed at low levels within the cell.
[0059] Counting rnRNAs with MPSS is based on the ability to uniquely identify every mRNA in a sample. This is done by generating a signature sequence of 17 or more bases for each mRNA at a specific site upstream from its poly(A) tail (e.g., the last DpnLI site in double stranded cDNA). To measure the level of expression of any given gene, the total number of signatures for that gene's mRNA are counted.
[0060] MPSS signatures for rnRNAs in a sample are generated by sequencing double stranded cDNAs fragments cloned into microbeads using the Lynx Megaclone technology. The Megaclone technology is described in Brenner et al., (2000) In vitro
expressed cDNAs, PNAS USA 97:1665-1670.
[0061] MPSS and microbead technology are further described in the following patents and references cited within: U.S. Patent No. 6,306,597 to Macevicz entitled "DNA sequencing by parallel oligonucleotide extensions" issued October 23, 2001; U.S. Patent No. 6,280,935 to Macevicz entitled "Method of detecting the presence or absence of a plurality of target sequences using oligonucleotide tags" issued August 28, 2001; U.S. Patent No. 6,265,163 to Albrecht et al., entitled "Solid phase selection of differentially expressed genes" issued July 24, 2001; U.S. Patent No. 6,235,475 to Brenner et al., entitled "Oligonucleotide tags for sorting and identification" issued May 22, 2001; U. S. Patent No. 6,228,589 to Brenner entitled "Measurement of gene expression profiles in toxicity determination" issued May 8, 2001; U.S. Patent No. 6,175,002 to DuBridge et al., entitled "Adaptor-based sequence analysis" issued January 16, 2001; U.S. Patent No. 6,172,218 to Brenner entitled "Oligonucleotide tags for sorting and identification" issued January 9, 2001; U.S. Patent No. 6,172,214 to Brenner entitled "Oligonucleotide tags for sorting and identification" issued January 9, 2001; U. S. Patent No. 6,150,516 to Brenner et al., entitled "Kits for sorting and identifying polynucleotides" issued November 21, 2000; U.S. Patent No. 6,140,489 to Brenner entitled "Compositions for sorting polynucleotides" issued October 31, 2000; U.S. Patent No. 6,138,077 to Brenner entitled "Method, apparatus and computer program product for determining a set of non-hybridizing oligonucleotides" issued on October 24, 2000; U. S. Patent No. 6,013,445 to Albrecht et al., entitled "Massively parallel signature sequencing by ligation of encoded adaptors" issued January 11, 2000; U.S. Patent No. 5,962,228 to Brenner entitled "DNA extension and analysis with rolling primers" issued October 5, 1999; U.S. Patent No. 5,888,737 to DuBridge et al., entitled "Adaptor-based sequence analysis" issued March 30, 1999; U.S. Patent No. 5,780,231 to Brenner entitled "DNA extension and analysis with rolling primers" issued July 14, 1998; U. S. Patent No. 5,750,341 to Macevicz entitled "DNA sequencing by parallel oligonucleotide extensions" issued May 12, 1998; U.S. Patent No. 5,747,255 to Brenner entitled "Polynucleotide detection by isothermal amplification using cleavable oligonucleotides" issued May 5, 1998; U.S. Patent No. 5,969,119 to Macevicz entitled "DNA sequencing by parallel oligonucleotide extensions" issued October 19, 1999; U. S. Patent No. 5,863,722 to Brenner entitled "Method of sorting polynucleotides" issued
for sorting and identification" issued December 8, 1998; U.S. Patent No. 5,763,175 to Brenner entitled "Simultaneous sequencing of tagged polynucleotides" issued June 9, 1998; U.S. Patent No. 5,695,934 to Brenner entitled "Massively Parallel sequencing of sorted polynucleotides" issued December 9, 1997; U.S. Patent No. 5,635,400 to Brenner entitled "Minimally cross-hybridizing sets of oligonucleotide tags" issued June 3, 1997; and, U.S. Patent No. 5,604,097 to Brenner entitled "Methods for sorting polynucleotides using oligonucleotide tags" issued February 19, 1997.
SAGE
[0062] SAGE is another transcript counting technique that generates a tag sequence for each mRNA and a "digital" gene expression profile. SAGE is based on the principles that a short sequence tag derived from a defined position from an mRNA can uniquely identify the transcript and concatenation of the tags allows for high-throughput sequencing. The length of the SAGE tag is about 10 to about 14 nucleotides. The tag sequence is determined using conventional sequencing technologies. See the following publications and references cited within regarding SAGE: Nelculescu et al., (1995), Serial analysis of gene expression, Science, 270:484-487; and Zhang et al., (1997), Gene expression profiles in normal and cancer cells, Science, 276:1268-1272.
[0063] To determine the expression level of a gene from SAGE technique, the frequency of a sequence tag derived from the corresponding mRΝA transcript is measured. As with microarray data described below, adjustments to take into consideration bias and normalization are optionally included in the present invention. See, e.g., Marguiles et al., (2001) Identification and prevention of a GC content bias in SAGE libraries, Nucleic Acid Res., 29(12):E60-0.
Microarrays [0064] Microarrays in the context of the present invention include microarrays that contains a variety of genes, e.g., the Affymetrix human U95 set contains elements to study the expression of more than 60,000 genes and ESTs (Affymetrix, California). The rnRNAs from the sample are then allowed to hybridize to the microarray. Microarrays have the advantage of high throughput analysis of multiple samples. However, there are variables that must be considered when analyzing microarray data. First, the desired genes must be
Second, a microarray must exist for the organism of interest, which is not always the case. Third, the detection sensitivity must be optimized to achieve detection of low expressed genes. Fourth, a sample must be compared with a control sample to compensate for several sources of bias and noise in the intensity results. Fifth, compensation must be made for multiple values for a single gene, because multiple values can arise from distinct probe sets within different sections within the gene. Typically, a significance test is possible only if the experiment is replicated several times. See Kerr and Churchill, G.A., (2001), Statistical design and the analysis of gene expression microarray data, Biostatistics, 2:183-201; Wodicka et al., (1997), Genome wide expression monitoring in Saccharomyces cerevisiae, Nature Biotech., 15:1359-1367; Lockhart et al., (1996), Expression monitoring by hybridization to high-density oligonucleotide arrays, Nature Biotech., 14:1675-1680; Aach et al., Systematic management and analysis of yeast gene expression data, Genome Res., 10:431-445 and Wittes and Friedman, (1999) Searching for evidence of altered gene expression: a comment on statistical analysis of microarray data, J. Natl. Cancer Inst., 91:400-401.
[0065] More information on microarrays can be found in the following publications and references cited within: Duggan et al., (1999), Expression profiling using cDNA microarrays, Nature Genetics, 21:10-14; Lipshutz et al., High density synthetic oligonucleotide arrays. Nature Genetics Suppl. 21:20-24; Evertsz et al., (2000), Technology and applications of gene expression microarrays, in Microarray Biochip technology, Schena, M., Ed. BioTechniques Books, Natick, MA, pp.149-166; Lockhart and Winzeler, (2000), Genomics, gene expression and DNA arrays, Nature, 405:827-836; Zhou et al., (2000), Information processing issues and solutions associated with microarray technology, in Microarray Biochip technology, Schena, M., Ed., BioTechniques Books, Natick, MA, pp. 167-200; and Hughes et al., (2001), Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer, Nature Biotech., 19:342-347.
Datasets [0066] Various types of datasets can be generated from gene expression analysis experiments. For example, the datasets can include categorical data, e.g., data generated from MPSS. Data are categorical data, e.g., when a sequence in a particular dataset is either present at a certain proportion or absent. Other types of data optionally used in the present
generated from microarrays (where the results are represented by a ratio of the fluorescent levels of two probes) and the like.
Comparison tests [0067] The methods of the invention include a comparison test to compare the datasets. A variety of comparison tests can be used, for example, a two-tailed normal approximation test, a Chi-Square test, a Fisher exact test, a generalized linear model, Audic and Claverie's Bayesian method and the like. Comparison tests are well-known to one of skill in the art; information on statistical tests can be found in variety of places, for example, textbooks, papers and the world wide web. For example, see Fisher and van Belle, (1993)
Biostatistics: a Methodology for the Health Science, John Wiley & Sons, New York; Man et al, (2000) POWER SAGE: comparing statistical tests for SAGE experiments,
Bioinformatics, 16(11): 953-959; and, Audic and Claverie, (1997) The significance of digital gene expression profiles, Genome Research, 7:986-995.
[0068] In one embodiment, the comparison test is a normal approximation test for binomial proportions, which includes a two-tailed test using a first equation:
wherein the τi
\ is a number of, e.g., mRNA molecules or cloned cDNA copies sequenced that represents the population of sequences for the first sample, and the n
2 is a number of , e.g., mRNA molecules or cloned cDNA copies sequenced that represents the population of sequences for the second sample, wherein associated value, e.g., abundance, of an individual sequence in the first sample is represented by the x
1 and the associated value, e.g., abundance, of the individual sequence in the second sample is the x
2, and wherein the
and the p are represented by a second equation and a third equation:
wherein the p is represented by i fourth equation
wherein the q is represented by a fifth equation:
wherein the ni and n
2 are large, e.g., about 300,000 to about 10,U00,UUU mRNA molecules. Further details on the use of the two tailed normal approximation test are found in U.S. Patent Application No. 60/341,030, concurrently filed with the parent of the present application on December 11, 2001, the contents of which are incorporated by reference.
Additional Analysis [0069] The methods of the present invention optionally include filtering the datasets.
In one embodiment, the filtering step occurs before the data analysis step. For example, the methods can further include filtering the population of sequences obtained via the gene expression analysis by removing sequences that contain one or more ambiguous nucleotides.
[0070] In one embodiment, the data can be filtered by removing sequences that fail to meet a minimum abundance test. In one embodiment, the minimum abundance test includes removing sequences that, when analyzing the samples with a normal approximation test, cannot be distinguished from zero. For example, when a dataset is obtained via MPSS, an individual signature sequence with the highest abundance across all MPSS runs that is less than about 4 per million is not significantly different from zero when a p< 0.05 significance level is chosen, and e.g., an abundance of 6 is not significantly different from zero when a p<0.01 significance level is chosen, and e.g., an abundance of 11 is not significantly different from zero when a p<0.001 significance level is chosen.
[0071] ' In another embodiment, the filtering step removes sequences that do not match a known genome. For example, sequences can be matched to known genes by comparison with data available in genomic and/or EST sequence databases, e.g., the National Center for Biotechnology Information (NCBI). If no match is found, the sequence is removed from further analysis. Alternatively, if multiple matches are found, e.g. to a repeated sequence, the sequence is removed from further analysis. Typically, when using MPSS data, sequences that match a genome sequence at 16 or more nucleotides out of 17 and have 1-3 matches in the genome are retained for further analysis.
Computer Program Product [0072] Computer program products are also provided by the invention. Any of the methods of the present invention can be performed on a computer. For example, the
that play a role in heterosis. In one embodiment, the invention includes a program that includes code that receives as input a dataset from each of a first inbred parent sample, a second inbred parent sample, a first hybrid sample, and a second hybrid sample, each dataset including a population of sequences and associated values; code that uses a comparison test with the dataset of the first hybrid sample and the datasets of the first parent sample and the second parent sample, thereby generating first comparison test results; code that analyzes the first comparison test results by a heterosis analysis thereby placing sequences in a first set of non-additively expressed sequences and assigning one of a group of heterosis parameters to each sequence in the first set; code that uses the comparison test with the dataset of the second hybrid sample and the datasets of the first parent sample and the second parent sample, thereby generating second comparison test results; code that analyzes the second comparison test results by the heterosis analysis thereby placing sequences in a second set of non-additively expressed sequences and assigning one of the group of heterosis parameters to each sequence in the second set; code that identifies a heterosis set of sequences, wherein each sequence in the heterosis set is present in and is assigned the same heterosis parameter in both the first and second sets of non-additively expressed sequences; and code that provides the heterosis set of sequences, wherein the codes are stored on a tangible medium.
Systems [0073] Systems for identifying a heterosis set of sequences are also a part of the present invention. These systems include a processor and a computer readable medium coupled to the processor, said computer readable medium storing a computer program. The computer program includes code for any of the analysis used in the present invention. For example, the code can include instructions that receives as input a dataset from each of a first inbred parent sample, a second inbred parent sample, a first hybrid sample, and a second hybrid sample, each dataset including a population of sequences and associated values; code that uses a comparison test with the dataset of the first hybrid sample and the datasets of the first parent sample and the second parent sample, thereby generating first comparison test results; code that analyzes the first comparison test results by a heterosis analysis thereby placing sequences in a first set of non-additively expressed sequences and assigning one of a group of heterosis parameters to each sequence in the first set; code that
first parent sample and the second parent sample, thereby generating second comparison test results; code that analyzes the second comparison test results by the heterosis analysis thereby placing sequences in a second set of non-additively expressed sequences and assigning one of the group of heterosis parameters to each sequence in the second set; code that identifies a heterosis set of sequences, wherein each sequence in the heterosis set is present in and is assigned the same heterosis parameter in both the first and second sets of non-additively expressed sequences; and code that provides the heterosis set of sequences. Other relevant codes are described, e.g., under the computer program products section herein.
[0074] Logic systems and methods such as those described herein can include a variety of different components and different functions assembled in a modular fashion. Different embodiments of the invention can include different mixtures of elements and functions and can group various functions as parts of various elements. The invention is described in terms of systems that include many different innovative components and innovative combinations of innovative components and known components. No inference should be taken to limit the invention to combinations containing all of the innovative components listed in any illustrative embodiment in this specification. The functional aspects of the invention that are implemented on a computer can be implemented or accomplished using any appropriate implementation environment or programming language, such as PERL, C, C++, Cobol, Pascal, Java, Java-script, HTML, XML, dHTML, assembly or machine code programming, etc.
[0075] The present invention encompasses a variety of specific embodiments for performing these steps. As further described below, the request for importing the data collected can be received in a variety of ways, including through one or more graphical user interfaces provided by the collection system to a database or by the collection system receiving an email or other digital message or communication from the client system connected to the database. Thus, according to specific embodiments of the present invention, data and/or indications can be transmitted to the database using any method for transmitting digital data, including HTML communications, FTP communications, email communications, wireless communications, etc. In various embodiments, indications of
computing device.
[0076] After the request is received, the collection system according to specific embodiments of the present invention accesses the requested data. As discussed further below, a collection system can hold data files prior to receiving a request for particular data or the collection system can create requested data while responding to a request from a user to receive the data. When the data is available at the collection system, the collection system transmits or imports the data to a client system. At the client system, a logic routine can be used to access the file that is transmitted.
[0077] Accessing the data can be done with or without the active participation of a human user. For example, a voice command can be spoken by a user, a key can be depressed by a user, a button on a scientific device can be depressed by a user or selecting using any pointing device can be effected by a user. Thus, in different embodiments, requested data can be submitted by automated equipment at a client site. For example, a scientific device can be programmed to automatically request needed data sets from a collection system according to specific embodiments of the present invention.
EXAMPLES
[0078] The following examples are offered to illustrate, but not to limit the claimed invention.
EXAMPLE 1
[0079] Figure 1 is a flowchart illustrating one embodiment of a method of identifying a heterosis set of sequences. The flowchart illustrates a computer program written in Perl, which was used to analyze datasets generated by MPSS using samples from a heterotic organism.
[0080] The method includes prefiltering the data by removing ambiguous signatures
(Step 101), calculating a midparent (MP) (Step 102), determining the high parent (HP) and low parent (LP) (Step 103), calculating hybrid 1 significance (Step 104), and calculating hybrid 2 significance (Step 105).
[0081] Figure 2 is a flowchart illustrating one embodiment of a method for determining the high parent (HP) and low parent (HP), using 2- and 4-stepper datasets
U.S. Patent Application No. 60/341,030, concurrently filed with the parent application on December 11, 2001, the contents of which are incorporated by reference. This embodiment was used in Step 103 of the method illustrated in Figure 1.
[0082] The method includes calculating the proportions of parent 1 and parent 2 for each stepper (Step 201) and evaluating whether or not the sum of proportions of parent 1 and parent 2 two stepper is greater than the sum of proportions of parent 1 and 2 four stepper (Step 202); if the result of Step 202 is False, determining whether or not the proportion of parent 1 four stepper is greater than the proportion of parent 2 four stepper (Step 203); if the result of Step 202 is True, determining whether or not the proportion of parent 1 two stepper is greater that the proportion of parent 2 two stepper (Step 204). The end of the method is the result that each of the inbred parents (parent 1 and parent 2) is identified as either the low parent (LP) or the high parent (HP) for a given sequence. This information was used in the steps for calculating hybrid significance.
[0083] Figure 3A-3G is a flow chart illustrating one embodiment of a method for calculating hybrid significance, e.g., performing a heterosis analysis and assigning heterosis' parameters to sequences using datasets generated by MPSS. This embodiment was used in Steps 104 and 105 of the method illustrated in Figure 1.
[0084] As shown in Figure 3A-3B, the method includes setting a significance level, e.g., Z-values of 1.96, 2.58, and 3.57 for corresponding p-values of 0.05, 0.01, and 0.001, respectively, (Step 301), calculating proportions of hybrid and MP for each stepper (Step 302), comparing the sum of these proportions (Step 303), using a comparison test to examine the difference between the hybrid value and the midparent value (Step 304), connecting to Cl if the result of Step 304 gives a value that is less than the significance value and greater than the negative of the significance value (Step 305), connecting to Bl if the result of Step 303 gives a value that is less than the significance value and less than the negative of the significance value (Step 306), and connecting to Al if the result of Step 303 gives a value that is greater than the significance value (Step 307).
[0085] The method also includes the connections to Al, Bl, and Cl. The connection to Al is shown in Figure 3C-3D, and includes the steps which assign heterosis parameters to sequences with hybrid values that differ significantly from the high parent.
heterosis parameters to sequences with hybrid values that differ significantly from the low parent. The connection to Cl is shown in Figure 3G, and includes the steps that determine whether or not an associated value, e.g., expression level indicates low expression or additive expression.
EXAMPLE 2
[0086] A set of heterosis genes from Crassostrea gigas, a species of Pacific oyster, were identified. Samples were obtained from larvae of two inbred populations (35 and 51) of the Pacific oyster and two hybrids (Hybrid 35: 35 x 51 and Hybrid 53: 51 x 35) derived by reciprocally crossing individuals from the two inbred parent populations. Each reciprocal hybrid showed heterosis in shell size at day 5 and growth rate from day 2 to day 5. Gene expression analysis was performed by cloning 3J million cDNAs on plastic beads (Brenner et al. 2000 PNAS 97:1665) and sequencing by massively parallel signature sequencing technology (MPSS®; Brenner et al. 2000 Nature Biotechnology 18:630). The MPSS dataset generated was analyzed using the embodiment of Example 1.
[0087] The clone count, e.g., gene expression levels, for the majority of cDNAs either did not vary significantly between inbreds and hybrids or behaved additively, because there was no significant deviation in the hybrids from the expected mid-parent level. A small proportion of cDNAs (<5%), however, exhibited a pattern of expression that was significantly non-additive in each hybrid (p<0.001). Amongst the non-additively expressed cDNAs, expression in the hybrid was either different from both inbred parents or like one inbred parent but not the other. Figure 4 shows that the largest class of non-additively expressed cDNAs (>500 in each hybrid; p<0.001) was over-expressed relative to both inbred parents. As shown in Figure 5, about 150 cDNAs had the same pattern of non- additive expression in each reciprocal hybrid when compared to both inbred parents, e.g., the same heterosis parameter, and are likely to be associated with growth heterosis. This study illustrates how a comprehensive gene expression profiling technology such as MPSS® can be applied in an organism with a poorly characterized genome to address a problem of significant biological importance, e.g., heterosis.
[0088] It is understood that the examples and embodiments, described herein are for illustrative purposes only and that various modifications or changes in light thereof will be
this application and scope of the appended claims.
[0089] While the foregoing invention has been described in some detail for purposes of clarity and understanding, it will be clear to one skilled in the art from a reading of this disclosure that various changes in form and detail can be made without departing from the true scope of the invention. For example, all the techniques and apparatus described above can be used in various combinations. All publications, patents, patent applications, and/or other documents cited in this application are incorporated by reference in their entirety for all purposes to the same extent as if each individual publication, patent, patent application, and/or other document were individually indicated to be incorporated by reference for all purposes.