GB2365124A

GB2365124A - Analysis and identification of transcribed genes, and fingerprinting

Info

Publication number: GB2365124A
Application number: GB0018016A
Authority: GB
Inventors: Sten Linnarsson; Patrik Ernfors
Original assignee: GLOBAL GENOMICS AB; Karolinska Innovations AB
Current assignee: GLOBAL GENOMICS AB; Karolinska Innovations AB
Priority date: 2000-07-21
Filing date: 2000-07-21
Publication date: 2002-02-13
Anticipated expiration: 2020-07-21
Also published as: GB0018016D0; GB2365124B

Abstract

A method for identifying mRNA molecules present in a sample, and also for quantifying the expression levels of the mRNA molecules. A profile of gene identities and/or expression levels is produced by generating two independent patterns characteristic of the population of mRNA molecules expressed in the sample and analysing these patterns using a combinatorial algorithm. Gene expression by different cell types or of the same cell types under different conditions may be compared. In this way, genes may be identified which play a role in determining various cellular processes and states, including susceptibility to external factors, development, and disease.

Description

<Desc/Clms Page number 1> METHODS FOR ANALYSIS AND IDENTIFICATION OF TRANSCRIBED GENES, AND FINGERPRINTING The present invention relates to methods for identifying genes and patterns of genes that are expressed. Specifically the present invention allows for analysis of genes that are transcribed, and for comparison of patterns of transcription in different cells or the same cells under different conditions or stages of development, and further allows for quantitation of the level of expression in a pool of RNA from many different genes. In a few years the sequence to the human and rodent genomes will be complete. However, defining the role of each of the estimated 50,000-140,000 genes will be a major task. An even greater task will be to understand how expression of the genome functions as a whole in a living organism.

Only a fraction of the total number of genes present in the genome is expressed in any give cell. The relatively small fraction of the total number of genes that is expressed in a cell determine its life processes, e.g. intrinsic and extrinsic properties of the cell including development and differentiation, homeostasis, its response to insults, cell cycle regulation, aging, apoptosis, and the like. Alterations in gene expression decide the course of normal cell development and the appearance of diseased states, such as cancer. Because the profile of gene expression in any given cell has direct consequences to its nature, methods for analyzing gene expression on a global scale are of critical import. Identification of gene-expression

profiles will not only further understanding of normal biological processes in organisms but provide a key to prognosis and treatment of a variety of diseases or condition states in humans, animals and plants associated with alterations in gene expression. In addition, since differential gene expression is associated with predisposition to diseases, infectious agents and responsiveness to external treatments (Alizadeh et al., 2000; Cho et al., 1998; Der et al., 1998; Iyer et al., 1999; McCormick, 1999; Szallasi, 1998), identification of such gene-expression profiles can provide a powerful diagnostic tool for diseases, and as a tool to identify new drugs for treating or preventing such diseases. This technology will also be immensely powerful for gene-discovery.

The only means of achieving this is to measure a11 genes expressed in particular tissues/cells at a particular time on a large scale, preferentially in one experiment. Less than a decade ago the concept of being able to simultaneously measure the concentration of every transcript in a cell in a single experiment would have been deemed undoable. However, use of DNA microarrays and other technological advances in the past few years have stimulated an extraordinary surge of interest in this field (Bowtell, 1999; Brown and Botstein, 1999; Duggan et al., 1999; Lander, 1999; Southern et al., 1999).

DNA microarrays are based on solid support. Pieces of known identified DNA sequences (cDNA or synthetic oligonucleotides) are attached to the solid support in high-density grids and a pool of labeled RNA or cDNA from cell(s) or tissue(s) are hybridised

(Duggan et al., 1999; Lipshutz et al., 1999). The intensity of the hybridisation signal at each grid is measured and give an estimate of the expression This procedure requires prior knowledge of the genes under study. DNA microarrays based on oligonucleotides attached to a glass surface covering around 30,000 unique gene sequences ordered in high density on small

slides (i.e. approximately oneQtTijd to one fourth of all genes) are now available from Affymetrix@(Lipshutz et al., 1999). Thus, microarrays are based on a high capacity system to monitor the expression of many genes in parallel with high sensitivity. cDNA microarrays are prepared by high speed robotic printing of cDNAs on glass providing quantitative expression measurements of the corresponding genes following hybridisation of the query pool of RNA (Brown and Botstein, 1999; Duggan et al., 1999; Schena et al., 1995; Schena et al., 1996). Differential expression measurements of genes are made by means of simultaneous hybridization of different pools of RNA. Oligonucleotide arrays are based on a high-density synthesis in arrays of oligonucleotides corresponding to cDNA or expressed sequence tag sequences on a solid support to which a query pool of RNA is hybridized (Lipshutz et al., 1999).

Although very powerful, like any technology it has drawbacks: (i) It requires prior knowledge of the genes whose expression is studied, (ii) it is indirect as it relies on hybridisation of RNA or equivalents to the attached templates, (iii) the most reproducible method available, the synthetic oligonucleotide arrays, are very expensive (approximately 4000 USD/for each determination of 30,000 genes), (iv) manufacturing the arrays in-

house requires individual amplification and arraying of each gene to be studied, a major if not impossible task for each lab interested in the technology.

A number of alternative methods for detecting and quantification of gene expression are available. These include for instance Northern blot analysis (Alwine et al., 1977), S1 nuclease protection assay (Berk and Sharp, 1977), serial analysis of gene expression (SAGE) (Velculescu et al., 1995) and sequencing of cDNA libraries (Okubo et al., 1992). However, all these are low- throughput approaches not suitable for global gene expression analysis. Differential display (Liang and Pardee, 1992) and related technologies contrast to microarray technology by not being based on solid support. The advantage of these technologies to microarrays is that no prior sequence information is required to execute the experiment. However, differential display and related technologies have two shortcomings that make them unsuitable for large-scale gene expression analysis; (i) the identity of the genes which are under study in each experiment can only be determined following cloning and sequence analysis of each of the cDNA in every experiment and (ii) the mRNAs are identified multiple times in every experiment.

A method for large scale restriction fragment length polymorphism of genomic DNA (KeyGene EP0969102) involves enzymatic cleavage of genomic DNA with one or two restriciton enzymes and ligating specific adapters to the fragments. When using two different restriction enzymes they use two adapters (one for each enzyme). One of these two adapters is biotinylated. Solid phase support is

then used to isolate the fragments that contains at least one restriction site which the biotinylated adaptor is complementary to. This procedure leads to an enrichment which improves resolution and reduces background in the PCR. Multiplex PCR is performed using primers directed against the adapters with 3 or more nucleotides unique to each primer.

The Celera GeneTag technology is for quantitatively measuring the expression levels of virtually all RNA transcripts in a cell or tissue, whether previously known or unknown. This allows simultaneous monitoring of known genes and discovery of novel genes, saving significant time and costs relative to sequencing or chip-based strategies. GeneTag technology provides this information within a biological context, so the genes you discover are specific to the biological pathway, disease model, or drug response being investigated.

The GeneTag process is based on the principle that unique PCR fragments are generated for each cDNA. The fragments are separated by fluorescent capillary electrophoresis, then size- called and quantitated using Celera's proprietary algorithms. The amount of a specific mRNA is then determined by the fluorescent intensity of its cognate PCR fragment. Using Celera's proprietary GeneTag database, the cDNA fragment peaks are matched with their corresponding gene names.

In this methodology, total RNA is isolated from the cell line (s) or tissues of interest. The GeneTag' process requires at least 200 /-jg of total RNA.

Complementary DNA is prepared from the total RNA samples then restricted twice in a stepwise fashion. 3'-end capture is used after each digest to isolate the fragment of interest. Using their method, adapters are ligated to both ends of the fragment to serve as PCR primer sites. Thus, multiple fragments are potentially prepared for each gene.

The adapter-ligated cDNA samples are amplified using a set of primers, which have two selective bases on each end (+2/+2). Combinations of these four bases yield a total of 128 unique PCR primer pairs.

The 128 PCR reactions from each sample are analyzed individually by capillary electrophoresis, one reaction per capillary plus an internal lane standard. Each gene presents one unique fragment that can be "binned" based on its size (bp) and the specific primer pair used to generate it. This binning process enables rapid data analysis and gene identification.

Celera's proprietary software assigns sizing and quantitation measures to each peak in the electropherogram. Internal size standards allow direct comparison of electropherograms from treated samples and controls.

All 128 electropherograms from both the treated samples and the control samples are analyzed and compared automatically. Peaks (cDNA fragments) exhibiting a statistically significant difference between sample and control are flagged and quantitated.

The two steps of purification leads to a requirement for large amounts of starting material (i.e. >200pg). The small number of sub-reactions or subdivisions (128) leads to difficulties assigning known sequences to fragments (since many genes will run as doublets with other genes, i.e. will appear as fragments with the same size). Increasing the number of frames would at the same time increase the required amount of starting material even more. Another method (US patent 6010850 and 5712126) uses a Y-shaped adaptor to suppress non-3"fragments in the PCR. Thus, this cDNA is digested with a restriction enzyme and ligated to a Y-shaped adapter. The Y-shaped adapter enables selective amplification of 3'-fragments. However, since the entire pool of cDNA is present, there are numerous opportunities for primers to hybridize non- specifically. A protocol which purifies 3' fragments, yet does not require large amounts of starting material is thus desirable. Digital Gene Technologies (http://www.dgt.com/) provide display of unique 3'-fragments, each representing a single gene and with each gene represented only once. The method (US patent 5459037) involves isolating and subcloning 3'-fragments, growing the subcloned fragments as a library in E. coli, extracting the plasmids, converting the inserts to cRNA and then back to DNA and then PCR amplifying. Both the above and this method is based on the use of a multiplex PCR (i.e. specific primers each protruding a few bases into unknown sequence; those bases varied across multiple reactions; each such reaction analyzed separately on a gel or capillary) to split the reaction in enough parts to be able to separate most bands from each other.

This protocol achieves the objective of requiring relatively small amount of starting material while still purifying 3' fragments, allowing a more stringent PCR. However, this is at the expense of a very elaborate and time-consuming protocol requiring subcloning, library production and re-purification of cDNA fragments from bacteria.

Various other techniques have been used in so-called DNA fingerprinting.

The profiles of gene expression in any given cell determine its life processes and thereby directly reflect the properties and functions of the cell alone or in a multicellular organism. A large scale analysis of the global expression pattern during development and in the adult in different tissues and cells provides expression atlases of a11 genes expressed in that cell/tissue. Such atlases provide important information on gene function and further our understanding of normal biological processes in organisms. They also provide information on what is necessary for driving cells to a particular fate (i.e for example, the identification of a11 genes exclusively expressed during dopaminergic neuron specification and differentiation). They also provide a powerful tool for gene discovery.

It is generally believed that much disease behaviour is dictated by the altered expression of hundreds to thousands of genes and that global gene expression profiling will provide a powerful tool to characterise the disease behaviour and clinical consequences, responsiveness to different drugs, and predicted

disease outcome. This has so far been proven correct for cancer (Alizadeh et al., 2000; Golub et al., 1999; Perou et al., 1999). A fast and cheap global gene expression profiling technique would provide the means to rapidly identify the critical diagnostic genes. Such information may thereafter be used for diagnosis using a small scale analysis using for instance the real-time polymerase chain reaction (PCR).

Drugs are often identified in high throughput screens by selection of a single/few properties. Thus, a primary molecular target is identified but the full pathway as well as secondary targets of the drug is unknown. The other actions and consequences of the drug may be beneficial or harmful. The identification of the full biological pathway of action of drugs or drug candidates is therefore a problem of commercial and human importance. Global gene expression profiling would provide a fast and inexpensive approach to characterising drug activities and cellular pathways affected by drugs.

In the present invention, cDNA generated from mRNA in a sample is subject to restriction enzyme digestion at one end, the other end being anchored to a solid support (such as beads, e.g. magnetic or plastic, or any other solid support that can be retained while washing, for instance by centrifugation or magnetism, or a microfabricated reaction chamber with sub-chambers for the subdivision procedure, where chemicals are washed through the chambers) by means of oligo T at the 5' end of one strand complementary to polyA originally at the 3' end of the mRNA molecules. An adaptor is ligated to the free (digested) end of

the cDNA molecules and PCR performed using primers that anneal at the ends of the cDNA - one designed to anneal to the adaptor at the 3' end of one strand of the cDNA, the other containing oligodT to anneal to polyA at the 3' end of the other strand of the cDNA (corresponding to the original polyA in the mRNA). For use with a Type II enzyme, each primer includes a variable nucleotide or sequence of nucleotides that will amplify a subset of cDNA's with complementary sequence - either adjacent to the adaptor for one strand or adjacent to the polyA for the other strand. For a Type IIS enzyme, adaptors are employed that will ligate with the possible different cohesive ends generated when the enzyme cuts the double-stranded DNA. Thus a population of adaptors may be employed to be complementary to all possible cohesive ends within the population of DNA after cutting/digestion by the Type IIS enzyme. Primers are used in the PCR that anneal with the adaptors.

Primers may be labelled, and the labels may correspond to the relevant A, T, C or G nucleotide at a corresponding position in the relevant primer variable region. This means that double- stranded DNA produced in the PCR is labelled, and that the combination of the label and the length of the product DNA provides a characteristic signal. Otherwise, the combination of length of the product and (i) PCR primer used for a Type II enzyme digest or (ii) adaptor used for a Type IIS digest, provides a characteristic signal.

For any sample containing a population of mRNA molecules (e.g. from a cell or in vitro expression system), a pattern can be

produced that is characteristic of the sample. Patterns generated from different cells or the same cells under different conditions or stages of differentiation or cell cycle, or transformed (tumorigenic) cells and normal cells, can be compared and differences in the pattern identified. This allows for identification of sequences whose expression is involved in cellular processes that differ between cells or in the same cells under different conditions or stages of differentiation or cell cycle or between normal and tumorigenic cells.

A pattern of signals generated for a sample, or one or more individual signals identified as differing between samples, may be compared with a pattern generated from a database of known sequences to identify sequences of interest.

According to one aspect of the present invention there is provided a method according to claim 1.

The restriction enzyme is generally selected such that one obtains a size distribution which can be readily separated and length-determined with the fragment analysis method employed. The distribution of fragments obtained when cutting and holding on to the 3' end (by means of oligoT on a solid support) is proportional to 1/x where x is the length. The scale of the distribution depends on the probability of cutting. If an enzyme cuts once in 4096 (six base pair recognition sequence), the distribution will extend too far for current capillary electrophoresis methods. 1/1024 or 1/512 is preferred. HaeII cuts 1/1024 because of its degenerate recognition motif. FokI cuts

1/512 because it recognizes five base pairs in either forward or reverse directions. A 4bp-cutter cuts 1/256, which creates a too compressed distribution where doublets are more likely to occur. Thus enzymes like HaeII and FokI are preferred.

Thus a restriction enzyme employed in preferred embodiments may cut double-stranded DNA with a frequency of cutting of 1/256 - 1/4096 bp, preferably 1/512 or 1/1024 bp.

Where the restriction enzyme is a Type II restriction enzyme, it is preferred to use HaeII, ApoI, XhoII or Hsp 921. Where the restriction enzyme is a Type IIS restriction enzyme, it is preferred to use FokI, BbvI or Alw261. Other suitable enzymes are identified by REBASE (rebase.neb.com).

Preferably, the restriction enzyme digests double-stranded DNA to provide a cohesive end of 2-4 nucleotides. For a Type IIS restriction enzyme a cohesive end of 4 nucleotides is preferred. As discussed, more information can be obtained by generating an additional pattern for the sample using a second, or second and third, different Type II or Type IIS restriction enzyme or enzymes.

In each first primer used for PCR following digestion with a Type II enzyme, there may be a single variable nucleotide, or a variable nucleotide sequence of more than one nucleotide, e.g. two or three. At each position in a variable sequence, first primers may be provided such that each of A, C, G and T is

represented in the population.

In each second primer (comprising oligo dT), n may be 0, 1 or 2. No variable nucleotide is need in the primers used for PCR where a Type IIS restriction enzyme is employed because variability in the adaptor sequence is provided by the cohesive end. Generally, where a Type IIS restriction enzyme is employed a population of adaptors is provided such that all possible cohesive ends for the restriction enzyme are represented in the population, and each adaptor may be ligated to a fraction of the sample in a separate reaction vessel. The adaptor used in each reaction vessel will then be known and combination of this information with the length of double-stranded product DNA molecules provides the desired characteristic pattern.

For convenience, multiple adaptors may be combined in a single reaction vessel, in which case each different adaptor in a given vessel (with a different end sequence complementary to a cohesive end within the population of possible cohesive ends provided by the Type IIS restriction enzyme digestion) comprises a different primer annealing sequence. For instance three different adaptors may be combined in one reaction vessel. Corresponding first primers are then employed, and these may be labelled to distinguish between products arising from the respective different adaptor oligonucleotides.

Where a Type II enzyme is used, the first primers may be labelled, although where individual polymerase chain reaction

amplifications are performed in separate reaction vessels there is already knowledge of which first primer is used. Otherwise, labelling provides convenient information on which first primer sequence is providing which double-stranded DNA product molecule. Conveniently, three different first primer PCR amplifications can be performed in each reaction vessel, with each first primer being labelled appropriately (optionally with employment of a labelled size marker).

Separation may employ capillary or gel electrophoresis. A single label may be employed per reaction, with four dyes per capillary or lane, one of which may carry a size marker.

As discussed elsewhere, a first pattern characteristic of a population of mRNA molecules present in a first sample may be compared with a second pattern characteristic of a population of mRNA molecules present in a second sample. A difference may be identified between said first pattern and said second pattern, and a nucleic acid whose expression leads to the difference between said first pattern and said second pattern may be identified and/or obtained.

As a supplement or alternative, a signal provided for a double- stranded product DNA by combination of its length and first primer or adaptor oligonucleotide used may be compared with a database of signals for known expressed mRNA's. A known expressed mRNA in the sample may be identified.

Precautions and optimising steps can be taken by the ordinary skilled person in accordance with common practice. For example, in the restriction enzyme mix, calf intestinal phosphatase may be included to dephosphorylate the cohesive ends to prevent self- ligation in the next step.

Labels may conveniently be fluorescent dyes, allowing for the relevant signals (e.g. on a gel) following electrophoresis to separate double-stranded product DNA molecules on the basis of their length to be read using a normal sequencing machine.

A library of 3' end cDNA fragments can be prepared on a solid support, where each transcript is represented by a unique fragment. The library can be displayed on a capillary electrophoresis machine after PCR amplification with fluorescent primers. In order to reduce the number of bands in each electropherogram, the initial library may be subdivided, e.g. using one of the following two methods.

For libraries generated with an ordinary Type II enzyme, an adapter is ligated to the cohesive end of each fragment. The adaptor comprises a portion complementary to the cohesive end generated by the restriction enzyme and a portion to which a primer anneals. One primer annealing sequence may be used, or a small number, e.g. 2 or 3, of different sequences showing minimal cross-hybridisation, to allow that small number of independent reactions to proceed in a single reaction vessel. The library is then split into a number of different reaction vessels and a subset of the fragments in each vessel is PCR amplified using

primers compatible with the 3' (oligo-T) and S' (universal adapter) ends carrying a few extra bases protruding into unknown sequence. Thus in each reaction a different combination of protruding bases causes selective amplification of a subset of the fragments.

For libraries generated by Type IIS enzymes - which cleave outside their recognition sequence giving a gene-specific cohesive end - the library is split into a number of different reaction vessels. A set of adapters is designed containing a universal invariant part and a variable cohesive end such that all possible cohesive ends are represented in the set. In each reaction vessel a single such adapter is ligated. The subset of fragments in each vessel carrying adapters is then amplified with universal high-stringency primers.

In both methods, the resulting reactions may be run separately on a capillary electrophoresis machine which quantifies the fragment length and abundance, indicating the relative abundances of the corresponding mRNAs in the original sample.

For each fragment, the following are known: - the restriction enzyme site used to generate (e.g. 4-8 bases); - its length; - sub-reaction (given by the subdivision method, but generally corresponding to an additional 4-6 bases). If the subdivision is done judiciously, enough information is generated to identify each fragment with known sequences from a database This may be performed by selecting a combination of fragment length

distribution (given by the enzyme) and subdivision (given by the protruding bases and/or by the cohesive end (Type IIS)). As few as two bases (16 sub-reactions) or as many as 8 (65536 sub- reactions) can be used; if a small genome is being analyzed, a small number of sub-reactions may be enough; if a high-throughput analysis method is available a large number of sub-reaction allows the separation of very large numbers of genes. In practice, between four and six bases are usually used.

Brief Description of the Figures Figure 1 outlines an approach according to one embodiment of the present invention employing a Type II restriction enzyme (HaeII). Figure 2 outlines an approach according to another embodiment of the present invention employing a Type IIS restriction enzyme (FokI).

EXAMPLE 1 Method I, using PCR primers with one or more bases protruding into unknown sequence to generate subsets (frames) RNA was purified according to standard techniques. The RNA was denatured at 65 C for 10 minutes and added to Oligotex beads (Qiagen) and annealed to the oligo dT template covalently bound to the beads. A first strand cDNA synthesis was carried out using the mRNA attached to the Oligotex beads as template. This first strand cDNA therefore becomes covalently attached to the Oligotex beads (Hara et al. (1991) Nucleic Acids Res. 19, 7097). Second

strand synthesis was performed as described in Hara et al above. Briefly, the first strand was synthesized by reverse transcriptase (RT) from mRNA primed with oligo-dT. The second strand was produced by an RNase, which cleaves the mRNA, and a DNA Polymerase, which primes off small RNA fragments which are left by the RNase, displacing other RNA fragments as it goes along. The double stranded cDNA attached to the Oligotex beads was purified and restriction digested with HaeII. HaeII was used. Alternative enzymes include ApoI, XjoII and Hsp921 (Type II) and FokI, BbvI and Alw261 (Type IIS). The cDNA was again purified retaining the fraction of cDNA attached to the Oligotex.

An adaptor was ligated to the HaeII site of the cDNA. The adaptor contained sequences complementary to the HaeII site and extra nucleotides to provide a universal template for PCR of all cDNAs. The cDNA was then again purified to remove salt, protein and unligated adaptors.

The cDNA was divided into 96 equal pools in a 96 well dish. In order to PCR amplify only a subset of the purified fragments in each well, a multiplex PCR was designed as follows.

The 5' primers were complementary to the universal template but extended two bases into the unknown sequence. The first of these bases was either thymine or cytosine, corresponding to a wobbling base in the HaeII site, while the second was any of guanine, cytosine, thymine or adenosine. Each 5' primer was fluorescently coupled by a carbon spacer to fluorochromes detectable by the ABI Prism capillary sequencer. The fluorochrome was matched to the

second base. Each well received four primers with all four fluorochromes (and hence a11 four second bases); half of the wells received primers with a thymine first base, half with a cytosine first base.

The 3' primers were oligo dT and therefore complementary to the polyadenylation sequence of the original mRNA. Each primer was designed with three bases extending into unknown sequence, the first of which was either guanine, adenosine or cytosine, while the other two was any of the four bases. Each well received a single 3' primer. Thus, the PCR reaction was multiplexed into 384 sub-reactions: 96 wells with four fluorochrome channels in each. A standard PCR reaction mix was added, including buffer, nucleotides, polymerase. The PCR was run on a Peltier thermal cycler (PTC-200). Each primer pair used in this experiment recognises and amplifies only genes containing the unique 4 nucleotide combination of that primer pair. The size of the PCR fragment of each of these genes corresponds to the length between the polyadenylation and the closest HaeII site.

The resulting PCR products were isopropanol precipitated and loaded onto an ABI prism capillary sequencer. The PCR fragments representing the expressed genes were thus, separated according to size and the fluorescence of each fragment quantitated using the detector and software supplied with the ABI Prism.

The combination of primers used lead to a theoretical mean of -70 PCR products in each fluorescent channel and sample (based on 200

genes expressed in a given sample and a total of 140,000 genes). Analysis of statistical size distribution of 3'fragments including the polyadenylation generated from known genes following HaeII restriction digestion, showed that an estimated 80% can be uniquely identified based on frame and length of fragment alone. The ABI prism has 0.5% resolution between 1-2,000 nucleotides. Allowing for this uncertainty, -600 of the expressed genes can be uniquely identified. Using an additional parallel experiment using the same protocol but replacing the HaeII enzyme with another 5 base cutting restriction enzyme increases the theoretical limit to -96o and the practical limit (given the resolution of the ABI Prism) to -850 of all transcripts in the genome.

The level of each mRNA in the sample corresponds to the signal strength in the ABI prism. Combining the information unique to each fragment in this analysis, i.e. 8.5 nucleotides (including the HaeII recognition sequence) and the size from poly adenylation to the HaeII restriction site, the identity (EST, gene or mRNA identity) of each mRNA can thus be established. A searchable database on all known genes and unigene EST clusters was constructed as follows.

Unigene, a public database containing clusters of partially homologous fragments was downloaded (although the algorithm will work with any set of single or clustered fragments). For each cluster, all fragments containing a polyA signal and a polyA sequence were scanned for an upstream HaeII site. If no HaeII site was found, then the fragments were extended towards 5' using

sequences from the same cluster until a HaeII site was found. Then, the frame was determined from the base pairs adjacent to the HaeII and the polyA sequences and the length of a HaeII digest was calculated. The frame and length were used as indexes in the database for quick retrieval.

The output from the ABI Prism was run against the database, thus allowing the identification of expression level of a11 known genes and ESTs expressed in the RNA of this study. The identification in a cell or tissue of virtually all genes expressed as well as quantification of their expression levels was accomplished by a simple double strand cDNA reaction and a 3 hour run on a 96 capillary sequencer.

EXAMPLE 2 Ligation of multiple adapters to cohesive ends generated by a Type IIS enzyme to generate subsets (frames), followed by PCR with universal primers In another set of experiments the method was simplified and an increased resolution was achieved. cDNA was synthezised on solid support as described in Example 1. The cDNA was then cleaved with a class-IIS endonuclease with a recognition sequence of 4 or 5 nucleotides.

Class IIS restriction endonucleases cleave double-stranded DNA at precise distances from their recognition sequences (at 9 and 13 nucleotides from the recognition sequence in the example of the class IIS restriction endonuclease FokI). Other examples of class

IIS restriction endonucleases include BbvI, SfaNI and Alw26I and others described in Szybalski et al. (1991) Gene, 100, l3-26. The 3'parts of the cDNA were then purified using the solid support as described above. The cDNA was then divided into 256 fractions and a different adaptor was ligated to the fragments in each fraction.

For example, FokI cleavage leads to four nucleotides 5'overhang, with each overhang consisting of a gene-specific but arbitrary combination of bases. One adaptor carrying a single possible nucleotide combination in these four positions was used in each fraction i.e. a total of 256 adapters and fractions. Again taking advantage of the solid support, the cDNA was then purified to remove excess non-ligated adaptor. PCR was performed on the 256 fractions using one universal primer complementary to the constant part of the adapter sequence and one complementary to the poly-A tail.

The 3' primers were oligo dT and therefore complementary to the polyadenylation sequence of the original mRNA. Each primer was designed with a base extending into unknown sequence, guanine, adenosine or cytosine. (A second or still further base may be included, being any of guanine, adenosine, thymine or cytosine.) Each well received a mixture of the three possible 3' primers. This ensured that the 3' primer would always direct the polymerase to the beginning of the poly-A tail, giving a defined and reproducible fragment length.

The advantage of this second protocol is that the splitting into

multiple frames occurs at the ligation step, not the PCR, allowing the use of high-stringency universal primers in the PCR. This leads to improved specificity and reproducibility. Another advantage is that a set of 256 adapters compatible with any 4- base overhang can be reused in multiple experiments with Type IIS enzymes which recognize different sequences but still give four base overhangs. Thus for each length of overhang, a single set of adapters will suffice.

The resulting PCR products were purified and loaded onto an ABI prism capillary sequencer. The PCR fragments representing the expressed genes were thus separated according to size and the fluorescence of each fragment quantified using the detector and software supplied with the ABI Prism.

Four separate frames may be run in each reaction vessel using different fluorophores because the ABI Prism has four detection channels. Four different universal forward primers (5' end) have been designed with no cross-hybridization between them. The use of these primers allowed the 256 reactions to be reduced to 64. In an alternative embodiment, three primers and three adaptors are employed, allowing for one channel in the ABI Prism to be used for a size reference. The total number of reactions is then 86.

It is also desirable to increase the annealing temperature of the oligo-dT primer. This was enabled by adding a tail with an arbitrary sequence (not cross-hybridizing with any of the forward primers) and mixing the long primer containing oligo-dT with a

short primer identical with the arbitrary sequence and having a high melting point. The first few cycles were then be performed at low temperature, at which only the oligo-dT primers anneal, after which all fragments had the tail added. This then allowed for subsequent cycles to be performed at higher temperature (at which only the short primer anneals) relying on the longer tail being present. This approach increases specificity of PCR and reduces background.

The combination of primers used leads to a theoretical mean of -80 PCR products in each fluorescent channel and sample (based on 20% genes expressed in a given sample and a total of 100 000 genes). Analysis of statistical size distribution of 3"fragments including the polyadenylation generated from known genes following FokI restriction digestion, provides that an estimated 67% can be uniquely identified based on frame and length of fragment alone. Using an additional parallel experiment using the same protocol but replacing the FokI enzyme with another 5 base cutting class IIS restriction enzyme increases the theoretical limit to -890; a third experiment yields -990 of a11 transcripts in the genome.

These numbers are under estimates since in practice a gene that runs as a doublet in two experiments can still be identified as unique if at least one of its doublet partners is not expressed (a 96% chance). This and similar effects have been disregarded in the above calculations.

Combining the information unique to each fragment in this

analysis, i.e. 9 nucleotides (including the FokI recognition sequence and cleavage site) and the size from polyadenylation to the FokI restriction site obtained from the capillary sequencer, the identity (EST, gene or mRNA identity) of each mRNA can thus be established. A searchable database on all known genes and unigene EST clusters was constructed as described above. Fragment identification Following construction of a database of calculated fragment lengths for known gene sequence, using a particular restriction enzyme, fragments found to be present in an experimental sample can be identified as follows.

If a single experiment was performed, the fragments obtained in a database of calculated fragment lengths can be simply looked up. However, depending on the number of subdivisions of the reaction there may be cases where a fragment cannot be uniquely identified. This can be alleviated by adding more subdivisions (more protruding bases for example or longer cohesive ends, if possible, in the case of Type IIS enzymes).

Alternatively a second experiment can be performed with a different enzyme.

When two or more experiments have been performed the data analysis can be much improved. First, the more experiments are performed the likelier it is that a given gene runs as a singlet fragment in at least one of them and can thus be unambiguously

identified. Second, even if a given gene runs as a doublet in all experiments, it can still be identified if one of its doublet partners in one of the experiments should run as a singlet in another experiment and is absent there.

For example, if there is a fragment in experiment I at 162 by corresponding to genes A and B, and one in experiment II at 367 by corresponding to A and C, then one can look up C in experiment I (if it should run as a singlet there, say at 214 bp, and it is absent, i.e. there is no peak at 214 bp, then the peak at 162 by in I can be identified as A) and B in experiment II. This simple procedure greatly increases the number of genes which can be unambiguously identified even when only two experiments have been performed.

Computer simulations using estimated error rates from an ABI Prism capillary electrophoresis machine indicate that 85-990 of a11 genes can be correctly identified even in the presence of normal fragment length errors.

The procedure can be systematically performed by a computer program, as follows.

1. All the genes in the database which correspond to a fragment in each experiment are listed. This forms a list of possibly expressed genes for each experiment.

2. Then for each experiment, the genes which definitely do not correspond to a fragment are listed (i.e. those which should give

a fragment of a length which was not found in the experiment). This forms a list of definitely unexpressed genes for each experiment.

3. The unexpressed genes in each experiment are then removed from the list of possibly expressed genes in each other experiment.

4. The result is a list for each experiment where in most cases each fragment retains a single candidate gene identification. An alternative method which may be especially suitable when all or most genes in an organism have been identified is as follows: 1. A11 the genes in the database which correspond to a fragment in each experiment are listed. This forms a list of possibly expressed genes for each experiment. For each fragment in each experiment an equation is written of the form Fi = ml + m2 + m3, where 1, 2, 3 etc are the id's of the genes and Fi is the intensity of the signal from the fragment. Each gene which may correspond to a fragment peak in the electrophoresis appears as a term on the right-hand side.

For example, if a peak at 162 by corresponds to genes 234, 647 and 78 in the database, and it has intensity 2546, then the corresponding equation is written: 2546 = m234 + m647 + m78 2. Then for each experiment, the genes which definitely do not

correspond to a fragment are listed (i.e. those which should give a fragment of a length which was not found in the experiment). This forms a list of definitely unexpressed genes for each experiment. For each gene on that list, an equation is written of the form: 0 m657 Where 657 is the gene id, as above.

3. A system of simultaneous equations is thus obtained with m (_ the number of genes in the organism) unknowns and n < _ km equations (where k is the number of experiments). If a11 genes run as singlets in a11 experiments then n = km because each gene will appear in its own equation. The more they run as doublets or multiplets the smaller n will be. As long as n > m, however, the system is over-determined and can thus be solved using standard numerical methods to find a least-squares solution. For example, the backslash operator in MATLAB can be used.

4. The least-squares solution of the system gives for each gene the best approximation of its expression level. The more experiments that are performed, the better the approximation will be. Errors can be estimated by computing residuals (that is, by inserting the estimated gene activities in the equations to obtain calculated peak intensities and comparing those to the measured intensities). Simulations show that a system of 100 000 equations in 50 000 unknowns can be solved in 16 hours on a regular PC.

The optimum number of subdivisions can be determined.

The purpose of subdividing the reaction is to reduce the number of fragment peaks which correspond to multiple genes.

Two factors determine the number of doublets: the number of sub- reactions and the size distribution of fragments.

The optimal size distribution depends on the detection method. Capillary electrophoresis has single-basepair resolution up to 500 by and about 0.15% resolution after that. Thus a distribution extending too far would not be useful. But a narrow distribution may present difficulties as well, because then genes will begin to run as true doublets (with the exact same length) which cannot be resolved no matter what the resolution.

The probability of finding a fragment of length n if you cut with an enzyme which cuts with a probability 1/512 is P1 (n) = (51l/512)"(1/512) If the reaction is divided in 192 sub-reactions, the probability of finding a fragment of length n in a given subreaction is P2 (n) = (511/512) (1/512)(1/192) The probability of this fragment corresponding to a single gene from M possible genes is

Punique (n) - P2 (n) (1-P2 (n) ) (M-i ) In other words, this is the probability that one gene gives a fragment of that length and all others do not.

The total number of genes which can be uniquely identified in a single experiment can be obtained by summing over all detectable lengths.

Taking instrument imprecision into account, P"nique becomes Punique (n) - P2 (n) ( (1-P2 (n) ) (M-1) ) (1 + 2En) where E is the magnitude of the imprecision. This states that a unique gene can be identified if no other gene has the same length +/- a factor E.

For example, if there are 50 000 genes in the human, our instrument has an error of 0.2% and can detect fragments up to 1000 bp, and we cut with an enzyme which cuts 1/512 of all sequences, subdividing in 192 subreactions, then we can identify 560 of all genes uniquely in a single experiment, 80% in two and 96% in three.

In Mathematica, the number of uniquely identifiable genes can be calcuated as follows: Prob[n ] := (511/512)^n * 1/512 * 1/192 Sum[ 50000 * Prob[n]((1 - Prob[n])^50000)^1 + 0.002n), {n,1,1000)] * 192

By varying the parameters one can quickly see the effects on identification probabilities.

As noted above, if more experiments are performed, more powerful identification methods can be used, but they all benefit from an increased number of singleton genes.

DISCUSSION Most microarrays (except Affymetrix) are based on hybridisation to spotted cDNAs on a glass or membrane surface. This requires cloning, amplification and spotting of the cDNA of each gene in the genome for a comparable analysis to what can be performed in under one day using embodiments of the present invention.

All microarrays require the prior knowledge of each gene such as the cloning and sequencing of cDNAs or an expressed sequence tag. Embodiments of the present invention allow identification and quantification of a11 genes expressed in the genome without any prior information on their existence.

The Affymetrix microarray which at present allows quantification of expression of the largest number of genes in mammals cover at most 32,000 genes. Embodiments of the present invention can be applied to a11 genes in the genome.

All microarray-based technologies are limited to the species the array is generated from and depend on an availability of sequence information for the species of interest. Embodiments of the

present invention can be applied to a11 species from plants to mammals without any prior cDNA or DNA sequence information. Microarrays are often unable to differentiate between splice variants, and are always unable to detect rare alleles. Embodiments of the present invention allow for detection of the actual transcripts present in the sample.

All microarray-based technologies are based on indirect measurement of quantities following DNA hybridisation. Real copy numbers can be quantitated using the present inventionl. Hybridization-based technologies depend on the highly unpredictable and non-linear nature of hybridizationwkinetics; embodiments of the present invention employ the exponential, reproducible competitive polymerase chain reaction.

Because embodiments of the present invention are based on a kind of competitive PCR, i.e. all fragments in a reaction are amplified by the same primer pair (or a small number of very similar primer pairs), errors are minimized. The invention allows the skilled worker to reproducibly detect about 2-fold differences in gene expression across a wide dynamic range (about 2.5 orders of magnitude); very competitive with other technologies.

Because embodiments of the present invention are PCR-based, sensitivity can be traded for starting material. In other words, it is possible to start with a smaller amount of RNA and run a

few extra PCR cycles. Because PCR is exponential, an extra cycle will cut material requirement in half while adding only about 2- 3o to the experimental variation. Useful data can thus be produced from as little as a few or even single cells, while accuracy can be increased using larger samples. Microarray-technology allowing quantification of gene expression of a significant percent of the genes is very expensive. Affymetrix microarrays covering a claimed 32,000 unique ESTs cost 4000 USD/experiment. Estimated cost for performance of embodiments of the present invention is USD 60/experiment. Aspects and embodiments of the present invention will now be illustrated with reference to the following experimentation. Further aspects and embodiments of the present invention will be apparent to those skilled in the art.

MATERIALS AND METHODS Section 1 - employing Type II restriction enzyme Isolating mRNA from total RNA Isolate mRNA from 20 ug total RNA according to Oligotex protocol until pure mRNA is bound to the beads and washed clean. Spin down and resuspend in 20 u1 distilled water. The suspension should contain 0.5 mg Oligotex.

Split the reaction in 2x 10 u1. Heat denature at 70 C for 10 min, then chill quickly on ice. Synthesize first strand cDNA using

each of the protocols below: First strand cDNA synthesis using AMV Add first-strand buffer: 5 u1 5x AMV buffer, 2.5 u1 10 mM dNTP, 2.5 u1 40 mM NaPyrophosphate, 0.5 u1 RNase inhibitor, 2 u1 AMV RT, 2.5 u1 5 mg/ml BSA.

Incubate at 42 C for 60 min. Total volume: 25 u1..

[Note: it may be better to run in 100 u1, to get a more dilute Oligotex suspension] Second strand cDNA synthesis using AMV Add 12.5 u1 10x AMV second-strand buffer (500 mM Tris pH 7.2, 900 mM KC1, 30 mM MgC12, 30 mM DTT, 5 mg/ml BSA), 29 U E Coli DNA Polymerase I, 1 U RNase H to a final volume of 125 u1 with dH20. Incubate at 14 C for 2 hours.

Restriction enzyme cleavage and dephosphorylation Spin down Oligotex/cDNA complexes and resuspend in 1.8 u1 10x FokI buffer, 16.2 u1 H20, 2 u1 FokI, 1 u Calf Intestinal Phosphatase (included to dephosphorylate cohesive ends to prevent self-ligation in the next step).

Incubate at 37 C for 1 hour.

Spin down and remove supernatant for quality-control. Phosphatase deactivation

Add 70 u1 TE. Heat to 70 C for 10 minutes. Cool down to room temperature and leave for 10 minutes.

Ligation Resuspend in 2 u1 10x ligation buffer, 100X adaptor, 2 u1 ligase, H20 to 20 u1.

Incubate at RT for 2 hours.

Spin down and wash with l0mM Tris (pH 7.6). Primer and adaptor design The adaptor is as follows (shown 5' to 3'). It consists of a long and a short strand which are complementary. The long strand has four extra bases complementary to the GCGC cohesive end generated by the HaeII enzyme cleavage.

5'-GTCCTCGATGTGCGC-3' 5'-ACATCGAGGAC-3' The 5' primers are 5'-GTCCTCGATGTGCGCWN-3', where W is A or T and N is A, C, G or T. There are 8 different 5' primers, labelled with a fluorochrome corresponding to the last base.

The 3' primers are T2OVNN, where V is A, G or C and N is A, G, C or T. That is, 25 thymines followed by three bases as shown. There are 48 different 3' primers.

All combinations of 3' and 5' primers are used, or 384 in total.

The 5' primers are pooled with respect to the last base (i.e. all four fluorochromes are run in the same reaction), giving a total of 96 reactions.

The primer combinations are predispensed into 96-well PCR plates. PCR amplification Resuspend in 768 u1 PCR buffer (buffer, enzyme, dNTP), add 8 u1 to each well of a premade primer-plate containing 2 u1 primer-mix (four 5' primers and one 3' primer) per well.

Using hot-start touchdown PCR, amplify each fraction as follows: Hot start Heat to 70 C Add Taq polymerase 10 cycles 94 C 30 s 60 C 30 s, reduced by 0.5 C each cycle 72 C 1 min 25 cycles 94 C 30 s 55 C 30 s 72 C 1 min Finally 72 C 5 min Cool down to 4 C The touchdown ramp annealing temperature may have to be adjusted

up or down. The reaction should only proceed until the plateau phase has been reached; the 25 cycles may have to be adjusted. A rotating real-time PCR apparatus is preferred, to minimize temperature variation and to allow monitoring the plateau phase. With such a machine, Taq polymerase is loaded in the cap of each tube and the hot start is performed before the rotor is started, melting away the second strand from the Oligotex. When the rotor starts, the beads and the first strand are pelleted and Taq drops into the reaction mix at the same time.

puantification by capillary electrophoresis Load the 96-well plate on an ABI Prism 3700 setup for fragment analysis with a long capillary and long run time. The output is a table of fragment length (in base pairs) and peak height/area for each peak detected.

Proceed to identification, e.g. as described above with reference to a database.

Section 2 - employing Type IIS restriction enzyme Isolating mRNA from total RNA Isolate mRNA from 20 ug total RNA according to Oligotex protocol until pure mRNA is bound to the beads and washed clean. Spin down and resuspend in 20 u1 distilled water. The suspension should contain 0.5 mg Oligotex.

Split the reaction in 2x 10 u1. Heat denature at 70 C for 10 min,

then chill quickly on ice. Synthesize first strand cDNA using each of the protocols below: First strand cDNA synthesis using AMV Add first-strand buffer: 5 u1 5x AMV buffer, 2.5 u1 10 mM dNTP, 2.5 u1 40 mM NaPyrophosphate, 0.5 u1 RNase inhibitor, 2 u1 AMV RT, 2.5 u1 5 mg/ml BSA.

Incubate at 42 C for 60 min. Total volume: 25 u1.

Restriction enzyme cleavage and dephosphorylation Spin down Oligotex/cDNA complexes and resuspend in 1.8 u1 10x FokI buffer, 16.2 u1 H20, 2 u1 FokI, 1 u Calf Intestinal Phosphatase.

Incubate at 37 C for 1 hour.

Spin down and remove supernatant for quality-control. Phosphatase deactivation

Liaation (in 86 separate vessels) Spin down and remove supernatant. Resuspend in 0.2 u1 10x ligation buffer, 100X adaptor, 0.2 u1 E. Coli ligase, H20 to 2 u1.

Incubate at RT for 2 hours. Primer design The adaptors are as follows (shown 5' to 3'). Each pair is composed of a short and a long strand, which are complementary. The long strands have four nucleotides complementary to the cohesive ends generated by the FokI cleavage (a total of 4*4*4*4 = 256 possible adapters).

Labelled versions of the upper, shorter strands also serve as forward PCR primers.

5'-CCAAACCCGCTTATTCTCCGCAGTA-3' 5'-NNNNTACTGCGGAGAATAAGCGGGTTTGG-3' 5'-GTGCTCTGGTGCTACGCATTTACCG-3' 5'-NNNNCGGTAAATGCGTAGCACCAGAGCAC-3' 5'-CCGTGGCAATTAGTCGTCTAACGCT-3' 5'-NNNNAGCGTTAGACGACTAATTGCCACGG-3'

The reverse primers are as follows 5'-CTGGGTAGGTCCGATTTAGGCTTTTTTTTTTTTTTTTTTTTTV-3' 5'-CTGGGTAGGTCCGATTTAGGC-3' where V = A, C or G, for a total of three long reverse primers. Universal PCR Add 18 u1 PCR buffer (buffer, enzyme, dNTP, three universal adapter primers, anchored oligo-T primers).

Amplify each fraction as follows: Hot start Heat Add Taq at 70 C (or use heat-activated Taq) 2 cycles 94 C 30 s 50 C 30 s 72 C 1 min 25 cycles 94 C 30 s 61 C 30 s 72 C 1 min Finally 72 C 5 min Cool down to 4 C A rotating real-time PCR apparatus is preferred, to minimize

temperature variation and to allow monitoring the plateau phase. With such a machine, Taq polymerase is loaded in the cap of each tube and the hot start is performed before the rotor is started, melting away the second strand from the Oligotex. When the rotor starts, the beads and the first strand are pelleted and Taq drops into the reaction mix at the same time.

Quantification by capillary electrophoresis Load the 96-well plate on an ABI Prism 3700 setup for fragment analysis with a long capillary and long run time. The output will be a table of fragment length (in base pairs) and peak height/area for each peak detected.

REFERENCES Alizadeh et al. (2000) Nature 403, 503 - 511.

Alwine et al. (1977) Proc. Natl. Acad. Sci. USA 74, 5350-5354. Berk and Sharp (1977) Cell 12, 721-732.

Bowtell (1999) [published erratum appears in Nat Genet 1999 Feb;21(2):241]. Nat Genet 21, 25-32.

Britton-Davidian et al. (2000) Nature 403, 158. Brown and Botstein (1999) Nat Genet 21, 33-7. Cahill et al. (1999) Trends Cell Biol 9, M57-60. Cho et al. (1998) Mol Cell 2, 65-73.

Collins et al. (1997) Science 278, 1580-1.

Der et al. (1998) Proc Natl Acad Sci U S A 95, 15623-8. Duggan et al. (1999) Nat Genet 21, 10-4.

Golub et al. (1999) Science 286, 531-7. Iyer et al. (1999) Science 283, 83-7. Lander (1999) Nat Genet 21, 3-4.

Lengauer et al. (1998) Nature 396, 643-9. Liang and Pardee (1992) Science 257, 967-71.

Lipshutz et al. (1999). High density synthetic oligonucleotide arrays. Nat Genet 21, 20-4.

McCormick (1999) Trends Cell Biol 9, M53-6. Okubo et al. (1992) Nat Genet 2, 173-9. Paabo (1999) Trends Cell Biol 9, M13-6.

Perou et al. (1999) Proc Natl Acad Sci U S A 96, 9212-7. Schena et al. (1995) Science 270, 467-70.

Schena et al. (1996) Proc Natl Acad Sci U S A 93, 10614-9. Southern et al. (1999) Nat Genet 21, 5-9.

Stoler et al. (1999) Proc Natl Acad Sci U S A 96, 15121-6. Szallasi (1998) Nat Biotechnol 16, 1292-3.

Thomson and Esposito (1999) Trends Cell Biol 9, M17-20. Velculescu et al. (1995) Science 270, 484-7.

Claims

CLAIMS: 1. A method of providing a pattern characteristic of a population of mRNA molecules present in a sample, the method comprising: immobilising mRNA molecules in the sample on a solid support by annealing of a polyA tail of each mRNA molecule to polyT oligonucleotides attached to the support; synthesizing a cDNA strand complementary to each mRNA attached to the support using the mRNA as template, thereby providing a population of first cDNA strands attached to the support; removing the mRNA; synthesizing a second cDNA strand complementary to each first strand attached to the support, thereby providing a population of double-stranded cDNA molecules attached to the support; digesting the double-stranded cDNA molecules attached to the support with a Type II or Type IIS restriction enzyme to provide a population of digested double-stranded cDNA molecules attached to the support, each digested double-stranded cDNA molecule having a cohesive end provided by the restriction enzyme digestion; purifying the digested double-stranded cDNA molecules attached to the support by washing away material not attached to the support; ligating a population of adaptor oligonucleotides to the cohesive end of each of the digested double-stranded cDNA molecules, the adaptor oligonucleoti-.des each comprising an end

<Desc/Clms Page number 44>

sequence complementary to a cohesive end and a primer annealing sequence, thereby providing double-stranded template cDNA molecules each comprising a first strand and a second strand wherein the first strand of the double-stranded template cDNA molecules each comprise a 3' terminal adaptor oligonucleotide and the second strand of the double-stranded template cDNA molecules each comprise a 3' terminal polyA sequence; purifying the double-stranded template cDNA molecules by washing away material not attached to the support; performing polymerise chain reaction amplification on the double-stranded template cDNA molecules using a population of first primers and a population of second primers, wherein the first primers each comprise a sequence which anneals to a primer annealing sequence of an adaptor oligonucleotide; and where the restriction enzyme is a Type II enzyme the first primers each comprise at least one 3' terminal variable nucleotide and optionally more than one 3' terminal variable nucleotides wherein the variable nucleotide is, or at a corresponding position within the variable nucleotides each first primer has, a nucleotide selected from A, T, C and G, whereby the population of first primers primes synthesis in the polymerise chain reaction of first strand product DNA molecules each of which is complementary to the first strand of a template cDNA molecule that comprises adjacent to the primer annealing sequence within the first strand of the template cDNA molecule a nucleotide or sequence of nucleotides complementary to the variable nucleotide or nucleotides of a first primer within the

<Desc/Clms Page number 45>

population of first primers; or where the restriction enzyme is a Type IIS enzyme the first primers prime synthesis in the polymerase chain reaction of first strand product DNA molecules each of which is complementary to the first strand of a template cDNA molecule that comprises within the first strand of the template cDNA molecule a sequence of nucleotides complementary to an end sequence of an adaptor oligonucleotide in the population of adaptor oligonucleotides; the second primers comprise an oligoT sequence and a 3' variable portion conforming to the following formula: (G/C/A)(X)n wherein X is any nucleotide, n is zero, at least one or more than one; whereby the population of second primers primes synthesis in the polymerase chain reaction of second strand product DNA molecules each of which is complementary to the second strand of a template cDNA molecule that comprises adjacent to polyA within the second strand of the template cDNA molecule a nucleotide or nucleotides complementary to the variable portion of a second primer within the population of second primers; whereby the polymerase chain reaction amplification provides a population of double-stranded product DNA molecules each of, which comprises a first strand product DNA molecule and a second strand product DNA molecule; separating double-stranded product DNA molecules on the basis of length; and detecting said double-stranded product DNA molecules; whereby a pattern characteristic of the population of mRNA

<Desc/Clms Page number 46>

molecules present in the sample is provided by combination of length of said double-stranded product DNA molecules and (i) first primer variable nucleotide or nucleotides, where a Type II restriction enzyme is employed, or (ii) adaptor oligonucleotide end sequence, where a Type IIS restriction enzyme is employed.
2. A method according to claim 1 wherein the restriction enzyme cuts double-stranded DNA with a frequency of cutting of l/256 - 1/4096 bp.
3. A method according to claim 2 wherein the frequency of cutting is 1/5l2 or l/1024 bp.
4. A method according to any one of the preceding claims wherein the restriction enzyme is a Type II restriction enzyme.
5. A method according to claim 4 wherein the restriction enzyme digests double-stranded DNA to provide a cohesive end of 2-4 nucleotides.
6. A method according to claim 5 wherein the restriction enzyme is selected from the group consisting of HaeII, ApoI, XhoII and Hsp 921.
7. A method according to any one claims 4 to 6 wherein the first primers each have one variable nucleotide.
8. A method according to any one of claims 4 to 6 wherein the first primers each have two variable nucleotides, each of which

<Desc/Clms Page number 47>

may be A, T, C or G.
9. A method according to any one of claims 4 to 6 wherein the first primers each have three variable nucleotides, each of which may be A, T, C or G.
10. A method according to any one of claims 4 to 9 wherein each first primer is labelled with a label to indicate which of A, T, C and G is said variable nucleotide or is present at said corresponding position within the variable nucleotides of the first primer.
11. A method according to any one of claims 1 to 3 wherein the restriction enzyme is a Type IIS restriction enzyme.
12. A method according to claim 11 wherein the restriction enzyme digests double-stranded DNA to provide a cohesive end of 2-4 nucleotides.
13. A method according to claim 12 wherein the restriction enzyme is selected from the group consisting of FokI, BbvI, SfaNI and Alw261.
14. A method according to any one of claims 10 to 13 wherein adaptor oligonucleotides in the population of adaptor oligonucleotides are ligated to cohesive ends of digested double- stranded cDNA molecules in separate reaction vessels from different adaptor oligonucleotides with different end sequences.

<Desc/Clms Page number 48>
15. A method according to claim 14 wherein each reaction vessel contains a single adaptor oligonucleotide end sequence.
16. A method according to claim 14 wherein each reaction vessel contains multiple adaptor oligonucleotide end sequences, each adaptor oligonucleotide sequence in a reaction vessel comprising a different end sequence and primer annealing sequence from the end sequence and primer annealing sequence of other adaptor oligonucleotide sequences in the same reaction vessel, corresponding multiple first primers being employed in the polymerase chain reaction amplification in each reaction vessel.
17. A method according to any one of the preceding claims wherein n is 0.
18. A method according to any one of claims 1 to 16 wherein n is 1.
19. A method according to any one of claims 1 to 16 wherein n is 2.
20. A method according to any one of the preceding claims wherein first primers are labelled.
21. A method according to claim 20 wherein the labels are fluorescent dyes readable by a sequencing machine.
22. A method according to any one of the preceding claims wherein an additional pattern is generated for the sample using a

<Desc/Clms Page number 49>

second, different Type II or Type IIS restriction enzyme.
23. A method according to any one of claims 1 to 22 wherein double-stranded DNA molecules are separated on the basis of length by electrophoresis on a sequencing gel or capillary, and the pattern is generated as an electropherogram. 24. A method according to any one of the preceding claims wherein a first pattern characteristic of a population of mRNA molecules present in a first sample is compared with a second pattern characteristic of a population of mRNA molecules present in a second sample. 25. A method according to claim 24 wherein a difference is identified between said first pattern and said second pattern. 26. A method according to claim 25 wherein a nucleic acid whose expression leads to the difference between said first pattern and said second pattern is identified and/or obtained. 27. A method according to any one of claims 1 to 23 wherein a signal provided for a double-stranded product DNA by said combination of length of said double-stranded product DNA molecules and (i) first primer variable nucleotide or nucleotides, where a Type II restriction enzyme is employed, or (ii) adaptor oligonucleotide end sequence, where a Type IIS restriction enzyme is employed, is compared with a database of signals determined or predicted for known mRNA's.

<Desc/Clms Page number 50>

28. A method according to claim 27 wherein patterns generated for a sample using at least two different Type II or Type IIS restriction enzymes in separate experiments are compared with a database of signals determined or predicted for known mRNA's, by: (i) listing all mRNA's in the database which may correspond to a double-stranded product DNA in each experiment, forming a list of mRNA molecules possibly present for each experiment, and (ii) for each experiment listing mRNA's which definitely do not correspond to a double-stranded product DNA molecule, forming a list of mRNA molecules definitely not present for each experiment, then (iii) removing the mRNA molecules definitely not present from the list of mRNA molecules possibly present for each experiment, and (iv) generating a list of mRNA molecules possibly present and mRNA molecules definitely not present by combining each list generated for each experiment in (iii). 29. A method according to claim 27 or claim 28 wherein the presence in the sample of a known mRNA is identified.

<Desc/Clms Page number 51>

CLAIMS: 1. A method of providing a pattern characteristic of a population of mRNA molecules present in a sample, the method comprising: immobilising mRNA molecules in the sample on a solid support by annealing of a polyA tail of each mRNA molecule to polyT oligonucleotides attached to the support; synthesizing a cDNA strand complementary to each mRNA attached to the support using the mRNA as template, thereby providing a population of first cDNA strands attached to the support; removing the mRNA; synthesizing a second cDNA strand complementary to each first strand attached to the support, thereby providing a population of double-stranded cDNA molecules attached to the support; digesting the double-stranded cDNA molecules attached to the support with a Type II or Type IIS restriction enzyme to provide a population of digested double-stranded cDNA molecules attached to the support, each digested double-stranded cDNA molecule having a cohesive end provided by the restriction enzyme digestion; purifying the digested double-stranded cDNA molecules attached to the support by washing away material not attached to the support; ligating a population of adaptor oligonucleotides to the cohesive end of each of the digested double-stranded cDNA molecules, the adaptor oligonucleotides each comprising an end

<Desc/Clms Page number 52>

sequence complementary to a cohesive end and a primer annealing sequence, thereby providing double-stranded template cDNA molecules each comprising a first strand and a second strand wherein the first strand of the double-stranded template cDNA molecules each comprise a 3' terminal adaptor oligonucleotide and the second strand of the double-stranded template cDNA molecules each comprise a 3' terminal polyA sequence; purifying the double-stranded template cDNA molecules by washing away material not attached to the support; performing polymerase chain reaction amplification on the double-stranded template cDNA molecules using a population of first primers and a population of second primers, wherein the first primers each comprise a sequence which anneals to a primer annealing sequence of an adaptor oligonucleotide; and where the restriction enzyme is a Type II enzyme the first primers each comprise at least one 3' terminal variable nucleotide and optionally more than one 3' terminal variable nucleotides wherein the variable nucleotide is, or at a corresponding position within the variable nucleotides each first primer has, a nucleotide selected from A, T, C and G, whereby the population of first primers primes synthesis in the polymerase chain reaction of first strand product DNA molecules each of which is complementary to the first strand of a template cDNA molecule that comprises adjacent to the primer annealing sequence within the first strand of the template cDNA molecule a nucleotide or sequence of nucleotides

<Desc/Clms Page number 53>

complementary to the variable nucleotide or nucleotides of a first primer within the population of first primers; or where the restriction enzyme is a Type IIS enzyme the first primers prime synthesis in the polymerase chain reaction of first strand product DNA molecules each of which is complementary to the first strand of a template cDNA molecule that comprises within the first strand of the template cDNA molecule a sequence of nucleotides complementary to an end sequence of an adaptor oligonucleotide in the population of adaptor oligonucleotides; the second primers comprise an oligoT sequence and a 3' variable portion conforming to the following formula: (G/C/A)(X)n wherein X is any nucleotide, n is zero, at least one or more than one; whereby the population of second primers primes synthesis in the polymerase chain reaction of second strand product DNA molecules each of which is complementary to the second strand of a template cDNA molecule that comprises adjacent to polyA within the second strand of the template cDNA molecule a nucleotide or nucleotides complementary to the variable portion of a second primer within the population of second primers; whereby the polymerase chain reaction amplification provides a population of double-stranded product DNA molecules each of which comprises a first strand product DNA molecule and a second strand product DNA molecule;

<Desc/Clms Page number 54>

separating double-stranded product DNA molecules on the basis of length; and detecting said double-stranded product DNA molecules; whereby a pattern characteristic of the population of mRNA molecules present in the sample is provided by combination of length of said double-stranded product DNA molecules and (i) first primer variable nucleotide or nucleotides, where a Type II restriction enzyme is employed, or (ii) adaptor oligonucleotide end sequence, where a Type IIS restriction enzyme is employed; generating an additional pattern for the sample using a second, different Type II or Type IIS restriction enzyme, and comparing the patterns generated using at least two different Type II or Type IIS restriction enzymes in separate experiments with a database of signals determined or predicted for known mRNA's, by (i) listing a11 mRNA's in the database which may correspond to a double-stranded product DNA in each experiment, forming a list of mRNA molecules possibly present for each experiment, and (ii) for each experiment listing mRNA's which definitely do not correspond to a double-stranded product DNA molecule, forming a list of mRNA molecules definitely not present for each experiment, then (iii) removing the mRNA molecules definitely not present from the list of mRNA molecules possibly present for each experiment, and (iv) generating a list of mRNA molecules possibly present and mRNA molecules definitely not present by combining each list generated for each experiment in (iii).

<Desc/Clms Page number 55>

2. A method according to claim 1 which comprises comparing the patterns generated using at least two different Type II or Type IIS restriction enzymes in separate experiments with a database of signals determined or predicted for known mRNA's; by (i) listing a11 mRNA's in the database which may correspond to a double stranded product DNA in each experiment, and forming a set of equations of the form Fi = ml + m2 + m3, wherein Fi is the intensity of the signal from the fragment, the numerals are the mRNA identity and wherein each mRNA which may correspond to a double stranded product DNA appears as a term on the right-hand side; (ii) for each experiment listing mRNA's which definitely do not correspond to double-stranded product DNA in each experiment, and writing for each gene which definitely does not correspond to a double-stranded product DNA in each experiment an equation of the form 0 = m4, wherein the numeral is the mRNA identity; (iii) combining the sets of equations to form a system of simultaneous equations wherein the number of equations is greater than the number of genes in the organism; (iv) determining an estimate of the expression level of each gene by finding the least squares solution for the system of simultaneous equations.

<Desc/Clms Page number 56>

3. A method according to claim 1 or claim 2 wherein the restriction enzyme cuts double-stranded DNA with a frequency of cutting of 1/256 - 1/4096 bp. 4. A method according to claim 3 wherein the frequency of cutting is 1/512 or 1/1024 bp. 5. A method according to any one of the preceding claims wherein the restriction enzyme is a Type II restriction enzyme. 6. A method according to claim 5 wherein the restriction enzyme digests double-stranded DNA to provide a cohesive end of 2-4 nucleotides. 7. A method according to claim 6 wherein the restriction enzyme is selected from the group consisting of HaeII, ApoI, XhoII and Hsp 921. 8. A method according to any one claims 5 to 7 wherein the first primers each have one variable nucleotide. 9. A method according to any one of claims 5 to 7 wherein the first primers each have two variable nucleotides, each of which may be A, T, C or G. 10. A method according to any one of claims 5 to 10 wherein the first primers each have three variable nucleotides, each of which may be A, T, C or G.

<Desc/Clms Page number 57>

11. A method according to any one of claims 5 to 10 wherein each first primer is labelled with a label to indicate which of A, T, C and G is said variable nucleotide or is present at said corresponding position within the variable nucleotides of the first primer. 12. A method according to any one of claims 1 to 4 wherein the restriction enzyme is a Type IIS restriction enzyme. 13. A method according to claim 12 wherein the restriction enzyme digests double-stranded DNA to provide a cohesive end of 2-4 nucleotides. 14. A method according to claim 13 wherein the restriction enzyme is selected from the group consisting of FokI, BbvI, SfaNI and Alw261. 15. A method according to any one of claims 11 to 14 wherein adaptor oligonucleotides in the population of adaptor oligonucleotides are ligated to cohesive ends of digested double-stranded cDNA molecules in separate reaction vessels from different adaptor oligonucleotides with different end sequences. 16. A method according to claim 15 wherein each reaction vessel contains a single adaptor oligonucleotide end sequence. 17. A method according to claim 15 wherein each reaction vessel contains multiple adaptor oligonucleotide end sequences, each

<Desc/Clms Page number 58>

adaptor oligonucleotide sequence in a reaction vessel comprising a different end sequence and primer annealing sequence from the end sequence and primer annealing sequence of other adaptor oligonucleotide sequences in the same reaction vessel, corresponding multiple first primers being employed in the polymerase chain reaction amplification in each reaction vessel. 18. A method according to any one of the preceding claims wherein n is 0. 19. A method according to any one of claims 1 to 17 wherein n is 1. 20. A method according to any one of claims 1 to 17 wherein n is 2. 21. A method according to any one of the preceding claims wherein first primers are labelled. 22. A method according to claim 21 wherein the labels are fluorescent dyes readable by a sequencing machine. 23. A method according to any one of claims 1 to 22 wherein double-stranded DNA molecules are separated on the basis of. length by electrophoresis on a sequencing gel or capillary, and the pattern is generated as an electropherogram.

<Desc/Clms Page number 59>
24. A method according to any one of the preceding claims wherein a first pattern characteristic of a population of mRNA molecules present in a first sample is compared with a second pattern characteristic of a population of mRNA molecules present in a second sample.
25. A method according to claim 24 wherein a difference is identified between said first pattern and said second pattern.
26. A method according to claim 25 wherein a nucleic acid whose expression leads to the difference between said first pattern and said second pattern is identified and/or obtained.
27. A method according to anyone of the preceding claims wherein the presence in the sample of a known mRNA is identified.