WO2004066183A2

WO2004066183A2 - Microrna

Info

Publication number: WO2004066183A2
Application number: PCT/IB2004/000620
Authority: WO
Inventors: Stephen Cohen; Julius Brennecke; Robert B. Russell; Alexander Stark
Original assignee: European Molecular Biology Laboratory
Priority date: 2003-01-22
Filing date: 2004-01-22
Publication date: 2004-08-05
Also published as: WO2004066183A3

Abstract

The invention relates to computational methods of identifying novel microRNA (miRNA) molecules and novel targets for miRNA molecules and the microRNA molecules and targets identified by such methods.

Description

MICRORNA

The present invention relates to methods of identifying new microRNA molecules and their targets and microRNA (miRNA) molecules and targets identified by such methods.

BACKGROUND miRNAs are short 21-23 nucleotide RNAs originally identified in metazoans. Known miRNAs are transcribed as precursor RNAs, containing an RNA stem loop of approximately 80 nucleotides from which the mature single stranded molecule is excised. miRNAs can be subdivided into two groups based on their mechanism of gene regulation. miRNAs that are complementary to their target sequences direct RNA cleavage (RNA interference, or RNAi). Prediction of target RNAs for the first class of miRNA is possible using sequence similarity searches (Rhoades et al., 2002, Cell, 110, 513-520). The second class of miRNAs, exemplified by lin-4 and let7, match their target sequences imperfectly and do not direct RNA cleavage. For the C. elegans miRNAs, lin-4 and let7, this binding has been shown to allow for bulges, mismatches and non-canonical G:U pairing in the middle of the mRNA target. The lin-4 and let7 miRNAs regulate translation of target mRNAs. Alignment of these miRNAs to their targets requires allowing for gaps of variable length at variable positions and sequence mismatches. This makes target prediction a difficult computational problem - the known targets of lin-4 and let7 miRNAs were in fact identified genetically. The short length of miRNAs and their targets makes detection of matches difficult by conventional sequence comparison methods. Even exact or near exact matches have sequence alignment scores that are similar to those found between functionally unrelated sequences. Also, the fast, widely-used BLAST program requires a minimal length of ungapped alignments that makes it difficult to detect short matches containing mismatches, insertions or deletions, which are found when aligning known miRNAs to their target sequences. As a result, existing methods based on BLAST searches have failed to detect miRNA targets in animal genomes, though some success has been reported in plants due to the very high sequence identity between miRNA and their targets (Rhoades et al., 2002 [supra]). There are thought to be hundreds of miRNAs in the human genome. Functions are lαiown for hardly any of these, but they are likely to be involved in most, if not all areas of cell regulation. Some of the same problems - gaps, sequence mismatch - make prediction of targets to be regulated by miRNAs a difficult problem. This is further exacerbated by G:U base pairing, which it is difficult to account for using sequence-alignment based search methods. No computational method has been described previously that can predict the novel miRNAs or their targets.

A novel approach that solves the problem of target gene prediction and miRNA homologue identification would therefore be very useful. A method to define the spatial expression of miRNAs in animals would also be useful. Novel miRNAs sequences and their target sequences obtainable by such an approach would also be useful as a means of gene regulation, such as regulating the translation of mRNAs.

SUMMARY OF THE INVENTION

The invention is based on the development of a computational method for predicting homologs of miRNAs and also prediction of target genes regulated by miRNAs.

In a first aspect of the invention, there is provided a method for identifying an miRNA molecule, comprising the steps of: a) generating a sequence profile for the miRNA molecule, wherein said sequence profile defines a continuous nucleotide sequence that is 20-30 nt in length, that specifies higher sequence conservation at the 5' and 3' termini of the miRNA than the sequence conservation that is specified in the middle region of the miRNA molecule; b) using the profile as a query sequence to search a database of nucleic acid sequences to identify a putative miRNA sequence that satisfies the sequence profile; c) extending the putative miRNA sequence of step b) to include a region of contiguous nucleotides of genomic sequence immediately upstream and a region of contiguous nucleotides of genomic sequence immediately downstream of the putative miRNA sequence, to generate the predicted precursor of the miRNA molecule; d) assessing the ability of said precursor sequence to fold into a secondary structure; e) selecting as the candidate miRNA molecule, one whose precursor sequence generates a secondary structure with a low predicted energy of folding and which forms a stem loop structure, wherein the sequence of the miRNA molecule itself is fully paired with the other arm of the stem in the precursor sequence and forms no part of the loop. Preferably, the method is a computer-implemented method. New miRNAs cannot be easily identified by carrying out homology searches using public tools such as BLAST since functional miRNA homologs need not be perfectly conserved at the sequence level, particularly in the middle region of the miRNA molecule. One limitation of such methods is thus the fact that target sequences can be interrupted by mismatches and loops, and these have severely detrimental effects on searches that input short sequence queries. A second problem is that G:U base pairs are allowed in RNA heteroduplexes and have been observed in miRNA-target complexes, and these are not permitted in algorithms such as BLAST. Searches using such public domain tools are therefore likely to fail owing to the limitations of the algorithm. Homolog prediction is thus a challenging computational problem for which no solution has been described previously. The method of the first aspect of the invention, which takes into account not only the sequence, but also the structure of the precursor of the miRNA molecule, allows for identification of miRNA homologs that are conserved at the ends of the miRNA precursor but may diverge in sequence internally. In one embodiment of the first aspect of the invention, the sequence profile of step a) can be generated from a single molecule and/or its reverse complement (see later description regarding the exact model for further details). For example, a profile can be generated from exact copies of the miRNA in a single species if no other information is available. Also, a sequence profile can be generated by aligning multiple copies of a single molecule and varying the nucleotides in the middle region to make hypothetical miRNA homologs that are in effect approximations of potential miRNA homologs.

In another embodiment of the first aspect of the invention, the sequence profile of step a) of the method is generated by aligning homologs of the miRNA sequence of interest together to give a sequence profile that is a characteristic statistical description of the consensus sequence that is representative of the miRNA molecule. Preferably, the sequence profile generated, such as by aligning the homologs, is a profile hidden Markov model (profile HMM) of which examples are shown herein. Profile HMMs (see Durbin et al., "Biological sequence analysis: probabilistic models of proteins and nucleic acids", Cambridge University Press, 1998) can be used to perform sensitive database searching using statistical descriptions of a particular consensus. The homologs that are aligned according to this step of the method may be derived from one distinct species, from several related species or from a variety of different species. Preferably, there are at least two or more species. Such species may include vertebrate species. The sequence profile generated by the multiple alignment is scored such that a higher degree of sequence conservation is required at the 5' and 3' termini of the miRNA, than is required in the middle region of the miRNA. Generally, a slightly higher degree of sequence conservation is required at the 5' terminus of the miRNA molecule than is required at the 3' terminus. The middle region as defined herein preferably refers to the central 2-10 nucleotides, more preferably, the central 3-6 nucleotides of the miRNA molecule, since known miRNA molecules are often found to have the central nucleotides of the molecule forming a loop in the stem loop structure generated in the miRNA precursor molecule. In generating the sequence profile, account may thus be taken of specific features that are characteristic of certain miRNA species or their targets, such as the existence of insertions or deletions in the central loop. If desirable, "hypothetical homologs" may also be used in generating the sequence profile. Such "hypothetical homologs" can be generated by randomly varying the middle region of a known miRNA molecule sequence. Preferably, the continuous nucleotide sequence defined in the sequence profile is between 20-28 nucleotides in length, more preferably, between 21 and 25 nucleotides in length, even more preferably, between 21 and 23 nucleotides in length. These lengths of sequence appear to be most common in miRNA molecules that have been identified to date.

In step b) of the method, the profile is used as a query sequence to search a database of nucleic acid sequences to identify a putative miRNA sequence that satisfies the sequence profile. As the skilled reader will be aware, a number of different methodologies that are capable of searching a database using a profile HMM might be utilised and any one of these methods may be utilised in the method of the present invention. A preferred methodology is that provided by the HMMER tool (Eddy, 1995, Proc. Third Int. Conf. Intelligent Systems for Molecular Biology, C. Rawlings et al., eds. AAAI Press, Menlo Park. pp. 114-120; Eddy, S.R. (2001) HMMER: Profile hidden Markov models for biological sequence analysis [http://hmmer.wustl.edu/]; Eddy S.R., (1998): Profile hidden Markov models, Bioinformatics, 1998; 14(9):755-63) which is an example of a freely distributable implementation of profile HMM software for sequence analysis. The database that is searched in step b) of the above-described method may be a database of cDNAs, ESTs, mRNAs or the whole genome. Preferably, the database is a genomic DNA database. Screening a whole genome provides the maximum opportunity to identify all the putative miRNAs present in an organism. Putative miRNA sequences are identified as those which satisfy the sequence profile used as the input sequence in the database search. In step c) of the method, an identified putative miRNA sequence is extended to include a region of contiguous nucleotides of genomic sequence immediately upstream and a region of contiguous nucleotides immediately downstream of the putative miRNA sequence, to generate the predicted precursor of the miRNA molecule. Preferably, around 80 nucleotides are excised around the putative miRNA sequence, including between around 40 and 60 nucleotides upstream (preferably around 50) and around 5-15 nucleotides downstream (preferably around 10) and vice versa. In step d) of the method, the ability of said precursor sequence to fold into a secondary structure is assessed. A number of techniques are available for the prediction of RNA secondary structure. The quickest and easiest route to RNA structure prediction is through the use of simple energy rules or energy minimization criteria (for review, see Serra et al., 1995, Meth. Enzymol., 259, 243-261). Any predicted "optimal" secondary structure for an RNA or DNA molecule depends on the model of folding and the specific folding energies used to calculate that structure. Generally, simple energy rules are insufficient to capture the destabilizing effects of various loops, or the nearest neighbour interactions in helices and loops - more sophistication is required and this may be provided by computational tools that have been developed specifically to predict the ability of a given RNA molecule to fold into a secondary structure.

A number of such tools may be used for secondary structure prediction in accordance with the method of the present invention. A preferred method is Mfold, a set of programs developed by M. Zuker and the laboratory of D.H. Turner that uses dynamic programming to predict RNA secondary structures by free energy minimization (see Zuker et al., Algorithms and thermodynamics for secondary structure prediction: a practical guide. In RNA Biochemistry and Biotechnology, 11-43, J. Barciszewski & B.F.C. Clark, eds., NATO ASI series, Klewer Academic publishers, 1999; Mathews et al, J. Mol. Biol. 288, 911-940, 1999). The mfold server currently at (http://www.bioinfo.ipi.edu/applications/mfold/old/rna/form3.cgi) accepts submissions of query sequences of interest. This method uses the energy rules developed by Turner and colleagues to determine optimal and suboptimal secondary structures for an RNA molecule. MFold calculates energy matrices that determine all optimal and suboptimal secondary structures for a given RNA molecule. The program writes these energy matrices to an output file. A companion program, PlotFold, reads this output file and displays a representative set of optimal and suboptimal secondary structures for the molecule within any increment of the computed minimum free energy chosen.

As part of the calculation performed in step d), the energy of folding (free energy ΔG) of said precursor sequence is calculated and compared to free energies for known miRNA molecules. The "energy of folding", measured as ΔG, is a measure of the preferred folded conformation for an RNA molecule. ΔG describes the free energy change for a process at constant temperature and pressure and is defined mathematically as ΔG = ΔH -TΔS where T is temperature, S is entropy and H is enthalpy. The more negative the energy of folding is for a molecular structure, the more favoured such a structure is thermodynamically. This step of the method thus selects for those molecules whose folded structures have a low ΔG when compared to the energy of folding of an unfolded RNA molecule. Preferably, ΔG is equal to or below -18 kJ/mol, more preferably, equal to -20 kJ/mol, -21 kJ/mol, -22 kJ/mol, -23kJ/mol, -24 kJ/mol, -25 kJ/mol, -26 kJ/mol, -27 kJ/mol, -28 kJ/mol, -29 kJ/mol, -30 kJ/mol, -31 kJ/mol, -32 kJ/mol, -33 kJ/mol, -34 kJ/mol, -35 kJ/mol, or below. Additionally, the structural requirements of known miRNA and their precursors should be satisfied. For example, the match must not be in the main loop structure of the precursor RNA. Accordingly, the candidate miRNA molecule selected in step e) of the method is one whose precursor sequence generates a secondary structure with a low predicted energy of folding and which forms a stem loop structure, wherein the sequence of the miRNA molecule itself is situated on the stem in the precursor sequence and forms no part of the main loop connecting the arms of the hairpin. Preferably, the sequence of the miRNA molecule is fully paired with the other arm to create the stem of the hairpin. More preferably, the stem loop structure has a stem length of at least 21 nt. This is to accommodate the miRNA sequence in the stem part of the stem loop structure. Furthermore, it is preferred that there are no side-branches on the stem of the precursor. Again, this is to mirror the situation found in known miRNA precursor molecules.

In addition to those steps described above, the method may additionally comprise the further steps of screening for the presence of the precursor sequence predicted in step c), or a homolog thereof, in the genome of a closely related organism. This acts as a secondary filter, since precursor miRNAs are often conserved between closely related species such as Drosophila and Anopheles or human and mouse and/or Fugu. Homologs preferably exhibit a high degree of sequence identity with the precursor sequence, preferably at least 70%, more preferably 80%, 90%, 95%, 99%) or more over the full length of the precursor sequence. Identity may be assessed using any suitable alignment technique known in the art (see Computational Molecular Biology, Lesk, A.M., ed., Oxford University Press, New York, 1988; Biocomputing. Informatics and Genome Projects, Smith, D.W., ed., Academic Press, New York, 1993; Computer Analysis of Sequence Data, Part 1, Griffin, A.M., and Griffin, H.G., eds., Humana Press, New Jersey, 1994; Sequence Analysis in Molecular Biology, von Heinje, G., Academic Press, 1987; and Sequence Analysis Primer, Gribskov, M. and Devereux, J., eds., M Stockton Press, New York, 1991). Furthermore, erroneous matches may be removed by excluding precursor sequences that fall within the coding sequences of a closely related organism. This reduces the number of false positives identified.

It will be readily apparent to a skilled person, that the method of the first aspect of the invention may be adapted to identify other miRNA molecule types. By altering the sequence profile of step a), the method could be used for identifying miRNA molecules that have a uniformly high sequence conservation throughout its length rather than those miRNA molecules with a higher sequence conservation at the 5' and 3' tennini of the miRNA.

Another problem currently suffered in this field is how to identify the target(s) of an miRNA molecule. The inventors examined the known targets of the C. elegans lin-4 and let-7 miRNAs (which were in fact identified genetically) and found that many miRNA targets have characteristics that are incompatible with identification by standard sequence based searches such as BLAST. Indeed, the BLAST-based method recently described for use in Arabidopsis was unable to identify targets of known miRNAs in metazoans (Rhoades et al. 2002). One obvious limitation of BLAST-based methods is that target sequences can be interrupted by mismatches and loops, which have severely detrimental effects on BLAST searches with short sequence queries. A second problem is that G:U base pairs are allowed in RNA heteroduplexes and have been observed in miRNA-target complexes. The inventors have now developed a computational method to screen the genome for possible targets for regulation by miRNAs. According to a first embodiment of this second aspect of the invention, there is provided a method for identifying the target molecule of an miRNA of interest, said method comprising the steps of: a) searching a database of nucleic acid sequences to identify a putative target sequence that comprises a homologous reverse complement sequence to the miRNA of interest; b) extending the putative target sequence from step a) to include a hai in-forming linker sequence immediately downstream of the complement sequence; c) generating a hypothetical test sequence by extending the resulting sequence of step b) to include the sequence of the miRNA of interest immediately downstream of the haiφin-forming linker sequence comprising a canonical haiφin loop; d) assessing the ability of the hypothetical test sequence of step c) to fold into a secondary structure; e) selecting as the candidate target molecule, a putative target sequence for which the hypothetical test sequence generates a predicted stem loop structure with a low predicted ΔG and where the sequence of the miRNA of interest is paired to the putative target sequence in such a manner that neither the target sequence or the miRNA form the loop of the stem loop structure.

A second, alternative embodiment of this second aspect of the invention exploits the existence of improved methods for the calculation of the predicted energy of annealing of miRNA and target molecules, that do not require concatenation of the miRNA and target sequences. This eliminates the need for the addition of a haiφin forming linker sequence.

This embodiment of the invention thus provides a method for identifying the target molecule of an miRNA of interest, said method comprising the steps of: a) searching a database of nucleic acid sequences to identify a putative target sequence that comprises a homologous reverse complement sequence to the miRNA of interest; b) predicting the free energy of base-pairing between the putative target sequence identified in step a) and the miRNA of interest; c) selecting as the candidate target molecule, a putative target sequence which is predicted to base pair with the miRNA of interest with a favourable predicted free energy ΔG.

These methods have been shown herein to allow the identification of known targets in complex genomes. The methods work on any DNA/RNA database, although they improve depending on the quality and size of the database. Best results are generated from a database of well-annotated or even experimentally verified 3'UTRs and accordingly, a database of this nature is preferred. An example of such a database would be a correctly annotated database consisting of the entire transcriptome of an organism, including alternate splice forms. Such databases are not yet available. Until such databases are available a pragmatic compromise consists of first searching an available transcriptome database and in a second step searching a hypothetical 3'UTR database, generated by taking at least 1500 bp, preferably 2000 bp or 2500 bp of DNA downstream from the translation stop codon of all annotated genes. Methods for the generation of such hypothetical databases are included as aspects of the present invention.

At present, gene prediction programs can identify coding sequence but do not predict UTRs, hence many annotated genes lack UTR data. A 3'UTR, as used in the present description, refers to the region of a transcript, that is 3' to the stop codon, and which is not translated. In order to reduce the total amount of genomic sequence that needs to be searched for a potential target, a conserved 3'UTR database may be created. Preferably, a conserved 3' UTR database may be created in a method comprising the following steps:

(a) taking known or predicted 3' UTRs of one organism and selecting those that are longer than a certain threshold length, for example, more than 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, more preferably, more than 50 nucleotides in length;

(b) identifying homologous UTR sequences in the genome of another organism to define UTR sequences conserved in evolution and hence likely to have a function; and

(c) selecting only those UTR sequences from the organism in (a) that are conserved, as identified in step (b), for inclusion in the 3 'UTR database. A 3'UTR database such as that created in step (c) when known 3' UTRs are taken in step a) is herein referred to as a "conserved 3'UTR database".

For genes in the first organism lacking experimentally validated 3' UTRs, predicted UTRs can be used in step (a) as described above to search for conserved UTR sequences in the predicted UTRs of the organism in step (b). The conserved UTR sequences from this organism can be used to construct a "conserved predicted 3' UTR database". This can be used either alone or together with the "conserved 3' UTR database". Such databases form aspects of the present invention, as do methods for generating such databases, as discussed in some detail herein. Preferably, there is an intermediate step (a2) included in the above method for generating a conserved 3'UTR database between steps (a) and (b), wherein identical duplicate UTRs from different splice variants of the same transcript of the organism of step (a) are removed. This reduces the number of target sequences taken into step (b).

Preferably, homologous UTR sequences are identified in step (b) above by a method comprising the steps of:

(i) generating an amino acid sequence by translating the 3' nucleotides of the ORF of the transcript to which the 3'UTR of step (a) belongs;

(ii) using the amino acid sequence of step (i) in a homology search of the genome of the organism for which a target sequence is to be identified and selecting a region from the genome that encodes a polypeptide sequence that gives an E value below a significance threshold (for example, an E value of less than or equal to 10^"5, preferably, less than 10^"6, more preferably 10 or less) when compared with the amino acid sequence of step (i);

(iii) selecting only those regions from step (ii) that encode the C-terminal-most amino acid residues and have a sequence identity of >80% or E<10^"10 over a region spanning the C-terminal-most amino acids;

(iv) comparing the 3'UTR sequence from organism one of step (a) with a region of nucleotides downstream of the region of step (iii) from organism two and selecting those with an E value of equal to or less than a significance threshold. Non-conserved nucleotides or those outside the matched regions are replaced by "N"s in the 3' UTR database from organism one to produce the conserved 3' UTR database. Residues replaced by "N" are ignored by the sequence search tool. In step (i), the amino acid sequence is preferably above a threshold length, for example, above 40, 45, 50, 55 or 60 amino acid residues in length, more preferably, about 50 amino acid residues in length.

In step (ii), the homology search may be performed according to any suitable method, as will be clear to those of skill in the art. A suitable method, for example, is the tblastn software available at http://www.ncbi.nlm.nih.gov/BLAST/. A suitable significance threshold may be an E value of less than or equal to IO^"5, preferably, less than IO^"6, more preferably 10^" or less.

In step (iii), only regions from step (ii) that encode the C-terminal-most amino acid residues (for example, the C-terminal-most 20, 10 or 5 amino acid residues, preferably 10) are selected to ensure that the end of the ORF is defined and an internal exon is not mapped. Furthermore, only those regions that exhibit a certain degree of homology are selected. Preferably, only those regions that have a sequence identity of >80% (more preferably, >85%, >90%, >95%) or E<10^"10 (preferably E<10^"15, E≤IO^"20, E<10^"50 or less) over a region spanning around the 50 C-terminal-most amino acids are selected. These E values apply to blast matches irrespective of how long they are. The 80% threshold cutoff is advantageous, because on a genome level, short sequences can have high (bad) E-values even if they are 90% identical and clearly orthologous.

In step (iv), the 3'UTR sequence of step (a) is preferably compared with a region of about 3000 nucleotides downstream of the region of step (iii). Those with an E value of equal to or less than a significance threshold are selected; a suitable significance threshold might be around 10 000, assuming a database the size of the whole D. pseudoobscura genome when using BLASTN.

An example of a case in which such a conserved 3' UTR database has been generated by the inventors is when the organism of step (a) is Drosophila melanogaster and the organism of step (b) is Drosophila pseudoobscura. Example 13 herein demonstrates the successful application of this method for finding hid as a target for bantam. Using the genome of Drosophila pseudoobscura in this respect has been found to be advantageous over using the Anopheles genome because of the availability of a higher level of completeness of its gene predictions. Furthermore, it has been found advantageous to include additional genomes for comparison to derive a multi-genome conserved 3' UTR database. For example, the honeybee or mosquito genomes could be compared with Drosophila when they are annotated sufficiently well for the 3' UTRs to be compared by the method described above and used herein for Drosophila pseudoobscura.

In another example of relevance to the present invention is when the organism of step (a) is a mouse and the organism of step (b) is a human. Generating a conserved 3'UTR database for the human genome in this way reduces the amount of sequence that needs to be searched for a potential target site. This is particularly important given the large sizes of the mouse and human genomes. Other examples include fugu (pufferfish). Conservation over three (or four or five or more) related genomes is a more powerful filter than just two genomes. Additionally, mouse and human genes are quite similar even in the 3' UTRs, so the sequence conservation is high.

Including additional vertebrate genomes that are more distant to human than mouse would be helpful in making a useful mammalian 3' UTR database (for example, the annotation of genomes such as the zebrafish and medaka fish should soon be in a suitable state to allow such comparisons). This method is reliant on the accuracy and availability of the annotation of the human and mouse genome. The accuracy of this method is thus going to improve as the accuracy and availability of the annotation of the human and mouse genome improves.

In the meantime, given that the annotation of the human and mouse genome is not complete, the method can be adapted by making use of known or predicted Drosophila targets. In one embodiment of this aspect of the invention, therefore, a conserved 3'UTR database may be created for the mouse or human genomes by finding homologous sequences to predicted (preferably validated) targets of the Drosophila miRNAs and C. elegans miRNAs. Homologues identified according to this method are included as aspects of the present invention. If genetic evidence can be used to restrict the targets to a set of proteins or a specific region of the genome, the database size decreases and the sensitivity of the method thus increases. In addition, the profile and structural constraints will undoubtedly improve as more miRNA target sequences are identified and a more general picture of miRNA target complementarity emerges. Preferred organisms for study are eukaryotes, particularly mammals and, of course, the human.

In a preferred alternative to the generation of a conserved 3'UTR database, the methods described above may be applied separately (either sequentially or simultaneously) to more than one genome. For example, a database of UTRs from a first genome may initially be searched (for example, Drosophila melanogaster). If a promising candidate target site is found in a UTR, then the UTR from the corresponding gene is searched in a second genome (for example, Drosophila pseudoobscura). The first and second genomes can, of course, be any genomes. Examples include human and mouse, but other mammal and vertebrate genomes may be used, or indeed any genome whatsoever. This approach can be extended to include a third genome or any number of related genomes as desired. Increasing the number of genomes evaluated improves the filter for conservation during evolution and hence reduces false positives due to random matches. One preferred embodiment of the above-described methods of the second aspect of the invention thus involves performing the method iteratively for a second genome, and optionally for third, fourth, fifth, sixth, seventh, eighth, ninth, tenth or further genomes.

In step a) of the above-described method for identifying the target molecule of an miRNA of interest, a database of nucleic acid sequences is thus searched to identify a putative target sequence that comprises a homologous reverse complement sequence to the miRNA of interest. As with the method of the first aspect of the invention, a sequence profile is used for the search that is a characteristic statistical description of the consensus sequence that is representative of an miRNA target. The sequence profile may be a profile hidden Markov model (profile HMM). If profile HMMs are used, a range of profile HMMs are preferably used to search for sequences complementary to miRNAs to allow for a range of possible target configurations. Alternatively to profile HMMs, sequence strings can be searched using simple pattern recognition tools. For example, a simple string matching programme, written for example in PERL, would suffice and be less computationally intensive than HMMER and thus of greater practical utility. For example, five different models may be used (three are illustrated in Figures 8A-C). A model referred to herein as the "exact" model assumes perfect alignment between miRNA and target and imposes a penalty for mismatches or loops in either miRNA or its target. A second model, referred to herein as the insertion-deletion ("indel") model allows loops in either the miRNA or its target. A third model, referred to herein as the "loop" model allows loops only in the miRNA. By limiting the loops to one strand, the loop model allowed a greater range of variation in the extent and number of loops than could be used with the indel model. Profiles incoφorating gapped alignments should thus be generated containing mismatches of the test miRNA reverse complement sequence, using the miRNA sequence as input.

The exact model should contain a number of exact copies (for example, between 3 and 10, preferably, 5 exact copies) of the reverse complement. It is expected that certain target sequences of miRNA molecules will be fully paired to miRNA molecules, thus the requirement in this model that the target sequences be fully paired. "Fully paired", as used herein, includes G:U base-pairing. In one version of the "exact" model, a conserved 3'UTR database is searched assuming G:C base pairing only followed by a search giving equal "weight" to G:C and G:U pairing. The two lists of prospective targets are merged and the duplicates removed.

Even using the exact alignment model can lead to high numbers of potential target sites, however. In one embodiment of this invention, therefore, to increase the probability of the identified potential target site being a target site in vivo, certain steps may be added.

These improvements incoφorate a measurement of the significance of predicted target sites. Significance may be measured either using so-called "Z" scores, or using "E" values.

(a) by Z scores

The stability of a hypothetical test sequence may be measured by calculating the Z score and only selecting those hypothetical test sequences that have a Z score of more than or equal to 3, most preferably more than or equal to 4. The Z score is defined as "{(ΔG (target site) - ΔG (_mean of random _Sequence)}/standard deviation of ΔG for random sequences" and is a measure of the likelihood that the predicted target site is significantly different to a random sequence.

A "random sequence" is herein defined as being a sequence that is not a natural target site for miRNA in vivo. "ΔG(tar_get site) is the folding energy of the hypothetical test sequence having the target site. "ΔG(_mean of random _sequences) is the mean of the folding energy of the hypothetical test sequences having random sequences replacing the target site sequence. Preferably, the number of random sequences used in the calculation is more than 8000, more than 9000, more than 10000 or more. The random sequences are required to be the same length as the average predicted target site. For convenience, in the example used herein, the random sequences were chosen by taking the first N nt from the first 10 000 UTRs in the 3' UTR database, where N = average predicted target site length for each miRNA. However, it will be clear that any other means of selecting random sequences of the equivalent length would be equally applicable.

The higher the Z score, the more significant is the hit, that is, the more likely it is that the predicted target site is not a random sequence. For example, a score of Z=3 means that a test site has a predicted folding energy 3 standard deviations above the mean predicted folding energy for random sequences. For example, selecting sites of Z=3 or higher would eliminate 99.6% of random sequences.

It is possible that more than one predicted target site for a specific miRNA can fall within the same 3'UTR when using any of the methods of the present invention. In fact, multiple target sites falling within the same UTR increases the confidence that that particular UTR is a natural target site in vivo rather than just a random sequence. Z scores can be used to take multiple sites into account. Adding the Z scores together (ZUTR) for predicted target sites that fall within the same 3'UTR, gives a further indication of the likelihood that there is indeed a natural target site in vivo within the 3 'UTR.

Adding up the ΔG alone does not work because most random matches have favourable energy values. Z scores therefore take into account that a known target can contain multiple predicted miRNA binding sites in their UTRs. Preferably, a cutoff value of Z=3 is chosen as a reasonable probability of the site being valid. Use of a higher Z-value increases the likelihood that a prediction is correct, but increases the risk of missing out possible contribution for valid sites of lower folding energy. The lists of predicted targets have been evaluated according to the best single site in the UTR (Zmax) and by the sum of sites in each UTR with Z>3 (ZUTR).

(b) by E values (expectation values') The significance of the target site prediction may also be measured by using expectation (E) values. Similarly to the E-value that is used in BLAST analysis, E predicts the number of background matches that are equal or better than the score for that particular target site prediction.

E values can be computed by fitting an exponential function to the cumulative background distributions for energies and extrapolated to give a value for any observed energy. To compute the probability of finding multiple sites in a single UTR, the E value should be calculated for each site assuming the database consists of only the single UTR (i.e. asking for the probability of multiple sites in that UTR as distinct from finding them in the whole database. Next, all the E values for the individual sites are combined by multiplication (E- values for single sites in single UTR sequences correspond to probabilities: E~P if E«l) and can be combined by multiplication i.e. to get the E-value for having multiple sites within one UTR. The resulting UTR E-value is finally multiplied with the real database size (total set of conserved UTRs) to get the final E-value for multiple sites in a single UTR within the largest database.

As E-values correspond to the number of background matches that are expected to occur by chance, larger E-values (E-values scale from 0 to infinity) are less significant whereas E-values close to 0 (preferably less than 10^"1, IO^"2, IO^"3, 10^"5, 10^"10, IO^"15 or lower) are significant.

Accordingly, in a refined version of the exact model, the following considerations are taken into account: (1) Any predicted target with a Z score of less than 3, is considered unlikely to be functional and in one application of the method may be discarded. In a second application of the method, such sites may be retained but as they rank low, they are unlikely to be considered as valid targets. In a less preferred application the cutoff threshold can be set lower to exclude sites with Z values for example less than 2.5, less than 2, less than 1. (This will increase the number of false positives).

(2) Predicted targets that overlap the coding sequence of another gene are considered as less likely to be valid targets because the overlap with a coding sequence would suggest that the conservation of the sequence, and hence its inclusion in the conserved UTR database, is due to the function of the coding sequence rather than that of the 3'UTR. (3) Using a conserved database wherein the 3 'UTRs used in step (a) have been experimentally validated (see above). Normally, the UTR will be validated in the reference organism (for example, D. melanogaster and then compared to predicted UTRs from D. pseudoobscura). In the case of mouse and human it is possible that the UTR could be validated in either one, or in another organism such as the fish, and the method could be adjusted to use validated UTR data from non-human species and search against predicted human UTR sequence for the cases where there are no validated human UTR data. For the indel model the alignment should contain copies of the miRNA reverse complement with central nucleotides (preferably 0, 1, 2, and 3 nucleotides) deleted or inserted.

For the loop model, the alignment should contain copies of the miRNA reverse complement with the central nucleotides (preferably 3 to 6 nucleotides) deleted. It is known that miRNA molecules often have a bulge of unpaired nucleotides when bound to the target sequence and this model is designed to identify such targets.

Figures 8 (A-C) illustrate how three of the models described above penalize sequence mismatches and where they are more and less permissive for mismatches and gaps. Another model that can be utilised in the identification of target molecules for miRNA in step a) of the methods described above is the "gapped" model. This model is preferably HMM-based and is designed to favour alignment of the 5' end of the miRNA and allow more flexibility in the positioning of the 3' end alignment, thus reflecting the real-life interaction between miRNAs and their targets. The model can be described as follows:

(1) search a 3'UTR database (preferably a conserved 3' UTR database) for a complementary sequence to the 5' region of the miRNA;

(2) search a 3'UTR database (preferably a conserved 3' UTR database) for a complementary sequence to the 3' region of the miRNA; and (3) select as a putative target sequence, a sequence that comprises the 5' region of step (1) upstream to the 3' region of step (2), separated by a maximum threshold distance (for example, 4, 5 or 6 nucleotides, preferably 5) so that the total length of the target sequence does not exceed the length of the miRNA plus this threshold distance.

Preferably, the length of the 5' region is more than or equal to 5, 6, 7 or 8 or more nucleotides in length. More preferably, the 5' region is about 8 nucleotides in length. The selection of these lengths is based on examination of known and predicted targets.

Preferably, the length of the 3' region is more than or equal to 2, 3, 4, or 5 nucleotides in length. More preferably, the 3' region is about 5 nucleotides in length. The selection of these lengths is based on examination of known and predicted targets. This method allows for some flexibility in the alignment without dramatically increasing the number of possible alternative 3' alignments for each 5' match.

In a preferred embodiment, the method for identifying a target molecule comprises using both the exact model and the gapped model and consolidating the results. A target site scoring highly for both models increases the probability that that particular target site is a valid target in vivo rather than simply random sequence.

In another preferred embodiment of the above-described methods, a model that can be utilised in the identification of target molecules for miRNA is one herein described as the 5' 8nt model. Based on examination of the alignment of known miRNA targets, the inventors have determined that the first residues in the target sequence are mismatched or interrupted much less often than in other locations. On this basis a model (the hmmer model) has been used to search for similarity to the initial few bases at the 5' end of the miRNA (both conventional and GU). The first 8 bases have been determined herein to be mismatched or interrupted much less often than in other locations, although the method will also work using fewer than 8 (for example, 7, 6 or 5) or more than 8 (for example, 9, 10, 11 or 12) residues for the search. There are no absolute limits but of course perfect matches of 6 nt occur every 4Kb or so, meaning that the problems of distinguishing true from false increases dramatically as the number of residues used decreases. Using fewer than 8 nucleotides works, but increases the number of false positives because it reduces the maximum sequence content of the match. More than 8 nucleotides also works, but may lead to spurious results being assigned better scores than valid targets if the gap or mismatch occurs in residues 9-12 as often happens (thus the 5' 8nt method performs better than the original "exact" model).

Matches to the first residues are then extended to a length of the miRNA sequence plus a number of additional residues (for example, 1, 2, 3, 4, 5, 6, 7 or more, preferably 5). Extending residues much longer than 5 increases the risk that the 'overhang' from the extended sequence might be able to form a secondary structure using programs such as mfold that will give a spuriously favourable folding energy, so there are diminishing returns to longer gap allowances. Extensions of the matched sequence to the length of miRNA plus less than 5 nucleotides can also work but limit the flexibility in the length of target site loops. N+5 thus appears to be a good compromise based on the inventors' trials with known targets. It is quite possible that N + other numbers might work better for some miRNAs and the use of such parameters is included as an aspect of the invention.

This 5' 8nt model has been found to work as well as the gapped method (see above) in allowing flexibility on size and position of loops in the target sequences, but is simpler. Using the methods that are described above, profile-based sequence searches are performed of a database to generate lists of possible targets. The method works using any DNA/RNA database, but the value of the results will improve depending on the quality and size of the database. Preferably, the database searched is a database comprising 3 'UTRs (e.g. a transcriptome database from which the 3 'UTRs are identified), since some characterized miRNAs are known to bind to their target sequences in the 3 'UTRs of genes. Best results will be generated from a database of well-annotated or experimentally verified 3 'UTRs. If genetic evidence can be used to restrict the targets to a set of proteins or a specific region of the genome, the database size decreases and the sensitivity of the method thus increases. As with the method of the first aspect of the invention, a number of different methodologies that are capable of searching a database, for example, using a profile HMM, might be utilised in these aspects of the invention. One preferred methodology is that provided by the HMMER tool (Eddy, 1995, Proc. Third Int. Conf. Intelligent Systems for Molecular Biology, C. Rawlings et al., eds. AAAI Press, Menlo Park. pp. 114-120). For example, the software "hmmbuild" from the HMMer package may be used to build HMMer profiles from the alignments, using a null model that corrects for the expected sequence length of 25 nucleotides. The profiles may be calibrated with "hmmcalibrate" and used to search a database with "hmmsearch" (E-value threshold <100).

At this level, the ranking of putative target sequences generated from performing such a profile-based search is not statistically significant. This is partly due to the fact that tools such as HMMer do not take into account the possibility of G:U base pairing.

To address this problem, in the first embodiment of the second aspect of the invention, it has been found necessary additionally to evaluate the ability of a putative target sequence to form a secondary structure with the miRNA of interest. In order to do this, a number of techniques available for the prediction of RNA secondary structure may be implemented. As the miRNA and each predicted target are independent sequences it is necessary first to connect them pairwise into single sequence strings. Accordingly, in step b) of the method, the putative target sequence from step a) is extended to include a haiφin-forming linker sequence immediately downstream of the complement sequence. This may be done in a number of ways - for example, a PERL program may be used to extend the miRNA sequence. The haiφin-forming linker sequence used is preferably a canonical haiφin-loop such as the sequence GGGGAC (Mathews, J. Sabina, M. Zuker & D.H. Turner (1999), J. Mol. Biol. 288, 911-9). Of course, the haiφin-forming linker sequence of the method of the second aspect of the invention is not present in vivo, since in nature, the target sequence and the miRNA are not usually found in close proximity on the same RNA molecule, but rather, are brought together by means of complementary base pairing. The function of the haiφin-forming linker sequence in the present method is to enable the two structures to be folded in close proximity to each other and thus interact. "A haiφin-forming linker sequence", as used herein, refers to a sequence that is incapable of binding to itself by means of complementary base pairing. Also encompassed by this definition is a sequence that has its endmost 5' base and 3' base paired together. The haiφin-forming linker sequence is of such a length that allows any base pairing between the miRNA molecule and the target molecule to occur, but does not take part in the base pairing itself. Shorter sequences are preferred to minimize their contribution to the calculated free energy of folding. Longer sequences have a higher probability of generating undesired secondary structures by pairing with the arms of the haiφin, and by so doing, affecting the overall free energy of folding. Even more preferably, the haiφin-forming linker sequence is GCGGGGACGC. The sequence of the haiφin-forming linker sequence is important in that it should not add to the overall stability of the folded structure. Any sequence that forms a haiφin can be used, as can any sequence that does not itself contribute to base pairing but that does not impede formation of a haiφin driven by base-pairing between the miRNA and its target (i.e. the arms of the haiφin).

In one optional implementation of this aspect of the invention, complementary nucleotides can be included at each end of the extended sequence in order to stabilize the extended molecule. More preferably, a pair of stabilizing nucleotides are included at each end. Yet more preferably, the complementary nucleotides are stabilizing GC pairs. Even more preferably, a hypothetical molecule with the following organization is generated (GC- predicted target-GGGGAC-miRNA-GC). In the preferred implementations of the exact model and the gapped model and in the 5' 8nt model this implementation has been found unnecessary and is preferably not used.

A preferred method for prediction of the secondary structure of hypothetical molecules generated in this manner is the program, mfold (Zuker et al, 1999; Mathews et al, 1999; http://www.bioinfo.φi.edu/applications/mfold/old/rna form3.cgi). This server outputs structural description text files, which can be retrieved and evaluated on the basis of free energy (ΔG), the number of paired bases, the position of loops and mismatches to prepare lists of possible targets.

In the first embodiment of the second aspect of the invention, a sequence that is selected as a candidate target molecule is a putative target sequence for which the hypothetical test sequence generates a predicted stem loop structure with a low predicted ΔG and where the sequence of the miRNA of interest is paired to the putative target sequence and does not by itself form the loop of stem loop structure. Similarly, the candidate target molecule should not form the loop of the stem loop structure. The term "paired" in this context includes a target that pairs perfectly, but includes G:U base pairs.

Preferably, at least one bulge is or mismatch is required between the sequence of the miRNA of interest and the putative target sequence in the hypothetical test sequence. A "bulge" in a secondary structure as this term is used herein refers to a sequence of unpaired nucleotides wherein the sequences immediately upstream and downstream of the unpaired nucleotides are paired to a complementary sequence on an opposite strand and the bulge is formed because the sequence of unpaired nucleotides does not have a complementary sequence on the opposite strand. A "bulge" includes secondary structures generated by a single mismatch as well as bulges generated by at least 2 nucleotides. For stability reasons, the method of the invention requires that the secondary structure consists of fewer than four bulges or loops. As the number of bulges allowed increases, it becomes harder to discriminate signal from noise. However, as more inaccuracy is permitted, more possible sequences become 'valid' targets. The parameters of the present model allow for enough discrimination between signal and noise, yet aim to maximise the number of possible valid targets. In one embodiment, 2 or more of the endmost nucleotides of the hypothetical test sequence may be paired. The 2 endmost nucleotides may consist of Gs and/or Cs. The endmost pairings may increase the stability of the overall molecule. Again, however, this implementation is only optional - in the preferred implementations of the exact model and the gapped model and in the 5' 8nt model this implementation has been found unnecessary and is preferably not used.

Preferably, the hypothetical test sequence of step c) has a predicted ΔG of less than -10 kJ/mol, which is considered to be a stable complex. More preferably, the hypothetical test sequence of step c) has a predicted ΔG of less than -20 kJ/mol, even more preferably, of less than -25 kJ/mol, -30 kJ/mol or less than -35 kJ/mol. Even more preferably, the hypothetical test sequence of step c) comprises or consists of the formula target sequence- GCGGGGACGC-miRNA sequence (or {GC}-target sequence- GCGGGGACGC -miRNA sequence-{GC} if this implementation is being used). Using this formula in the described method, the target sequences of lin-14 and lin-28 were successfully identified when using the sequence of lin-4 as the miRNA sequence of interest (see examples section). In this example, the hypothetical test sequence used had the formula -target sequence- GCGGGGACGC-GUGAGAUCAUUUUGAAAGCUG-. The method of the second aspect of the invention can also be used for testing the effect on binding when a lαiown miRNA homolog or target sequence is altered or mutated. Such a method may be useful in drug design or in therapy in general. For example, a mutated miRNA homolog or target sequence that binds more efficaciously to target might be used to modulate the natural physiological operation of wild type miRNA sequences in an organism. In the method of the second embodiment of the second aspect of the invention, it is not necessary to evaluate the ability of a putative target sequence to form a secondary structure with the miRNA of interest. In this embodiment, in step b), the free energy of base-pairing between the putative target sequence identified in step a) and the miRNA of interest is predicted; in step c), a candidate target molecule is selected which is predicted to base pair with the miRNA of interest with a favourable predicted free energy ΔG.

Preferably, step a) of the method of the second embodiment of the second aspect of the invention uses the 5'8nt model, or a variation of this model, to search a database of nucleic acid sequences to identify a putative target sequence that comprises a homologous reverse complement sequence to the miRNA of interest. This model is described above. The HMMR search tool, or other profile-based search tool may be used. Alternatively, a simpler search method may be used, in which a search is performed for sequences that are complementary to bases 2-7 of the miRNA of interest. This target sequence is extended and it is specified that base pairing is required in at least 7 of the first 8 positions (eg. 1-7 or 2-8). Matches to the first bases are then extended to the length of the miRNA sequence plus a number of additional bases (for example, 1, 2, 3, 4, 5, 6, 7 or more, preferably 5) and evaluated for alignment to the entire miRNA. It is known that some valid target sites contain G:U base pairs. The stringency of the search can thus be adjusted by allowing G:U base pairs. A preferred method allows 1 G:U base pair in positions 2-7 (and thus a total of 3 if positions 1 and 8 are considered). An alternate version of the method allows more G:U base pairs in positions 2-7. The maximum number is defined by the possibility of forming G:U base pairs with the miRNA sequence.

In step b) of the method of the second embodiment of the second aspect of the invention, the free energy of base-pairing between the putative target sequence identified in step a) and the miRNA of interest is predicted. In this way, an evaluation of the relative quality of alignment is permitted. "Free energy of base-pairing", measured as ΔG, is a measure of the strength of binding between miRNA and target; the more negative the energy of folding is for a molecular structure, the more favoured such a structure is thermodynamically. This step of the method thus selects for those pairs of molecules whose complexes have a low ΔG and thus are predicted to have stable base pairing (when compared to the energy of the uncomplexed RNA molecules). This may be performed using any methodology that is capable of aligning sequences and predicting the free energy of folding between them. One preferred tool is the alignment software package generated by Marc Rehmsmeier (University of Bielefeld), termed RNAhybrid (see http://bibiserv.techfak.uni- bielefeld.de/rnahybrid/submission.html). One specific advantage of this methodology is that RNAhybrid does not require concatenation of the miRNA and target sequences. This eliminates the need for addition of a haiφin forming linker sequence. It also allows for mispairing in position 1, which we observe in valid targets.

In step c) of the method of the second embodiment of the second aspect of the invention, a candidate target molecule selected is that which is predicted to base pair with the miRNA of interest with a favourable predicted free energy ΔG. By "favourable" predicted free energy ΔG is meant that the complex of miRNA and target has a predicted ΔG of less than -10 kJ/mol, which is considered to be representative of a stable complex. More preferably, the hypothetical test sequence of step c) has a predicted ΔG of less than -18 kJ/mol, more preferably, equal to -20 kJ/mol, -21 kJ/mol, -22 kJ/mol, -23kJ/mol, -24 kJ/mol, -25 kJ/mol, -26 kJ/mol, -27 kJ/mol, -28 kJ/mol, -29 kJ/mol, -30 kJ/mol, -31 kJ/mol, -32 kJ/mol, -33 kJ/mol, -34 kJ/mol, -35 kJ/mol, or below.

To increase the probability of the identified potential target site being a target site in vivo, certain steps may be added to the above-described method. These improvements incoφorate a measurement of the significance of predicted target sites. Notably, significance may be measured either using "Z" scores, or using "E" values as described above. Use of "Z" scores is preferred.

A further preferred feature of the methods of the second aspect of the invention involves a comparison of the quality of the sequence conservation of the target sites in related genomes (for example, Drosophila melanogaster and Drosophila pseudoobscura; human and mouse). This approach can be extended to include a third genome or any number of related genomes as desired. Increasing the number of genomes evaluated improves the filter for conservation during evolution and hence reduces false positives due to random matches. The predicted sites in the two (or more) genomes are thus evaluated not only for their free energy of folding, but also for the degree of conservation of the sequences across the tested genomes (i.e. do the two target sites base pair similarly to the miRNA or are the folding energies generated by structurally different alignments). This gives a factor that scales the score for the free energy of folding. For example, if the base pairing across genomes gives a different structure, then the sequence is unlikely to be evolutionarily ancient and thus more likely to be a false positive candidate.

A particular method according to the second embodiment of the second aspect of the invention is preferred. This method for identifying the target molecule of an miRNA of interest comprises the steps of: a) searching a database of nucleic acid sequences to identify a putative target sequence that comprises a homologous reverse complement sequence to the miRNA of interest, wherein i) a search is performed for a target sequence that is complementary to bases 2-7 of the miRNA of interest; ii) a target sequence identified in step i) is extended and it is specified that base pairing between target and miRNA is required in at least 7 of the first 8 bases; iii) a target sequence identified in step ii) is extended to the length of the miRNA sequence plus a number of additional bases, preferably 5 bases, and evaluated for alignment to the entire miRNA; b) predicting the free energy of base-pairing between the putative target sequence identified in step a) and the miRNA of interest; c) selecting as the candidate target molecule, a putative target sequence which is predicted to base pair with the miRNA of interest with a favourable predicted free energy ΔG.

Preferably, the search tool used in step a)i) is a string recognition tool. Preferably, the free energy of base-pairing between the putative target sequence identified in step a) and the miRNA of interest is predicted using the RNAhybrid tool.

Preferably, the method is performed iteratively for a second genome, and optionally for one or more further genomes to improve the filter for conservation during evolution and thus reduce false positives due to random matches. Preferably, a comparison is made of the quality of the sequence conservation of the candidate target sites in related genomes to give a factor that scales the relevance of the score for the free energy of folding.

According to a third aspect of the invention, there is provided an isolated miRNA molecule identifiable by the method of the first aspect of the invention. One hitherto unknown miRNA species that may be identified using such a method is the human homolog of the Drosophila miRNA bantam. Originally, the bantam locus in Drosophila was identified in a gain-of function screen for genes that affect tissue growth without affecting pattern (Hipfner et al., 2002, Genetics, 161:1527-1537). Its product has now been identified as a 21 -nucleotide miRNA. The bantam miRNA is not among those miRNAs previously described.

An investigation of the function of this molecule has revealed that it acts in vivo to promote tissue growth by simultaneously stimulating cell proliferation and preventing apoptosis. A Bantam miRNA homolog has been identified in Anopheles and predicted in human and other mammals. It is predicted that the human bantam miRNA has a comparable function to that demonstrated in Drosophila. Accordingly, one embodiment of the third aspect of the invention provides an isolated miRNA molecule that functions to suppress apoptosis and stimulate cell proliferation. Such a molecule has great potential in the treatment of diseases in which these phenomena are dysfunctional, such as cancer. Although agents are known that possess one of these properties, the identification of a small molecule that possesses both functions is of great significance. This is the first time an isolated miRNA molecule has been shown to possess both these properties. The unique combination of both these properties is particularly useful in designing treatment of hypeφroliferative disorders such as cancer, including but not limited to harmatomas, inducing cell proliferation for regeneration, in particular, tissue regeneration, driving stem cell proliferation, blocking apoptosis, for example, of neurons in response to spinal cord damage or preventing virally-induced apoptosis of T-cells in AIDS patients, or treating degenerative disorders, including but not limited to neurodegenerative diseases such as Alzheimer's disease.

By "miRNA molecule" is meant a short RNA molecule that acts to regulate expression of another gene, by a mechanism including but not limited to base pairing with the target RNA leading to RNA degradation (RNAi) or base pairing with the target RNA leading to translational control. Other mechanisms may involve base pairing with the target RNA leading to alteration of transcription, splicing, chromatin structure etc. Preferably, an miRNA molecule according to the invention is between 19 and 28, more preferably between 20 and 25, even more preferably between 21 and 23 nucleotides in length. Such molecules may be synthesised with a natural ribose phosphate backbone and natural bases, as normally found in RNA molecules, or alternatively, may be synthesised with non- natural backbones, for example, 2'-O-methyl RNA, to provide protection from ribonuclease degradation and may contain modified bases. miRNA molecules may be modified to increase intracellular stability and half-life. Possible modifications include, but are not limited to, the addition of flanking sequences at the 5' and/or 3' ends of the molecule or the use of phosphorothioate or 2' O-methyl rather than phosphodiesterase linkages within the backbone of the molecule. This concept is inherent in the production of PNAs and can be extended in all of these molecules by the inclusion of non-traditional bases such as inosine, queosine and butosine, as well as acetyl-, methyl-, thio- and similarly modified forms of adenine, cytidine, guanine, thymine and uridine which are not as easily recognised by endogenous endonucleases. By "apoptosis" is meant the process of programmed cell death or cell suicide. Assays to measure suppression of apoptosis will be known to a person skilled in the art. For example, cytotoxicity assays may be used (including radioactive and nonradioactive assays) that measure increases in plasma membrane permeability; colorimetric assays measure reduction in the metabolic activity of mitochondria; fragmentation of DNA in populations of cells or in individual cells shows apoptotic DNA breaking into different length pieces; measurement of alterations in membrane asymmetry; activation of apoptotic caspases; release of cytochrome C and AIF into cytoplasm by mitochondria. Any assay of this nature may include co-expressing the miRNA of interest and a pro-apoptotic gene and comparing cell death with a similar system where the miRNA of interest is absent. If the system co- expressing the miRNA of interest exhibits a lesser degree of cell death than the system in which the miRNA is absent, then this would suggest that the miRNA suppresses apoptosis. A working example of an apoptosis assay is described below.

A variety of methods have been devised that measure the viability or proliferation of cells in vitro and in vivo. These can be subdivided into four groups: reproductive assays can be used to determine the number of cells in a culture that are capable of forming colonies in vitro, permeability assays involve staining damaged (leaky) cells with a dye and counting viable cells that exclude the dye, membrane integrity can be assayed by quantifying the release of substances from cells when membrane integrity is lost, e.g. lactate dehydrogenase (LDH), metabolic activity can be measured by adding tetrazolium salts to cells, direct proliferation assays use DNA synthesis as an indicator of cell growth. A cell proliferation assay may thus compare the degree of cell proliferation in a system expressing the miRNA with one lacking this species. A working example of a cell proliferation assay is described in the examples section of this application. In a preferred embodiment of the third aspect of the invention, the miRNA molecule comprises or consists of a) the nucleotide sequence GUGAGAUCAUUUUGAAAGCUG (SEQ ID NO:l); or b) is a fragment or functional equivalent thereof that functions to inhibit apoptosis and control cell proliferation. One example of such an RNA molecule is the nucleic acid sequence recited in SEQ ID NO:l (Drosophila bantam). Examples of functional equivalents include the sequences UGAGAUCAUUUUGAAAGCUGA (SEQ ID NO:4), UGAGAUCAUUUUGAAAGCUGAU (SEQ ID NO:5), UGAGAUCAUUUUGAAAGCUGAUU (SEQ ID NO:6). Preferably, the nucleic acid molecule consists or comprises a sequence that is identical or complementary to any part of SEQ ID NO:l, and functions as bantam miRNA to suppress apoptosis and stimulate cell proliferation.

Included as functional equivalents according to the third aspect of the invention are miRNA sequences that exhibit significant sequence identity to the Drosophila bantam miRNA whose sequence is recited in SEQ ID NO:l and which function to inhibit apoptosis and control cell proliferation. For example, included as functional equivalents are miRNA molecules derived from species other than Drosophila, such as other eukaryotes, including C. elegans, mammals and particularly humans. The Anopheles and human predicted bantam miRNAs are recited herein and form embodiments of this aspect of the invention.

"Identity" indicates that at any particular position in the aligned sequences, the nucleotide is identical between the compared sequences. Degrees of identity can be readily calculated (Computational Molecular Biology, Lesk, A.M., ed., Oxford University Press, New York, 1988; Biocomputing. Informatics and Genome Projects, Smith, D.W., ed., Academic Press, New York, 1993; Computer Analysis of Sequence Data, Part 1, Griffin, A.M., and Griffin, H.G., eds., Humana Press, New Jersey, 1994; Sequence Analysis in Molecular Biology, von Heinje, G., Academic Press, 1987; and Sequence Analysis Primer, Gribskov, M. and Devereux, J., eds., M Stockton Press, New York, 1991). By significant sequence identity is meant that the functional equivalent exhibits at least 85% identity over its entire length to a nucleic acid molecule with the sequence recited in SEQ ID NO:l, preferably, at least 90%, more preferably at least 95%, even more preferably at least 99% or more identity, provided that the miRNA molecule functions to inhibit apoptosis and control cell proliferation. Identity in the first 8 residues of the miRNA is likely to be most important.

A functional equivalent according to this aspect of the invention may be in the form of RNA, or in the form of DNA, including, for instance cDNA, synthetic DNA or genomic DNA. Such nucleic acid molecules may be obtained by cloning, by chemical synthetic techniques (using techniques such as solid phase phosphoramidite chemical synthesis) or by a combination thereof. Also included as functional equivalents are compounds that possess the same conformation as a domain of the miRNA that is responsible for its physiological function, that is, it is able to bind specifically to target sequences of the miRNA. Accordingly, this term is meant to include any macromolecule or molecular entity that mimics the conformation of the miRNA or that possesses an equivalent shape to that possessed by the binding sites of the bantam miRNA whose sequence is identified in SEQ ID NO:l.

Included as fragments of miRNAs according to this aspect of the invention are fragments of the identified RNA molecule which include the portion of the miRNA responsible for recognition and binding to its target molecule. By "fragment" is meant any portion of the entire miRNA sequence that retains a physiological function of the wild type miRNA, such as for example, an ability to bind specifically to the target sequences of the miRNA. The ability of a miRNA sequence to bind to its target molecule is easily measured, for example using a Northern blot or other conventional binding assay, or a functional assay as described above which measures the ability of an miRNA species to inhibit apoptosis or control cell proliferation. An miRNA is considered to bind specifically to a target molecule if hybridisation is effected under high stringency conditions. The term "hybridization" as used here refers to the association of two nucleic acid molecules with one another by hydrogen bonding. Hybridization assays are known in the art (see, for example, Sambrook et al. [supra]). Conditions of "high stringency" refers to conditions in a hybridization reaction that favour the association of very similar molecules over association of molecules that differ. An example of high stringency hybridisation conditions would be hybridization in 7% SDS, 5x SSC (150 mM NaCl, 15 mM trisodium citrate), 20 mM phosphate buffer pH7.2, and 1 x Denhardt's solution overnight, followed by washing the filters in 5% SDS and 3% SSC at 50°C. However, such an in vitro assay may not reflect the conditions in vivo since miRNA interactions in vivo may involve RNP complexes. In that respect, an in vivo functional assay may be preferable to an in vitro hybridization assay.

Such an in vivo functional assay may include comparing the expression levels of a reporter gene in: a) a first cell that comprises a reporter gene and which encodes a target sequence for the miRNA of interest in the 3'UTR of the reporter gene with b) a second cell that is genetically identical to the first cell with the exception that the reporter gene contains no target sequence for the miRNA of interest. An miRNA is considered to bind to a target molecule if the levels of expression of the reporter gene in the first cell is reduced significantly compared to levels of expression of the reporter gene in the second cell. The cell of this assay may be part of an organism, e.g. a fly or may be part of a culture of cells (see examples). Accordingly, fragments containing single or multiple nucleotide insertions, deletions and substitutions from either terminus of the miRNA or from internal stretches of the miRNA are included in this aspect of the present invention. Fragments of functional equivalents, such as fragments of the human bantam miRNA are also included within the terms of the present invention.

In a fourth aspect of the invention, there is provided an isolated nucleic acid molecule obtainable by the method of the second aspect of the invention, that is, the target molecule of a miRNA. Preferably, the target nucleic acid molecule is an RNA molecule, generally derived from the 3'UTR of a gene.

In one embodiment of the fourth aspect, the isolated RNA molecule is involved in apoptosis and/or cell proliferation. The target nucleic acid molecule of this aspect of the invention comprises or consists of a sequence that is complementary to an miRNA molecule of the third aspect of the invention, or a fragment or functional equivalent thereof. Examples of such target nucleic acid molecules are given herein. Definitions of the terms fragment and functional equivalent are provided above. For example, fragments of the target molecule include nucleic acid molecules which encode the portion of the target molecule that is recognised by its cognate miRNA. By "fragments" is thus meant any portion of the target nucleic acid molecule that retains a physiological function of the wild type target molecule, such as for example, an ability to bind specifically to its cognate miRNA. The ability to measure the binding of a target molecule is readily apparent on reading this application. Such binding assays may include the method described in the second aspect of the invention or conventional binding assays known to a skilled person in the art. Such conventional binding assays may include the technique of northern blotting or assays that assess the ability of the miRNA to control degradation or translation of a reporter gene containing the target sequence in cells or in animals. The functional assay described in the third aspect of the invention may also be used for measuring the binding ability of a target molecule to its cognate miRNA. A target molecule is considered to bind to its cognate miRNA if the level of expression of the reporter gene in the first cell is lower than the levels of expression of the reporter gene in the second cell. Functional equivalents include target nucleic acid molecules that possess significant sequence identity with the wild type target molecule in the region to which the miRNA binds. By significant sequence identity is meant that the functional equivalent exhibits at least 85% identity over its entire length to a nucleic acid molecule with the complement of the miRNA molecule, more preferably, at least 90%, preferably at least 95%, more preferably at least 99% or more identity, provided that the functional equivalent retains the ability to bind to its cognate miRNA. Again, identity in the first 8 residues is likely to be most important. It will be appreciated that individual or multiple nucleotide insertions, deletions and substitutions may also be made without departing from this aspect of the invention. Included in the invention as functional equivalents are invertebrate and vertebrate homologs of the target molecules. The term "functional equivalents" is also intended to include fragments or variants of the target molecule or closely related polynucleotide sequences exhibiting significant sequence homology. Modifications of nucleic acid target molecules, such as to avoid degradation by RNases are also included within the terms of this aspect of the invention, as described above for the third aspect.

Of course, target nucleic acid molecules do not function to inhibit apoptosis or control cell proliferation, but are themselves acted on by miRNA species in order that such effects are elicited. Accordingly, target nucleic acid species may be overexpressed or expressed at lower levels than usual to effect changes in the degree of apoptosis and cell proliferation normally evident. Such target nucleic acid molecules may also be used in assays to measure the efficacy of miRNA molecules themselves. Using the method of the second aspect of the invention, a prototypic example of a target nucleic acid molecule has been identified. This target nucleic acid molecule is a target for Drosophila bantam miRNA and forms part of the hid gene, a gene that is known to encode a protein with pro-apoptic properties (see examples section of the present application). This aspect of the invention thus provides a nucleic acid molecule, preferably an RNA molecule that comprises the nucleotide sequence

UAGUUUUCACAAUGAUCUCGGGGGGACGUGAGAUCAUUUUGAAAGCUG (SEQ ID NO:2) or

GCCAUAUUCAAAUUGGUCUCACGGGGACGUGAGAUCAUUUUGAAAGCUGGC (SEQ ID NO:3) or a fragment or functional equivalent thereof that functions as a target molecule for bantam miRNA. More preferably, the RNA molecule consists of the nucleic acid sequence of SEQ ID NO:2 or SEQ ID NO: 3. Further lists of nucleic acid molecules according to the invention are provided in Tables 1-5. All these nucleic acid molecules, and their homologues and functional equivalents, are included as aspects of the present invention. In a further embodiment of the fourth aspect, there is provided a nucleic acid molecule that comprises a sequence that is identical or complementary to the RNA molecule of the fourth aspect of the invention. Such a nucleic acid molecule may comprise DNA or cDNA. Target nucleic acid molecules are of significant utility for a variety of reasons, as will be clear to those of skill in the art. Principal utilities include controlling areas of cell proliferation and pattern formation in animal development. Identification of target nucleic acid molecules regulated by miRNAs can also be used to identify new drug targets that are involved in the control of cell proliferation and/or apoptosis. This regulation is likely to be post-transcriptional so these targets would not be identified by conventional functional genomics methods, which mainly rely on RNA expression profiling.

For example, it is shown herein that bantam targets are involved in control of cell proliferation. Of the genes currently known in the literature that control cell proliferation, only Ex contains bantam target sequences; this ranks in position 18 (table 1) when using a conserved 3' UTR database such as that generated by the inventors using the methodology described above. However, the Ex mutant phenotype is distinct from what would be expected for bantam targets and experiments performed by the inventors have established that Ex is not regulated by bantam. Identification of these target sequences is thus useful in developing novel regulators of cell proliferation and/or apoptosis. Such novel regulators could act by inhibiting binding of the bantam miRNA to the target sequences, resulting in a decrease in cell proliferation and/or apoptosis. For example, such novel regulators might simply comprise or consist of further copies of the target sequence, since when inserted into a cell, they could quench the available miRNA hence preventing any interaction between the miRNA and the true target.

The invention also includes cloning vectors comprising the nucleic acid molecules of the third and fourth aspects of the invention. Such cloning vectors will incoφorate the appropriate transcriptional and translational control sequences, for example, enhancer elements, promoter-operator regions, termination stop sequences and RNA stability sequences.

Vectors according to the invention include plasmids and viruses (including both bacteriophage and eukaryotic viruses). Many such vectors are well known and documented in the art. For further details see Sambrook et al., 2001, Molecular Cloning: a Laboratory Manual. Many known techniques and protocols for manipulation of nucleic acid, for example, in the preparation of nucleic acid constructs, mutagenesis, sequencing, introduction of DNA into cells and gene expression, and analysis of proteins, are described in detail in Short Protocols in Molecular Biology, Second Edition, Ausubel et al. eds., (John Wiley & Sons, 1992) or Protein Engineering: A practical approach (edited by A. R. Rees et al., IRL Press 1993).

A further aspect of the present invention provides a host cell containing a nucleic acid or vector comprising a nucleic acid molecule according to the third or fourth aspect of the invention. A still further aspect provides a method comprising introducing such nucleic acid into a host cell or organism. In one embodiment, a nucleic acid of the third or fourth aspect of the invention may be integrated into the genome (e.g. chromosome) of the host cell. Integration may be promoted by inclusion of sequences which promote recombination with the genome, in accordance with standard techniques. Transgenic animals transformed so as to express or overexpress in the germ line one or more nucleic acid molecules or functional equivalents as described herein form a still further aspect of the invention, along with methods for their production. Many techniques now exist to introduce transgenes into the embryo or germ line of an organism, such as for example, illustrated in Watson et al., (1994) Recombinant DNA (2nd edition), Scientific American Books. According to a yet further aspect, the present invention provides a method of treatment of hypeφroliferative disease, including, but not limited to cancers and harmatomas, or conditions involving regeneration of tissues or cells, including but not limited to neurodegenerative disorders such as Alzheimer's disease in a patient comprising administering to a patient a nucleic acid molecule of the third or fourth aspect of the invention, or a vector or host cell as described above in a therapeutically-effective amount. Such a method may incoφorate a method of gene therapy of a pathological condition caused by a gene mutation in a patient comprising administering to a patient a nucleic acid of the present invention, in a therapeutically-effective amount.

Preferably, the present invention provides a method of treatment of hypeφroliferative disease, including, but not limited to cancers and harmatomas, in a patient comprising administering to a patient a nucleic acid molecule of the fourth aspect of the invention, a compound that blocks bantam function, or a vector or host cell as described above in a therapeutically effective amount.

In another embodiment, the present invention provides a method of treatment of diseases resulting from hypoproliferation of cells, including, but not limited to neurodegenerative diseases in a patient comprising administering to a patient a nucleic acid of the third aspect of the invention, or a vector or host cell as described above in a therapeutically effective amount.

In yet another embodiment, the present invention provides a method of promoting growth of stem cells comprising incoφorating a nucleic acid of the third aspect of the invention, or a vector as described above. The nucleic acid may be introduced into a patient by any suitable means, as will be clear to those of skill in the art. Effective methods of introduction include the use of adenovirus, adeno-associated virus, herpes virus, alpha virus, pox virus and other virus vectors that serve as delivery vehicles for expression of the gene. See generally, Jolly (1994) Cancer Gene Therapy 1: 51-64; Kimura (1994) Human Gene Therapy 5: 845-852; Connelly (1995) Human Gene Therapy 6: 185-193; and Kaplitt (1994) Nature Genetics 6: 148-153. Retroviral vectors may also be used (see Tumor Viruses, Second Edition, Cold Spring Harbor Laboratory, 1985.) Preferred retroviruses for the construction of retroviral gene therapy vectors include Avian Leukosis Virus, Bovine Leukaemia, Virus, Murine Leukaemia Virus, Mink-Cell Focus-Inducing Virus, Murine Sarcoma Virus, Reticuloendotheliosis Virus and Rous Sarcoma Virus.

The term "therapeutically effective amount" as used herein refers to an amount of a therapeutic agent to treat, ameliorate, or prevent the disease or condition, or to exhibit a detectable therapeutic or preventative effect. The precise effective amount for a subject for a given situation can be determined by routine experimentation and is within the judgement of the clinician. An effective dose will typically be from about 0.01 mg/kg to 50 mg/kg or 0.05 mg/kg to about 10 mg/kg of nucleic acid construct.

Non- viral strategies for gene therapy also exist that utilise agents capable of condensing nucleic acid molecules, delivering these molecules to cells and protecting them from degradation inside the cell. Vehicles for delivery of gene therapy constructs may be administered either locally or systemically.

Such strategies include, for example, nucleic acid expression vectors, polycationic condensed DNA linked or unlinked to killed adenovirus alone (see Curiel (1992) Hum Gene Ther 3: 147-154) and ligand linked DNA (see Wu (1989) J Biol Chem 264: 16985- 16987). Naked DNA may also be employed, optionally using biodegradable latex beads to increase uptake. Other methods will be known to those of skill in the art.

Liposomes can act as gene delivery vehicles encapsulating nucleic acid comprising a gene cloned under the control of a variety of tissue-specific or ubiquitously-active promoters. Mechanical delivery systems such as the approach described in Woffendin et al (1994) Proc. Natl. Acad Sci. USA 91 (24): 11581-11585 may also be used.

Direct delivery of gene therapy compositions will generally be accomplished, in either a single dose or multiple dose regime, by injection, either subcutaneously, intraperitoneally, intravenously or intramuscularly or delivered to the interstitial space of a tissue.

Other modes of administration include oral and pulmonary administration, using suppositories, and transdermal applications, needles, and gene guns or hyposprays.

According to a further aspect of the invention there is provided a pharmaceutical composition comprising a nucleic acid molecule of the third or fourth aspect of the invention or functional equivalent, in conjunction with a pharmaceutically-acceptable excipient. A thorough discussion of pharmaceutically acceptable excipients is available in Remington's Pharmaceutical Sciences (Mack Pub. Co., N. J. 1991).

According to a yet further aspect, the present invention provides for the use of nucleic acid molecule, vector, host cell, or pharmaceutical composition as described above in therapy. According to a still further aspect of the invention there is provided the use of a nucleic acid molecule or functional equivalent according to the invention in conjunction with a pharmaceutically-acceptable carrier in the manufacture of a medicament for the treatment or prevention of a hypeφroliferative or hypoproliferative disease in a human or an animal.

One aspect of the invention includes the use of the RNA molecules of the present invention in assays. In particular, the RNA molecules of the present invention can be used to study the spatial regulation of miRNA during development and/or the levels of miRNA present in an organism.

Expression of miRNAs can be assessed by Northern blots but limited spatial and temporal resolution is possible. A method has been developed by the present inventors that reveals miRNA expression in vivo, and is based on the ability of miRNAs to inactivate genes by RNAi (Hutvagner and Zamore 2002; Martinez et al., 2002; Zeng et al, 2002).

According to a fifth aspect of the invention, there is thus provided an assay to measure and visualise miRNA expression comprising comparing the expression levels of a reporter gene in: a) a first cell that comprises a reporter gene and which encodes a target sequence for the miRNA of interest in the 3'UTR of the reporter gene; b) a second cell that is genetically identical to the first cell with the exception that the reporter gene contains no target sequence for the miRNA of interest.

Where the miRNA of interest is present in the cell, the miRNA acts to reduce expression of reporter gene encoding the target sequence, by directing RNAi to cleave the target sequence. This reduces expression of the reporter gene in the cell that contains the target sequence relative to the expression of the reporter gene in the cell whose reporter gene contains no target sequence for the miRNA of interest. Alternatively, the reporter gene could also contain a target sequence that would be regulated by translational control.

The assay of this embodiment of the invention is generally applicable in testing and selecting compounds that modulate the activity of an miRNA with respect to its target. The assay may be directly testing the efficacy of a miRNA moiety or may be used to test a target sequence for an miRNA of interest. Furthermore, by comparing the assay system in the presence and absence of a candidate drug compound, such compounds can be tested for their ability to modify miRNA activity or the interaction between miRNA and its target sequence.

Preferably, this method is performed in vivo and thus allows in vivo miRNA expression to be evaluated. The cell may form part of an organism, particularly an insect or vertebrate organism such as a fish or a mammal. In this manner, the visualisation of reporter gene expression may be facilitated. The cell may form part of a culture of cells. Thus, in a further embodiment of the present invention, there is provided a transgenic animal or plant expressing the reporter gene recited in the assay of the fifth aspect of the invention under the control of a promoter, wherein said animal is not a human. Preferably, the animal is a vertebrate or invertebrate.

The reporter gene should be expressed under the control of a promoter, suitable examples of which will be apparent to those of skill in the art. The promoter may be an inducible or constitutive promoter. Preferably, the promoter is a constitutive promoter. In one embodiment, the promoter is the ubiquitous tubulin promoter The reporter gene is preferably of the group consisting of luciferase, green fluorescent protein (and variants thereof), or horseradish peroxidase. These molecules are well characterized and their use as reporter molecules is well documented. In a preferred embodiment, the reporter molecule is enhanced fluorescent protein (EFGP). In one embodiment of the invention, there is more than one copy of the target sequence for the miRNA of interest. Increasing the number of copies of the target sequence may increase the sensitivity of the assay since the likelihood of a cleavage event within a UTR increases with the number of copies of target sequence present. It is the cleavage of the UTR that leads to decreased expression of the reporter gene and thus the measurable phenotype. Alternatively, if the target sequence is not fully complementary to the miRNA of interest, the mechanism of regulation could be translational control.

In a particularly preferred embodiment of this aspect, the target sequence is complementary to a bantam miRNA, such as the Drosophila or human bantam miRNA. The inventors have shown that bantam miRNA can be successfully used in such an assay (see below).

In another preferred embodiment of this aspect of the invention, the assay to measure and visualise miRNA expression comprises comparing the expression levels of a reporter gene in a system wherein miRNA expression is under the control of an inducible promoter and the levels of the expression of the reporter gene are compared for when the system is in an induced or uninduced state.

In the system of the fifth aspect of the invention, a cell may further comprise a heterologous sequence encoding the miRNA of interest, that may be under the control of a constitutive or an inducible promoter system. This generates an isolated assay system with utility in testing the efficacy of drugs and the like. According to a further aspect of the present invention there is provided a drug identified by a screen according to the fifth aspect of the invention.

According to a yet further aspect of the present invention, there is provided a kit for screening hypeφroliferative or hypoproliferative disorders comprising the nucleic acid molecule of the third or third aspect of the invention. Suitable disorders to be screened may include cell survival defective disorders, including, but not limited to Alzheimer's disease, diseases of increased cell apoptosis, including but not limited to T cells in AIDS or cancer. In a preferred embodiment, the kit measures patient bantam homolog levels in patient biopsy material, including, but not limited to carrying out a reporter gene assay as described herein on transfected cells derived from patient biopsy material. According to another aspect of the invention, there is provided a computer apparatus adapted to perform a method according to any one of the first or second aspects of the invention.

In a preferred embodiment of this aspect of the invention, said computer apparatus may comprise a processor means incoφorating a memory means adapted for storing data relating to nucleotide sequences; means for inputting data relating to a plurality of nucleic acid sequences; and computer software means stored in said computer memory that is adapted such that upon receiving a request to identify an miRNA molecule or a target of an miRNA molecule, it performs a method according to any one of the first or second aspects of the invention.

The invention also provides a computer-based system for identifying novel miRNA sequences and/or novel miRNA targets, comprising means for inputting data relating to a profile of an miRNA sequence; means adapted to perform a method according to any one of the first or second aspects of the invention; and means for outputting a list of candidate miRNA molecules or candidate miRNA targets.

The system of this aspect of the invention may comprise a central processing unit; an input device for inputting requests; an output device; a memory; and at least one bus connecting the central processing unit, the memory, the input device and the output device. The memory should store a module that is configured so that upon receiving a request to identify a miRNA or miRNA target, it performs the steps listed in any one of the methods of the invention described above.

In the apparatus and systems of these embodiments of the invention, data may be input by downloading the sequence data from a local site such as a memory or disk drive, or alternatively from a remote site accessed over a network such as the internet. The sequences may be input by keyboard, if required.

The generated list of candidate miRNAs or candidate miRNA targets may be output in any convenient format, for example, to a printer, a word processing program, a graphics viewing program or to a screen display device. Other convenient formats will be apparent to the skilled reader. The means adapted to identify candidate miRNAs or candidate miRNA targets will preferably comprise computer software means, such as the computer software discussed in more detail below. As the skilled reader will appreciate, once the novel and inventive teaching of the invention is appreciated, any number of different computer software means may be designed to implement this teaching.

According to a still further aspect of the invention, there is provided a computer program product for use in conjunction with a computer, said computer program comprising a computer readable storage medium and a computer program mechanism embedded therein, the computer program mechanism comprising a module that is configured so that upon receiving a request to identify candidate miRNAs or miRNA targets, it performs the steps listed in any one of the methods of the invention described above.

All documents mentioned in the text are incoφorated herein by reference. Various aspects and embodiments of the present invention will now be described in more detail by way of example, with particular reference to bantam miRNA. It will be appreciated that modification of detail may be made without departing from the scope of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS Figure 1: map of the bantam locus

(A) EP(3)3622 is inserted in a region of 41 Kb lacking predicted genes (Hipfner D.R., Weigmann, K., & Cohen, S.M. (2002) The Bantam gene product regulates Drosophila growth. Genetics 161: 1527-1537.). The extent of the 21 Kb bantam^Λ1 deletion is indicated by the box, and shown at larger scale below. Positions of other P-element insertions are indicated. The shaded triangles indicate hypomoφhic mutants for bantam. For EP- elements, the arrow indicates the orientation of GAL4-dependent transcription. DNA contained in transgenes that rescued the bantam^ΔI deletion to viability is indicated below. The overlap of the 6.7 Kb BamHI and 9.6 Kb Spel fragments defines the maximal extent of the bantam locus. RE64518: arrow indicates the position and size of this cDNA clone. The light-grey arrow: position of a conserved haiφin sequence. (B) ClustalW alignment of a short sequence conserved between Anopheles gambiae (Ag) and Drosophila (Dm), indicated by the arrow in a. Shading indicates the region of highest sequence identity (C) Secondary structures for the conserved haiφin sequences. Shading corresponds to (B).

Figure 2: bantam encodes a 21 nucleotide miRNA (A) Northern blot comparing bantam miRNA levels. Lanes 1-4: third instar larvae. WT: wild-type; EP: Actin-Gal4 EP(3)3622; A: Actin-Gal4 UAS-A (the 6.7 Kb BamHI genomic rescue construct in the UAS vector); C: engrailed-Gal4 UAS-C (the 584 nt Hpal-Spel fragment containing the haiφin in the 3'UTR of UAS-EGFP). Constructs illustrated in figure 3. Lane 5: bantam^Δ1 mutant larvae. Lane 6: S2 cells. Arrow: 21 nt bantam miRNA. P: precursor. The blot was probed with a 31 nt 5 '-end labelled oligonucleotide complementary to the shaded side of the stem in Figure lC. (B) SI nuclease-protection mapping of the 5' and 3' ends of bantam miRNA. Total RNA from S2 cells or tRNA was annealed with the 5'-end labelled 25mer 5'CAGCTTTCAAAATGATCTCACTTGT or the 3' end labelled 27mer 5' GACCAAAATCAGCTTTCAAAATGATCTC. Heteroduplexes were digested with SI nuclease and resolved on 15% denaturing acrylamide gels. Lanes labelled + and ++ denote different amounts of SI nuclease. Lanes labelled P show end- labelled probes not treated with SI nuclease. A 21 nt fragment of the 5' end probe was protected. A19 nt fragment of the 3'-end labelled probe was protected.

Figure 3: bantam miRNA promotes tissue growth and cell proliferation

(A) Schematic representation of the UAS transgenes used in the rescue and overgrowth assays. The arrow indicates the predicted haiφin. Rescue assays were performed by crossing the UAS construct into homozygous bantam deletion mutant flies, in the absence of a GAL4 driver. Overgrowth was assayed using tubulin-GAL4 and engrailed-Gal4. UAS-A is the 6.7 Kb BamHI genomic fragment in antisense orientation relative to the orientation of the promoter in the pUAST vector. Note that the same fragment in sense orientation also rescued the mutant, but produced a lethal phenotype when overexpressed with GAL4 drivers. UAS-B is the BamHI fragment in the sense orientation, lacking 81 nt containing the predicted haiφin. UAS-C is a 584 nt Hpal-Spel fragment clone into the 3'UTR of tubulin-EGFP. (B) Quantitation of the overgrowth of the posterior compartment caused by engrailed-GAL4 driven expression of the transgenes, expressed as the ratio of P:A area. P = the area bounded by vein 4 and the posterior of the wing. A = the area anterior to vein 3, as described in Hipfner D.R., Weigmann, K., & Cohen, S.M. (2002) The Bantam gene product regulates Drosophila growth. Genetics 161: 1527-1537.. EP: EP(3)3622; +: no UAS transgene. A, B, C refer to the constructs depicted in panel (A). (C) Examples of wings from the experiments in B. The P compartment is larger in the engrailed-GAL4 UAS-C wing. The wings were aligned along veins 3 and 4. Figure 4: regulation of bantam miRNA expression during development

(A) Northern blot showing bantam miRNA at different stages of development, probed as in (Fig 2a). embryo: 3-12 hour and 12-24 hour old embryos. Larval stages: first, second, early and mid third. Pupal stages: early and mid. Adults: M=male, F^female. bantam miRNA was low in early embryos, increased in the second half of embryonic development and in first instar larvae, and then decreased through later larval and early pupal stages. The level increased again in late pupae and adults, but was similar in males and females, suggesting that there is little maternal deposition of the miRNA in the embryo. (B, C) wing imaginal discs expressing the tubulin-EGFP reporter gene with the SV40 3'UTR. (C) The bantam miRNA sensor construct contains two copies of a 31 nt sequence perfectly complementary to the conserved sequence highlighted in Fig lC. (D, E) Wing discs carrying the bantam miRNA sensor containing clones of cells homozygous for (D) the bantam^Δ1 deletion mutant or (E) the bantam hypomoφhic allele EP(3)3622. Mutant clones showed cell autonomous elevation of bantam sensor levels (white arrows). (E) asterisks indicate reduced expression of the sensor in the twin-spot clone homozygous wild-type for the bantam locus. (F) Detail of a wing disc showing reduced bantam sensor levels in clones of cells overexpressing bantam miRNA using EP(3)3622 (asterisk). The clones are marked by expression of lacZ, white right panel.

Figure 5: bantam autonomously controls cell proliferation (A) Area measurements (pixels) of 42 pairs of homozygous bantam^Δ1 mutant and wild- type twin clones. The two groups are shown separately. (B) Left panel: Wing imaginal disc showing homozygous bantam ¹ deletion mutant clones and homozygous wild-type twin clones. The wild-type and mutant cells are produced in the same cell division, so differences in size reflect differences in growth or cell survival after clone induction. Homozygous bantam^Δ1 mutant cells lack the βGal marker protein and are unlabelled. Homozygous wild-type cells have two copies of the marker appear brighter than heterozygous bantam^Λ1/+ cells. Right panel: DAPI labelled nuclei of the same disc.

Figure 6: Wingless regulates bantam expression in the ZNC

(A) Wing disc expressing the bantam miRNA sensor labelled by BrdU incoφoration during late third instar. (B) bantam sensor levels are low in proliferating cells of the brain hemisphere, and higher in non proliferating cells. (C) Wing disc expressing the bantam miRNA and EGFP under ptc-Gal4 control (left) labelled by BrdU incorporation (right). Arrow: cells in the ZNC that underwent DNA synthesis due to bantam expression.

Figure 7: bantam inhibits apoptosis

(A, B) Wing imaginal discs labelled with antibody to activated caspase 3 (white). . The images are projections of several optical sections. (A) ptc-GAL4 directed expression of HID using EP(3)30060. (B) as in A plus bantam expressed by EP(3)3622. (C) Cuticle preparations of adult wings: from top: wild-type, ptc-GAL4 + EP(3)30060; ptc-GAL4 + EP(3)3622; ptc-GAL4 + EP(3)30060 + EP(3)3622. The area bounded by veins 3, 4 and the anterior cross-vein are indicated as a percent of wild-type. Measurements are the average of 5 discs ± standard deviation. All differences were highly significant, with P values well below e-05 using a T-test. (D, E) Adult heads showing the abnormal eyes induced by GRM-Gal4 directed expression of EP(3)30060 (D) or EP(3)30060 + EP(3)3622 (E).

Figure 8: Models for searching target sequences of miRNA

PERL generated models (8A-C) that were used for searching target sequences of miRNA. Figures 8A-C illustrate how the models penalize sequence mismatches between bantam miRNA and its predicted targets. The figures illustrate where each model is more and less permissive for mismatches and gaps between bantam miRNA and its predicted target. The exact model (8 A) was generated using 5 exact copies of the reverse complement. The indel model (B) was generated using copies of the miRNA reverse complement with 0, 1, 2, and 3 central nucleotides deleted or inserted. This mimics formation of a loop of 1-3 nucleotides in the miRNA or in its target. For the loop model (8C), the alignment contained copies of the miRNA reverse complement with 3 to 6 of the central nucleotides deleted. Figure 8D shows one of the target sequences (HID protein UTR -SEQ ID NO: 2) identified using Drosophila bantam miRNA to screen the Drosophila 3'UTR database using the exact model. Figure 8E shows a second possible target site (SEQ ID NO:3), found in the 3'UTR of HID using the indel model.

Figure 9: bantam regulation of HID expression in Drosophila

Figure 9A shows HID protein expressed under patched-GAL4 control using an EP insertion at the HID locus. Figure 7A shows expression of HID leading to apoptosis, visualized by antibody to the activated form of Caspase 3. Figure 9B shows coexpression of HID with bantam miRNA. Hid protein levels are much reduced, indicating a function of bantam in regulating HID expression. Figure 7B shows the effects of co-expression of HID and bantam on apoptosis. bantam miRNA reduced HID protein levels and thus reduced HID-induced apoptosis.

Figure 10: Comparison of Mfold predicted free energy between random and predicted matches Figure 10 shows the distribution of folding energies for the bantam miRNA, comparing 10,000 randomly selected sequences and the predicted target sequences. The mean and standard deviation of ΔG was determined for the random sequences and was used to evaluate the likelihood that a predicted target site is different from random matches. Folding energies of more than 3 standard deviations above the mean are expected to occur for 0.3% of random matches. A considerable number of predicted bantam targets ranked above Z=3. Y-axis: number of sites. X-axis: ΔG calculated for each site by Mfold.

Figure 11: a) Alignment of target sites in genes of the E(spl) and Brd complexes. Light grey indicates identity; dark grey shows a mismatch; black bars show positions of bulges in the target sequence. b) Left panel shows mir-7, right panel shows the same disc indicating GFP which is reduced in the miR-7 expressing cells. c) Left panel shows mir-7, right panel shows the same disc indicating GFP which is reduced in the miR-7 expressing cells. d) The predicted miR-7 binding site is conserved across 5 genomes, and shows striking conservation of alignment at the 5' and 3' ends of the predicted miRNA binding site.

Figure 12: a) reaper, grim and the third pro-apoptotic gene sickle are clustered in the genome and show blocks of high conservation in their 3' UTRs, which include the miR-2a sites. b) Alignment of the miR-2a sites shows a very similar pattern of predicted miRNA binding for reaper and grim. c) GFP expression detected in immunoblots of cells transfected with the reaper 3' UTR construct. Figure 13: Validation of predicted Bantam targets by improved 5'8nt method 1-8 indicates position in miRNA Black = mispairing Grey = G:U base pair White = conventional base pair.

EXAMPLES

Experimental procedures

Strains

EP(3)30060 directs expression of HID and was identified by Mata et al (Mata, J., Curado, S., Ephrussi, A., and Rorth, P. 2000. Tribbles coordinates mitosis and moφhogenesis in Drosophila by regulating string/CDC25 proteolysis, Cell 101, 511-22). GMR-Gal4, ptc- Gal4, engrailed-Gal4, tubulin-GAL4 and actin-Gal4 are described in flybase (http://fly.bio.indiana.edu/gal4.htm). UAS-Dfz2GPI (Cadigan, K. M., Fish, M. P., Rulifson, E. J., and Nusse, R. (1998). Wingless repression of Drosophila frizzled 2 expression shapes the Wingless moφhogen gradient in the wing, Cell 93, 767-777.).

Transgenes:

Genomic rescue constructs: 9.6 Kb Spel and 6.7 Kb BamHI fragments of BAC AC011907 were cloned into pUAST digested with Xbal or Bglll. The ability of the transgenes to rescue was assayed in homozygous bantam deletion mutant flies lacking a GAL4 driver. Haiφin deletion rescue construct: residues 14192-14689 and 14770-15097 of AE003469 were PCR amplified with a Notl site added following residue 14689 and preceding 14770. Ligation at the Notl site deleted 81 nt containing the haiφin. This fragment was inserted to replace the Hpal-Spel fragment in pUAST-BamHI.

Heterologous haiφin expression construct a 584 nt Hpal-Spel fragment was cloned into 3'UTR of Tub-EGFP-SV40 3'UTR digested with Notl (end repaired) and Xbal. bantam sensor: two copies of the 31 nt conserved sequence in the haiφin were cloned into 3'UTR of Tub-EGFP-SV40 EGFP.

At least two independent transgenic strains were assayed for each construct. bantam clonal analysis

Mitotic recombination clones were induced 48±1.5 h after egg laying (AEL) in staged larvae by heat shock at 37°C for 30 min. larval genotypes: HS-FLP1; armLacZ FRT80Blbantam^ΛI FRT80B (or armLacZ FRT80BIΕY (3)3622 FRT80B). Both genotypes were examined with and without the bantam sensor on chromosome 2. Discs were dissected at 110±1.5 h AEL, fixed in 4% formaldehyde and stained with anti-β- galactosidase antibody to mark the clones and DAPI to mark the nuclei. Clones were analysed by confocal microscopy. Clone areas were measured using Adobe Photoshop.

Northern blots: Total RNA was resolved on 15% denaturing acrylamide gels and probed with 5'-end labelled oligonucleotides as indicated in the text. A tRNA probe was used as a loading control.

SI nuclease mapping was performed as described by Hahn (http://www.fhcrc.org/labs/hahn/methods/mol_bio meth/sl_oligo_probe.html . For 5' end mapping the 25-mer 5' CAGCTTTCAAAATCATCTCACTTGT was 5' end labelled. For 3' end mapping the 26-mer 5' GCCAAAATCAGCTTTCAAAATGATCT was annealed to a second oligo 5' GTGAGATCATTTTGGAAAGCTGA and extended by addition of dCTP. Labelled primers were annealed with RNA from S2 cells at 20°C.

Example 1 - Bantam encodes a miRNA

The bantam locus was identified by several EP-element insertions clustered in a region of ~41 Kb that lacks predicted genes (Fig. 1A). EP-elements are transposable elements designed to allow inducible expression of sequences flanking the insertion site under control of the yeast transcription factor Gal4 (Rorth, P. (1996). A modular misexpression screen in Drosophila detecting tissue specific phenotypes, Proc Natl Acad Sci USA 93,

12418-12422.). Gal4-dependent expression of the EP elements inserted at the bantam locus caused tissue overgrowth. Flies homozygous for the bantam^Λ1 deletion, which removes ~21 Kb flanking the insertion site of EP(3)3622 died as early pupae. Flies heterozygous for the bantam^Δ1 deletion and three of the P-element inserts survived and were moφhologically normal but smaller than normal flies. These observations led to the conclusion that the bantam locus is involved in growth control during development (Hipfiier, D. R., Weigmann, K., and Cohen, S. M. (2002). The bantam Gene Regulates

Drosophila Growth, Genetics 161, 1527-37.). In an effort to molecularly define the bantam locus we produced transgenic flies carrying fragments of genomic DNA overlapping the region where P-element inserts clustered. Two fragments rescued to viability flies homozygous for the bantam^Δ1 deletion (Fig 1A). Thus the 3.85 Kb overlap of these transgenes defines the maximal extent of the bantam locus. This region contains an EST, RE64518, providing evidence for an endogenous transcript. However, expression of RE64518 under Gal4 control failed to reproduce the overgrowth phenotype caused by the EP elements (not shown). Thus RE64518 does not encode bantam function.

The bantam region does not appear to have the capacity to encode a protein with significant sequence similarity to proteins in other genomes examined. A BLAST search of the Anopheles gambiae genome with the bantam region identified a sequence with 30/31 identical residues located adjacent to RE64518 (light-grey arrow, Fig 1A). Alignment of the two genomic regions containing these sequences with ClustalW identified a block of ~90 residues with considerable similarity (Fig. IB). The Drosophila and Anopheles sequences were each predicted to fold into stable haiφin structures using the mfold server (www.bioinfo.φi.edu/applications/mfold/old/rna/; Fig. 1C). The region of highest similarity between these sequences was found on the same arm of the haiφin (shown by shading). These observations raised the possibility that the predicted haiφins might be precursors in the production of a miRNA.

A small RNA of ~22 nucleotides (nt) was detected in a Northern blot of total RNA from third instar larvae, using an end-labelled probe complementary to the conserved 31 nt sequence (Fig 2 A, arrow). The other arm of the haiφin did not produce a miRNA product. bantam miRNA levels were elevated in total RNA from actin-Gal4>EP(3)3622 larvae

(lane 2) and by Gal4-directed expression of the 6.7 Kb BamHI genomic rescue fragment

(UAS-A; lane 3). A larger product was also detected, which may represent the haiφin precursor. To define the sequences necessaiy to produce the bantam miRNA more precisely, we cloned a 584 nt fragment containing the predicted haiφin into the 3'UTR of a heterologous transcript (UAS-C, Fig 3A). Expression of UAS-C under engrailed-Gal4 control also led to oveφroduction of bantam miRNA (Fig 2A, lane 4). bantam miRNA was absent from larvae homozygous for the bantam^Δ1 deletion (lane 5). Both products were detected in Schneider S2 cells (lane 6). SI nuclease mapping was used to identify the

5' and 3' ends of the miRNA (Fig. 2B). The deduced product is the 21 nt miRNA 5'

GUGAGAUCAUUUUGAAAGCUG. To verify that the miRNA produced by the predicted haiφin is the functional product of the bantam locus a transgene was prepared consisting of the 6.7 Kb BamHI fragment that rescued the mutant, but lacking 81 nt containing the haiφin (UAS-B; Fig 3 A). This construct was unable to rescue the mutant phenotype, indicating that the deleted residues are essential for bantam function. Next, their activity in overexpression assays were compared. Expression of the two wild-type constructs, UAS-A and the 584 nt fragment in UAS-C, under engrailed-Gal4 control caused overgrowth of the posterior compartment of the wing, comparable to that obtained with EP(3)3622 (Fig 3B, C). In contrast, expression of the haiφin deletion construct, UAS-B, did not produce overgrowth. Together these observations assign bantam function to the region containing the haiφin and indicate that the 21 nt miRNA is the bantam gene product.

Example 2 - An in vivo assay for measuring levels of bantam miRNA bantam miRNA was expressed at all developmental stages, though at varying levels (Fig 4A). To ask whether bantam miRNA expression is spatially regulated during development, an assay was developed based on the ability of miRNAs to inactivate genes by RNAi (Hutvagner, G., and Zamore, P. D. (2002). A microRNA in a multiple-turnover RNAi enzyme complex, Science 297, 2056-60; Zeng, Y., Wagner, E. J., and Cullen, B. R. (2002). Both natural and designed micro RNAs can inhibit the expression of cognate mRNAs when expressed in human cells, Mol Cell 9, 1327-33.). A transgene expressing EGFP ubiquitously was prepared, under control of the tubulin promoter, and placed two copies of a perfect bantam target sequence in the 3' UTR. A comparable construct without the bantam target sequences in the 3'UTR was used as a control. Where present, bantam miRNA should reduce expression of the transgene containing the target sequences by RNAi, providing an in vivo sensor for bantam levels. The control transgene showed limited spatial modulation in the third instar wing disc (Fig 4B). In comparison, the level of the bantam sensor transgene was higher in cells near the antero-posterior and dorso- ventral (DV) boundaries and in patches in the dorsal thorax (Fig 4C).

To validate the use of the bantam sensor transgene, we asked whether its expression depended on the level of bantam miRNA. Complete removal of bantam miRNA in clones of cells homozygous for the bantam deletion increased expression of the bantam sensor to a level considerably higher than the maximal endogenous level, at the DV boundary (Fig 4D). The EP-element insertion EP(3)3622 is located 2.7 Kb from the haiφin and has previously been identified as a hypomoφhic allele of bantam based on phenotypic criteria . Clones of cells homozygous mutant for EP(3)3622 also showed upregulation of the bantam sensor (Fig 4E), demonstrating that this insertion reduces bantam miRNA levels. In this case the maximal level of sensor expression was similar to the level at the DV boundary. It was noted that the level of sensor expression was lower in the twin-spots, which express two copies of the endogenous bantam gene than in the surrounding cells, which have one copy (Fig 4D, E). This suggested that elevated bantam levels would reduce sensor expression. Indeed, clones overexpressing bantam reduced EGFP levels (Fig 4F). Taken together, these observations indicate that the sensor is capable of reflecting both increases and decreases in bantam miRNA levels in vivo. In all cases the effects on the sensor were cell autonomous. Second, they indicate that bantam miRNA is expressed in the wing disc. This method provides a generally applicable tool to visualise miRNA expression in vivo and can be applied to any transgenic animal.

Example 3 - bantam controls proliferation cell-autonomously In light of the observation that bantam acts cell-autonomously to regulate sensor expression, we asked whether bantam also acts autonomously to control cell proliferation. FLP-induced mitotic recombination results in the generation of two daughter cells, one homozygous for the bantam^¹ deletion and a homozygous wild-type "twin" clone. The mutant and wild-type daughter cells are differently marked, allowing their progeny to be identified after a period of growth. Growth rates were directly compared by measuring the areas of pairs of mutant and wild-type twin clones (Fig 5). Clones were generated at the end of second instar, and analysed late in third instar. Mutant clones were on average 33% the size of the wild-type twins (n=42 pairs, Fig 5 A; see also Fig 4D for clones in a disc expressing the bantam sensor). Although a few relatively large bantam mutant clones were observed, the mutant clones were typically very small. DAPI labelling did not reveal an observable difference in the apparent size or spacing of nuclei in mutant and wild-type tissue. Because the average size of the mutant clones was smaller than that of their wild- type twins, it was concluded that bantam acts cell-autonomously to control cell proliferation. Example 4 - bantam can direct cell proliferation

The secreted signalling protein Wingless is expressed at the DV boundary of the wing disc and directs nearby cells to exit proliferation during the mid third instar stage. The proliferation differential can be visualised using BrdU incoφoration to label cells undergoing DNA synthesis. Comparison of the bantam sensor with BrdU labelling showed that the region of reduced bantam miRNA (elevated sensor levels) corresponds to the zone of non-proliferating cells (Fig 6A). A second zone of reduced proliferation that has begun to appear along the anterior-posterior boundary is also reflected in upregulation of the bantam sensor. There is also a striking correlation between bantam expression and cell proliferation in other tissues, for example in the developing larval brain (Fig 6B). Restoring bantam expression was sufficient to direct cells in the non-proliferating zone to enter S phase (arrow, Fig 6C). In our previous study, cell cycle profiles of bantam overexpressing cells did not differ from wild-type cells (Hipfner, D. R., Weigmann, K., and Cohen, S. M. (2002). The bantam Gene Regulates Drosophila Growth, Genetics 161, 1527-37.

Together these findings suggest that bantam can regulate cellular growth, Gl/S and G2/M progression in a balanced manner and thereby control the rate of cell proliferation. Example 5 - An apoptosis assay

Studies on the Myc and E2F oncogenes have shown that strong growth stimuli can simultaneously induce apoptosis (eg Dyson, N. (1998). The regulation of E2F by pRB- family proteins, Genes Dev 12, 2245-62.Pelengaris, S., Khan, M., and Evan, G. I. (2002). Suppression of Myc-induced apoptosis in beta cells exposes multiple oncogenic properties of Myc and triggers carcinogenic progression, Cell 109, 321-34.). Similarly, overexpression of E2F with its cofactor DP caused apoptosis in the Drosophila wing disc, and net cell proliferation resulted only when apoptosis was prevented (Neufeld, T. P., de la Cruz, A. F., Johnston, L. A., and Edgar, B. A. (1998). Coordination of growth and cell division in the Drosophila wing, Cell 93, 1183-1193.). In contrast, stimulation of growth by bantam overexpression was not associated with an increase in apoptosis (not shown). This raised the possibility that bantam might stimulate cell proliferation and simultaneously suppress apoptosis. To test this directly we expressed the pro-apoptotic gene hid (Grether, M. E., Abrams, J. M., Agapite, J., White, K., and Steller, H. (1995). The head involution defective gene of Drosophila melanogaster functions in programmed cell death, Genes Dev 9, 1694-708.) in the wing disc under ptc-Gal4 control. HID-induced apoptosis was visualised by antibody to activated caspase3 (Fig 7A). In the adult wing, ptc- Gal4 directed hid expression led to a decrease in the area bounded by veins 3 and 4 (Fig 7C; 86±2% of wild-type; P«0.001 using T-test). Coexpression of bantam suppressed HID-induced apoptosis in the wing disc and restored the size of the area bounded by veins 3 and 4 in the adult wing (Fig 7B, C; bantam + hid 106±1% of wild-type PO.001; bantam 118±2% PO.001). When expressed in post-mitotic cells of the eye imaginal disc using GMR-Gal4, HID-induced cell death caused a very small, rough eye phenotype (Fig 7D; Bergmann, A., Agapite, J., McCall, K., and Steller, H. (1998). The Drosophila gene hid is a direct molecular target of Ras-dependent survival signaling, Cell 95, 331-41). This phenotype was strongly, though not completely, suppressed by coexpression of bantam (Fig 7E). These observations indicate that bantam miRNA can suppress apoptosis in both proliferating and post-mitotic cells.

Example 6 - Bantam homologue search in the human genome

Drosophila and Anopheles bantam sequences (sense and reverse complement) were aligned. HMMer profiles were built based on the alignment using hmmbuild (25 nucleotide null model). The profiles were then calibrated using hmmcalibrate (HMMer package, Eddy). The human genome was scanned with the profile using hmmsearch (domain bitscore threshold minimum of 8). For each match the genomic DNA from 50 nucleotides upstream to 10 nucleotides downstream of the match were excised. These sequences were submitted to the mfold server (supra) and the resulting CT text files retrieved. The results were then filtered by selecting those molecules that had an energy cut-off of dG<-20 kJ/mol and various structural considerations. These structural considerations were that the stem length has to be at least 60 nucleotides; the haiφin loop is not within the putative miRNA and the putative miRNA is paired to continuous sequence (no breaks). As a further check, the resulting putative miRNAs were then compared to the mouse and Fugu genomes since the new homolog would be expected to be highly similar to human orthologues. The results of the bantam human homolog search is shown in Table 1.

Example 7 - Identifying the target sequence of an miRNA

The aim is to identify target sequences for a known or putative miRNA. The HMMer (a Hidden Markov Model tool) program was used to search for sequences complementary to the miRNA of interest. Three different HMMer models were used to allow for a range of possible target configurations (illustrated in Fig 1 A-C). The "exact" model assumes perfect alignment and imposes a penalty for mismatches or loops in either miRNA or its target. The insertion-deletion ("indel") model allows loops in either the miRNA or its target. The "loop" model allows loops only in the miRNA. By limiting the loops to one strand, this model allowed a greater range of variation in the extent and number of loops than could be used with the indel model. In a subsequent development two additional models were developed (gapped and 5' 8nt; see below).

A program was written in PERL to generate gapped alignments containing mismatches of the test miRNA reverse complement sequence using the miRNA sequence as input. The exact model contained 5 exact copies of the reverse complement. For the indel model the alignment contained copies of the miRNA reverse complement with 0, 1, 2, and 3 central nucleotides deleted or inserted. For the loop model, the alignment contained copies of the miRNA reverse complement with 3 to 6 of the central nucleotides deleted. Figures 8(A-C) illustrate how these models penalize sequence mismatches and where they are more and less permissive for mismatches and gaps.

The three models were used in profile based sequence searches to generate lists of possible targets. The program hmmbuild from the HMMer package (Eddy) was used with a null model that corrected for the expected sequence length of 25 nucleotides to build HMMer profiles from the alignments. The profiles were calibrated with hmmcalibrate and a database consisting of 3' UTR's of known and predicted Drosophila genes was searched with hmmsearch (E-value threshold <100). At this level, the ranking in the HMMer lists were found not to be statistically significant (e>3). This is partly due to the fact that HMMer does not take into account the possibility of G:U base pairing. To address this problem, use was made of the RNA secondary structure prediction program, mfold (M. Zuker, D.H. Mathews & D.H. Turner (1999) Algorithms and Thermodynamics for RNA Secondary Structure Prediction: A Practical Guide In RNA Biochemistry and Biotechnology, 11-43, J. Barciszewsld & B.F.C. Clark, eds., NATO ASI Series, Kluwer Academic Publishers, D.H. Mathews, J. Sabina, M. Zuker & D.H. Turner (1999) Expanded Sequence Dependence of Thermodynamic Parameters Improves Prediction of RNA Secondary Structure J. Mol. Biol. 288, 911-940).

As the miRNA and each predicted target are independent sequences it was necessary to first connect them pairwise into single sequence strings. A PERL program was used to extend the miRNA with a canonical haiφin-loop (GGGGAC), the putative target sequences and stabilizing GC pairs at each end to produce a hypothetical molecule with the following organization (GC-predicted target-GGGGAC-miRNA-GC). The same exercise has also been performed without using stabilising GC pairs, using the hypothetical molecule (predicted target-GCGGGGACGC-miRNA) These sequences were submitted to the mfold server (http://www.bioinfo.φi.edu/applications/mfold/old/rna/form3.cgi,), which predicts an RNA secondary structure. The resulting structural description text files were retrieved and evaluated on the basis of free energy (ΔG), the number of paired bases, the position of loops and mismatches to prepare lists of possible targets.

The combination of the HMMer profile search and the RNA folding program solves two problems needed to predict possible targets with reasonable confidence. The first is that the characteristics of target sequences, being relatively short and interrupted by mismatches and loops, have severely detrimental effects on BLAST-based searches. The second is that BLAST-based programs impose severe penalties on G:U base pairs which are allowed in RNA heteroduplexes and have been observed in miRNA-target complexes.

Both these problems are overcome by using a combination of HMMer profile searching and an RNA folding program. The combination of the two programs is useful in a way that neither is alone.

The target sequences identified by this method may then be validated experimentally by means of assays known to a skilled person. An example of such an assay is presented in Example 10. Example 8 - Construction of a simplified 3'UTR database

A Drosophila melanogaster 3'UTR database was constructed by extracting the 2000 nucleotide genomic sequences downstream from each of the annotated translation features in the genome. The annotations (complete annotations file, *.GFF) and the genomic sequence (chromosome arm genomic sequence, *.FASTA) were obtained from the Berkeley Drosophila Genome Project (www.fruitfly.org). The resulting non-redundant database (unique identifiers and sequences) comprised 1447 UTRs.

Example 9- Testing the target search method

The method described in Example 7 was tested with the nucleotide sequence of C. elegans lin-4 as the miRNA. The 3 'UTRs of C. elegans Un4 targets - lin-14 and lin-28 were incoφorated into the 3'UTR Drosophila database of the previous example. Using the lin-4 miRNA sequence, the database was searched for potential targets. Both the lin-14 and lin- 28 sequences were retrieved and ranked among the top 20 hits (15th and 20th respectively).

Example 10 - Identifying a new target sequence for bantam

The method of Example 7 was applied using the Drosophila bantam miRNA to build the HMMer profiles and screened against the Drosophila 3'UTR database of Example 8. Using the exact model a list of possible target mRNAs was identified. Among these was the apoptosis inducing protein HID, which has a very good target site in its 3'UTR (Fig 8D - SEQ ID NO:2). A second possible target site was found in the 3'UTR of HID using the indel model (Fig 8E - SEQ ID NO:3). On the basis of these two sites, HID was considered a likely target for regulation by bantam miRNA.

The ability of bantam to regulate HID expression was assessed in transgenic flies using the GAL4/UAS system. HID protein was expressed under patched-GAL4 control using an EP insertion at the HID locus (Fig 9A). Expression of HID led to apoptosis, visualized by antibody to the activated form of Caspase 3 (Fig 9C). When HID was coexpressed with the bantam miRNA, HID protein levels were reduced indicating regulation of HID expression (Fig 9B). Consequently apoptosis was blocked (Fig 7B). Thus HID is a target for regulation by the bantam miRNA in vivo.

In animal cells miRNAs serve as negative regulators of gene expression by repressing translation of target messenger RNAs to which they bind. Target recognition is based on formation of an RNA duplex between the miRNA and its target mRNA, so identification of target genes is in principle amenable to computational analysis. We have developed a computational method to predict possible targets of miRNAs. There are hundreds of genes encoding miRNAs in animal genomes (estimated 255 in human). A method to predict miRNA targets will advance our understanding of the regulatory capacity of this part of the genome.

Example 11 - Constructing a conserved 3'UTR database

RNA structure prediction programs can evaluate the quality of predicted heteroduplexes such as the miRNA-target complexes of Example 7. However, the complexity of the RNA- folding problem means that it is not easily feasible to apply them to large databases ([Eddy, 2002]). In order to cut down the size of the 3'UTR databases that are used in the present invention, it is assumed that valid targets will be located in conserved 3' UTR sequences of protein coding genes. This assumption is based on the five validated miRNA targets known to date: the lin-4 targets lin-14 and lin-28 (Olsen, 1999; Seggerson, 2002), the let-7 targets lin- 41 and lin-57 (Reinhart, 2000; Abrahante, 2003; Lin, 2003) and the bantam target hid (Brennecke, 2003). In each case the target sites are located in the 3' UTRs of the mRNA and are conserved in the 3 'UTRs of the homologous genes from closely related species (Wightman, 1993; Moss, 1997; Brennecke, 2003).

A database of conserved 3' UTR sequences was generated by comparison of the D. melanogaster and D. pseudoobscura genomes. 3' UTRs cannot be predicted. Experimental evidence is available indicating 3' UTRs of >50bp for -10000 D. melanogaster genes. Homologous UTR sequences were found for ~2/3 of these in D. pseudoobscura. For the remaining ~l/3 of predicted D. melanogaster genes, a 3' UTR of 2 Kb was assumed and searched for conserved sequences adjacent to the corresponding D. pseudoobscura gene. The conserved 3' UTR database is 22% the size of the full-length UTR database and so reduces the number of predicted target sites by ~5 fold. The conserved validated and predicted UTR databases can be considered separately or combined together.

Example 12 - Measuring the significance of target site prediction

Z scores The length and GC content of each miRNA influences the folding energy for all its predicted targets. To normalize for sequence length and GC content and to permit evaluation of how predicted target sites compare to random sequences, folding energies were converted into Z scores. For each miRNA 10000 randomly selected sequences of the same length as the average predicted target site were evaluated. The mean and standard deviation of the MFOLD free energy was determined for these sequences and used to calculate the Z score "{ΔG (target site) - ΔG (_mean of random sequence)}/standard deviation of ΔG for random sequences. This provides a means to evaluate the likelihood that a predicted target site is significantly different from random matches. Random matches show a normal distribution of ΔG values, with 0.3% of random matches expected to have folding energies more than 3 SD above the mean. This figure drops to 0.01% of random matches expected at Z>4 (see Figure 10). Expectation (E) values

The statistical significance of predicted targets was also assessed using the background distributions of RNA-RNA duplex. We did this with expectation (E) values similar to those used in sequence comparison (e.g. Blast). For a particular score (ΔG value) E predicts the number of background matches that are equal or better. E-values greater than 1 are not significant, while those close to 0 are very significant. E-values are not restricted to normal distributions (like Z-scores) and readily scale with database size, meaning that different searches can be compared. To compute E, an exponential function is fitted to the cumulative background distributions for energies and extrapolated it to give a value for any observed energy and database size. The best scoring single sites were found to have folding energies between -30 and -40 Kcal have E-values close to 1 (at the border of significance). In such cases, experimental validation would be important.

Multiple sites

Multiple sites within a single UTR can greatly increase the statistical significance of the prediction and may provide a better guide to prediction of valid target RNAs. For example, the hid 3' UTR had the second-best scoring single site on both lists (see below), but its E value of 7.6 indicates that there are many false positives of equal quality. Multiple sites can improve confidence that the predictions are valid. For bantam the exact model predicts 2 sites. The 5 '8 model predicts 4 sites with a highly significant E-value of 3x10^"10. Mutation of the two sites with best folding energy reduced the sensitivity of the UTR to regulation by bantam, but did not eliminate it. This indicates that multiple sites can contribute to regulation of a real 3' UTR. The statistical argument suggests that the presence of multiple bantam sites might be a better predictor of function than the best single sites. Following the same statistical argument, the sum of Z scores of predicted targets that fall within one UTR may be a better predictor of function than the Z score from single sites. The way that multiple hits within a single UTR are treated is described above.

Example 13 - Refining the Exact model

Inspection of the few known miRNA-target duplexes suggests that structural features of the duplex might be important for function. Possible features include the apparently greater complementarity at the 5' end of the miRNA, a C-bulge in lin-4 targets, and the preference for loops in the middle (Banerjee, 2002; Lai, 2002). However, with so few validated examples, it is not possible to distinguish general features from those that might be specific to certain miRNAs or even random features that are tolerated as opposed to being required for function. For this reason, in a first approach, no assumptions were made about the structural features of the RNA duplexes. HMMer profiles were prepared using two "exact" alignment models. The first model assumes G:C base pairing. The second model gives equal 'weight' to G:C and G:U base pairs. The 3'UTR database was searched with both models. The two lists of prospective targets were merged and duplicates removed.

The resulting lists of predicted target sites are very long, typically > 10000 entries. In order to reduce the number of sequences having to be examined in detail, three filters were applied. (1) Lower energy sites occur more frequently, so they are more likely to occur by chance. Predicted targets with folding energies scoring Z<3 were discarded. (2) Predicted targets that overlapped the coding sequence of another gene were discarded. This is based on the assumption that valid targets must be conserved in related genomes. If there is an overlap with coding sequence we cannot evaluate the basis for the sequence conservation and assume that it is more like to be due to the function of the coding sequence than of the 3'UTR. These two filters reduced the lists to hundreds rather than thousands of entries. (3) The third filter is for known 3 'UTRs. Although predicted 3 'UTRs may also prove to be valid, in order to limit the number of predictions to a level that can reasonably be tested by experiment, the evaluation of sites in predicted UTRs was deferred until more information is gained from examining sites in known 3' UTRs. The reason for using these filters is to reduce the number of false positives. bantam, lin-4 and let-7

Example 10 demonstrates that using an earlier version of the exact model, hid was identified as a target for bantam. Example 10 did not require conservation of predicted target sequences in the D. pseudoobscura genome. Using the refined exact alignment model with the conserved UTR database, more single sites were identified within the hid UTR (with folding energies that ranked Z>4) compared to the number of single sites identified using the method of Example 10. The significant difference between these two searches is database size. In the complete UTR database there are 5 times as many possible matches for any given folding energy, so background sites scored higher than some real target sites. Using a Z score cutoff of 3, real sites were lost. The hid UTR ranks in position 2 of the list of target genes in terms of the predicted folding energy for the best single site (table 2). If the sum of folding energies of all sites (Z>3) in each UTR is considered, hid ranks first. This example shows that the refined exact alignment model can detect valid targets with increased confidence. Example 14 -The gapped model: sequential alignment of 5' and 3' ends

Detection of known targets for the C. elegans lin-4 and let-7 miRNAs when their 3 ' UTRs are included in the Drosophila 3 ' UTR database

These targets were selected because some had been experimentally validated and because they contained features that may have proved difficult to detect with the exact model (a "C- bulge" near the 5' end of the duplex and looping out of the miRNA).

This prompted a more detailed evaluation of the bantam sites in hid. We found that the exact model had accurately predicted the top-scoring two sites but that it did not align the 3' ends of the latter 3 sites optimally because of the large gaps needed to find the optimal alignment at the 3' end of the miRNA (Figure 2). We found that the hid UTR showed bαrøtαm-dependent regulation even when the two best-scoring sites were mutated, indicating that these gapped sites are functional (Brennecke, 2003).

A second approach was designed to favour alignment at the 5' end of the miRNA and to give more flexibility in positioning the 3' end alignment. The conserved 3'UTR database was searched separately for sequences complementary to the 5' and 3' ends of the miRNA, allowing for G:U base pairs in the HMMer alignments. Based on examination of known and predicted targets, we selected 8 bp for 5' alignment, 5 bp for 3' alignment. Allowing a gap of up to 5 bp between the two partial alignments proved to allow some flexibility in alignment without dramatically increasing the number of possible alternative 3' alignments for each 5' match. Using this method, the lαiown targets of the C elegans lin-4 and let-7 miRNAs were found in the conserved 3' UTR database (to which the UTRs of lin-14, lin- 28, lin-41 and lin-57 were added). Using lin-4 miRNA, all seven previously identified sites were predicted in the lin-14 UTR and 4 new sites were predicted. Two sites were predicted in both lin-28 and lin-41 UTRs. When the target list was sorted according to the best scoring single site in each UTR, lin-28 ranked 3rd, lin-14 ranked 15th and lin-41 ranked 27th. Comparable results were obtained for the let-7 miRNA. lin-14 ranked 2nd, lin-57 ranked 4th, lin-41 ranked 8th. The Drosophila homologue of lin-41, dpld ranked 44th (2=7.3). These observations permit two conclusions:

(1) The gapped model can find valid targets sites missed by the exact model.

(2) All of the experimentally validated targets for the C. elegans and Drosophila miRNAs rank high on these lists. Table 2 provides an assessment of predictions for known and predicted miRNA targets made according to the method described above.

Table 2 miRNA/target pair ΔG Z_Max Rank Z_Maχ # sites Z>3 Rank Z_UTR

Confirmed Pairs lin-4 /lin-14 -29.9 4.3 20 3 1 lin-4 /lin-28 -30.9 4.6 8 1 15 let-7 /lin-41 -32.3 6.4 3 2 20 let-7 /lin-57 (hbl-1) -33.4 6.8 2 14 1 bantam /hid -33.0 5.8 3 4 1

Predicted Pairs lin-4 /lin-41 -28.9 4.0 32 1 36 lin-4 /lin-57 -21.6 1.7 361 0 - let-7/lin-14 -35.1 7.2 1 13 2 let-7 /lin-28 -20.6 2.8 861 0 - miR-13a/hb - - - 0 - miR-4/hb - - - 0 - miR-3/hb - - - 0 - miR-11 / HLHm8 -29.4 4.7 27 1 46 (predicted UTR) miR-4 /HLHm4 -21.5 2.1 272 0 - miR-7/HLHm3 -37.3 7.4 2 1 53 miR-7 / Tom -34.5 6.6 5 2 6 miR-14 / Drice - - - 0 (site not conserved)

Confirmed pairs indicates experimentally validated target 3' UTRs. ΔG, Z _ax and ZU_TR are as defined above. "Predicted pairs" indicates examples that are predicted in the literature for which there is no experimental validation. The let-7 /lin-14 pair ranks very high on the list of let-7 predictions and is likely to be a functional target. The lin-4/lin-41 pair requires experimental validation. The other C. elegans predictions cannot be distinguished from random matches. The 5' end of the K box show sequence complementarity to the miR- 2/miR-13 family and to miR-6 and miR-11 (Lai, E. C. 2002). The prediction of HLHm8 as a target for miR-11 seems plausible (using predicted UTR), as do the two miR-7 GY box- based predictions. None of the other K or Brd box predictions showed convincing folding energies when examined by Mfold. None of the conserved sites predicted for Drosophila hb ^•were on our lists because of interrupted 5' alignments (Abrahante, J. E. et al. 2003), although hb did place on the miR-7 list (Z=4.0). The site predicted for miR-14 in the Drice 3'UTR is not conserved in related genomes, and is therefore unlikely to be functional (Xu, P., Vernooy, S. Y., Guo, M. & Hay, B. A. The Drosophila MicroRNA Mir- 14 Suppresses Cell Death and Is Required for Normal Fat Metabolism. Curr Biol 13, 790-5 (2003)).

Example 15 - Comparison of exact and gapped models for bantam targets

A preliminary comparison of the exact and gapped models for bantam targets is shown in Table 3 which shows the top 20 predicted single sites for bantam using the two models. The lists differ by 8/20 loci (shaded), indicating that they do select for different features. It is to be expected that the gapped model will find many of the high-scoring sites found by the exact model, though the reverse need not be true. The top two predictions are the same, and both models do tend to find the same best site for the genes that are on both lists. The difference between the lists increases for lower scoring sites. A site that was high-scoring in both models, therefore, is highly suggestive that that site is a valid target in vivo. Experimental tests could then be used to confirm the ability of these UTRs to mediate bβrøtαm-dependent repression.

Example 16 - systematic validation of target predictions

Lists of predicted targets have been prepared using the exact and gapped models for the known miRNAs of Drosophila. miR-7 miR-7 was selected for analysis on the basis of target predictions by Eric Lai (Lai, 2002). Lai previously defined regulatory elements known as K boxes and GY boxes in the 3 'UTRs of Notch pathways target genes of the HLH transcription factor family. He had shown that these sites were functional as repressors of translation and in control of RNA level (Lai, 1998; Lai, 1997). He reported that the 5' end of the GY box showed sequence complementarity to the 5' end of miR-7. By visual examination of the 3' UTRs of transcripts known to contain GY boxes Lai predicted one target site for miR-7 in the HLHmgamma and Tom genes. The exact model found these and additional high-scoring sites in transcripts for other HLH transcription factors (Table 3). HLHm3, hairy, Tom and HLHmgamma rank in the top 10. HLHm3, hairy and Tom were also found among the top 10 by the gapped model (Table 4).

Table 3: Z scores and ΔG are shown for the best single site at each locus, along with the number of sites of Z>3 for the exact model and Z>4 for the gapped model and the sum of those scores. The names of known genes are shown. Many of the predicted targets lie in annotated genes about which nothing is known. Those are shown by a dash (-) in the gene name column. Table 4: top 10 miR-7 targets exact model gapped model

This example is interesting for two reasons. First, it illustrates that with specific prior knowledge of a connection between the miRNA and possible target gene, one can predict target sites 'by eye'. The difficulty comes in doing so on a genome wide basis which is one of the main advantages of the present invention.

Second, the clustering of top-scoring sites in one gene family is highly suggestive. A number of the top targets are in bHLH genes. This is to be expected using the prior knowledge approach, but is significant when it arises from an unbiased whole genome analysis. Preliminary data confirms the finding for the bHLH genes in that overexpression of miR-7 (as a UAS-dsRed transgene) was found to cause phenotypes consistent with repression of the bHLH proteins, including repression of Cut expression at the wing margin and blocking sense organ development (data not shown). Several of the top 10 genes on the miR-7 list were from the E(spl) and Brd complexes and encoded HLH or Brd family proteins (see table 4). The HLH protein Hairy was also in the top 10. Clustering of top-scoring sites in a group of related genes is significant when it arises from an unbiased genome-wide analysis. This prompted us to examine all the genes in E(spl) and Brd complexes for miR-7 sites, including predicted UTRs. We found possible target sites in most genes of the E(spl) and Brd complexes. Alignment of these sites showed a pattern of 5' end conservation in some of the genes (see Figure 11a). To assess the validity of predicted miR-7 targets, 3'UTR sensor transgenes were prepared (as described by Brennecke et al 2003). The 3' UTRs of the predicted targets HLHm3, HLHm4 and hairy were cloned into a tubulin promoter-EGFP reporter plasmid and used to produce transgenic flies. A specific 7m^'R-7 sensor transgene was produced by cloning two copies of a perfect complement of the miR-7 miRNA sequence into the 3' UTR of this construct. To allow GAL4-dependent expression of miR-7, a genomic fragment containing the miR-7 hairpin was cloned into the 3'-UTR of a UAS-DSRed2 plasmid and used to produce transgenic flies. The miR-7 GFP sensor transgene was expressed uniformly in the wing imaginal disc. Gal4-dependent expression of miR-7 miRNA reduced expression of miR-7 GFP sensor transgene (see Figure l ib). Gal4-dependent expression of miR-7 also caused down-regulation of the HLHm4 3' UTR sensor transgene (see Figure l ie). The hairy 3' UTR sensor transgene also showed clear down-regulation by miR-7 (not shown). The hairy gene has been cloned and cDNAs sequenced from three insect genomes: the flour beetle Tribolium castanaeum, the mosquito Anopheles gambiae and D. simulans. The predicted miR-7 binding site is conserved in all 5 genomes, and shows striking conservation of alignment at the 5' and 3' ends of the predicted miRNA binding site (see Figure lid). TheHLHm3 3' UTR sensor also showed regulation by miR-7 (not shown).

These observations validate the utility of the method of the invention in predicting new miRNA targets.

C3.3 miR-2a

The list of predicted targets for miR-2a (Table 5) suggests that this miRNA might function in the control of apoptosis.

The pro-apoptotic genes reaper and grim rank among the top 10 on both lists ranked by the best Z score for a single site in each UTR. reaper and grim have also been found to rank high on the list of predicted targets for miR-2b, miR-13a and miR-13b (data not shown), indicating that these miRNAs may all be involved in regulation of the pro-apoptotic genes and thus that they might function to control cell death in vivo. Using an in vitro assay, regulation of reporter gene expression of via the reaper 3' UTR was validated. The reaper UTR and a mutant version of the reaper UTR lacking the predicted miR-2a/2b/13a/13b binding site were compared. The UTR lacking the binding site showed higher reporter gene expression indicating that in the intact UTR this site reduces reporter gene expression. Table 5: list of predicted targets for miR-2a exact model gapped model

The list of predicted miR-2a targets also showed an intriguing cluster of genes involved in apoptosis (see Table 5). The pro-apoptotic genes reaper and grim were among the top predictions for miR-2a and also for miR-2b, miR-13a and miR-13b, which have related miRNA sequences, reaper, grim and the third pro-apoptotic gene sickle are clustered in the genome and show blocks of high conservation in their 3' UTRs, which include the miR-2a sites (underlined, Figure 12a). Alignment of the miR-2a sites shows a very similar pattern of predicted miRNA binding for reaper and grim (see Figure 12b). sickle is more divergent for miR-2a, but is predicted to bind better to miR-2b (Z=3.6). To validate the predicted site a reaper 3' UTR sensor transgene was transfected into Drosophila Schneider S2 cells (which express miR-2a, miR-2b, miR-13a and miR-13b). S2 cells were transfected with the reaper 3' UTR construct or with a version of the construct in which the miR-2a binding site was mutated (the residues shown in Figure 12b were replaced by a Notl site). A low level of GFP expression was detected in immunoblots of cells transfected with the reaper 3' UTR construct (see Figure 12c, lane 2). The level of GFP expression was much higher in cells transfected with the mutated UTR construct. Thus the endogenous miR-2a family miRNAs repress expression of a reporter construct via the reaper 3'UTR. This experiment validates the prediction of novel sites for miR2a. Example 17: Excluding false positive hits for miRNA targets

An improved method is here described, based on methods described in Brennecke et al 2003 (Cell. 2003 Apr 4; 113(1): 25-36) and Stark et al 2003 (PLoS Biol. 2003 Dec; 1(3): E60. Epub 2003 Oct 13). Specific improvements include the following.

We now search for sequences complementary to bases 2-7 or the miRNA, extend the sequence and require base pairing in at least 7 of the first 8 positions (eg. 1-7 or 2-8). Sequences that match this requirement are extended to miRNA length + 5 and evaluated for alignment to the entire miRNA. A string recognition tool written in PERL was adequate to perform this search.

It is known that some valid target sites contain G:U base pairs. The stringency of the search can be adjusted by allowing G:U base pairs. The preferred method allows 1 G:U base pair in positions 2-7 (and thus a total of 3 if positions 1 and 8 are considered). An alternate version of the method allows more G:U base pairs in positions 2-7. The maximum number is defined by the possibility of forming G:U base pairs with the miRNA sequence.

The free energy of base-pairing between putative target sequences and miRNA is then calculated. We use an alignment software package provided to us by Marc Rehmsmeier (University of Bielefeld), termed RNAhybrid. Sequences for alignment can be submitted to http://bibiserv.techfak.uni-bielefeld.de/rnahybrid/submission.html. One advantage of this tool is that RNAhybrid does not require concatenation of the miRNA and target sequences. This eliminates the need for addition of a hairpin forming linker sequence. It also allows for mispairing in position 1, which we observe in valid targets and which was penalized too strongly in the original method. Fundamentally the alignment method generates a predicted free energy of folding. In this sense Mfold and rnahybrid do the same thing — permit evaluation of relative quality of alignment, they just do it in a slightly different way.

The method is performed iteratively for two genomes to improve the filter for conservation during evolution and thus reduce false positives due to random matches. This is an alternative to the use of a conserved UTR database. First, a database of UTRs from genome 1 (Drosophila melanogaster) was searched. If there is a good site in a UTR, we then search the UTR from the corresponding gene from genome 2 (D pseudoobscura). Genomes 1 and 2 can be any genomes: human and mouse or any other mammal or vertebrate genome, or any genome. The method for identifying orthologues and predicting UTRs in genome 2 is unchanged. This approach can be extended to include a third genome or any number of related genomes as desired. Having more genomes improves the filter for conservation during evolution and hence reduces false positives due to random matches.

A new feature of the method is comparison of the quality of the sequence conservation of the target sites in related genomes. This was built in previously in the database, by requiring the site to be in a block of conserved sequence. We now examine the predicted sites in the 2 (or more) genomes not only for their free energy of folding, but also for how conserved the sequences are across genomes (ie do the two sites base pair similarly to the miRNA or are the folding energies generated by structurally different alignments). In this way, a comparison is made of the quality of the sequence conservation of the candidate target sites in related genomes to give a factor that scales the relevance of the score for the free energy of folding.

This is a significant improvement over the 5' conservation filter used in the previous implementation of the method.

Figure 13 shows how the method described in Brennecke et al 2003 (Cell. 2003 Apr 4; 113(1): 25-36) and Stark et al 2003 (PLoS Biol. 2003 Dec; 1(3): E60. Epub 2003 Oct 13) generates 45 predictions. Only 17 of these have been experimentally validated, making a 38%o success rate. Use of the flags filters out several sites that are unlikely to be real and so eliminates 10 out of 28 false positives to give a success rate of 49% true positives.

Using the improved method, initially omitting the final step of comparing the quality of the sequence conservation of the target sites in related genomes, eliminates 18 of the false positives to give a success rate of 17/27 predictions or 63%> correct predictions. If this final step is included, further false positives may be eliminated. REFERENCES

Abrahante, J. E., Daul, A. L., Li, M., Volk, M. L., Tennessen, J. M., Miller, E. A., and Rougvie, A. E. (2003). The Caenorhabditis elegans hunchback-like Gene lin-57/hbl-l Controls Developmental Time and Is Regulated by MicroRNAs, Dev Cell 4, 625-37. Banerjee, D., and Slack, F. (2002). Control of developmental timing by small temporal RNAs: a paradigm for

RNA-mediated regulation of gene expression., Bioessays 24, 119-129.

Brennecke, J., Hipfiier, D. R., Stark, A., Russell, R. B., and Cohen, S. M. (2003). bantam encodes a developmentally regulated microRNA that controls cell proliferation and regulates the pro-apoptotic gene hid in Drosophila., Cell 113, 25-36.

Eddy, S. R. (2002) Computational genomics of noncoding RNA genes. Cell 109, 137-40

Lai, E. C. (2002). Micro RNAs are complementary to 3' UTR sequence motifs that mediate negative post-transcriptional regulation, Nat Genet 30, 363-4.

Lai, E. C, Burks, C, and Posakony, J. W. (1998). The K box, a conserved 3' UTR sequence motif, negatively regulates accumulation of enhancer of split complex transcripts, Development 125, 4077-88.

Lai, E. C, and Posakony, J. W. (1997). The Bearded box, a novel 3' UTR sequence motif, mediates negative post-transcriptional regulation of Bearded and Enhancer of split Complex gene expression, Development 124, 4847-56. Lin, S. Y., Johnson, S. M., Abraham, M., Vella, M. C, Pasquinelli, A., Gamberi, C, Gottlieb, E., and Slack, F. J. (2003). The C. elegans hunchback Homolog, hbl-1, Controls Temporal Patterning and Is a Probable MicroRNA Target, Dev Cell 4, 639-50.

Moss, E. G., Lee, R. C, and Ambros, V. (1997). The cold shock domain protein LIN-28 controls developmental timing in C. elegans and is regulated by the lin-4 RNA, Cell 88, 637-46.

Olsen, P. H., and Ambros, V. (1999). The lin-4 regulatory RNA controls developmental timing in Caenorhabditis elegans by blocking LIN-14 protein synthesis after the initiation of translation, Dev Biol 216, 671-80. Reinhart, B. J., Slack, F. J., Basson, M., Pasquinelli, A. E., Bettinger, J. C, Rougvie, A. E., Horvitz, H. R., and Ruvkun, G. (2000). The 21 -nucleotide let-7 RNA regulates developmental timing in Caenorhabditis elegans, Nature 403, 901-6.

Seggerson, K., Tang, L., and Moss, E. G. (2002). Two genetic circuits repress the Caenorhabditis elegans heterochronic gene lin-28 after translation initiation, Dev Biol 243, 215-25.

Wightman, B., Ha, I., and Ruvkun, G. (1993). Posttranscriptional regulation of the heterochronic gene lin-14 by lin-4 mediates temporal pattern formation in C. elegans, Cell 75, 855-862.

List of sequences

SEQ ID NO: 1

GUGAGAUCAUUUUGAAAGCUG

SEQ IDNO:2 UAGUUUUCACAAUGAUCUCGGGGGGACGUGAGAUCAUUUUGAAAGCUG

SEQ IDNO:3

GCCAUAUUCAAAUUGGUCUCACGGGGACGUGAGAUCAUUUUGAAAGCUGGC

SEQ ID NO:4

UGAGAUCAUUUUGAAAGCUGA SEQ IDNO:5

UGAGAUCAUUUUGAAAGCUGAU

SEQIDNO:6

UGAGAUCAUUUUGAAAGCUGAUU Table 1

o

Claims

1. A computer-implemented method for identifying an miRNA molecule, comprising the steps of: a) generating a sequence profile for the miRNA molecule, wherein said sequence profile defines a continuous nucleotide sequence that is 20-30 nt in length, that specifies higher sequence conservation at the 5' and 3' teπnini of the miRNA than the sequence conservation that is specified in the middle region of the miRNA molecule; b) using the profile as a query sequence to search a database of nucleic acid sequences to identify a putative miRNA sequence that satisfies the sequence profile; c) extending the putative miRNA sequence of step b) to include a region of contiguous nucleotides of genomic sequence immediately upstream and a region of contiguous nucleotides immediately downstream of the putative miRNA sequence, to generate the predicted precursor of the miRNA molecule; d) assessing the ability of said precursor sequence to fold into a secondary structure; e) selecting as the candidate miRNA molecule, one whose precursor sequence generates a secondary structure with a low predicted energy of folding and which forms a stem loop structure, wherein the sequence of the miRNA molecule itself is fully paired with the other arm of the stem in the precursor sequence and forms no part of the loop.

2. A method according to claim 1, wherein in step a), homologs of the miRNA sequence of interest are aligned together to generate a sequence profile that is a characteristic statistical description of the consensus sequence that is representative of the miRNA molecule.

3. A method according to claim 1 or claim 2, wherein the sequence profile generated is a hidden Markov model (profile HMM).

4. A method according to any one of the preceding claims, wherein the sequence profile is scored such that a higher degree of sequence conservation is required at the 5' and 3' termini of the miRNA sequence of interest, than is required in the middle region.

5. A method according to claim 4, wherein the middle region comprises the central 3- 6 nucleotides of the miRNA molecule.

6. A method according to any one of the preceding claims, wherein the continuous nucleotide sequence defined in the sequence profile is between 21 and 23 nucleotides in length

7. A method according to any one of the preceding claims, wherein in step b) of the method, the profile is used as a query sequence to search a database of nucleic acid sequences using the HMMER tool.

8. A method according to any one of the preceding claims, wherein the database that is searched is a genomic DNA database.

9. A method according to any one of the preceding claims, wherein in step c) of the method, around 80 nucleotides are excised around the putative miRNA sequence, including around 50 nucleotides upstream and around 10 nucleotides downstream and vice versa.

10. A method according to any one of the preceding claims, wherein in step d) of the method, the ability of said precursor sequence to fold into a secondary structure is assessed through the use of simple energy rules or energy minimization criteria.

11. A method according to claim 10, wherein the Mfold set of programs is used to assess the ability of said precursor sequence to fold into a secondary structure.

12. A method according to claim 11, wherein the energy of folding (free energy ΔG) of said precursor sequence is equal to or below -18 kJ/mol.

13. A method according to any one of the preceding claims, wherein in step e), the stem loop structure has a stem length of at least 21 nucleotides.

14. A method according to any one of the preceding claims, additionally comprising the step of screening for the presence of the precursor sequence predicted in step c), or a homolog thereof, in the genome of a closely related organism.

15. A method according to any one of the preceding claims, wherein erroneous matches may be removed by excluding precursor sequences that fall within the coding sequences of a closely related organism to that from which the miRNA of interest is derived.

16. A computer-implemented method for identifying the target molecule of an miRNA of interest, said method comprising the steps of: a) searching a database of nucleic acid sequences to identify a putative target sequence that comprises a homologous reverse complement sequence to the miRNA of interest, wherein i) a search is performed for a target sequence that is complementary to bases 2-7 of the miRNA of interest; ii) a target sequence identified in step i) is extended and it is specified that base pairing between target and miRNA is required in at least 7 of the first 8 bases; iii) a target sequence identified in step ii) is extended to the length of the miRNA sequence plus a number of additional bases, preferably 5 bases, and evaluated for alignment to the entire miRNA; b) predicting the free energy of base-pairing between the putative target sequence identified in step a) and the miRNA of interest; c) selecting as the candidate target molecule, a putative target sequence which is predicted to base pair with the miRNA of interest with a favourable predicted free energy ΔG.

17. A method according to claim 16, wherein the search tool used in step a)i) is a string recognition tool.

18. A method according to claim 16 or claim 17, wherein the stringency of the search is adjusted to allow G:U base pairs.

19. A method according to claim 18, wherein 1 G:U base pair is permitted in positions 2-7.

20. A method according to claim 18, wherein the maximum number of G:U base pairs permitted in positions 2-7 is defined by the possibility of forming G:U base pairs with the miRNA sequence.

21. A method according to any one of claims 16-20, wherein the free energy of base- pairing between the putative target sequence identified in step a) and the miRNA of interest is predicted using the RNAhybrid tool.

22. A method according to any one of claims 16-21, wherein the method is performed iteratively for a second genome, and optionally for one or more further genomes to improve the filter for conservation during evolution and thus reduce false positives due to random matches.

23. A method according to any one of claims 16-22, wherein a comparison is made of the quality of the sequence conservation of the candidate target sites predicted by the method in related genomes to give a factor that scales the relevance of the score for the free energy of folding.

24. An isolated miRNA molecule or a target molecule of an miRNA of interest identified or identifiable by a method according to any one of the preceding claims.

25. An isolated miRNA molecule according to claim 24 that functions to suppress apoptosis and stimulate cell proliferation.

26. An isolated miRNA molecule according to claim 25, that comprises or consists of a) the nucleotide sequence GUGAGAUCAUUUUGAAAGCUG (SEQ ID NO:l); or b) is a fragment or functional equivalent thereof that functions to inhibit apoptosis and control cell proliferation.

27. An isolated miRNA molecule according to claim 26, selected from the group listed in Table 1.

28. A target nucleic acid molecule according to claim 24 that is a target for a bantam miRNA or a homologue thereof, and which forms part of the hid gene or a functional equivalent thereof.

29. A target nucleic acid molecule according to claim 28 that comprises or consists of one of the nucleotide sequences selected from the group of UAGUUUUCACAAUGAUCUCGGGGGGACGUGAGAUCAUUUUGAAAGCUG (SEQ ID NO:2), GCCAUAUUCAAAUUGGUCUCACGGGGACGUGAGAUCAUUUUGAAAGCUGGC

(SEQ ID NO: 3), or is a fragment or functional equivalent thereof that functions as a target molecule for bantam miRNA.

30. An assay to measure and visualise miRNA expression comprising comparing the expression levels of a reporter gene in: a) a first cell that comprises a reporter gene and which encodes a target sequence for the miRNA of interest in the 3'UTR of the reporter gene; b) a second cell that is genetically identical to the first cell with the exception that the reporter gene contains no target sequence for the miRNA of interest.

31. A transgenic animal expressing the reporter gene recited in claim 30 under the control of a promoter wherein said animal is not a human.

32. The transgenic animal of claim 31, wherein the animal is a vertebrate or invertebrate.

33. A transgenic plant expressing the reporter gene recited in claim 30 under the control of a promoter.

34. A method of creating a conserved 3' UTR database for a candidate organism comprising the following steps: a) taking known 3' UTRs of a first organism and selecting those that are longer than a certain threshold length; b) identifying homologous UTR sequences in the genome of another organism to define UTR sequences conserved in evolution and hence likely to have a function; and c) selecting only those UTR sequences from the organism in (a) that are conserved, as identified in step (b), for inclusion in the 3'UTR database.

35. A method according to claim 34, wherein the threshold length of the 3' UTRs taken from the first organism is more than 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, more preferably, more than 50 nucleotides in length.

36. A method according to claim 34 or claim 35, which additionally comprises the step (a2) between steps (a) and (b), wherein duplicate UTRs from different splice variants of the same transcript of the first organism of step (a) are removed, so reducing the number of target sequences taken into step (b).

37. A method according to any one of claims 34 to 36, wherein homologous UTR sequences are identified in step (b) by a method comprising the steps of:

(i) generating an amino acid sequence by translating the 3' nucleotides of the ORF of the transcriptome to which the 3'UTR of step (a) belongs;

(ii) using the amino acid sequence of step (i) in a homology search of the genome of the candidate organism for which a target sequence is to be identified and selecting a region from the genome that encodes a polypeptide sequence that gives an E value below a significance threshold when compared with the amino acid sequence of step (i);

(iii) selecting only those regions from step (ii) that encode the C-terminal-most amino acid residues and have a sequence identity above a significance threshold over a region spanning the C-terminal-most amino acids;

(iv) comparing the 3'UTR sequence from organism one of step (a) with a region of nucleotides downstream of the region of step (iii) from organism two and selecting those with an E value of equal to or less than a significance threshold.

38. A method according to any one of claims 34-37, wherein in step a), known or predicted 3' UTRs are taken from more than one organism.