US20060263798A1 - System and method for identification of MicroRNA precursor sequences and corresponding mature MicroRNA sequences from genomic sequences - Google Patents
System and method for identification of MicroRNA precursor sequences and corresponding mature MicroRNA sequences from genomic sequences Download PDFInfo
- Publication number
- US20060263798A1 US20060263798A1 US11/351,951 US35195106A US2006263798A1 US 20060263798 A1 US20060263798 A1 US 20060263798A1 US 35195106 A US35195106 A US 35195106A US 2006263798 A1 US2006263798 A1 US 2006263798A1
- Authority
- US
- United States
- Prior art keywords
- patterns
- attributes
- nucleotide sequence
- microrna
- sequences
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N15/00—Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
- C12N15/09—Recombinant DNA-technology
- C12N15/11—DNA or RNA fragments; Modified forms thereof; Non-coding nucleic acids having a biological activity
- C12N15/111—General methods applicable to biologically active non-coding nucleic acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N2310/00—Structure or type of the nucleic acid
- C12N2310/10—Type of nucleic acid
- C12N2310/14—Type of nucleic acid interfering N.A.
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N2320/00—Applications; Uses
- C12N2320/10—Applications; Uses in screening processes
- C12N2320/11—Applications; Uses in screening processes for the determination of target sites, i.e. of active nucleic acids
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N2330/00—Production
- C12N2330/10—Production naturally occurring
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
Definitions
- the present invention relates to genes and, more particularly, to ribonucleic acid interference molecules and their role in gene expression.
- RNA molecules can act as potent gene expression regulators either by inducing messenger-RNA (mRNA) degradation or by inhibiting translation. This activity is summarily referred to as post-transcriptional gene silencing, or PTGS for short.
- mRNA messenger-RNA
- PTGS post-transcriptional gene silencing
- An alternative name by which it is also known is RNA interference, or RNAi.
- PTGS/RNAi has been found to function as a mediator of resistance to endogenous and exogenous pathogenic nucleic acids, and, also as a regulator the expression of genes inside cells.
- gene expression refers generally to the transcription of messenger-RNA (mRNA) from a gene, and, e.g., its subsequent translation into a functional protein.
- mRNA messenger-RNA
- RNA molecules involved in gene expression regulation comprises microRNAs, which are endogenously encoded and regulate gene expression by either disrupting the translation processes or by degrading mRNA transcripts, e.g., inducing post-transcriptional repression of one or more target sequences.
- RNAi/post-transcriptional gene silencing mechanism allows an organism to employ short RNA sequences to either degrade or disrupt translation of complementary mRNA transcripts.
- Early studies suggested only a limited role for RNAi, that of a defense mechanism against foreign born pathogens.
- RNAi a defense mechanism against foreign born pathogens.
- the subsequent discovery of many endogenously-encoded microRNAs pointed towards the possibility of this being a more general, in nature, control mechanism. Recent evidence has led the community to hypothesize that a wider spectrum of biological processes are affected by RNAi, thus extending the range of this presumed control layer.
- the methods begin by predicting the RNA secondary structure of candidate sequences using any of the available predictions programs (e.g. “RNAfold” or “mfold”).
- the methods focus on only those sequences that are predicted to fold into the familiar hairpin-like structure of microRNA precursors, subselecting those that satisfy additional sequence or other properties (Lai E C, Tomancak P, Williams R W, Rubin G M. (2003) Computational identification of Drosophila microRNA genes. Genome Biol 4(7): R42; Lim L P, Glasner M E, Yekta S, Burge C B, Bartel D P (2003b) Vertebrate microRNA genes.
- the second type of approach uses the observation that the two arms of the hairpin of a precursor exhibit a much higher degree of sequence conservation than the regions outside the precursor and also the region in the loop of the precursor. This observation was combined with additional, known properties of microRNAs and led to the successful discovery of many novel mature microRNA and microRNA precursors (Berezikov, E., Guryev, V., van de Belt, J., Wienholds, E., Plasterk, R. H. A., Cuppen, E. (2005) Phylogenetic shadowing and computational identification of human microRNA genes. Cell 120: 21-24).
- inventive approach that we present in the discussion below represents a departure from the above two schools of thought. Even though the inventive approach exploits sequence conservation to discover microRNA precursors, the inventive approach does so locally, i.e. the approach seeks to leverage the existence of locally conserved sequence fragments that are shared by known precursors that could potentially be distant from a phylogenetic standpoint.
- a method for determining whether a nucleotide sequence contains a microRNA precursor comprises the following steps.
- One or more patterns are generated by processing a collection of known microRNA precursor sequences.
- One or more attributes are assigned to the one or more generated patterns. Only the one or more patterns whose one or more attributes satisfy at least one criterion are subselected, and then the one or more subselected patterns are used to analyze the nucleotide sequence.
- a method for identifying a mature microRNA sequence in a microRNA precursor sequence comprises the following steps.
- One or more patterns are generated by processing a collection of known mature microRNA sequences.
- the one or more patterns are filtered, and then used to locate instances of the one or more filtered patterns in one or more candidate precursor sequences.
- FIG. 1A is a flow diagram illustrating a method for identifying a microRNA precursor sequence, according to one embodiment of the invention
- FIG. 1B is a flow diagram illustrating a method for identifying a mature microRNA sequence in a microRNA precursor sequence, according to one embodiment of the invention
- FIG. 2A is a graph illustrating a genomic sequence hit with a microRNA-precursor-pattern-set, the graph further illustrating the number of pattern hits with instances in a particular genomic neighborhood as a function of position;
- FIG. 2B is a graph illustrating detail of the region shown in FIG. 2A ;
- FIG. 2C is a graph illustrating detail of the region shown in FIG. 2B ;
- FIG. 2D is an illustration of the predicted secondary structure of cel-mir-273 as determined by RNAfold
- FIG. 3A is a graph illustrating the distribution of pattern-hit-scores for all C. elegans microRNAs within RFAM (solid line) versus generic hairpins (dashed line).
- FIG. 3B is a graph illustrating the distribution of predicted folding energies for all C. elegans microRNAs (solid line) and generic hairpins (dashed line).
- FIG. 3C is an X-Y scatter plot illustrating patterns hits versus folding energy for C. elegans microRNAs (light-grey-colored dots) and generic hairpins (dark-grey-colored dots);
- FIG. 4 is a table summarizing the microRNA-precursor predictions for the genomes of C. elegans, D. melanogaster, M. musculus and H. sapiens;
- FIG. 5 is a block diagram illustrating a system for determining whether a nucleotide sequence contains a microRNA precursor, in accordance with one embodiment of the invention.
- RNA molecules relate to ribonucleic acid (RNA) molecules and their role in gene expression regulation.
- RNA ribonucleic acid
- inventive approach obviates the need of cross-species sequence conservation, and is thus readily applicable to any genomic sequence independent of whether it has orthologues in other species.
- inventive approach is demonstrated herein by first showing that the inventive approach correctly identifies many of the currently known microRNA precursors and mature microRNAs. We describe an implemented prototype system and use the system to analyze computationally the C. elegans, D. melanogaster, M. musculus and H. sapiens genomes.
- microRNA target sites directly from genomic sequences.
- a method for identifying microRNA target sites is described in detail in the above-mentioned related U.S. patent application (YOR920060077US1), the disclosure of which is incorporated herein.
- FIG. 1A is a flow diagram illustrating a method for identifying a microRNA precursor sequence, according to one embodiment of the invention.
- a pattern-based methodology which discovers variable-length sequence fragments (‘patterns’) that recur in an input database a user-specified, minimum number of times.
- the number of discovered patterns, the exact locations of each instance of the discovered pattern, the actual extent of each pattern, and finally the number of instances that a pattern has in the input database are, of course, not known ahead of time.
- the pattern discovery problem is a much ‘harder’ problem than database searching, a task with which most biologists are familiar and has been in main-stream use for more than 20 years.
- pattern discovery is an NP-hard problem whereas database searching can be solved in polynomial time.
- step 110 the generation of patterns.
- the generation of patterns (step 110 ) is comprised of steps 112 and 114 , as shown in FIG. 1A .
- Step 112 is the step of processing known microRNA precursors to discover intra- and inter-species patterns of conserved sequence.
- the recurrent instances of conserved sequence segments can be represented with the help of regular expressions each with a differing degree of descriptive power.
- the expressions used in this disclosure are composed of literals (solid characters from the alphabet of permitted symbols), wildcards (each denoted by ‘.’ and representing any character), and sets of equivalent literals (each set being a small number of symbols, any one of which can occupy the corresponding position).
- the distance between two consecutive occupied positions is assumed to be unchanged across all instances of the pattern (i.e., ‘rigid patterns’).
- the pattern [LIV].[LIV].D.ND[NH].P is an example from the domain of amino acid sequences and describes the calcium binding motif of cadherin proteins.
- the motif in question comprises exactly one of the amino acids ⁇ leucine, isoleucine, valine ⁇ , followed by any amino acid, followed again by exactly one of the amino acids ⁇ leucine, isoleucine, valine ⁇ , followed by any amino acid, followed by the negatively charged aspartate, etc.
- the presence of a statistically significant pattern in an unannotated amino acid sequence is taken as a sufficient condition to suggest the presence of the feature captured by the pattern.
- the symbol set that is used comprises the four nucleotides ⁇ A,C,G,T ⁇ found in a deoxyribonucleic acid (DNA) sequence.
- the input set which we processed in order to discover patterns is Release 3.0 of the RFAM database, from January 2004 (Griffiths-Jones, S. et al. Rfam: an RNA family database. Nucleic Acids Res., 31 439-441 (2003)).
- the use of a more-than-18-month-old release of the database as our training set was intentional. We wanted to gauge how well our method would perform if presented only with the knowledge that was available in the literature in January 2004. The analysis has since been repeated using subsequent releases of the RFAM database.
- the present invention makes use of the sequence information from all the microRNAs which are contained in the RFAM release, and independent of the organism in which they originate.
- the release in question contains microRNAs from the human, mouse, rat, worm, fly and several plant genomes.
- the simultaneous processing of microRNA sequences from distinct organisms permits the discovery of conserved sequences both within and across species and makes the method suitable for the analysis of more than one organism. Release 3.0 of RFAM (January 2004), which was used as our input, contained 719 microRNA precursor sequences.
- This non-redundant input was then processed using the Teiresias algorithm (Rigoutsos, I. and Floratos, A. Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics 14 55-67 (1998)) in order to discover intra- and inter-species patterns of sequence conservation.
- the combinatorial nature of the algorithm and the guaranteed discovery of all patterns contained in the processed input makes Teiresias a good choice for addressing this task.
- the nature of the patterns that can be discovered is controlled by three parameters: L, the minimum number of symbols participating in a pattern; W, the maximum permitted span of any L consecutive (not contiguous) symbols in a pattern; and K, the minimum number of instances required of a pattern before it can be reported.
- the Teiresias algorithm requires that the three parameters L, W and K be set.
- the parameter L controls the minimum possible size of the discovered patterns.
- the parameter W satisfies the inequality W ⁇ L and controls the ‘degree of conservation’ across the various instances of the reported patterns. Setting W to smaller (respectively larger) values permits fewer (respectively more) mismatches across the instances of each of the discovered patterns. Finally, the parameter K controls the minimum number of instances that a pattern must have before it can be reported.
- W and K Teiresias guarantees that it will report all patterns that have K or more appearances in the processed input and are such that any L consecutive (but not necessarily contiguous) positions span at most W positions. It is important to stress that even though no pattern can have fewer than L literals, the patterns' maximum length is unconstrained and limited only by the size of the database.
- Setting L to small values permits the identification of shorter conserved motifs that may be present in the processed input. As mentioned above, even if L is set to small values, patterns that are longer than L will be discovered and reported. Generally speaking, in order for a short motif to be considered statistically significant it will need to have a large number of copies in the processed input. Setting L to large values will generally permit the identification of statistically significant motifs even if these motifs repeat only a small number of times. This increase in specificity will happen at the expense of a potentially significant decrease in sensitivity.
- K 2. This is a natural consequence of the fact that we generate conserved sequence motifs through an unsupervised pattern discovery scheme. The value of 2 is the smallest possible one (a pattern or motif, by definition, must appear at least two times in the processed input) and guarantees that all patterns will be discovered.
- Step 114 is the step of statistically filtering the patterns that were generated in step 112 .
- the step of filtering is done by estimating the log-probability of each pattern with the help of a Markov-chain.
- Step 120 is comprised of step 122 and step 124 , as shown in FIG. 1A .
- Step 122 is the step of locating instances of patterns in the genomic sequences of interest.
- Step 124 is the step of identifying regions in the genomic sequences of minimum length and supported by a minimum number of pattern hits.
- An instance of the microRNA precursor pattern generates a “pattern hit” which covers as many nucleotides as the span of the corresponding pattern-this is repeated for all patterns.
- Each pattern contributes a support of +1 to all of the genomic sequence locations spanned by its instance.
- a given nucleotide position may be hit by more than one pattern.
- Segments of contiguous sequence locations that received more than 60 patterns and spanned at least 60 positions were excised together with a 30-nucleotide-long flanking sequence at each end.
- Step 130 the step of subselecting among candidate regions and reporting the subselected regions.
- Step 130 is comprised of step 132 , step 134 , step 136 and step 138 , as shown in FIG. 1A .
- Step 132 is the step of predicting the RNA secondary structure of the candidate sequences.
- Vienna package software Hofacker, I. L. et al. Fast Folding and Comparison of RNA Secondary Structures. Monatsh. Chem. 125 167-188 (1994)
- we could have used the ‘mfold’ algorithm to predict the hybrid's secondary RNA structure (Matthews, D. H., Sabina, J., Zuker, M. and Turner, D. H. Expanded Sequence Dependence of Thermodynamic Parameters Improves Prediction of RNA Secondary Structure. J. Mol. Biol. 288, 911-940 (1999)).
- Step 134 is the step of filtering candidate sequences based on the energy of the structure. Only those sequences whose predicted Gibbs free energy was ⁇ 18 Kcal/mol were kept and reported.
- Step 136 is the step of further filtering candidate sequences based on number of bulges.
- Step 138 is the step of reporting candidate sequences as microRNA precursors.
- the results (e.g., predictions) of the above processes can be optionally evaluated through experiments.
- FIG. 1B is a flow diagram illustrating a method for identifying a mature microRNA sequence in a microRNA precursor sequence, according to one embodiment of the invention.
- Step 140 the step of generating patterns.
- Step 140 is comprised of step 142 and step 144 , as shown in FIG. 1B .
- Step 142 is the step of processing known microRNAs to discover intra- and inter-species patterns of conserved sequence. Similar to step 112 , we downloaded 644 mature microRNAs from the RFAM, Release 3.0 (January, 2004). Subsequent implementations of our method described herein have used more recent versions of the RFAM database.
- Step 144 is the step of filtering discovered patterns, keeping only statistically significant patterns.
- the final set comprised 354 sequences of mature microRNAs such that no two remaining sequences agreed on more than 90% of their positions.
- Step 150 the step of identifying mature regions.
- Step 150 is comprised of step 152 , step 154 and step 156 , as shown in FIG. 1B .
- Step 152 is the step of locating instances of patterns in the candidate precursor sequences.
- the 233,554 mature microRNA patterns that we derived from the processed mature microRNA sequences generated, we sought the instances of the mature microRNA patterns in the sequences of microRNA precursors that were identified above. Similar methods as described above in step 122 are incorporated herein.
- Step 154 is the step of identifying regions in the candidate precursor sequences of a minimum length and supported by a minimum number of pattern hits.
- a pattern's instance contributes a vote of “+1” to all the UTR locations that the instance spans. All regions that did not overlap with the putative loop of the precursor and comprised contiguous blocks of locations that were hit by ⁇ 60 patterns and were at least 18 nucleotides long were reported as the mature microRNAs corresponding to this precursor. Similar methods as described above in step 124 are incorporated herein.
- Step 156 is the step of reporting regions as mature microRNAs.
- the results (e.g., predictions) of the above processes can be optionally evaluated through experiments.
- FIGS. 2 A-D illustrate how, for the genomic sequence under consideration, the microRNA-precursor-patterns accumulate in the region of the precursor whereas the microRNA-precursor-patterns are absent in the other areas.
- approximately 500 patterns end up contributing to genomic location 14,946,975.
- the contiguous genomic locations that receive support from the microRNA-precursor-patterns corresponds to the known span of cel-miR-273, which is indicated by the light-grey rectangle in FIG. 2B .
- the region that received the substantial non-zero precursor support was examined for instances of the mature-microRNA-pattern-set.
- FIG. 2C we show how well the inventive approach localized the mature microRNA section within the cel-miR-273 precursor.
- the actual span of the known mature microRNA is indicated by the light-grey background.
- FIG. 3A is a graph illustrating the distribution of pattern-hit-scores for all C. elegans microRNAs within RFAM (solid line) versus generic hairpins (dashed line).
- FIG. 3B is a graph illustrating the distribution of predicted folding energies for all C. elegans microRNAs (solid line) and generic hairpins (dashed line).
- FIG. 3C is an X-Y scatter plot illustrating patterns hits versus folding energy for C. elegans microRNAs (light-grey-colored dots) and generic hairpins (dark-grey-colored dots).
- This hairpin set was designed so as to comprise sequences whose geometric features were characteristic of all known microRNA precursors, namely, a hairpin-shaped secondary structure and lengths in the interval [60,120] nucleotides.
- a hairpin-shaped secondary structure and lengths in the interval [60,120] nucleotides.
- the dashed-line curve in FIG. 3A shows the probability density function for the percentage of the generic hairpins that contained a certain number of pattern instances.
- Setting the support threshold to 60 pattern-instances captures 104 of the 114 known C. elegans microRNAs or 91%.
- less than 1% of the members of the generic hairpin set exceed threshold.
- the estimates we generated for the false-positive ratio when predicting microRNA precursors in the other three genomes ranged from ⁇ 1% (for hairpins with Gibbs energies of ⁇ 25 Kcal/mol or less) to ⁇ 2% (for hairpins with Gibbs energies of ⁇ 18 Kcal/mol or less). Given that the four genomes span a very wide evolutionary spectrum, it is reasonable to assume that these values are characteristic of our method and independent of the identity of the genome that is used.
- FIG. 4 is a table summarizing the microRNA-precursor predictions for the genomes of C. elegans, D. melanogaster, M. musculus and H. sapiens.
- the method correctly identifies a very large percentage of the known microRNA precursors in these four genomes, for the used thresholds. Additionally, we also predict many novel microRNA precursors. Their numbers are significantly higher than what has previously been discussed in the literature. In light of the very low error rate estimates of our method, we believe that a substantial number of our microRNA precursor predictions are likely correct.
- FIG. 5 is a block diagram of a system 500 for determining whether a nucleotide sequence contains a microRNA precursor in accordance with one embodiment of the present invention.
- System 500 comprises a computer system 510 that interacts with a media 550 .
- Computer system 510 comprises a processor 520 , a network interface 525 , a memory 530 , a media interface 535 and an optional display 540 .
- Network interface 525 allows computer system 510 to connect to a network
- media interface 535 allows computer system 510 to interact with media 550 , such as Digital Versatile Disk (DVD) or a hard drive.
- DVD Digital Versatile Disk
- the methods and apparatus discussed herein may be distributed as an article of manufacture that itself comprises a computer-readable medium having computer-readable code means embodied thereon.
- the computer-readable program code means is operable, in conjunction with a computer system such as computer system 510 , to carry out all or some of the steps to perform the methods or create the apparatuses discussed herein.
- the computer-readable code is configured to generate patterns processing a collection of already known mature microRNA sequences; assign one or more attributes to the generated patterns; subselect only the patterns whose attributes satisfy certain criteria; generate the reverse complement of the subselected patterns; and use the reverse complement of the subselected patterns to analyze the nucleotide sequence.
- the computer-readable medium may be a recordable medium (e.g., floppy disks, hard drive, optical disks such as a DVD, or memory cards) or may be a transmission medium (e.g., a network comprising fiber-optics, the world-wide web, cables, or a wireless channel using time-division multiple access, code-division multiple access, or other radio-frequency channel). Any medium known or developed that can store information suitable for use with a computer system may be used.
- the computer-readable code means is any mechanism for allowing a computer to read instructions and data, such as magnetic variations on a magnetic medium or height variations on the surface of a compact disk.
- Memory 530 configures the processor 520 to implement the methods, steps, and functions disclosed herein.
- the memory 530 could be distributed or local and the processor 520 could be distributed or singular.
- the memory 530 could be implemented as an electrical, magnetic or optical memory, or any combination of these or other types of storage devices.
- the term “memory” should be construed broadly enough to encompass any information able to read from or written to an address in the addressable space accessed by processor 520 . With this definition, information on a network, accessible through network interface 525 , is still within memory 530 because the processor 520 can retrieve the information from the network. It should be noted that each distributed processor that makes up processor 520 generally contains its own addressable memory space. It should also be noted that some or all of computer system 510 can be incorporated into an application-specific or general-use integrated circuit.
- Optional video display 540 is any type of video display suitable for interacting with a human user of system 500 .
- video display 540 is a computer monitor or other similar video display.
- the invention may be implemented in a network-based implementation, such as, for example, the Internet.
- the network could alternatively be a private network and/or local network.
- the server may include more than one computer system. That is, one or more of the elements of FIG. 5 may reside on and be executed by their own computer system, e.g., with its own processor and memory.
- the methodologies of the invention may be performed on a personal computer and output data transmitted directly to a receiving module, such as another personal computer, via a network without any server intervention.
- the output data can also be transferred without a network.
- the output data can be transferred by simply downloading the data onto, e.g., a floppy disk, and uploading the data on a receiving module.
- inventive approach obviates the need to enforce a cross-species conservation filtering before reporting results, thus allowing the discovery of microRNA precursors that may not be shared even by closely related species; b) the inventive approach can be applied to the analysis of any genome that potentially harbors endogenous microRNAs without the need to be retrained each time.
Landscapes
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Genetics & Genomics (AREA)
- Biotechnology (AREA)
- Chemical & Material Sciences (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Medical Informatics (AREA)
- Biomedical Technology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Biology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Analytical Chemistry (AREA)
- Zoology (AREA)
- General Engineering & Computer Science (AREA)
- Organic Chemistry (AREA)
- Wood Science & Technology (AREA)
- Biochemistry (AREA)
- Microbiology (AREA)
- Plant Pathology (AREA)
- Artificial Intelligence (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Description
- This application claims the benefit of U.S. Provisional Application No. 60/652,499, filed Feb. 11, 2005, the disclosure of which is incorporated by reference herein.
- This application is related to U.S. patent application entitled “System and Method for Identification of MicroRNA Target Sites and Corresponding Targeting MicroRNA Sequences,” Attorney Docket Number YOR920060077US1, filed concurrently herewith, the disclosure of which is incorporated by reference herein. Also, this application is related to U.S. patent application entitled “Ribonucleic Acid Interference Molecules,” Attorney Docket Number YOR920040675US2, filed concurrently herewith, the disclosure of which is incorporated by reference herein.
- The present invention relates to genes and, more particularly, to ribonucleic acid interference molecules and their role in gene expression.
- The ability of an organism to regulate the expression of its genes is of central importance to life. A breakdown in this homeostasis leads to disease states, such as cancer, where a cell multiplies uncontrollably, to the detriment of the organism. The general mechanisms utilized by organisms to maintain this gene expression homeostasis are the focus of intense scientific study.
- It recently has been discovered that some cells are able to down-regulate their gene expression through certain ribonucleic acid (RNA) molecules. Namely, RNA molecules can act as potent gene expression regulators either by inducing messenger-RNA (mRNA) degradation or by inhibiting translation. This activity is summarily referred to as post-transcriptional gene silencing, or PTGS for short. An alternative name by which it is also known is RNA interference, or RNAi. PTGS/RNAi has been found to function as a mediator of resistance to endogenous and exogenous pathogenic nucleic acids, and, also as a regulator the expression of genes inside cells.
- The term ‘gene expression,’ as used herein, refers generally to the transcription of messenger-RNA (mRNA) from a gene, and, e.g., its subsequent translation into a functional protein. One class of RNA molecules involved in gene expression regulation comprises microRNAs, which are endogenously encoded and regulate gene expression by either disrupting the translation processes or by degrading mRNA transcripts, e.g., inducing post-transcriptional repression of one or more target sequences.
- The RNAi/post-transcriptional gene silencing mechanism allows an organism to employ short RNA sequences to either degrade or disrupt translation of complementary mRNA transcripts. Early studies suggested only a limited role for RNAi, that of a defense mechanism against foreign born pathogens. However, the subsequent discovery of many endogenously-encoded microRNAs pointed towards the possibility of this being a more general, in nature, control mechanism. Recent evidence has led the community to hypothesize that a wider spectrum of biological processes are affected by RNAi, thus extending the range of this presumed control layer.
- To date, there have been relatively few attempts to devise new methods for finding novel microRNA precursors and their associated mature microRNAs. This is likely connected to a belief that is held by the research community at large according to which all of the relevant mature microRNAs and their precursors for the most important model organisms have already been identified using biochemical methods. The existing methods can be categorized into two basic approaches.
- In the first approach, the methods begin by predicting the RNA secondary structure of candidate sequences using any of the available predictions programs (e.g. “RNAfold” or “mfold”). The methods then focus on only those sequences that are predicted to fold into the familiar hairpin-like structure of microRNA precursors, subselecting those that satisfy additional sequence or other properties (Lai E C, Tomancak P, Williams R W, Rubin G M. (2003) Computational identification of Drosophila microRNA genes. Genome Biol 4(7): R42; Lim L P, Glasner M E, Yekta S, Burge C B, Bartel D P (2003b) Vertebrate microRNA genes. Science 299: 1540; Lim L P, Lau N C, Weinstein E G, Abdelhakim A, Yekta S, Rhoades M W, Burge C B, Bartel D P (2003a) The microRNAs of Caenorhabditis elegans. Genes and Development 17: 991-1008; I. Bentwich et al., “Identification of hundreds of conserved and nonconserved human microRNAs,” Nature Genetics, published online Jun. 19, 2005. DOI: 10.1038/ng1590).
- The second type of approach uses the observation that the two arms of the hairpin of a precursor exhibit a much higher degree of sequence conservation than the regions outside the precursor and also the region in the loop of the precursor. This observation was combined with additional, known properties of microRNAs and led to the successful discovery of many novel mature microRNA and microRNA precursors (Berezikov, E., Guryev, V., van de Belt, J., Wienholds, E., Plasterk, R. H. A., Cuppen, E. (2005) Phylogenetic shadowing and computational identification of human microRNA genes. Cell 120: 21-24).
- The inventive approach that we present in the discussion below represents a departure from the above two schools of thought. Even though the inventive approach exploits sequence conservation to discover microRNA precursors, the inventive approach does so locally, i.e. the approach seeks to leverage the existence of locally conserved sequence fragments that are shared by known precursors that could potentially be distant from a phylogenetic standpoint.
- A better understanding of the mechanism of the RNA interference process would benefit the fight against disease, drug design and host defense mechanisms.
- A method for identifying microRNA precursor sequences and corresponding mature microRNA sequences from genomic sequences is provided. For example, in one aspect of the invention, a method for determining whether a nucleotide sequence contains a microRNA precursor comprises the following steps. One or more patterns are generated by processing a collection of known microRNA precursor sequences. One or more attributes are assigned to the one or more generated patterns. Only the one or more patterns whose one or more attributes satisfy at least one criterion are subselected, and then the one or more subselected patterns are used to analyze the nucleotide sequence.
- In another aspect of the invention, a method for identifying a mature microRNA sequence in a microRNA precursor sequence comprises the following steps. One or more patterns are generated by processing a collection of known mature microRNA sequences. The one or more patterns are filtered, and then used to locate instances of the one or more filtered patterns in one or more candidate precursor sequences.
- A more complete understanding of the present invention, as well as further features and advantages of the present invention, will be obtained by reference to the following detailed description.
-
FIG. 1A is a flow diagram illustrating a method for identifying a microRNA precursor sequence, according to one embodiment of the invention; -
FIG. 1B is a flow diagram illustrating a method for identifying a mature microRNA sequence in a microRNA precursor sequence, according to one embodiment of the invention; -
FIG. 2A is a graph illustrating a genomic sequence hit with a microRNA-precursor-pattern-set, the graph further illustrating the number of pattern hits with instances in a particular genomic neighborhood as a function of position; -
FIG. 2B is a graph illustrating detail of the region shown inFIG. 2A ; -
FIG. 2C is a graph illustrating detail of the region shown inFIG. 2B ; -
FIG. 2D is an illustration of the predicted secondary structure of cel-mir-273 as determined by RNAfold; -
FIG. 3A is a graph illustrating the distribution of pattern-hit-scores for all C. elegans microRNAs within RFAM (solid line) versus generic hairpins (dashed line). -
FIG. 3B is a graph illustrating the distribution of predicted folding energies for all C. elegans microRNAs (solid line) and generic hairpins (dashed line). -
FIG. 3C is an X-Y scatter plot illustrating patterns hits versus folding energy for C. elegans microRNAs (light-grey-colored dots) and generic hairpins (dark-grey-colored dots); -
FIG. 4 is a table summarizing the microRNA-precursor predictions for the genomes of C. elegans, D. melanogaster, M. musculus and H. sapiens; and -
FIG. 5 is a block diagram illustrating a system for determining whether a nucleotide sequence contains a microRNA precursor, in accordance with one embodiment of the invention. - The teachings of the present invention relate to ribonucleic acid (RNA) molecules and their role in gene expression regulation. As mentioned above, a novel and robust pattern-based approach for the discovery of microRNA precursors and their corresponding mature microRNAs from genomic sequence is provided. Advantageously, the inventive approach obviates the need of cross-species sequence conservation, and is thus readily applicable to any genomic sequence independent of whether it has orthologues in other species. The capabilities of the inventive approach are demonstrated herein by first showing that the inventive approach correctly identifies many of the currently known microRNA precursors and mature microRNAs. We describe an implemented prototype system and use the system to analyze computationally the C. elegans, D. melanogaster, M. musculus and H. sapiens genomes. By way of example, such sequences are described in detail in Application No. 60/652,499, the disclosure of which is incorporated by reference herein. Also, such sequences are described in detail in the above-mentioned related U.S. patent application (YOR920040675US2), the disclosure of which is incorporated herein.
- We estimate that the number of endogenously-encoded microRNA precursors is substantially higher than currently hypothesized. The inventive approach readily extends to the discovery of microRNA target sites directly from genomic sequences. A method for identifying microRNA target sites is described in detail in the above-mentioned related U.S. patent application (YOR920060077US1), the disclosure of which is incorporated herein.
-
FIG. 1A is a flow diagram illustrating a method for identifying a microRNA precursor sequence, according to one embodiment of the invention. Underlying the inventive approach is a pattern-based methodology which discovers variable-length sequence fragments (‘patterns’) that recur in an input database a user-specified, minimum number of times. The number of discovered patterns, the exact locations of each instance of the discovered pattern, the actual extent of each pattern, and finally the number of instances that a pattern has in the input database are, of course, not known ahead of time. Computationally, the pattern discovery problem is a much ‘harder’ problem than database searching, a task with which most biologists are familiar and has been in main-stream use for more than 20 years. Indeed, pattern discovery is an NP-hard problem whereas database searching can be solved in polynomial time. - We will first describe
step 110, the generation of patterns. The generation of patterns (step 110) is comprised ofsteps FIG. 1A . - Step 112 is the step of processing known microRNA precursors to discover intra- and inter-species patterns of conserved sequence.
- The recurrent instances of conserved sequence segments can be represented with the help of regular expressions each with a differing degree of descriptive power. The expressions used in this disclosure are composed of literals (solid characters from the alphabet of permitted symbols), wildcards (each denoted by ‘.’ and representing any character), and sets of equivalent literals (each set being a small number of symbols, any one of which can occupy the corresponding position). The distance between two consecutive occupied positions is assumed to be unchanged across all instances of the pattern (i.e., ‘rigid patterns’). The pattern [LIV].[LIV].D.ND[NH].P is an example from the domain of amino acid sequences and describes the calcium binding motif of cadherin proteins. The motif in question comprises exactly one of the amino acids {leucine, isoleucine, valine}, followed by any amino acid, followed again by exactly one of the amino acids {leucine, isoleucine, valine}, followed by any amino acid, followed by the negatively charged aspartate, etc. Typically, the presence of a statistically significant pattern in an unannotated amino acid sequence is taken as a sufficient condition to suggest the presence of the feature captured by the pattern.
- In the context of the invention described herein, the symbol set that is used comprises the four nucleotides {A,C,G,T} found in a deoxyribonucleic acid (DNA) sequence. The input set which we processed in order to discover patterns is Release 3.0 of the RFAM database, from January 2004 (Griffiths-Jones, S. et al. Rfam: an RNA family database. Nucleic Acids Res., 31 439-441 (2003)). The use of a more-than-18-month-old release of the database as our training set was intentional. We wanted to gauge how well our method would perform if presented only with the knowledge that was available in the literature in January 2004. The analysis has since been repeated using subsequent releases of the RFAM database.
- Unlike previously published computational methods for microRNA precursor prediction, the present invention makes use of the sequence information from all the microRNAs which are contained in the RFAM release, and independent of the organism in which they originate. The release in question contains microRNAs from the human, mouse, rat, worm, fly and several plant genomes. The simultaneous processing of microRNA sequences from distinct organisms permits the discovery of conserved sequences both within and across species and makes the method suitable for the analysis of more than one organism. Release 3.0 of RFAM (January 2004), which was used as our input, contained 719 microRNA precursor sequences.
- We used a scheme based on BLASTN (Altschul, S. F. Gish, W. Miller, W. Myers, E. W. Lipman, D. J. Basic local alignment search tool. J Mol Biol. 215 403-410 (1990)) to remove duplicate and near-duplicate entries from the initial collection. The final set comprised 530 microRNA precursor sequences. In this cleaned-up set, no two sequences agreed on more than 90% of their positions. We next describe in detail the BLASTN-based cleanup scheme.
- We assume that we are given N sequences of variable length and a user-defined threshold X for the permitted, maximum remaining pair-wise sequence similarity. The sequence-based clustering scheme that we employed is shown below. Upon termination, the set CLEAN contains sequences no pair of which agrees on more than X % of the positions in the shorter of the two sequences. For our analysis, we set X=90%.
-
- sort the N sequences in order of decreasing length; let Si denote the i-th sequence of the sorted set (i=1, . . . , N)
- CLEAN S1
- for i=2 through N do
- use Si as query to run BLAST against the current contents of CLEAN if the top BLAST hit T agrees with Si at more than X % of the Si's position
- then
- make Si a member of the cluster represented by T discard Si;
- else
- CLEAN CLEAN 4 {Si};
- This non-redundant input was then processed using the Teiresias algorithm (Rigoutsos, I. and Floratos, A. Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics 14 55-67 (1998)) in order to discover intra- and inter-species patterns of sequence conservation. The combinatorial nature of the algorithm and the guaranteed discovery of all patterns contained in the processed input makes Teiresias a good choice for addressing this task. The nature of the patterns that can be discovered is controlled by three parameters: L, the minimum number of symbols participating in a pattern; W, the maximum permitted span of any L consecutive (not contiguous) symbols in a pattern; and K, the minimum number of instances required of a pattern before it can be reported. We also enforced a statistical significance requirement. The significance of each pattern was estimated with the help of a second-order Markov chain which was built from actual genomic data. Application of the significance filter reduced the number of patterns that were used in the subsequent phases of the algorithm. Details on the Teiresias algorithm and its properties, the three parameters L/W/K, and how to estimate log-probabilities are given below.
- The Teiresias algorithm requires that the three parameters L, W and K be set. The three parameters that control the discovery process were set to L=7, W=10 and K=2. 120,789,247 variable length patterns were discovered in the processed input set. Patterns with log-probability >−34.0 were removed resulting in a final set of 192,240 statistically-significant, microRNA precursor specific patterns. We next describe in detail how these parameters control the number and character of the discovered patterns.
- The parameter L controls the minimum possible size of the discovered patterns. The parameter W satisfies the inequality W≧L and controls the ‘degree of conservation’ across the various instances of the reported patterns. Setting W to smaller (respectively larger) values permits fewer (respectively more) mismatches across the instances of each of the discovered patterns. Finally, the parameter K controls the minimum number of instances that a pattern must have before it can be reported.
- For a given choice of L, W and K Teiresias guarantees that it will report all patterns that have K or more appearances in the processed input and are such that any L consecutive (but not necessarily contiguous) positions span at most W positions. It is important to stress that even though no pattern can have fewer than L literals, the patterns' maximum length is unconstrained and limited only by the size of the database.
- Setting L to small values permits the identification of shorter conserved motifs that may be present in the processed input. As mentioned above, even if L is set to small values, patterns that are longer than L will be discovered and reported. Generally speaking, in order for a short motif to be considered statistically significant it will need to have a large number of copies in the processed input. Setting L to large values will generally permit the identification of statistically significant motifs even if these motifs repeat only a small number of times. This increase in specificity will happen at the expense of a potentially significant decrease in sensitivity.
- For the work described herein, we selected L=7. This choice is dictated by the desire to capture potential commonalities among the seed regions of diverse microRNAs; setting L to a value that is smaller than the 6 nucleotides typically associated with the seed regions gives us added flexibility. We also set W=10, a choice that is dictated by the desire to capture sequence commonalities where the local conservation is at least 70%. In other words, any reported pattern will have more than ⅔ of its positions occupied by literals. Finally, we set K=2. This is a natural consequence of the fact that we generate conserved sequence motifs through an unsupervised pattern discovery scheme. The value of 2 is the smallest possible one (a pattern or motif, by definition, must appear at least two times in the processed input) and guarantees that all patterns will be discovered.
- Step 114 is the step of statistically filtering the patterns that were generated in
step 112. The step of filtering is done by estimating the log-probability of each pattern with the help of a Markov-chain. We next describe in detail how to use Markov chains to estimate the log-probabilities of patterns. The computation is carried out in the same manner for all of the patterns. - Real genomic data was used to estimate the frequency of trinucleotides that could span as many as 23 positions—there are at most 20 wild cards between the first and last nucleotide of the triplet. In other words, we computed the frequencies of all trinucleotides of the form:
AAA AA.A AA..A ... AA....................A A.AA A.A.A A.A..A ... T....................TT - With these counts at hand, we used Bayes' theorem to estimate the probability that a given pattern could be generated from a random database. Let us use the pattern
- A..[AT].C..T...G to describe the approach. Observe that we can write:
- Pr(A..[AT].C..T...G)=
- Pr(C..T...G/A..[AT].C..T)=
- Pr(C..T...G/C..T)*Pr(A..[AT].C..T)=
- Pr(C..T...G/C..T)*Pr([AT].C..T/A..[AT].C)=
- Pr(C..T...G/C..T)*Pr([AT].C..T/[AT].C)*Pr(A..[AT].C)=
- Pr(C..T...G/C..T)*Pr([AT].C..T/[AT].C)*Pr(A..[AT].C/A..[AT])=
- #(C..T...G)/(#(C..T...A)+#(C..T...C)+(C..T...G)+#(C..T...T))*
- #([AT].C..T)/(#([AT].C..A)+#([AT].C..C)+#([AT].C..G)+#([AT].C..T))*
- #(A..[AT].C)/(#(A..[AT].A)+#(A..[AT].C)+#(A..[AT].G)+#(A..[AT].T))
- Note that all of the counts #(.) are available directly from the Markov chain and thus can be substituted for in the last equation. This in turn allows us to estimate the Pr(A..[AT].C..T...G) as well as the log(Pr(A..[AT].C..T...G)).
- We next describe
step 120, the identification of candidate regions. Step 120 is comprised ofstep 122 and step 124, as shown inFIG. 1A . - Step 122 is the step of locating instances of patterns in the genomic sequences of interest. We use the 192,240 microRNA precursor patterns to locate instances in genomic sequences of interest. Typically, these sequences correspond to the intergenic and intronic regions of the genome at hand.
- We first remove all low-complexity regions from the genomic sequences to be processed using the publicly available NSEG program (Wootton, J. C. and S. Federhen. Statistics of local complexity in amino acid sequences and sequence databases. Computers and Chemistry. 1993; 17:149-163) with default parameter settings. In the filtered sequences, we sought instances of the patterns from the microRNA-precursor-pattern-set.
- Step 124 is the step of identifying regions in the genomic sequences of minimum length and supported by a minimum number of pattern hits. An instance of the microRNA precursor pattern generates a “pattern hit” which covers as many nucleotides as the span of the corresponding pattern-this is repeated for all patterns. Each pattern contributes a support of +1 to all of the genomic sequence locations spanned by its instance. Clearly, a given nucleotide position may be hit by more than one pattern. We make use of precisely this observation to associate genomic regions which receive multiple pattern hits with putative microRNA precursors. Conversely, regions which do not correspond to microRNA precursors are expected to receive a much smaller number of hits, if any, which of course permits us to differentiate between background and microRNA precursors.
- Segments of contiguous sequence locations that received more than 60 patterns and spanned at least 60 positions were excised together with a 30-nucleotide-long flanking sequence at each end.
- We next describe
step 130, the step of subselecting among candidate regions and reporting the subselected regions. Step 130 is comprised ofstep 132,step 134,step 136 and step 138, as shown inFIG. 1A . - Step 132 is the step of predicting the RNA secondary structure of the candidate sequences. With the help of the Vienna package software (Hofacker, I. L. et al. Fast Folding and Comparison of RNA Secondary Structures. Monatsh. Chem. 125 167-188 (1994)), we predicted the RNA secondary structure of each excised sequence. Instead of the Vienna package, we could have used the ‘mfold’ algorithm to predict the hybrid's secondary RNA structure (Matthews, D. H., Sabina, J., Zuker, M. and Turner, D. H. Expanded Sequence Dependence of Thermodynamic Parameters Improves Prediction of RNA Secondary Structure. J. Mol. Biol. 288, 911-940 (1999)).
- Step 134 is the step of filtering candidate sequences based on the energy of the structure. Only those sequences whose predicted Gibbs free energy was ≦−18 Kcal/mol were kept and reported.
- Step 136 is the step of further filtering candidate sequences based on number of bulges.
- Step 138 is the step of reporting candidate sequences as microRNA precursors.
- Lastly, as shown in
step 139 ofFIG. 1A , the results (e.g., predictions) of the above processes can be optionally evaluated through experiments. -
FIG. 1B is a flow diagram illustrating a method for identifying a mature microRNA sequence in a microRNA precursor sequence, according to one embodiment of the invention. In each of the candidate microRNA precursors that were identified instep 130, we sought to determine the location of the corresponding mature microRNA. To this end, we used the same method as described above, only this time we generated patterns from the set of known microRNA sequences. - We next describe
step 140, the step of generating patterns. Step 140 is comprised ofstep 142 and step 144, as shown inFIG. 1B . - Step 142 is the step of processing known microRNAs to discover intra- and inter-species patterns of conserved sequence. Similar to step 112, we downloaded 644 mature microRNAs from the RFAM, Release 3.0 (January, 2004). Subsequent implementations of our method described herein have used more recent versions of the RFAM database.
- Step 144 is the step of filtering discovered patterns, keeping only statistically significant patterns. As in
step 114, we used a scheme based on BLASTN to remove duplicate and near-duplicate entries from the initial collection. The final set comprised 354 sequences of mature microRNAs such that no two remaining sequences agreed on more than 90% of their positions. - The three parameters that control the discovery process were set to L=4, W=12 and K=2. 120,789,247 variable length patterns were discovered in the processed input set, typically spanning fewer than 22 positions. Patterns with log-probability >−32.0 were removed resulting in a final set of 233,554 statistically-significant, mature-microRNA patterns.
- We next describe
step 150, the step of identifying mature regions. Step 150 is comprised ofstep 152,step 154 and step 156, as shown inFIG. 1B . - Step 152 is the step of locating instances of patterns in the candidate precursor sequences. For the 233,554 mature microRNA patterns that we derived from the processed mature microRNA sequences generated, we sought the instances of the mature microRNA patterns in the sequences of microRNA precursors that were identified above. Similar methods as described above in
step 122 are incorporated herein. - Step 154 is the step of identifying regions in the candidate precursor sequences of a minimum length and supported by a minimum number of pattern hits. As before, a pattern's instance contributes a vote of “+1” to all the UTR locations that the instance spans. All regions that did not overlap with the putative loop of the precursor and comprised contiguous blocks of locations that were hit by ≧60 patterns and were at least 18 nucleotides long were reported as the mature microRNAs corresponding to this precursor. Similar methods as described above in
step 124 are incorporated herein. - Step 156 is the step of reporting regions as mature microRNAs.
- Lastly, as shown in
step 159 ofFIG. 1B , the results (e.g., predictions) of the above processes can be optionally evaluated through experiments. - We next illustrate the above-described stages (‘discovery of a microRNA precursor’/‘discovery of a mature microRNA’) with the help of the C. elegans genome. In particular, we use the genomic region in the vicinity of the known microRNA precursor cel-miR-273.
- FIGS. 2A-D illustrate how, for the genomic sequence under consideration, the microRNA-precursor-patterns accumulate in the region of the precursor whereas the microRNA-precursor-patterns are absent in the other areas. For the shown example sequence, approximately 500 patterns end up contributing to genomic location 14,946,975. In fact, the contiguous genomic locations that receive support from the microRNA-precursor-patterns corresponds to the known span of cel-miR-273, which is indicated by the light-grey rectangle in
FIG. 2B . The region that received the substantial non-zero precursor support was examined for instances of the mature-microRNA-pattern-set. InFIG. 2C , we show how well the inventive approach localized the mature microRNA section within the cel-miR-273 precursor. The actual span of the known mature microRNA is indicated by the light-grey background. -
FIG. 3A is a graph illustrating the distribution of pattern-hit-scores for all C. elegans microRNAs within RFAM (solid line) versus generic hairpins (dashed line). -
FIG. 3B is a graph illustrating the distribution of predicted folding energies for all C. elegans microRNAs (solid line) and generic hairpins (dashed line). -
FIG. 3C is an X-Y scatter plot illustrating patterns hits versus folding energy for C. elegans microRNAs (light-grey-colored dots) and generic hairpins (dark-grey-colored dots). - We used the 192,240 members of the microRNA-precursor-pattern-set to determine how well they covered those of the training sequences which originated in C. elegans. Almost all of the known C. elegans precursors contained ≧100 instances of the precursor patterns. The solid-line curve in
FIG. 3A shows the probability density function for the number of precursors which contained a given number of pattern instances in them. - We next generated randomly what we refer to as a generic hairpin set. This hairpin set was designed so as to comprise sequences whose geometric features were characteristic of all known microRNA precursors, namely, a hairpin-shaped secondary structure and lengths in the interval [60,120] nucleotides. First, we randomly selected numerous regions with lengths uniformly distributed between 60 and 120 nucleotides. There was no restriction as to where in the C. elegans genome these regions were located.
- Then, we inspected the predicted RNA secondary structure of these regions and kept only those which formed hairpins and did not include any low-complexity regions. Starting with an initial set of 120,000 randomly selected regions (=10,000×2 strands×6 chromosomes), and discarding as described above, we were left with a total of 20,560 generic hairpins. These hairpins are used to sample the “background” distribution of hairpins and to estimate its properties.
- We examined these generic hairpins for instances of the microRNA precursor patterns. The dashed-line curve in
FIG. 3A shows the probability density function for the percentage of the generic hairpins that contained a certain number of pattern instances. Setting the support threshold to 60 pattern-instances captures 104 of the 114 known C. elegans microRNAs or 91%. On the other hand, less than 1% of the members of the generic hairpin set exceed threshold. This is an important result that demonstrates that the microRNA precursor patterns capture sequence properties which are specific to microRNA precursors and can effectively distinguish them from randomly selected regions that simply happen to fold into “stem-loop-stem” structures. - In addition to the distribution of pattern instances, we also examined the distribution of the Gibbs free energy values that are computed from the generic hairpin set (dashed-line curve) and the known C. elegans precursors (solid-line curve) and show the results in
FIG. 3B . Setting the support threshold to −25 Kcal/mol captures 107 of the 114 known C. elegans microRNA precursors or 94%, but only 7% of the sequences in the generic hairpin set exceed threshold. - Finally, we examined how well a combination of the “energy” and the “pattern-instances” filters separates the known microRNA precursors (light-grey colored dots) from the generic hairpin set (dark-grey colored dots). The results are presented in
FIG. 3C . As can be seen inFIG. 3C , there is very little correlation between these two criteria and their combined application provides a simple yet powerful discriminator. The combined threshold of ≧60 pattern instances and a predicted Gibbs energy ≦−25 Kcal/mol allows us to identify 78 of the 114 known C. elegans precursors whereas less than 1% of the generic hairpins exceed this double threshold. This translates into an estimated sensitivity of 67% for our precursor prediction method and an estimated false-positive ratio that is ≦1%. - We repeated the above generic-hairpin analysis for the remaining three genomes of our collection. The remaining three genomes were D. melanogaster, M. musculus and H. sapiens. By way of example, such sequences are described in detail in Application No. 60/652,499, the disclosure of which is incorporated by reference herein. Also, such sequences are described in detail in the above-mentioned related U.S. patent application (YOR920040675US2), the disclosure of which is incorporated herein. The estimated false-positive ratios remained very low, and similar in magnitude to the case of C. elegans above. In particular, the estimates we generated for the false-positive ratio when predicting microRNA precursors in the other three genomes ranged from ≦1% (for hairpins with Gibbs energies of −25 Kcal/mol or less) to ≦2% (for hairpins with Gibbs energies of −18 Kcal/mol or less). Given that the four genomes span a very wide evolutionary spectrum, it is reasonable to assume that these values are characteristic of our method and independent of the identity of the genome that is used.
-
FIG. 4 is a table summarizing the microRNA-precursor predictions for the genomes of C. elegans, D. melanogaster, M. musculus and H. sapiens. - We have analyzed the intergenic and intronic regions of four complete genomes, as illustrated in
FIG. 4 . Results are reported for two values for the Gibbs energy threshold, namely −18 Kcal/mol and −25 Kcal/mol. - As can be seen from
FIG. 4 , the method correctly identifies a very large percentage of the known microRNA precursors in these four genomes, for the used thresholds. Additionally, we also predict many novel microRNA precursors. Their numbers are significantly higher than what has previously been discussed in the literature. In light of the very low error rate estimates of our method, we believe that a substantial number of our microRNA precursor predictions are likely correct. -
FIG. 5 is a block diagram of asystem 500 for determining whether a nucleotide sequence contains a microRNA precursor in accordance with one embodiment of the present invention.System 500 comprises acomputer system 510 that interacts with amedia 550.Computer system 510 comprises aprocessor 520, anetwork interface 525, amemory 530, amedia interface 535 and anoptional display 540.Network interface 525 allowscomputer system 510 to connect to a network, whilemedia interface 535 allowscomputer system 510 to interact withmedia 550, such as Digital Versatile Disk (DVD) or a hard drive. - As is known in the art, the methods and apparatus discussed herein may be distributed as an article of manufacture that itself comprises a computer-readable medium having computer-readable code means embodied thereon. The computer-readable program code means is operable, in conjunction with a computer system such as
computer system 510, to carry out all or some of the steps to perform the methods or create the apparatuses discussed herein. The computer-readable code is configured to generate patterns processing a collection of already known mature microRNA sequences; assign one or more attributes to the generated patterns; subselect only the patterns whose attributes satisfy certain criteria; generate the reverse complement of the subselected patterns; and use the reverse complement of the subselected patterns to analyze the nucleotide sequence. The computer-readable medium may be a recordable medium (e.g., floppy disks, hard drive, optical disks such as a DVD, or memory cards) or may be a transmission medium (e.g., a network comprising fiber-optics, the world-wide web, cables, or a wireless channel using time-division multiple access, code-division multiple access, or other radio-frequency channel). Any medium known or developed that can store information suitable for use with a computer system may be used. The computer-readable code means is any mechanism for allowing a computer to read instructions and data, such as magnetic variations on a magnetic medium or height variations on the surface of a compact disk. -
Memory 530 configures theprocessor 520 to implement the methods, steps, and functions disclosed herein. Thememory 530 could be distributed or local and theprocessor 520 could be distributed or singular. Thememory 530 could be implemented as an electrical, magnetic or optical memory, or any combination of these or other types of storage devices. Moreover, the term “memory” should be construed broadly enough to encompass any information able to read from or written to an address in the addressable space accessed byprocessor 520. With this definition, information on a network, accessible throughnetwork interface 525, is still withinmemory 530 because theprocessor 520 can retrieve the information from the network. It should be noted that each distributed processor that makes upprocessor 520 generally contains its own addressable memory space. It should also be noted that some or all ofcomputer system 510 can be incorporated into an application-specific or general-use integrated circuit. -
Optional video display 540 is any type of video display suitable for interacting with a human user ofsystem 500. Generally,video display 540 is a computer monitor or other similar video display. - It is to be appreciated that, in an alternative embodiment, the invention may be implemented in a network-based implementation, such as, for example, the Internet. The network could alternatively be a private network and/or local network. It is to be understood that the server may include more than one computer system. That is, one or more of the elements of
FIG. 5 may reside on and be executed by their own computer system, e.g., with its own processor and memory. In an alternative configuration, the methodologies of the invention may be performed on a personal computer and output data transmitted directly to a receiving module, such as another personal computer, via a network without any server intervention. The output data can also be transferred without a network. For example, the output data can be transferred by simply downloading the data onto, e.g., a floppy disk, and uploading the data on a receiving module. - Presented herein is a novel and robust pattern-based methodology for the identification of microRNA precursors and their corresponding mature microRNAs directly from genomic sequence. With the help of patterns derived by processing the sequences of known microRNA precursors, our method identifies genomic regions where numerous instances of these patterns aggregate and subselects among them following energy based filtering.
- The following are examples of advantages that characterize the inventive approach provided herein: a) the inventive approach obviates the need to enforce a cross-species conservation filtering before reporting results, thus allowing the discovery of microRNA precursors that may not be shared even by closely related species; b) the inventive approach can be applied to the analysis of any genome that potentially harbors endogenous microRNAs without the need to be retrained each time.
- Although illustrative embodiments of the present invention have been described herein, it is to be understood that the invention is not limited to those precise embodiments, and that various other changes and modifications may be made by one skilled in the art without departing from the scope or spirit of the invention.
Claims (58)
Priority Applications (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/351,951 US20060263798A1 (en) | 2005-02-11 | 2006-02-10 | System and method for identification of MicroRNA precursor sequences and corresponding mature MicroRNA sequences from genomic sequences |
EP06720672A EP1846433A4 (en) | 2005-02-11 | 2006-02-13 | Ribonucleic acid interferernce molecules and methods for generating precursor/mature sequences and determining target sites |
PCT/US2006/004949 WO2006086739A2 (en) | 2005-02-11 | 2006-02-13 | Ribonucleic acid interferernce molecules and methods for generating precursor/mature sequences and determining target sites |
CA002588023A CA2588023A1 (en) | 2005-02-11 | 2006-02-13 | Ribonucleic acid interferernce molecules and methods for generating precursor/mature sequences and determining target sites |
US12/183,204 US8795987B2 (en) | 2005-02-11 | 2008-07-31 | Ribonucleic acid interference molecules of Oryza sativa |
US12/183,166 US8912317B2 (en) | 2005-02-11 | 2008-07-31 | Ribonucleic acid interference molecules of Arabidopsis thaliana |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US65249905P | 2005-02-11 | 2005-02-11 | |
US11/351,951 US20060263798A1 (en) | 2005-02-11 | 2006-02-10 | System and method for identification of MicroRNA precursor sequences and corresponding mature MicroRNA sequences from genomic sequences |
Related Child Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/183,166 Continuation-In-Part US8912317B2 (en) | 2005-02-11 | 2008-07-31 | Ribonucleic acid interference molecules of Arabidopsis thaliana |
US12/183,204 Continuation-In-Part US8795987B2 (en) | 2005-02-11 | 2008-07-31 | Ribonucleic acid interference molecules of Oryza sativa |
Publications (1)
Publication Number | Publication Date |
---|---|
US20060263798A1 true US20060263798A1 (en) | 2006-11-23 |
Family
ID=36793801
Family Applications (5)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/351,821 Abandoned US20070154896A1 (en) | 2005-02-11 | 2006-02-10 | System and method for identification of MicroRNA target sites and corresponding targeting MicroRNA sequences |
US11/351,951 Abandoned US20060263798A1 (en) | 2005-02-11 | 2006-02-10 | System and method for identification of MicroRNA precursor sequences and corresponding mature MicroRNA sequences from genomic sequences |
US11/352,152 Abandoned US20080125583A1 (en) | 2005-02-11 | 2006-02-10 | Ribonucleic acid interference molecules |
US12/135,551 Expired - Fee Related US8494784B2 (en) | 2005-02-11 | 2008-06-09 | System and method for identification of microRNA target sites and corresponding targeting microRNA sequences |
US13/283,103 Expired - Fee Related US8445666B2 (en) | 2005-02-11 | 2011-10-27 | Ribonucleic acid interference molecules |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/351,821 Abandoned US20070154896A1 (en) | 2005-02-11 | 2006-02-10 | System and method for identification of MicroRNA target sites and corresponding targeting MicroRNA sequences |
Family Applications After (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/352,152 Abandoned US20080125583A1 (en) | 2005-02-11 | 2006-02-10 | Ribonucleic acid interference molecules |
US12/135,551 Expired - Fee Related US8494784B2 (en) | 2005-02-11 | 2008-06-09 | System and method for identification of microRNA target sites and corresponding targeting microRNA sequences |
US13/283,103 Expired - Fee Related US8445666B2 (en) | 2005-02-11 | 2011-10-27 | Ribonucleic acid interference molecules |
Country Status (4)
Country | Link |
---|---|
US (5) | US20070154896A1 (en) |
EP (1) | EP1846433A4 (en) |
CA (1) | CA2588023A1 (en) |
WO (1) | WO2006086739A2 (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100015669A1 (en) * | 2008-07-18 | 2010-01-21 | Ning Qin | Method for constructing microrna adenovirus expression plasmids of severe hepatitis related hfgl2, hfas and htnfr1 genes and pharmaceutical use thereof |
US20110125681A1 (en) * | 2008-07-11 | 2011-05-26 | Nec Soft, Ltd. | Feature extraction method, feature extraction apparatus, and feature extraction program |
US8445666B2 (en) | 2005-02-11 | 2013-05-21 | International Business Machines Corporation | Ribonucleic acid interference molecules |
US9536042B2 (en) | 2013-03-15 | 2017-01-03 | International Business Machines Corporation | Using RNAi imaging data for gene interaction network construction |
US9569584B2 (en) | 2013-03-15 | 2017-02-14 | International Business Machines Corporation | Combining RNAi imaging data with genomic data for gene interaction network construction |
WO2018136936A1 (en) * | 2017-01-23 | 2018-07-26 | Srnalytics, Inc. | Methods for identifying and using small rna predictors |
Families Citing this family (35)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2566519C (en) | 2004-05-14 | 2020-04-21 | Rosetta Genomics Ltd. | Micrornas and uses thereof |
US20110021600A1 (en) * | 2006-09-04 | 2011-01-27 | Kyowa Hakko Kirin Co., Ltd. | Novel nucleic acid |
JPWO2008084319A1 (en) * | 2006-12-18 | 2010-04-30 | 協和発酵キリン株式会社 | New nucleic acid |
JP2011502515A (en) * | 2007-11-09 | 2011-01-27 | アイシス ファーマシューティカルズ インコーポレイティッド | Regulation of factor 9 expression |
WO2009148137A1 (en) * | 2008-06-04 | 2009-12-10 | 協和発酵キリン株式会社 | Nucleic acid capable of controlling degranulation of mast cell |
AU2010259295B2 (en) | 2009-06-10 | 2015-05-07 | Temasek Life Sciences Laboratory Limited | Virus induced gene silencing (VIGS) for functional analysis of genes in cotton. |
US20110269119A1 (en) * | 2009-10-30 | 2011-11-03 | Synthetic Genomics, Inc. | Encoding text into nucleic acid sequences |
US8768630B2 (en) * | 2010-02-19 | 2014-07-01 | The Regents Of The University Of Michigan | miRNA target prediction |
WO2012027467A1 (en) * | 2010-08-26 | 2012-03-01 | Merck Sharp & Dohme Corp. | RNA INTERFERENCE MEDIATED INHIBITION OF PROLYL HYDROXYLASE DOMAIN 2 (PHD2) GENE EXPRESSION USING SHORT INTERFERING NUCLEIC ACID (siNA) |
US9920317B2 (en) | 2010-11-12 | 2018-03-20 | The General Hospital Corporation | Polycomb-associated non-coding RNAs |
DK2638163T3 (en) | 2010-11-12 | 2017-07-24 | Massachusetts Gen Hospital | POLYCOMB-ASSOCIATED NON-CODING RNAs |
IN2014CN03749A (en) | 2011-10-25 | 2015-09-25 | Isis Pharmaceuticals Inc | |
AR092982A1 (en) | 2012-10-11 | 2015-05-13 | Isis Pharmaceuticals Inc | MODULATION OF THE EXPRESSION OF ANDROGEN RECEIVERS |
CN105431153A (en) * | 2013-07-11 | 2016-03-23 | 纽约市哥伦比亚大学理事会 | MicroRNAs that silence tau expression |
JP2015093226A (en) * | 2013-11-11 | 2015-05-18 | 栗田工業株式会社 | Method and apparatus for manufacturing pure water |
US10172916B2 (en) | 2013-11-15 | 2019-01-08 | The Board Of Trustees Of The Leland Stanford Junior University | Methods of treating heart failure with agonists of hypocretin receptor 2 |
EP3099795A4 (en) * | 2014-01-27 | 2018-01-17 | The Board of Trustees of the Leland Stanford Junior University | Oligonucleotides and methods for treatment of cardiomyopathy using rna interference |
US9790495B2 (en) * | 2014-05-16 | 2017-10-17 | Oregon State University | Antisense antibacterial compounds and methods |
CA2948568A1 (en) | 2014-05-19 | 2015-11-26 | David Greenberg | Antisense antibacterial compounds and methods |
WO2016054296A2 (en) * | 2014-09-30 | 2016-04-07 | California Institute Of Technology | Crosslinked anti-hiv-1 compositions for potent and broad neutralization |
WO2016070060A1 (en) | 2014-10-30 | 2016-05-06 | The General Hospital Corporation | Methods for modulating atrx-dependent gene repression |
US10604757B2 (en) | 2014-12-23 | 2020-03-31 | Syngenta Participations Ag | Biological control of coleopteran pests |
AU2015372560B2 (en) | 2014-12-31 | 2021-12-02 | Board Of Regents, The University Of Texas System | Antisense antibacterial compounds and methods |
WO2016115490A1 (en) | 2015-01-16 | 2016-07-21 | Ionis Pharmaceuticals, Inc. | Compounds and methods for modulation of dux4 |
WO2016149455A2 (en) | 2015-03-17 | 2016-09-22 | The General Hospital Corporation | The rna interactome of polycomb repressive complex 1 (prc1) |
AU2016379399B2 (en) | 2015-12-23 | 2022-12-08 | Board Of Regents, The University Of Texas System | Antisense antibacterial compounds and methods |
MA45496A (en) * | 2016-06-17 | 2019-04-24 | Hoffmann La Roche | NUCLEIC ACID MOLECULES FOR PADD5 OR PAD7 MRNA REDUCTION FOR TREATMENT OF HEPATITIS B INFECTION |
KR102585898B1 (en) | 2017-10-16 | 2023-10-10 | 에프. 호프만-라 로슈 아게 | NUCLEIC ACID MOLECULE FOR REDUCTION OF PAPD5 AND PAPD7 mRNA FOR TREATING HEPATITIS B INFECTION |
JP2021505175A (en) * | 2017-12-11 | 2021-02-18 | ロシュ イノベーション センター コペンハーゲン エーエス | Oligonucleotides for regulating the expression of FNDC3B |
BR102018003245A2 (en) * | 2018-02-20 | 2019-09-10 | Fundação Oswaldo Cruz | oligonucleotide, oligonucleotide pool, method for simultaneous detection of neisseria meningitidis, streptococcus pneumoniae and haemophilus influenzae, and, kit. |
CN112567033A (en) | 2018-07-03 | 2021-03-26 | 豪夫迈·罗氏有限公司 | Oligonucleotides for modulating Tau expression |
JP2022511077A (en) * | 2018-12-07 | 2022-01-28 | セルテオン コーポレイション | Use in matrix attachment regions and promotion of gene expression |
WO2020148349A1 (en) * | 2019-01-16 | 2020-07-23 | INSERM (Institut National de la Santé et de la Recherche Médicale) | Variants of erythroferrone and their use |
US20230193259A1 (en) * | 2019-08-13 | 2023-06-22 | Universidade De Santiago De Compostela | Compounds and methods for the treatment of cancer |
WO2024102688A1 (en) * | 2022-11-07 | 2024-05-16 | New York Society For The Relief Of The Ruptured And Crippled, Maintaining The Hospital For Special Surgery | Compositions for treating osteoclastogenesis disorders and/or rheumatoid arthritis |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US351951A (en) * | 1886-11-02 | Check-hook attachment for harness | ||
US5474796A (en) * | 1991-09-04 | 1995-12-12 | Protogene Laboratories, Inc. | Method and apparatus for conducting an array of chemical reactions on a support surface |
US6108666A (en) * | 1997-06-12 | 2000-08-22 | International Business Machines Corporation | Method and apparatus for pattern discovery in 1-dimensional event streams |
US6812339B1 (en) * | 2000-09-08 | 2004-11-02 | Applera Corporation | Polymorphisms in known genes associated with human disease, methods of detection and uses thereof |
US20070154896A1 (en) | 2005-02-11 | 2007-07-05 | International Business Machines Corporation | System and method for identification of MicroRNA target sites and corresponding targeting MicroRNA sequences |
-
2006
- 2006-02-10 US US11/351,821 patent/US20070154896A1/en not_active Abandoned
- 2006-02-10 US US11/351,951 patent/US20060263798A1/en not_active Abandoned
- 2006-02-10 US US11/352,152 patent/US20080125583A1/en not_active Abandoned
- 2006-02-13 CA CA002588023A patent/CA2588023A1/en not_active Abandoned
- 2006-02-13 EP EP06720672A patent/EP1846433A4/en not_active Withdrawn
- 2006-02-13 WO PCT/US2006/004949 patent/WO2006086739A2/en active Application Filing
-
2008
- 2008-06-09 US US12/135,551 patent/US8494784B2/en not_active Expired - Fee Related
-
2011
- 2011-10-27 US US13/283,103 patent/US8445666B2/en not_active Expired - Fee Related
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8445666B2 (en) | 2005-02-11 | 2013-05-21 | International Business Machines Corporation | Ribonucleic acid interference molecules |
US20110125681A1 (en) * | 2008-07-11 | 2011-05-26 | Nec Soft, Ltd. | Feature extraction method, feature extraction apparatus, and feature extraction program |
US20100015669A1 (en) * | 2008-07-18 | 2010-01-21 | Ning Qin | Method for constructing microrna adenovirus expression plasmids of severe hepatitis related hfgl2, hfas and htnfr1 genes and pharmaceutical use thereof |
US9536042B2 (en) | 2013-03-15 | 2017-01-03 | International Business Machines Corporation | Using RNAi imaging data for gene interaction network construction |
US9536043B2 (en) | 2013-03-15 | 2017-01-03 | International Business Machines Corporation | Using RNAi imaging data for gene interaction network construction |
US9569584B2 (en) | 2013-03-15 | 2017-02-14 | International Business Machines Corporation | Combining RNAi imaging data with genomic data for gene interaction network construction |
US9569585B2 (en) | 2013-03-15 | 2017-02-14 | International Business Machines Corporation | Combining RNAi imaging data with genomic data for gene interaction network construction |
WO2018136936A1 (en) * | 2017-01-23 | 2018-07-26 | Srnalytics, Inc. | Methods for identifying and using small rna predictors |
CN110418850A (en) * | 2017-01-23 | 2019-11-05 | 小分子Rna分析股份有限公司 | Identification and the method for using tiny RNA predictive factor |
US10889862B2 (en) | 2017-01-23 | 2021-01-12 | Srnalytics, Llc. | Methods for identifying and using small RNA predictors |
US11028440B2 (en) | 2017-01-23 | 2021-06-08 | Gatehouse Bio, Inc. | Methods for identifying and using small RNA predictors |
Also Published As
Publication number | Publication date |
---|---|
US8494784B2 (en) | 2013-07-23 |
EP1846433A4 (en) | 2009-09-16 |
EP1846433A2 (en) | 2007-10-24 |
US8445666B2 (en) | 2013-05-21 |
CA2588023A1 (en) | 2006-08-17 |
US20070154896A1 (en) | 2007-07-05 |
US20120040460A1 (en) | 2012-02-16 |
WO2006086739A2 (en) | 2006-08-17 |
US20080125583A1 (en) | 2008-05-29 |
US20090012720A1 (en) | 2009-01-08 |
WO2006086739A8 (en) | 2008-12-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20060263798A1 (en) | System and method for identification of MicroRNA precursor sequences and corresponding mature MicroRNA sequences from genomic sequences | |
Vitsios et al. | Mirnovo: genome-free prediction of microRNAs from small RNA sequencing data and single-cells using decision forests | |
Bandyopadhyay et al. | MBSTAR: multiple instance learning for predicting specific functional binding sites in microRNA targets | |
Kan et al. | Selecting for functional alternative splices in ESTs | |
SaeTrom et al. | Weighted sequence motifs as an improved seeding step in microRNA target prediction algorithms | |
Ng et al. | De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures | |
Lindow et al. | Principles and limitations of computational microRNA gene and target finding | |
Mendoza et al. | RFMirTarget: predicting human microRNA target genes with a random forest classifier | |
Guan et al. | Inferring targeting modes of Argonaute-loaded tRNA fragments | |
Koyutürk et al. | Assessing significance of connectivity and conservation in protein interaction networks | |
Higashi et al. | M irinho: An efficient and general plant and animal pre-miRNA predictor for genomic and deep sequencing data | |
Kamenetzky et al. | MicroRNA discovery in the human parasite Echinococcus multilocularis from genome-wide data | |
van der Burgt et al. | In silico miRNA prediction in metazoan genomes: balancing between sensitivity and specificity | |
Vukusic et al. | Applying genetic programming to the prediction of alternative mRNA splice variants | |
Mohebbi et al. | Accurate prediction of human miRNA targets via graph modeling of the miRNA-target duplex | |
Mendes et al. | Navigating the unexplored seascape of pre-miRNA candidates in single-genome approaches | |
Wang et al. | Finding RNA–Protein interaction sites using HMMs | |
CN113921085B (en) | Prediction method for synergistic regulation and control effect of non-coding RNA genes | |
Koyutürk et al. | Assessing significance of connectivity and conservation in protein interaction networks | |
Chen et al. | Prediction of mammalian microRNA binding sites using random forests | |
Uthayopas et al. | PRIMITI: a computational approach for accurate prediction of miRNA-target mRNA interaction | |
Zhong et al. | Effective Classification of MicroRNA Precursors Using Combinatorial Feature Mining and AdaBoost Algorithms | |
Mohsen et al. | Improving de novo metatranscriptome assembly via machine learning algorithms | |
Yang et al. | Identification of microRNA precursors via SVM | |
Mohebbi et al. | Beyond Sequence: A Novel Image-Based Model for MicroRNA Target Prediction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HUYNH, TIEN;MIRANDA, KEVIN CHARLES;RIGOUTSOS, ISIDORE;REEL/FRAME:018000/0332;SIGNING DATES FROM 20060619 TO 20060702 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |
|
AS | Assignment |
Owner name: GLOBALFOUNDRIES U.S. 2 LLC, NEW YORK Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:INTERNATIONAL BUSINESS MACHINES CORPORATION;REEL/FRAME:036550/0001 Effective date: 20150629 |
|
AS | Assignment |
Owner name: GLOBALFOUNDRIES INC., CAYMAN ISLANDS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GLOBALFOUNDRIES U.S. 2 LLC;GLOBALFOUNDRIES U.S. INC.;REEL/FRAME:036779/0001 Effective date: 20150910 |
|
AS | Assignment |
Owner name: GLOBALFOUNDRIES U.S. INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:GLOBALFOUNDRIES INC.;REEL/FRAME:054633/0001 Effective date: 20201022 |
|
AS | Assignment |
Owner name: GLOBALFOUNDRIES U.S. INC., NEW YORK Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:WILMINGTON TRUST, NATIONAL ASSOCIATION;REEL/FRAME:056987/0001 Effective date: 20201117 |